diff --git a/README.md b/README.md
index 916e5200b29841028652c861c49dbb3650baea3c..ef5bdc66ef03131318e1dde627e0224cca9137fd 100644
--- a/README.md
+++ b/README.md
@@ -4,9 +4,10 @@
 
 -----------------
 
-| **`Linux CPU`** | **`Linux GPU`** | **`Mac OS CPU`** | **`Windows CPU`** | **`Android`** |
-|-----------------|---------------------|------------------|-------------------|---------------|
-| [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-cpu)](https://ci.tensorflow.org/job/tensorflow-master-cpu) | [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-linux-gpu)](https://ci.tensorflow.org/job/tensorflow-master-linux-gpu) | [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-mac)](https://ci.tensorflow.org/job/tensorflow-master-mac) | [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-win-cmake-py)](https://ci.tensorflow.org/job/tensorflow-master-win-cmake-py) | [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-android)](https://ci.tensorflow.org/job/tensorflow-master-android) [ ![Download](https://api.bintray.com/packages/google/tensorflow/tensorflow/images/download.svg) ](https://bintray.com/google/tensorflow/tensorflow/_latestVersion) |
+
+| **`Documentation`** | **`Linux CPU`** | **`Linux GPU`** | **`Mac OS CPU`** | **`Windows CPU`** | **`Android`** |
+|-----------------|---------------------|------------------|-------------------|---------------|---------------|
+| [![Documentation](https://img.shields.io/badge/api-reference-blue.svg)](https://www.tensorflow.org/api_docs/) | [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-cpu)](https://ci.tensorflow.org/job/tensorflow-master-cpu) | [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-linux-gpu)](https://ci.tensorflow.org/job/tensorflow-master-linux-gpu) | [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-mac)](https://ci.tensorflow.org/job/tensorflow-master-mac) | [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-win-cmake-py)](https://ci.tensorflow.org/job/tensorflow-master-win-cmake-py) | [![Build Status](https://ci.tensorflow.org/buildStatus/icon?job=tensorflow-master-android)](https://ci.tensorflow.org/job/tensorflow-master-android) [ ![Download](https://api.bintray.com/packages/google/tensorflow/tensorflow/images/download.svg) ](https://bintray.com/google/tensorflow/tensorflow/_latestVersion)
 
 **TensorFlow** is an open source software library for numerical computation using
 data flow graphs.  The graph nodes represent mathematical operations, while
@@ -21,20 +22,6 @@ organization for the purposes of conducting machine learning and deep neural
 networks research.  The system is general enough to be applicable in a wide
 variety of other domains, as well.
 
-**If you want to contribute to TensorFlow, be sure to review the [contribution
-guidelines](CONTRIBUTING.md). This project adheres to TensorFlow's
-[code of conduct](CODE_OF_CONDUCT.md). By participating, you are expected to
-uphold this code.**
-
-**We use [GitHub issues](https://github.com/tensorflow/tensorflow/issues) for
-tracking requests and bugs. So please see
-[TensorFlow Discuss](https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss) for general questions
-and discussion, and please direct specific questions to [Stack Overflow](https://stackoverflow.com/questions/tagged/tensorflow).**
-
-The TensorFlow project strives to abide by generally accepted best practices in open-source software development:
-
-[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/1486/badge)](https://bestpractices.coreinfrastructure.org/projects/1486)
-
 ## Installation
 *See [Installing TensorFlow](https://www.tensorflow.org/get_started/os_setup.html) for instructions on how to install our release binaries or how to build from source.*
 
@@ -75,6 +62,22 @@ $ python
 >>> sess.close()
 ```
 
+## Contribution guidelines
+
+**If you want to contribute to TensorFlow, be sure to review the [contribution
+guidelines](CONTRIBUTING.md). This project adheres to TensorFlow's
+[code of conduct](CODE_OF_CONDUCT.md). By participating, you are expected to
+uphold this code.**
+
+**We use [GitHub issues](https://github.com/tensorflow/tensorflow/issues) for
+tracking requests and bugs. So please see
+[TensorFlow Discuss](https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss) for general questions
+and discussion, and please direct specific questions to [Stack Overflow](https://stackoverflow.com/questions/tagged/tensorflow).**
+
+The TensorFlow project strives to abide by generally accepted best practices in open-source software development:
+
+[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/1486/badge)](https://bestpractices.coreinfrastructure.org/projects/1486)
+
 ## For more information
 
 * [TensorFlow Website](https://www.tensorflow.org)
diff --git a/tensorflow/SECURITY.md b/SECURITY.md
similarity index 97%
rename from tensorflow/SECURITY.md
rename to SECURITY.md
index 6ddac1f964dfba3afd240441e2a036bc24ee6d91..fea24b273920885ba8a1ae96aafbf7710df46e1f 100644
--- a/tensorflow/SECURITY.md
+++ b/SECURITY.md
@@ -233,7 +233,7 @@ v//Fw6ZeY+HmRDFdirjD7wXtIuER4vqCryIqR6Xe9X8oJXz9L/Jhslc=
 
 ### Known vulnerabilities
 
-| Type | Versions affected | Reported by | Additional Information |
-|------|:-----------------:|---------------------------------------|
-| out of bounds read| <=1.4 | TenCent Blade Team | [issue report](https://github.com/tensorflow/tensorflow/issues/14959) |
+| Type              | Versions affected |        Reported by | Additional Information      |
+|-------------------|:-----------------:|--------------------|-----------------------------|
+| out of bounds read|             <=1.4 | TenCent Blade Team | [issue report](https://github.com/tensorflow/tensorflow/issues/14959) |
 
diff --git a/configure b/configure
index 9c21d2b03a27714f05094667691e74c16fa89f35..66b66ba54ed68a9aa0ee556f84f68c3a83a495ab 100755
--- a/configure
+++ b/configure
@@ -8,7 +8,8 @@ if [ -z "$PYTHON_BIN_PATH" ]; then
 fi
 
 # Set all env variables
-"$PYTHON_BIN_PATH" configure.py
+CONFIGURE_DIR=$(dirname "$0")
+"$PYTHON_BIN_PATH" "${CONFIGURE_DIR}/configure.py" "$@"
 
 echo "Configuration finished"
 
diff --git a/configure.py b/configure.py
index 3aa1a3e956c6a559b89cdeb593a96a95188c32ae..97f46757ee241b1532e1c3da7c567e9af8f559f0 100644
--- a/configure.py
+++ b/configure.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import argparse
 import errno
 import os
 import platform
@@ -32,10 +33,6 @@ except ImportError:
   from distutils.spawn import find_executable as which
 # pylint: enable=g-import-not-at-top
 
-_TF_BAZELRC = os.path.join(os.path.dirname(os.path.abspath(__file__)),
-                           '.tf_configure.bazelrc')
-_TF_WORKSPACE = os.path.join(os.path.dirname(os.path.abspath(__file__)),
-                             'WORKSPACE')
 _DEFAULT_CUDA_VERSION = '9.0'
 _DEFAULT_CUDNN_VERSION = '7'
 _DEFAULT_CUDA_COMPUTE_CAPABILITIES = '3.5,5.2'
@@ -51,6 +48,11 @@ _SUPPORTED_ANDROID_NDK_VERSIONS = [10, 11, 12, 13, 14, 15]
 
 _DEFAULT_PROMPT_ASK_ATTEMPTS = 10
 
+_TF_WORKSPACE_ROOT = os.path.abspath(os.path.dirname(__file__))
+_TF_BAZELRC_FILENAME = '.tf_configure.bazelrc'
+_TF_BAZELRC = os.path.join(_TF_WORKSPACE_ROOT, _TF_BAZELRC_FILENAME)
+_TF_WORKSPACE = os.path.join(_TF_WORKSPACE_ROOT, 'WORKSPACE')
+
 
 class UserInputError(Exception):
   pass
@@ -119,22 +121,6 @@ def sed_in_place(filename, old, new):
     f.write(newdata)
 
 
-def remove_line_with(filename, token):
-  """Remove lines that contain token from file.
-
-  Args:
-    filename: string for filename.
-    token: string token to check if to remove a line from file or not.
-  """
-  with open(filename, 'r') as f:
-    filedata = f.read()
-
-  with open(filename, 'w') as f:
-    for line in filedata.strip().split('\n'):
-      if token not in line:
-        f.write(line + '\n')
-
-
 def write_to_bazelrc(line):
   with open(_TF_BAZELRC, 'a') as f:
     f.write(line + '\n')
@@ -245,25 +231,30 @@ def setup_python(environ_cp):
   environ_cp['PYTHON_BIN_PATH'] = python_bin_path
 
   # Write tools/python_bin_path.sh
-  with open('tools/python_bin_path.sh', 'w') as f:
+  with open(os.path.join(
+      _TF_WORKSPACE_ROOT, 'tools', 'python_bin_path.sh'), 'w') as f:
     f.write('export PYTHON_BIN_PATH="%s"' % python_bin_path)
 
 
-def reset_tf_configure_bazelrc():
+def reset_tf_configure_bazelrc(workspace_path):
   """Reset file that contains customized config settings."""
   open(_TF_BAZELRC, 'w').close()
-
-  home = os.path.expanduser('~')
-  if not os.path.exists('.bazelrc'):
-    if os.path.exists(os.path.join(home, '.bazelrc')):
-      with open('.bazelrc', 'a') as f:
-        f.write('import %s/.bazelrc\n' % home.replace('\\', '/'))
+  bazelrc_path = os.path.join(workspace_path, '.bazelrc')
+
+  data = []
+  if os.path.exists(bazelrc_path):
+    with open(bazelrc_path, 'r') as f:
+      data = f.read().splitlines()
+  with open(bazelrc_path, 'w') as f:
+    for l in data:
+      if _TF_BAZELRC_FILENAME in l:
+        continue
+      f.write('%s\n' % l)
+    if is_windows():
+      tf_bazelrc_path = _TF_BAZELRC.replace("\\", "/")
     else:
-      open('.bazelrc', 'w').close()
-
-  remove_line_with('.bazelrc', 'tf_configure')
-  with open('.bazelrc', 'a') as f:
-    f.write('import %workspace%/.tf_configure.bazelrc\n')
+      tf_bazelrc_path = _TF_BAZELRC
+    f.write('import %s\n' % tf_bazelrc_path)
 
 
 def cleanup_makefile():
@@ -271,7 +262,8 @@ def cleanup_makefile():
 
   These files could interfere with Bazel parsing.
   """
-  makefile_download_dir = 'tensorflow/contrib/makefile/downloads'
+  makefile_download_dir = os.path.join(
+      _TF_WORKSPACE_ROOT, 'tensorflow', 'contrib', 'makefile', 'downloads')
   if os.path.isdir(makefile_download_dir):
     for root, _, filenames in os.walk(makefile_download_dir):
       for f in filenames:
@@ -456,7 +448,7 @@ def check_bazel_version(min_version):
   if which('bazel') is None:
     print('Cannot find bazel. Please install bazel.')
     sys.exit(0)
-  curr_version = run_shell(['bazel', '--batch', 'version'])
+  curr_version = run_shell(['bazel', '--batch', '--bazelrc=/dev/null', 'version'])
 
   for line in curr_version.split('\n'):
     if 'Build label: ' in line:
@@ -502,7 +494,8 @@ def set_cc_opt_flags(environ_cp):
   for opt in cc_opt_flags.split():
     write_to_bazelrc('build:opt --copt=%s' % opt)
   # It should be safe on the same build host.
-  write_to_bazelrc('build:opt --host_copt=-march=native')
+  if not is_ppc64le():
+    write_to_bazelrc('build:opt --host_copt=-march=native')
   write_to_bazelrc('build:opt --define with_default_optimizations=true')
   # TODO(mikecase): Remove these default defines once we are able to get
   # TF Lite targets building without them.
@@ -916,7 +909,7 @@ def set_tf_cudnn_version(environ_cp):
     tf_cudnn_version = get_from_env_or_user_or_default(
         environ_cp, 'TF_CUDNN_VERSION', ask_cudnn_version,
         _DEFAULT_CUDNN_VERSION)
-    tf_cudnn_version = reformat_version_sequence(str(tf_cudnn_version) ,1)
+    tf_cudnn_version = reformat_version_sequence(str(tf_cudnn_version), 1)
 
     default_cudnn_path = environ_cp.get('CUDA_TOOLKIT_PATH')
     ask_cudnn_path = (r'Please specify the location where cuDNN %s library is '
@@ -1078,7 +1071,7 @@ def set_tf_tensorrt_install_path(environ_cp):
           break
 
     # Reset and Retry
-    if len(possible_files):
+    if possible_files:
       print('TensorRT libraries found in one the following directories',
             'are not compatible with selected cuda and cudnn installations')
       print(trt_install_path)
@@ -1087,7 +1080,8 @@ def set_tf_tensorrt_install_path(environ_cp):
       if search_result:
         print(libnvinfer_path_from_ldconfig)
     else:
-      print('Invalid path to TensorRT. None of the following files can be found:')
+      print(
+          'Invalid path to TensorRT. None of the following files can be found:')
       print(trt_install_path)
       print(os.path.join(trt_install_path, 'lib'))
       print(os.path.join(trt_install_path, 'lib64'))
@@ -1228,7 +1222,7 @@ def set_host_c_compiler(environ_cp):
       environ_cp,
       var_name='HOST_C_COMPILER',
       var_default=default_c_host_compiler,
-      ask_for_var=('Please specify which C compiler should be used as the host'
+      ask_for_var=('Please specify which C compiler should be used as the host '
                    'C compiler.'),
       check_success=os.path.exists,
       error_msg='Invalid C compiler path. %s cannot be found.',
@@ -1372,13 +1366,20 @@ def config_info_line(name, help_text):
 
 
 def main():
+  parser = argparse.ArgumentParser()
+  parser.add_argument("--workspace",
+                      type=str,
+                      default=_TF_WORKSPACE_ROOT,
+                      help="The absolute path to your active Bazel workspace.")
+  args = parser.parse_args()
+
   # Make a copy of os.environ to be clear when functions and getting and setting
   # environment variables.
   environ_cp = dict(os.environ)
 
   check_bazel_version('0.5.4')
 
-  reset_tf_configure_bazelrc()
+  reset_tf_configure_bazelrc(args.workspace)
   cleanup_makefile()
   setup_python(environ_cp)
 
@@ -1433,8 +1434,10 @@ def main():
     if is_linux():
       set_tf_tensorrt_install_path(environ_cp)
     set_tf_cuda_compute_capabilities(environ_cp)
-    if 'LD_LIBRARY_PATH' in environ_cp and environ_cp.get('LD_LIBRARY_PATH') != '1':
-      write_action_env_to_bazelrc('LD_LIBRARY_PATH', environ_cp.get('LD_LIBRARY_PATH'))
+    if 'LD_LIBRARY_PATH' in environ_cp and environ_cp.get(
+        'LD_LIBRARY_PATH') != '1':
+      write_action_env_to_bazelrc('LD_LIBRARY_PATH',
+                                  environ_cp.get('LD_LIBRARY_PATH'))
 
     set_tf_cuda_clang(environ_cp)
     if environ_cp.get('TF_CUDA_CLANG') == '1':
diff --git a/tensorflow/BUILD b/tensorflow/BUILD
index dc995d231d3e591771f801e28024a76610cdba26..d152281d5d760d5afb8bc1605441fdfcb9c919bf 100644
--- a/tensorflow/BUILD
+++ b/tensorflow/BUILD
@@ -682,6 +682,7 @@ filegroup(
         "//tensorflow/tools/docs:all_files",
         "//tensorflow/tools/git:all_files",
         "//tensorflow/tools/graph_transforms:all_files",
+        "//tensorflow/tools/integration_tests/gcs_smoke_test:all_files",
         "//tensorflow/tools/mlpbtxt:all_files",
         "//tensorflow/tools/proto_text:all_files",
         "//tensorflow/tools/quantization:all_files",
@@ -787,6 +788,7 @@ tf_cc_shared_object(
     }),
     deps = [
         "//tensorflow/c:c_api",
+        "//tensorflow/c:c_api_experimental",
         "//tensorflow/c:exported_symbols.lds",
         "//tensorflow/c:version_script.lds",
         "//tensorflow/c/eager:c_api",
diff --git a/tensorflow/c/BUILD b/tensorflow/c/BUILD
index 9060c58c1395f07eff0ccef7bd430b3402f8c826..29ed957c9aa8cbe515f5f43bdccbf8c94f47c459 100644
--- a/tensorflow/c/BUILD
+++ b/tensorflow/c/BUILD
@@ -12,18 +12,15 @@ load(
     "tf_custom_op_library",
 )
 
-# For platform specific build config
-load(
-    "//tensorflow/core:platform/default/build_config.bzl",
-    "tf_kernel_tests_linkstatic",
-)
-
 # -----------------------------------------------------------------------------
 # Public targets
 
 filegroup(
     name = "headers",
-    srcs = ["c_api.h"],
+    srcs = [
+        "c_api.h",
+        "c_api_experimental.h",
+    ],
     visibility = ["//tensorflow:__subpackages__"],
 )
 
@@ -34,7 +31,11 @@ filegroup(
             "*.cc",
             "*.h",
         ],
-        exclude = ["*test*"],
+        exclude = [
+            "c_api_experimental.cc",
+            "c_api_experimental.h",
+            "*test*",
+        ],
     ),
     visibility = ["//visibility:public"],
 )
@@ -101,6 +102,24 @@ tf_cuda_library(
     }),
 )
 
+tf_cuda_library(
+    name = "c_api_experimental",
+    srcs = [
+        "c_api_experimental.cc",
+    ],
+    hdrs = [
+        "c_api_experimental.h",
+    ],
+    copts = tf_copts(),
+    visibility = ["//visibility:public"],
+    deps = [
+        ":c_api",
+        ":c_api_internal",
+        "//tensorflow/compiler/jit/legacy_flags:mark_for_compilation_pass_flags",
+        "//tensorflow/core:protos_all_cc",
+    ],
+)
+
 exports_files(
     [
         "version_script.lds",
@@ -148,7 +167,7 @@ tf_cuda_library(
     ],
     deps = [
         ":c_api",
-        "//tensorflow/compiler/jit/legacy_flags:mark_for_compilation_pass_flags",
+        ":c_api_experimental",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:session_options",
diff --git a/tensorflow/c/c_api_experimental.cc b/tensorflow/c/c_api_experimental.cc
new file mode 100644
index 0000000000000000000000000000000000000000..be7f85a5bb06dce84579b109d506ded049042b50
--- /dev/null
+++ b/tensorflow/c/c_api_experimental.cc
@@ -0,0 +1,39 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/c/c_api_experimental.h"
+
+#include "tensorflow/c/c_api_internal.h"
+#include "tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.h"
+#include "tensorflow/core/protobuf/config.pb.h"
+
+void TF_EnableXLACompilation(TF_SessionOptions* options, unsigned char enable) {
+  tensorflow::ConfigProto& config = options->options.config;
+  auto* optimizer_options =
+      config.mutable_graph_options()->mutable_optimizer_options();
+  if (enable) {
+    optimizer_options->set_global_jit_level(tensorflow::OptimizerOptions::ON_1);
+
+    // These XLA flags are needed to trigger XLA properly from C (more generally
+    // non-Python) clients. If this API is called again with `enable` set to
+    // false, it is safe to keep these flag values as is.
+    tensorflow::legacy_flags::MarkForCompilationPassFlags* flags =
+        tensorflow::legacy_flags::GetMarkForCompilationPassFlags();
+    flags->tf_xla_cpu_global_jit = true;
+    flags->tf_xla_min_cluster_size = 1;
+  } else {
+    optimizer_options->set_global_jit_level(tensorflow::OptimizerOptions::OFF);
+  }
+}
diff --git a/tensorflow/c/c_api_experimental.h b/tensorflow/c/c_api_experimental.h
new file mode 100644
index 0000000000000000000000000000000000000000..5a7b007e40aa199889b2d00b2bde5976c19e2966
--- /dev/null
+++ b/tensorflow/c/c_api_experimental.h
@@ -0,0 +1,66 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_C_C_API_EXPERIMENTAL_H_
+#define TENSORFLOW_C_C_API_EXPERIMENTAL_H_
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include "tensorflow/c/c_api.h"
+
+// --------------------------------------------------------------------------
+// Experimental C API for TensorFlow.
+//
+// The API here is subject to changes in the future.
+
+// Macro to control visibility of exported symbols in the shared library (.so,
+// .dylib, .dll).
+// This duplicates the TF_EXPORT macro definition in
+// tensorflow/core/platform/macros.h in order to keep this .h file independent
+// of any other includes.$a
+#ifdef SWIG
+#define TF_CAPI_EXPORT
+#else
+#if defined(COMPILER_MSVC)
+#ifdef TF_COMPILE_LIBRARY
+#define TF_CAPI_EXPORT __declspec(dllexport)
+#else
+#define TF_CAPI_EXPORT __declspec(dllimport)
+#endif  // TF_COMPILE_LIBRARY
+#else
+#define TF_CAPI_EXPORT __attribute__((visibility("default")))
+#endif  // COMPILER_MSVC
+#endif  // SWIG
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// When `enable` is true, set
+// tensorflow.ConfigProto.OptimizerOptions.global_jit_level to ON_1, and also
+// set XLA flag values to prepare for XLA compilation. Otherwise set
+// global_jit_level to OFF.
+//
+// This API is syntax sugar over TF_SetConfig(), and is used by clients that
+// cannot read/write the tensorflow.ConfigProto proto.
+TF_CAPI_EXPORT extern void TF_EnableXLACompilation(TF_SessionOptions* options,
+                                                   unsigned char enable);
+
+#ifdef __cplusplus
+} /* end extern "C" */
+#endif
+
+#endif  // TENSORFLOW_C_C_API_EXPERIMENTAL_H_
diff --git a/tensorflow/c/c_test_util.cc b/tensorflow/c/c_test_util.cc
index a55af46ae2baef1cd4f55f478ec234551f370503..53346a8cdf26d98683579bfd5f0514d4b5fcc86b 100644
--- a/tensorflow/c/c_test_util.cc
+++ b/tensorflow/c/c_test_util.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/c/c_test_util.h"
 
-#include "tensorflow/compiler/jit/legacy_flags/mark_for_compilation_pass_flags.h"
+#include "tensorflow/c/c_api_experimental.h"
 #include "tensorflow/core/framework/function.pb.h"
 #include "tensorflow/core/framework/op_def.pb.h"
 #include "tensorflow/core/framework/tensor.pb.h"
@@ -34,6 +34,10 @@ static void DoubleDeallocator(void* data, size_t, void* arg) {
   delete[] static_cast<double*>(data);
 }
 
+static void FloatDeallocator(void* data, size_t, void* arg) {
+  delete[] static_cast<float*>(data);
+}
+
 TF_Tensor* Int8Tensor(const int64_t* dims, int num_dims, const char* values) {
   int64_t num_values = 1;
   for (int i = 0; i < num_dims; ++i) {
@@ -78,13 +82,21 @@ TF_Tensor* DoubleTensor(double v) {
                       &DoubleDeallocator, nullptr);
 }
 
+TF_Tensor* FloatTensor(float v) {
+  const int num_bytes = sizeof(float);
+  float* values = new float[1];
+  values[0] = v;
+  return TF_NewTensor(TF_FLOAT, nullptr, 0, values, num_bytes,
+                      &FloatDeallocator, nullptr);
+}
+
 // All the *Helper methods are used as a workaround for the restrictions that
 // one cannot call ASSERT_* methods in non-void-returning functions (when
 // exceptions are disabled during compilation)
 void PlaceholderHelper(TF_Graph* graph, TF_Status* s, const char* name,
-                       TF_Operation** op) {
+                       TF_DataType dtype, TF_Operation** op) {
   TF_OperationDescription* desc = TF_NewOperation(graph, "Placeholder", name);
-  TF_SetAttrType(desc, "dtype", TF_INT32);
+  TF_SetAttrType(desc, "dtype", dtype);
   *op = TF_FinishOperation(desc, s);
   ASSERT_EQ(TF_OK, TF_GetCode(s)) << TF_Message(s);
   ASSERT_NE(*op, nullptr);
@@ -92,7 +104,14 @@ void PlaceholderHelper(TF_Graph* graph, TF_Status* s, const char* name,
 
 TF_Operation* Placeholder(TF_Graph* graph, TF_Status* s, const char* name) {
   TF_Operation* op;
-  PlaceholderHelper(graph, s, name, &op);
+  PlaceholderHelper(graph, s, name, TF_INT32, &op);
+  return op;
+}
+
+TF_Operation* PlaceholderFloat(TF_Graph* graph, TF_Status* s,
+                               const char* name) {
+  TF_Operation* op;
+  PlaceholderHelper(graph, s, name, TF_FLOAT, &op);
   return op;
 }
 
@@ -126,6 +145,12 @@ TF_Operation* ScalarConst(double v, TF_Graph* graph, TF_Status* s,
   return Const(tensor.get(), graph, s, name);
 }
 
+TF_Operation* ScalarConst(float v, TF_Graph* graph, TF_Status* s,
+                          const char* name) {
+  unique_tensor_ptr tensor(FloatTensor(v), TF_DeleteTensor);
+  return Const(tensor.get(), graph, s, name);
+}
+
 void AddOpHelper(TF_Operation* l, TF_Operation* r, TF_Graph* graph,
                  TF_Status* s, const char* name, TF_Operation** op,
                  bool check) {
@@ -404,19 +429,7 @@ std::vector<string> GetFuncNames(const tensorflow::GraphDef& graph_def) {
 
 CSession::CSession(TF_Graph* graph, TF_Status* s, bool use_XLA) {
   TF_SessionOptions* opts = TF_NewSessionOptions();
-  tensorflow::legacy_flags::MarkForCompilationPassFlags* flags =
-      tensorflow::legacy_flags::GetMarkForCompilationPassFlags();
-  flags->tf_xla_cpu_global_jit = use_XLA;
-  if (use_XLA) {
-    tensorflow::ConfigProto config;
-    config.mutable_graph_options()
-        ->mutable_optimizer_options()
-        ->set_global_jit_level(tensorflow::OptimizerOptions::ON_1);
-    std::string contents;
-    contents.resize(config.ByteSizeLong());
-    config.SerializeToArray(&contents[0], contents.size());
-    TF_SetConfig(opts, contents.data(), contents.size(), s);
-  }
+  TF_EnableXLACompilation(opts, use_XLA);
   session_ = TF_NewSession(graph, opts, s);
   TF_DeleteSessionOptions(opts);
 }
diff --git a/tensorflow/c/c_test_util.h b/tensorflow/c/c_test_util.h
index 2a70177c724c569844a5d8ad42b99bed20209946..8cf060f73f1187f8f5281211785451be74882828 100644
--- a/tensorflow/c/c_test_util.h
+++ b/tensorflow/c/c_test_util.h
@@ -44,8 +44,14 @@ TF_Tensor* Int32Tensor(int32_t v);
 
 TF_Tensor* DoubleTensor(double v);
 
+TF_Tensor* FloatTensor(float v);
+
+// TODO(hongm): Change Placeholder() to take in a TF_DataType parameter, and
+// unify with PlaceholderFloat.
 TF_Operation* Placeholder(TF_Graph* graph, TF_Status* s,
                           const char* name = "feed");
+TF_Operation* PlaceholderFloat(TF_Graph* graph, TF_Status* s,
+                               const char* name = "feed");
 
 TF_Operation* Const(TF_Tensor* t, TF_Graph* graph, TF_Status* s,
                     const char* name = "const");
@@ -56,6 +62,9 @@ TF_Operation* ScalarConst(int32_t v, TF_Graph* graph, TF_Status* s,
 TF_Operation* ScalarConst(double v, TF_Graph* graph, TF_Status* s,
                           const char* name = "scalar");
 
+TF_Operation* ScalarConst(float v, TF_Graph* graph, TF_Status* s,
+                          const char* name = "scalar");
+
 TF_Operation* Add(TF_Operation* l, TF_Operation* r, TF_Graph* graph,
                   TF_Status* s, const char* name = "add");
 
diff --git a/tensorflow/c/eager/c_api.cc b/tensorflow/c/eager/c_api.cc
index 8e834eb99c13d1f26da9f0860897267efc2fd01c..4b619dc4e162baaede56a0d96a95acba288bb22e 100644
--- a/tensorflow/c/eager/c_api.cc
+++ b/tensorflow/c/eager/c_api.cc
@@ -31,8 +31,10 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/copy_tensor.h"
 #include "tensorflow/core/common_runtime/device_factory.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/device_set.h"
 #include "tensorflow/core/common_runtime/function.h"
 #include "tensorflow/core/common_runtime/rendezvous_mgr.h"
+#include "tensorflow/core/framework/node_def_util.h"
 #include "tensorflow/core/framework/rendezvous.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"
 #include "tensorflow/core/framework/types.h"
@@ -67,6 +69,18 @@ std::atomic_int_fast64_t func_id_generator(0);
 #endif  // TENSORFLOW_EAGER_USE_XLA
 }  // namespace
 
+TFE_ContextDevicePlacementPolicy PlacementPolicy(
+    bool soft_placement, TFE_ContextDevicePlacementPolicy original_policy) {
+  if (!soft_placement) {
+    return original_policy;
+  }
+  if (original_policy == TFE_DEVICE_PLACEMENT_EXPLICIT ||
+      original_policy == TFE_DEVICE_PLACEMENT_SILENT_FOR_INT32) {
+    return TFE_DEVICE_PLACEMENT_SILENT;
+  }
+  return original_policy;
+}
+
 extern "C" {
 
 TFE_ContextOptions* TFE_NewContextOptions() { return new TFE_ContextOptions; }
@@ -145,7 +159,7 @@ TFE_TensorHandle* TFE_NewTensorHandle(TF_Tensor* t, TF_Status* status) {
   tensorflow::Tensor tensor;
   status->status = tensorflow::TF_TensorToTensor(t, &tensor);
   if (!status->status.ok()) return nullptr;
-  return new TFE_TensorHandle(tensor, nullptr);
+  return new TFE_TensorHandle(tensor, nullptr, nullptr);
 }
 
 void TFE_DeleteTensorHandle(TFE_TensorHandle* h) { delete h; }
@@ -154,17 +168,22 @@ TF_DataType TFE_TensorHandleDataType(TFE_TensorHandle* h) {
   return static_cast<TF_DataType>(h->t.dtype());
 }
 
-int TFE_TensorHandleNumDims(TFE_TensorHandle* h) { return h->t.dims(); }
+int TFE_TensorHandleNumDims(TFE_TensorHandle* h, TF_Status* status) {
+  status->status = tensorflow::Status::OK();
+  return h->t.dims();
+}
 
-int64_t TFE_TensorHandleDim(TFE_TensorHandle* h, int dim_index) {
+int64_t TFE_TensorHandleDim(TFE_TensorHandle* h, int dim_index,
+                            TF_Status* status) {
+  status->status = tensorflow::Status::OK();
   return h->t.dim_size(dim_index);
 }
 
-const char* TFE_TensorHandleDeviceName(TFE_TensorHandle* h) {
-  // This might be a bit confusing as a tensor on CPU can sometimes return
-  // "CPU:0" and sometimes "/job:localhost/replica:0/task:0/cpu:0".
-  // TODO(ashankar): Figure out which one would be nicer.
-  return (h->d == nullptr) ? "CPU:0" : h->d->name().c_str();
+const char* TFE_TensorHandleDeviceName(TFE_TensorHandle* h, TF_Status* status) {
+  status->status = tensorflow::Status::OK();
+  return (h->op_device == nullptr)
+             ? "/job:localhost/replica:0/task:0/device:CPU:0"
+             : h->op_device->name().c_str();
 }
 
 TF_Tensor* TFE_TensorHandleResolve(TFE_TensorHandle* h, TF_Status* status) {
@@ -201,7 +220,8 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
   // has device type XLA_CPU, and the other CPU.
   const bool both_on_cpu = src_cpu && dst_cpu;
   if (is_same_device || both_on_cpu) {
-    return new TFE_TensorHandle(h->t, dst_cpu ? nullptr : dstd);
+    dstd = dst_cpu ? nullptr : dstd;
+    return new TFE_TensorHandle(h->t, dstd, dstd);
   }
   tensorflow::Tensor* src = &(h->t);
   if (!dst_cpu && (src->dtype() != tensorflow::DT_VARIANT &&
@@ -220,7 +240,8 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
   }
   tensorflow::Tensor dst(dstd->GetAllocator(attr), src->dtype(), src->shape());
   if (src->shape().num_elements() == 0) {
-    return new TFE_TensorHandle(dst, dst_cpu ? nullptr : dstd);
+    dstd = dst_cpu ? nullptr : dstd;
+    return new TFE_TensorHandle(dst, dstd, dstd);
   }
   tensorflow::DeviceContext* src_device_context = nullptr;
   if (!src_cpu) {
@@ -248,7 +269,8 @@ TFE_TensorHandle* TFE_TensorHandleCopyToDevice(TFE_TensorHandle* h,
                                  });
   n.WaitForNotification();
   return (TF_GetCode(status) == TF_OK)
-             ? new TFE_TensorHandle(dst, dst_cpu ? nullptr : dstd)
+             ? new TFE_TensorHandle(dst, dst_cpu ? nullptr : dstd,
+                                    dst_cpu ? nullptr : dstd)
              : nullptr;
 }
 
@@ -296,16 +318,15 @@ void TFE_OpSetXLACompilation(TFE_Op* op, unsigned char enable) {
 
 void TFE_OpAddInput(TFE_Op* op, TFE_TensorHandle* h, TF_Status* status) {
   // Questionable heuristic ...
-  //
-  // Motivation: After an 'op' is placed on GPU because some of its earlier
-  // inputs are on GPU, we want to keep the 'op' there, even if some later
-  // inputs of it are not on GPU.
-  if (IsCPU(op->device) && !IsCPU(h->d)) {
+  // - If a device was explicitly set on the op, always use that.
+  // - If not, place on the first non-host device seen.
+  if (op->device == nullptr && !IsCPU(h->d)) {
     op->device = h->d;
   }
   if (!status->status.ok()) return;
   op->inputs.push_back(h->t);
   op->input_devices.push_back(h->d);
+  op->input_op_devices.push_back(h->op_device);
   op->attrs.NumInputs(op->inputs.size());
 }
 
@@ -521,7 +542,8 @@ tensorflow::Status ValidateInputTypeAndPlacement(
       }
       // We are only here if the policy is warn or silent copies, so we should
       // trigger a copy.
-      TFE_TensorHandle original{op->inputs[i], op->input_devices[i]};
+      TFE_TensorHandle original{op->inputs[i], op->input_devices[i],
+                                op->device};
       TF_Status* s = TF_NewStatus();
       TFE_TensorHandle* copied_tensor = TFE_TensorHandleCopyToDevice(
           &original, ctx, expected_device->name().c_str(), s);
@@ -725,6 +747,7 @@ std::unique_ptr<TFE_Op> BuildXlaLaunch(TFE_Op* op, TF_Status* status) {
   // via `op_input_to_func_input`, adjust the actual inputs accordingly.
   launch_op->inputs = op->inputs;
   launch_op->input_devices = op->input_devices;
+  launch_op->input_op_devices = op->input_op_devices;
   if (!op_input_to_func_input.empty()) {
     DCHECK_EQ(op->inputs.size(), op_input_to_func_input.size());
     if (!op->input_devices.empty()) {
@@ -771,15 +794,38 @@ std::unique_ptr<TFE_Op> BuildXlaLaunch(TFE_Op* op, TF_Status* status) {
   return launch_op;
 }
 #endif  // TENSORFLOW_EAGER_USE_XLA
+
+tensorflow::Device* SelectDevice(const tensorflow::NodeDef& ndef,
+                                 TFE_Context* ctx, TF_Status* status) {
+  tensorflow::DeviceSet ds;
+  for (tensorflow::Device* d : ctx->devices()) {
+    ds.AddDevice(d);
+  }
+  tensorflow::DeviceTypeVector final_devices;
+  status->status = tensorflow::SupportedDeviceTypesForNode(
+      ds.PrioritizedDeviceTypeList(), ndef, &final_devices);
+  if (!status->status.ok()) {
+    return nullptr;
+  }
+  if (final_devices.empty()) {
+    status->status = tensorflow::errors::Internal(
+        "Could not find valid device for node ", ndef.DebugString());
+    return nullptr;
+  }
+  for (tensorflow::Device* d : ctx->devices()) {
+    if (d->device_type() == final_devices[0].type_string()) {
+      return d;
+    }
+  }
+  status->status = tensorflow::errors::Unknown(
+      "Could not find a device for node ", ndef.DebugString());
+  return nullptr;
+}
+
 }  // namespace
 
 void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals, int* num_retvals,
                  TF_Status* status) {
-  TFE_Context* ctx = op->ctx;
-  // TODO(ashankar): ASSUMPTION: ctx->devices()[0] is always CPU
-  tensorflow::Device* device =
-      (op->device == nullptr) ? ctx->devices()[0] : op->device;
-
 #ifdef TENSORFLOW_EAGER_USE_XLA
   std::unique_ptr<TFE_Op> xla_launch_op;
   if (op->use_xla && op->name != "_XlaLaunch") {
@@ -790,10 +836,33 @@ void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals, int* num_retvals,
     op = xla_launch_op.get();
   }
 #endif  // TENSORFLOW_EAGER_USE_XLA
+  TFE_Context* ctx = op->ctx;
+  tensorflow::Device* device = op->device;
+  // Ensure all resource-touching ops run in the device the resource is,
+  // regardless of anything else that has been specified. This is identical to
+  // the graph mode behavior.
+  for (int i = 0; i < op->inputs.size(); ++i) {
+    if (op->inputs[i].dtype() == tensorflow::DT_RESOURCE &&
+        op->input_op_devices[i] != device) {
+      tensorflow::Device* d = op->input_op_devices[i] == nullptr
+                                  ? ctx->devices()[0]
+                                  : op->input_op_devices[i];
+      VLOG(1) << "Changing device of operation " << op->name << " to "
+              << d->name() << " because input #" << i
+              << " is a resource in this device.";
+      device = d;
+      op->device = d;
+    }
+  }
+  if (!ctx->soft_placement && device == nullptr) {
+    // TODO(ashankar): ASSUMPTION: ctx->devices()[0] is always CPU
+    device = ctx->devices()[0];
+  }
 
   std::vector<tensorflow::Tensor> outputs(1);
   const tensorflow::MemoryTypeVector* output_memory_types = nullptr;
-  tensorflow::Fprint128 cache_key = op->attrs.CacheKey(device->name());
+  tensorflow::Fprint128 cache_key =
+      op->attrs.CacheKey(device == nullptr ? "unspecified" : device->name());
   tensorflow::KernelAndDevice* kernel;
   {
     tensorflow::tf_shared_lock l(ctx->cache_mu);
@@ -801,6 +870,17 @@ void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals, int* num_retvals,
   }
   if (kernel == nullptr) {
     const tensorflow::NodeDef& ndef = op->attrs.BuildNodeDef();
+    if (ctx->soft_placement && device == nullptr) {
+      device = SelectDevice(ndef, ctx, status);
+      if (!status->status.ok()) {
+        return;
+      }
+    }
+    CHECK(device != nullptr);
+    if (ctx->log_device_placement) {
+      LOG(INFO) << "Executing op " << ndef.op() << " in device "
+                << device->name();
+    }
     kernel = new tensorflow::KernelAndDevice(ctx->rendezvous);
     // Knowledge of the implementation of Init (and in-turn
     // FunctionLibraryRuntime::CreateKernel) tells us that ctx->func_lib_def
@@ -814,9 +894,34 @@ void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals, int* num_retvals,
       delete kernel;
       return;
     }
+    // Update output_dtypes inside `kernel`.
+    const tensorflow::OpDef* op_def = nullptr;
+    const tensorflow::FunctionDef* function_def =
+        ctx->func_lib_def.Find(ndef.op());
+    if (function_def != nullptr) {
+      op_def = &(function_def->signature());
+    }
+    if (op_def == nullptr) {
+      status->status = OpDefForOp(ndef.op().c_str(), &op_def);
+      if (!status->status.ok()) {
+        return;
+      }
+    }
+    tensorflow::DataTypeVector input_dtypes;
+    status->status = InOutTypesForNode(ndef, *op_def, &input_dtypes,
+                                       kernel->output_dtypes());
+    if (!status->status.ok()) {
+      return;
+    }
     tensorflow::mutex_lock ml(ctx->cache_mu);
     tensorflow::gtl::InsertOrUpdate(&(ctx->kernel_cache), cache_key, kernel);
   }
+  if (device == nullptr) {
+    // TODO(apassos) debug how the assignment below might return a different
+    // device from the one requested above.
+    device = kernel->device();
+  }
+
   std::vector<TFE_TensorHandle*> copied_tensors;
   status->status = ValidateInputTypeAndPlacement(
       ctx, ctx->devices()[0], device, op, kernel->kernel(), &copied_tensors);
@@ -882,7 +987,7 @@ void TFE_Execute(TFE_Op* op, TFE_TensorHandle** retvals, int* num_retvals,
         (*output_memory_types)[i] == tensorflow::HOST_MEMORY) {
       d = nullptr;
     }
-    retvals[i] = new TFE_TensorHandle(outputs[i], d);
+    retvals[i] = new TFE_TensorHandle(outputs[i], d, device);
   }
 }
 
@@ -908,7 +1013,7 @@ void TFE_ContextAddFunction(TFE_Context* ctx, TF_Function* function,
 }  // extern "C"
 
 TFE_TensorHandle* TFE_NewTensorHandle(const tensorflow::Tensor& t) {
-  return new TFE_TensorHandle(t, nullptr);
+  return new TFE_TensorHandle(t, nullptr, nullptr);
 }
 
 const tensorflow::Tensor* TFE_TensorHandleUnderlyingTensorInHostMemory(
diff --git a/tensorflow/c/eager/c_api.h b/tensorflow/c/eager/c_api.h
index 7a321b54da343fd2b8912187bc620c1e7456db0c..9610ca1b3bd6c0a77268709abaa9f899d476bde9 100644
--- a/tensorflow/c/eager/c_api.h
+++ b/tensorflow/c/eager/c_api.h
@@ -61,7 +61,8 @@ TF_CAPI_EXPORT extern void TFE_ContextOptionsSetConfig(
 // Controls how to act when we try to run an operation on a given device but
 // some input tensors are not on that device.
 typedef enum TFE_ContextDevicePlacementPolicy {
-  // Running operations with input tensors on the wrong device will fail.
+  // Running operations with input tensors on the wrong device will fail. When
+  // soft placement is enabled acts like TFE_DEVICE_PLACEMENT_SILENT.
   TFE_DEVICE_PLACEMENT_EXPLICIT = 0,
   // Copy the tensor to the right device but log a warning.
   TFE_DEVICE_PLACEMENT_WARN = 1,
@@ -69,7 +70,8 @@ typedef enum TFE_ContextDevicePlacementPolicy {
   // operation will be blocked till the copy completes.
   TFE_DEVICE_PLACEMENT_SILENT = 2,
   // Default placement policy which silently copies int32 tensors but not other
-  // dtypes.
+  // dtypes.  When soft placement is enabled acts like
+  // TFE_DEVICE_PLACEMENT_SILENT.
   TFE_DEVICE_PLACEMENT_SILENT_FOR_INT32 = 3,
 } TFE_ContextDevicePlacementPolicy;
 
@@ -119,11 +121,13 @@ TF_CAPI_EXPORT extern TFE_TensorHandle* TFE_NewTensorHandle(TF_Tensor* t,
                                                             TF_Status* status);
 TF_CAPI_EXPORT extern void TFE_DeleteTensorHandle(TFE_TensorHandle* h);
 TF_CAPI_EXPORT extern TF_DataType TFE_TensorHandleDataType(TFE_TensorHandle* h);
-TF_CAPI_EXPORT extern int TFE_TensorHandleNumDims(TFE_TensorHandle* h);
+TF_CAPI_EXPORT extern int TFE_TensorHandleNumDims(TFE_TensorHandle* h,
+                                                  TF_Status* status);
 TF_CAPI_EXPORT extern int64_t TFE_TensorHandleDim(TFE_TensorHandle* h,
-                                                  int dim_index);
+                                                  int dim_index,
+                                                  TF_Status* status);
 TF_CAPI_EXPORT extern const char* TFE_TensorHandleDeviceName(
-    TFE_TensorHandle* h);
+    TFE_TensorHandle* h, TF_Status* status);
 TF_CAPI_EXPORT extern TF_Tensor* TFE_TensorHandleResolve(TFE_TensorHandle* h,
                                                          TF_Status* status);
 
diff --git a/tensorflow/c/eager/c_api_internal.h b/tensorflow/c/eager/c_api_internal.h
index 7b9f1db02ed9c53a280c7bd1284165cac4fb6353..145e4c95cf07373261f81912fe1c35f8db2f9ebd 100644
--- a/tensorflow/c/eager/c_api_internal.h
+++ b/tensorflow/c/eager/c_api_internal.h
@@ -43,15 +43,23 @@ struct TFE_ContextOptions {
       TFE_DEVICE_PLACEMENT_SILENT_FOR_INT32};
 };
 
+TFE_ContextDevicePlacementPolicy PlacementPolicy(
+    bool soft_placement, TFE_ContextDevicePlacementPolicy original_policy);
+
 struct TFE_Context {
   explicit TFE_Context(const TFE_ContextOptions& opts, TF_Session* s)
-      : policy(opts.policy),
+      : soft_placement(
+            opts.session_options.options.config.allow_soft_placement()),
+        policy(PlacementPolicy(soft_placement, opts.policy)),
         session(s),
         rendezvous(new tensorflow::IntraProcessRendezvous(s->device_mgr)),
         pflr(new tensorflow::ProcessFunctionLibraryRuntime(
             session->device_mgr, opts.session_options.options.env,
-            TF_GRAPH_DEF_VERSION, &func_lib_def, {})) {}
+            TF_GRAPH_DEF_VERSION, &func_lib_def, {})),
+        log_device_placement(
+            opts.session_options.options.config.log_device_placement()) {}
 
+  const bool soft_placement;
   const TFE_ContextDevicePlacementPolicy policy;
 
   // Note: we cannot use C++11 thread_local here as there is no concept of a
@@ -88,11 +96,14 @@ struct TFE_Context {
   std::atomic<bool> should_store_metadata{false};
   tensorflow::mutex metadata_mu;
   tensorflow::RunMetadata run_metadata GUARDED_BY(metadata_mu);
+
+  const bool log_device_placement;
 };
 
 struct TFE_TensorHandle {
-  TFE_TensorHandle(const tensorflow::Tensor& t, tensorflow::Device* d)
-      : t(t), d(d) {}
+  TFE_TensorHandle(const tensorflow::Tensor& t, tensorflow::Device* d,
+                   tensorflow::Device* op_device)
+      : t(t), d(d), op_device(op_device) {}
 
   tensorflow::Tensor t;
   // TODO(ashankar): d == nullptr iff local CPU
@@ -104,6 +115,10 @@ struct TFE_TensorHandle {
   // TODO(ashankar): Reference count TFE_Context to ensure that 'd' of a
   // TFE_TensorHandle does not outlive the TFE_Context from which it came?
   tensorflow::Device* d;
+
+  // Device in which the op producing this tensor was executed. Equals to d for
+  // constant tensors.
+  tensorflow::Device* op_device;
 };
 
 struct TFE_Op {
@@ -120,6 +135,7 @@ struct TFE_Op {
   const tensorflow::AttrTypeMap* attr_types;
   std::vector<tensorflow::Tensor> inputs;
   std::vector<tensorflow::Device*> input_devices;
+  std::vector<tensorflow::Device*> input_op_devices;
   tensorflow::Device* device;
   bool use_xla = false;
 };
diff --git a/tensorflow/c/eager/c_api_test.cc b/tensorflow/c/eager/c_api_test.cc
index 4a3ecbc0abb16296a84c0d2184dc3fc9f7f3ebb4..00fb7e68d00dd2ef316bf89b8f253cf6c7c63f00 100644
--- a/tensorflow/c/eager/c_api_test.cc
+++ b/tensorflow/c/eager/c_api_test.cc
@@ -932,7 +932,8 @@ TEST(CAPI, Variables) {
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   ASSERT_EQ(1, num_retvals);
   EXPECT_EQ(TF_FLOAT, TFE_TensorHandleDataType(value_handle));
-  EXPECT_EQ(0, TFE_TensorHandleNumDims(value_handle));
+  EXPECT_EQ(0, TFE_TensorHandleNumDims(value_handle, status));
+  ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
   float value = 0.0f;
   TF_Tensor* t = TFE_TensorHandleResolve(value_handle, status);
   ASSERT_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
@@ -974,7 +975,8 @@ void BM_ReadVariable(int iters) {
     CHECK_EQ(1, num_retvals);
     CHECK(h);
     CHECK_EQ(TF_FLOAT, TFE_TensorHandleDataType(h));
-    CHECK_EQ(0, TFE_TensorHandleNumDims(h));
+    CHECK_EQ(0, TFE_TensorHandleNumDims(h, status));
+    CHECK_EQ(TF_OK, TF_GetCode(status)) << TF_Message(status);
     h = nullptr;
   }
   tensorflow::testing::StopTiming();
diff --git a/tensorflow/c/eager/runtime.cc b/tensorflow/c/eager/runtime.cc
index f77a937f1ffc2d146224cb3191a5ca127daefc22..4bf24fec2cbceab3da0c6a39a2d68bcda5915de9 100644
--- a/tensorflow/c/eager/runtime.cc
+++ b/tensorflow/c/eager/runtime.cc
@@ -41,17 +41,26 @@ const uint32 kIsList = 1U << 31;
 
 }  // namespace
 
+Status OpDefForOp(const char* op_name, const OpDef** op_def) {
+  const OpRegistrationData* op_reg_data = nullptr;
+  Status s = OpRegistry::Global()->LookUp(op_name, &op_reg_data);
+  if (s.ok()) {
+    *op_def = &op_reg_data->op_def;
+  }
+  return s;
+}
+
 Status AttrTypeMapForOp(const char* op_name, const AttrTypeMap** out) {
   mutex_lock l(g_op_name_to_attr_type_map_lock);
   *out = gtl::FindPtrOrNull(*OpNameToAttrTypeMap(), op_name);
   if (*out != nullptr) return Status::OK();
-  const OpRegistrationData* op_reg_data = nullptr;
-  Status s = OpRegistry::Global()->LookUp(op_name, &op_reg_data);
+  const OpDef* op_def = nullptr;
+  Status s = OpDefForOp(op_name, &op_def);
   if (!s.ok()) return s;
   std::unique_ptr<AttrTypeMap> m(new AttrTypeMap);
   // TODO(agarwal): Avoid having to create this "registry" at runtime,
   // perhaps can be done at op registration time?
-  for (const auto& attr : op_reg_data->op_def.attr()) {
+  for (const auto& attr : op_def->attr()) {
     string type = attr.type();
     const bool is_list = (type.length() > 6 && type.compare(0, 4, "list") == 0);
     if (is_list) {
diff --git a/tensorflow/c/eager/runtime.h b/tensorflow/c/eager/runtime.h
index 4d20b5244a46fcde2eed0a429dced2a77b86aedd..985ed96735ea578d738f36bddb6a70647e200906 100644
--- a/tensorflow/c/eager/runtime.h
+++ b/tensorflow/c/eager/runtime.h
@@ -39,6 +39,9 @@ namespace tensorflow {
 // represent the TF_AttrType type of the values in the list.
 typedef std::unordered_map<string, uint32> AttrTypeMap;
 
+// Look up OpDef for `op_name`.
+Status OpDefForOp(const char* op_name, const OpDef** op_def);
+
 // Returns the AttrTypeMap for the TensorFlow operation named op_name.
 Status AttrTypeMapForOp(const char* op_name, const AttrTypeMap** out);
 
@@ -180,12 +183,17 @@ class KernelAndDevice {
 
   const OpKernel* kernel() const { return kernel_.get(); }
 
+  Device* device() const { return device_; }
+
+  DataTypeVector* output_dtypes() { return &output_dtypes_; }
+
  private:
   std::unique_ptr<OpKernel> kernel_;
   Device* device_;
   FunctionLibraryRuntime* flib_;
   checkpoint::TensorSliceReaderCacheWrapper slice_reader_cache_;
   Rendezvous* rendez_;
+  DataTypeVector output_dtypes_;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/c/python_api.cc b/tensorflow/c/python_api.cc
index 6e37cdb5f4beea53d4a2ded0705ae482d0bc2d68..f553142d15f476ad2c1af68016a4254ed211b9b2 100644
--- a/tensorflow/c/python_api.cc
+++ b/tensorflow/c/python_api.cc
@@ -99,4 +99,9 @@ void RemoveAllControlInputs(TF_Graph* graph, TF_Operation* op) {
   }
 }
 
+void SetRequireShapeInferenceFns(TF_Graph* graph, bool require) {
+  mutex_lock l(graph->mu);
+  graph->refiner.set_require_shape_inference_fns(require);
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/c/python_api.h b/tensorflow/c/python_api.h
index aa9d9e06b28c54cb8869eb547d36ee3cb0d4e6b8..542d70f42c2a5df8309a722b32d850dd249e496f 100644
--- a/tensorflow/c/python_api.h
+++ b/tensorflow/c/python_api.h
@@ -37,6 +37,10 @@ void UpdateEdge(TF_Graph* graph, TF_Output new_src, TF_Input dst,
 
 void RemoveAllControlInputs(TF_Graph* graph, TF_Operation* op);
 
+// Sets whether ops missing a shape inference function should trigger an
+// error. The default is true.
+void SetRequireShapeInferenceFns(TF_Graph* graph, bool require);
+
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_C_PYTHON_API_H_
diff --git a/tensorflow/cc/gradients/nn_grad.cc b/tensorflow/cc/gradients/nn_grad.cc
index 13a3bba5e6d5ca19ff3f0eca76665ba7d3ab628d..63a67f09f6f7c2b39da8cf082c2a36179014ac6f 100644
--- a/tensorflow/cc/gradients/nn_grad.cc
+++ b/tensorflow/cc/gradients/nn_grad.cc
@@ -196,6 +196,70 @@ Status MaxPoolGradV2Helper(const Scope& scope, const Operation& op,
 }
 REGISTER_GRADIENT_OP("MaxPoolV2", MaxPoolGradV2Helper);
 
+Status MaxPool3DGradHelper(const Scope& scope, const Operation& op,
+                           const std::vector<Output>& grad_inputs,
+                           std::vector<Output>* grad_outputs) {
+  std::vector<int32> ksize;
+  std::vector<int32> strides;
+  string padding;
+  string data_format;
+  auto attrs = op.output(0).node()->attrs();
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "ksize", &ksize));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "strides", &strides));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
+  MaxPool3DGrad::Attrs grad_attrs;
+  grad_attrs.DataFormat(data_format);
+  auto dx = MaxPool3DGrad(scope, op.input(0), op.output(0), grad_inputs[0],
+                          ksize, strides, padding, grad_attrs);
+  grad_outputs->push_back(dx);
+  return scope.status();
+}
+REGISTER_GRADIENT_OP("MaxPool3D", MaxPool3DGradHelper);
+
+Status AvgPoolGradHelper(const Scope& scope, const Operation& op,
+                         const std::vector<Output>& grad_inputs,
+                         std::vector<Output>* grad_outputs) {
+  std::vector<int32> ksize;
+  std::vector<int32> strides;
+  string padding;
+  string data_format;
+  auto attrs = op.output(0).node()->attrs();
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "ksize", &ksize));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "strides", &strides));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
+  internal::AvgPoolGrad::Attrs grad_attrs;
+  grad_attrs.DataFormat(data_format);
+  auto dx =
+      internal::AvgPoolGrad(scope, Shape(scope, op.input(0)), grad_inputs[0],
+                            ksize, strides, padding, grad_attrs);
+  grad_outputs->push_back(dx);
+  return scope.status();
+}
+REGISTER_GRADIENT_OP("AvgPool", AvgPoolGradHelper);
+
+Status AvgPool3DGradHelper(const Scope& scope, const Operation& op,
+                           const std::vector<Output>& grad_inputs,
+                           std::vector<Output>* grad_outputs) {
+  std::vector<int32> ksize;
+  std::vector<int32> strides;
+  string padding;
+  string data_format;
+  auto attrs = op.output(0).node()->attrs();
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "ksize", &ksize));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "strides", &strides));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "padding", &padding));
+  TF_RETURN_IF_ERROR(GetNodeAttr(attrs, "data_format", &data_format));
+  AvgPool3DGrad::Attrs grad_attrs;
+  grad_attrs.DataFormat(data_format);
+  auto dx = AvgPool3DGrad(scope, Shape(scope, op.input(0)), grad_inputs[0],
+                          ksize, strides, padding, grad_attrs);
+  grad_outputs->push_back(dx);
+  return scope.status();
+}
+REGISTER_GRADIENT_OP("AvgPool3D", AvgPool3DGradHelper);
+
 Status LRNGradHelper(const Scope& scope, const Operation& op,
                      const std::vector<Output>& grad_inputs,
                      std::vector<Output>* grad_outputs){
diff --git a/tensorflow/cc/gradients/nn_grad_test.cc b/tensorflow/cc/gradients/nn_grad_test.cc
index 0cfe5f6e3c49f7c4a3cafbf48ff4e54a0ffd0d47..c4eba7ecb017fe4628140d75a63bc7f0f09deb7f 100644
--- a/tensorflow/cc/gradients/nn_grad_test.cc
+++ b/tensorflow/cc/gradients/nn_grad_test.cc
@@ -31,8 +31,11 @@ using ops::Elu;
 using ops::L2Loss;
 using ops::LogSoftmax;
 using ops::LRN;
+using ops::AvgPool;
+using ops::AvgPool3D;
 using ops::MaxPool;
 using ops::MaxPoolV2;
+using ops::MaxPool3D;
 using ops::Placeholder;
 using ops::Relu;
 using ops::Relu6;
@@ -70,9 +73,9 @@ class NNGradTest : public ::testing::Test {
 
   // Sets tensor with random values, ensuring that the max value is largest by
   // a reasonable amount.
-  // This is an issue for MaxPool and MaxPoolV2, in which perturbations by the
-  // numeric gradient computation in the gradient checker can change the max
-  // value if values are too close together.
+  // This is an issue for MaxPool, MaxPoolV2 and MaxPool3D, in which
+  // perturbations by the numeric gradient computation in the gradient checker
+  // can change the max value if values are too close together.
   template <typename T>
   void SetRandomValuesWithBumpedMax(Tensor* tensor) {
     auto tensor_flat = tensor->flat<T>();
@@ -203,6 +206,41 @@ TEST_F(NNGradTest, MaxPoolGradV2Helper) {
   RunTest(x, x_init_value, y, y_shape);
 }
 
+TEST_F(NNGradTest, MaxPool3DGradHelper) {
+  TensorShape x_shape({1, 3, 3, 3, 1});
+  TensorShape y_shape({1, 1, 1, 1, 1});
+  auto x = Placeholder(scope_, DT_FLOAT, Placeholder::Shape(x_shape));
+  // Setup window and strides so that we only do one MaxPool3D.
+  const std::vector<int> ksize{1, 3, 3, 3, 1};
+  const std::vector<int> strides{1, 3, 3, 3, 1};
+  auto y = MaxPool3D(scope_, x, ksize, strides, "VALID");
+  Tensor x_init_value = Tensor(DT_FLOAT, x_shape);
+  SetRandomValuesWithBumpedMax<float>(&x_init_value);
+  RunTest(x, x_init_value, y, y_shape);
+}
+
+TEST_F(NNGradTest, AvgPoolGradHelper) {
+  TensorShape x_shape({1, 2, 2, 1});
+  TensorShape y_shape({1, 1, 1, 1});
+  auto x = Placeholder(scope_, DT_FLOAT, Placeholder::Shape(x_shape));
+  // Setup window and strides so that we only do one AvgPool.
+  const std::vector<int> ksize{1, 2, 2, 1};
+  const std::vector<int> strides{1, 2, 2, 1};
+  auto y = AvgPool(scope_, x, ksize, strides, "SAME");
+  RunTest(x, x_shape, y, y_shape);
+}
+
+TEST_F(NNGradTest, AvgPool3DGradHelper) {
+  TensorShape x_shape({1, 3, 3, 3, 1});
+  TensorShape y_shape({1, 1, 1, 1, 1});
+  auto x = Placeholder(scope_, DT_FLOAT, Placeholder::Shape(x_shape));
+  // Setup window and strides so that we only do one AvgPool3D.
+  const std::vector<int> ksize{1, 3, 3, 3, 1};
+  const std::vector<int> strides{1, 3, 3, 3, 1};
+  auto y = AvgPool3D(scope_, x, ksize, strides, "SAME");
+  RunTest(x, x_shape, y, y_shape);
+}
+
 TEST_F(NNGradTest, LRN){
   TensorShape x_shape({1, 1, 2, 1});
   auto x = Placeholder(scope_, DT_FLOAT, Placeholder::Shape(x_shape));
diff --git a/tensorflow/cc/profiler/profiler.h b/tensorflow/cc/profiler/profiler.h
index 6077c45c5854fd5812ccb7c91522f93ed4e54883..64edbb5766c3604fbe0f15c2299843718381aa3f 100644
--- a/tensorflow/cc/profiler/profiler.h
+++ b/tensorflow/cc/profiler/profiler.h
@@ -61,18 +61,18 @@ class Profiler {
   /// Adds tracing information `run_meta` to profiler. A `run_meta` is
   /// generated by a TensorFlow session run call. `step` is the key
   /// to the `run_meta`. When calling ProfileXXX methods, caller can specify
-  /// `step` in `options` to seletively profile the corresponding `run_meta`.
+  /// `step` in `options` to selectively profile the corresponding `run_meta`.
   /// Multiple different `run_meta` can be keyed by the same `step` in order
   /// to group them together.
   void AddStep(int64 step, const RunMetadata& run_meta);
 
   /// Profiles the model by organizing nodes in graph structure.
-  /// Each node is an op and the nodes are contected by the op inputs/outputs.
+  /// Each node is an op and the nodes are connected by the op inputs/outputs.
   GraphNodeProto ProfileGraph(const Options& options);
 
   /// Profiles the model by organizing nodes in name scope structure.
   /// Each node is an op, and nodes are organized by the ops' name
-  /// scope, similar to a filesystem tree.
+  /// scope, similar to a file system tree.
   /// E.g. /foo is the root of operation /foo/matmul_1 and foo/conv_2.
   GraphNodeProto ProfileNameScope(const Options& options);
 
diff --git a/tensorflow/cc/tools/BUILD b/tensorflow/cc/tools/BUILD
index 97f66e79b8ad9f383b22f56e9385fc6d2080e1f8..f413a5cc52e9eb4bc393b8186f5b591681fa2e5e 100644
--- a/tensorflow/cc/tools/BUILD
+++ b/tensorflow/cc/tools/BUILD
@@ -32,6 +32,7 @@ tf_cc_test(
     deps = [
         ":freeze_saved_model",
         "//tensorflow/cc:cc_ops",
+        "//tensorflow/cc:resource_variable_ops",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework_internal",
         "//tensorflow/core:protos_all_cc",
diff --git a/tensorflow/cc/tools/freeze_saved_model.cc b/tensorflow/cc/tools/freeze_saved_model.cc
index ddf372cdef21e1b3892c9a03714478d5a5785517..4ddddcb5863c9ffb1e5367db750b0d2ffd29cd5e 100644
--- a/tensorflow/cc/tools/freeze_saved_model.cc
+++ b/tensorflow/cc/tools/freeze_saved_model.cc
@@ -75,16 +75,13 @@ void GetNodeNameToNodeDefMap(
 // variable nodes to convert.
 void GetReachableNodesAndVariables(
     GraphDef* graph_def, const std::unordered_set<string>& outputs,
+    const std::unordered_map<string, NodeDef*>& name_to_node_map,
     std::unordered_set<string>* reachable_node_names,
     std::unordered_set<string>* variable_node_names) {
   // TODO(suharshs): Add support for ResourceVariables.
   static const std::unordered_set<string>* kVariableTypes =
-      new std::unordered_set<string>({"Variable", "VariableV2"});
-  // name_to_node_map is needed to get the inputs from the NodeDef corresponding
-  // the a string node name. These inputs are used when doing our backwards
-  // traversal.
-  std::unordered_map<string, NodeDef*> name_to_node_map;
-  GetNodeNameToNodeDefMap(graph_def, &name_to_node_map);
+      new std::unordered_set<string>({"Variable", "VariableV2", "VarHandleOp"});
+
   std::queue<string> nodes_to_visit;
   for (const string& tensor_name : outputs) {
     // We need to strip off the tensor part to get the node name.
@@ -99,7 +96,7 @@ void GetReachableNodesAndVariables(
       continue;
     }
     reachable_node_names->insert(node_name);
-    NodeDef* node = name_to_node_map[node_name];
+    NodeDef* node = name_to_node_map.at(node_name);
     if (kVariableTypes->find(node->op()) != kVariableTypes->end()) {
       variable_node_names->insert(node->name());
     }
@@ -111,7 +108,9 @@ void GetReachableNodesAndVariables(
 
 // Gets a map from variable name to variable value.
 Status GetVariableNameToTensorMap(
-    Session* session, std::unordered_set<string> variable_names_set,
+    Session* session,
+    const std::unordered_map<string, NodeDef*>& name_to_node_map,
+    std::unordered_set<string> variable_names_set,
     std::unordered_map<string, Tensor>* variable_name_to_value_map) {
   if (variable_names_set.empty()) {
     return Status::OK();
@@ -120,8 +119,14 @@ Status GetVariableNameToTensorMap(
   std::vector<string> tensor_names;
   for (const string& node_name : variable_names_set) {
     variable_names.push_back(node_name);
-    // We need to run tensors, so append ":0".
-    tensor_names.push_back(node_name + ":0");
+    NodeDef* node_def = name_to_node_map.at(node_name);
+    if (node_def->op() == "VarHandleOp") {
+      // If this is a resource variable, we have to run the corresponding
+      // ReadVariableOp.
+      tensor_names.push_back(node_name + "/Read/ReadVariableOp:0");
+    } else {
+      tensor_names.push_back(node_name + ":0");
+    }
   }
   std::vector<Tensor> outputs;
   TF_RETURN_IF_ERROR(
@@ -143,6 +148,15 @@ void ConvertVariableToConstant(const NodeDef& variable_node,
       (*const_node->mutable_attr())["value"].mutable_tensor());
 }
 
+// Converts a ReadVariableOp NodeDef to an Identity NodeDef.
+void ConvertReadVariableOpToIdentity(const NodeDef& node,
+                                     NodeDef* identity_node) {
+  identity_node->set_name(node.name());
+  identity_node->set_op("Identity");
+  (*identity_node->mutable_attr())["T"] = node.attr().at("dtype");
+  identity_node->add_input(node.input(0));
+}
+
 // Freezes the subgraph of all nodes needed by `outputs`.
 Status FreezeGraphDef(const SavedModelBundle& saved_model_bundle,
                       const std::unordered_set<string>& outputs,
@@ -155,14 +169,19 @@ Status FreezeGraphDef(const SavedModelBundle& saved_model_bundle,
   if (graph_def.node_size() == 0) {
     return Status::OK();
   }
+  // name_to_node_map is needed to get the inputs from the NodeDef corresponding
+  // the a string node name. These inputs are used when doing our backwards
+  // traversal.
+  std::unordered_map<string, NodeDef*> name_to_node_map;
+  GetNodeNameToNodeDefMap(&graph_def, &name_to_node_map);
   std::unordered_set<string> reachable_node_names;
   std::unordered_set<string> variable_node_names;
-  GetReachableNodesAndVariables(&graph_def, outputs, &reachable_node_names,
-                                &variable_node_names);
+  GetReachableNodesAndVariables(&graph_def, outputs, name_to_node_map,
+                                &reachable_node_names, &variable_node_names);
   std::unordered_map<string, Tensor> variable_to_value_map;
-  TF_RETURN_IF_ERROR(
-      GetVariableNameToTensorMap(saved_model_bundle.session.get(),
-                                 variable_node_names, &variable_to_value_map));
+  TF_RETURN_IF_ERROR(GetVariableNameToTensorMap(
+      saved_model_bundle.session.get(), name_to_node_map, variable_node_names,
+      &variable_to_value_map));
   // We copy the nodes in the same order they were in the original graph_def.
   for (const NodeDef& node : graph_def.node()) {
     if (reachable_node_names.find(node.name()) == reachable_node_names.end()) {
@@ -171,6 +190,12 @@ Status FreezeGraphDef(const SavedModelBundle& saved_model_bundle,
     if (variable_node_names.find(node.name()) != variable_node_names.end()) {
       ConvertVariableToConstant(node, variable_to_value_map[node.name()],
                                 frozen_graph_def->add_node());
+    } else if (node.op() == "ReadVariableOp" &&
+               variable_node_names.find(node.input(0)) !=
+                   variable_node_names.end()) {
+      // If the node is a ReadVariableOp, its input VarHandleOp will be
+      // converted to a Constant, so we will need to convert it to an Identity.
+      ConvertReadVariableOpToIdentity(node, frozen_graph_def->add_node());
     } else {
       // If the node isn't a variable, just copy the node as-is.
       *frozen_graph_def->add_node() = node;
diff --git a/tensorflow/cc/tools/freeze_saved_model_test.cc b/tensorflow/cc/tools/freeze_saved_model_test.cc
index 52a81a50284aec36bba4e56a0232c886cb0cb6cf..cd35fd3b95deec669218cfa4f25fea2c3ac9e56e 100644
--- a/tensorflow/cc/tools/freeze_saved_model_test.cc
+++ b/tensorflow/cc/tools/freeze_saved_model_test.cc
@@ -15,6 +15,7 @@ limitations under the License.
 
 #include "tensorflow/cc/tools/freeze_saved_model.h"
 
+#include "tensorflow/cc/ops/resource_variable_ops.h"
 #include "tensorflow/cc/ops/standard_ops.h"
 #include "tensorflow/core/framework/function_testlib.h"
 #include "tensorflow/core/framework/graph.pb.h"
@@ -113,6 +114,160 @@ class FreezeTest : public ::testing::Test {
 
     test::ExpectTensorEqual<float>(unfrozen_outputs[0], frozen_outputs[0]);
   }
+
+  void TestFreezeGraphWithoutDependentVariables(bool use_resource) {
+    // Test freezing a graph with variables that are not needed by the outputs
+    // in the SignatureDef. The resulting graph shouldn't be frozen, but
+    // non-dependent nodes should be pruned.
+    SavedModelBundle saved_model_bundle;
+    GraphDef graph_def;
+    Scope scope = Scope::NewRootScope();
+    Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
+    Output b = ops::Const(scope.WithOpName("b"), 10.0f, {});
+    Output c = ops::Mul(scope.WithOpName("c"), a, b);
+    if (use_resource) {
+      Output var =
+          ops::VarHandleOp(scope.WithOpName("var"), DataType::DT_FLOAT, {});
+      Output read_var = ops::ReadVariableOp(
+          scope.WithOpName("var/Read/ReadVariableOp"), var, DataType::DT_FLOAT);
+      auto assign = ops::AssignVariableOp(scope.WithOpName("assign"), var, a);
+    } else {
+      Output var =
+          ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
+      Output assign = ops::Assign(scope.WithOpName("assign"), var, a);
+    }
+
+    TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
+    // "c" isnt dependent on the variable, so nothing should be frozen.
+    TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
+        graph_def, {"c:0"}, "assign", &saved_model_bundle));
+
+    GraphDef frozen_graph_def;
+    std::unordered_set<string> inputs;
+    std::unordered_set<string> outputs;
+    TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def,
+                                  &inputs, &outputs));
+
+    GraphDef expected_graph_def;
+    Scope expected_scope = Scope::NewRootScope();
+    Output expected_a = ops::Const(expected_scope.WithOpName("a"), 10.0f, {});
+    Output expected_b = ops::Const(expected_scope.WithOpName("b"), 10.0f, {});
+    Output expected_c =
+        ops::Mul(expected_scope.WithOpName("c"), expected_a, expected_b);
+    TF_ASSERT_OK(expected_scope.ToGraphDef(&expected_graph_def));
+
+    GraphDefEqual(frozen_graph_def, expected_graph_def);
+
+    RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
+                                         frozen_graph_def, "c:0");
+  }
+
+  void TestFreezeGraphWithDependentVariables(bool use_resource) {
+    // Test freezing a graph with variables that are needed by outputs in the
+    // SignatureDef. The variables should be frozen.
+    SavedModelBundle saved_model_bundle;
+    GraphDef graph_def;
+    Scope scope = Scope::NewRootScope();
+    Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
+    Output read_var;
+    if (use_resource) {
+      Output var =
+          ops::VarHandleOp(scope.WithOpName("var"), DataType::DT_FLOAT, {});
+      read_var = ops::ReadVariableOp(
+          scope.WithOpName("var/Read/ReadVariableOp"), var, DataType::DT_FLOAT);
+      auto assign = ops::AssignVariableOp(scope.WithOpName("assign"), var, a);
+    } else {
+      Output read_var =
+          ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
+      Output assign = ops::Assign(scope.WithOpName("assign"), read_var, a);
+    }
+    Output c = ops::Mul(scope.WithOpName("c"), a, read_var);
+    TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
+    // "c" isnt dependent on the variable, so nothing should be frozen.
+    TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
+        graph_def, {"c:0"}, "assign", &saved_model_bundle));
+
+    GraphDef frozen_graph_def;
+    std::unordered_set<string> inputs;
+    std::unordered_set<string> outputs;
+    TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def,
+                                  &inputs, &outputs));
+
+    // If using normal variables there should be 3 nodes in the resulting
+    // graph_def. If using resource variables there should be 4 nodes in the
+    // resulting graph_def.
+    // In both cases, none should be variables.
+    size_t expected_nodes = use_resource ? 4 : 3;
+    EXPECT_EQ(frozen_graph_def.node_size(), expected_nodes);
+    for (const NodeDef& node : frozen_graph_def.node()) {
+      EXPECT_NE(node.op(), "Variable") << node.name();
+      EXPECT_NE(node.op(), "VariableV2") << node.name();
+      EXPECT_NE(node.op(), "VarHandleOp") << node.name();
+      EXPECT_NE(node.op(), "ReadVariableOp") << node.name();
+    }
+
+    RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
+                                         frozen_graph_def, "c:0");
+  }
+
+  void TestFreezeGraphWithAndWithoutDependentVariables(bool use_resource) {
+    // Test freezing a graph with some variables that are needed and not needed
+    // by
+    // the outputs in the SignatureDef. The resulting graph should only freeze
+    // dependent variables.
+    SavedModelBundle saved_model_bundle;
+    GraphDef graph_def;
+    Scope scope = Scope::NewRootScope();
+    Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
+    Output read_var;
+
+    if (use_resource) {
+      Output var =
+          ops::VarHandleOp(scope.WithOpName("var"), DataType::DT_FLOAT, {});
+      read_var = ops::ReadVariableOp(
+          scope.WithOpName("var/Read/ReadVariableOp"), var, DataType::DT_FLOAT);
+      auto assign = ops::AssignVariableOp(scope.WithOpName("assign"), var, a);
+      Output var_1 =
+          ops::VarHandleOp(scope.WithOpName("var_1"), DataType::DT_FLOAT, {});
+      Output read_var_1 =
+          ops::ReadVariableOp(scope.WithOpName("var_1/Read/ReadVariableOp"),
+                              var, DataType::DT_FLOAT);
+      auto assign_1 =
+          ops::AssignVariableOp(scope.WithOpName("assign_1"), var_1, a);
+    } else {
+      read_var = ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
+      Output assign = ops::Assign(scope.WithOpName("assign"), read_var, a);
+      Output var_1 =
+          ops::Variable(scope.WithOpName("var_1"), {}, DataType::DT_FLOAT);
+      Output assign_1 = ops::Assign(scope.WithOpName("assign_1"), var_1, a);
+    }
+
+    Output c = ops::Mul(scope.WithOpName("c"), a, read_var);
+    TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
+    // "c" isnt dependent on the variable, so nothing should be frozen.
+    TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
+        graph_def, {"c:0"}, "assign", &saved_model_bundle));
+
+    GraphDef frozen_graph_def;
+    std::unordered_set<string> inputs;
+    std::unordered_set<string> outputs;
+    TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def,
+                                  &inputs, &outputs));
+
+    // There should be 3 nodes in the resulting graph_def, and none should be
+    // variables.
+    size_t expected_nodes = use_resource ? 4 : 3;
+    EXPECT_EQ(frozen_graph_def.node_size(), expected_nodes);
+    for (const NodeDef& node : frozen_graph_def.node()) {
+      EXPECT_NE(node.op(), "Variable") << node.name();
+      EXPECT_NE(node.op(), "VariableV2") << node.name();
+      EXPECT_NE(node.op(), "VarHandleOp") << node.name();
+      EXPECT_NE(node.op(), "ReadVariableOp") << node.name();
+    }
+
+    RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
+                                         frozen_graph_def, "c:0");
+  }
 };
 
 TEST_F(FreezeTest, InputsAndOutputsSingleSignatureDef) {
@@ -196,111 +351,28 @@ TEST_F(FreezeTest, GraphDefWithNoVariables) {
   GraphDefEqual(frozen_graph_def, graph_def);
 }
 
-TEST_F(FreezeTest, GraphDefWithVariablesNotNeededByOutputs) {
-  // Test freezing a graph with variables that are not needed by the outputs in
-  // the SignatureDef. The resulting graph shouldn't be frozen, but
-  // non-dependent nodes should be pruned.
-  SavedModelBundle saved_model_bundle;
-  GraphDef graph_def;
-  Scope scope = Scope::NewRootScope();
-  Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
-  Output b = ops::Const(scope.WithOpName("b"), 10.0f, {});
-  Output c = ops::Mul(scope.WithOpName("c"), a, b);
-  Output var = ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
-  Output assign = ops::Assign(scope.WithOpName("assign"), var, a);
-  TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
-  // "c" isnt dependent on the variable, so nothing should be frozen.
-  TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
-      graph_def, {"c:0"}, assign.name(), &saved_model_bundle));
-
-  GraphDef frozen_graph_def;
-  std::unordered_set<string> inputs;
-  std::unordered_set<string> outputs;
-  TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def, &inputs,
-                                &outputs));
-
-  GraphDef expected_graph_def;
-  Scope expected_scope = Scope::NewRootScope();
-  Output expected_a = ops::Const(expected_scope.WithOpName("a"), 10.0f, {});
-  Output expected_b = ops::Const(expected_scope.WithOpName("b"), 10.0f, {});
-  Output expected_c =
-      ops::Mul(expected_scope.WithOpName("c"), expected_a, expected_b);
-  TF_ASSERT_OK(expected_scope.ToGraphDef(&expected_graph_def));
-
-  GraphDefEqual(frozen_graph_def, expected_graph_def);
-
-  RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
-                                       frozen_graph_def, "c:0");
+TEST_F(FreezeTest, GraphDefWithoutDependentVariables) {
+  TestFreezeGraphWithoutDependentVariables(false);
 }
 
-TEST_F(FreezeTest, GraphDefWithVariablesNeededByOutputs) {
-  // Test freezing a graph with variables that are needed by outputs in the
-  // SignatureDef. The variables should be frozen.
-  SavedModelBundle saved_model_bundle;
-  GraphDef graph_def;
-  Scope scope = Scope::NewRootScope();
-  Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
-  Output var = ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
-  Output c = ops::Mul(scope.WithOpName("c"), a, var);
-  Output assign = ops::Assign(scope.WithOpName("assign"), var, a);
-  TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
-  // "c" isnt dependent on the variable, so nothing should be frozen.
-  TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
-      graph_def, {"c:0"}, assign.name(), &saved_model_bundle));
-
-  GraphDef frozen_graph_def;
-  std::unordered_set<string> inputs;
-  std::unordered_set<string> outputs;
-  TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def, &inputs,
-                                &outputs));
-
-  // There should be 3 nodes in the resulting graph_def, and none should be
-  // variables.
-  EXPECT_EQ(frozen_graph_def.node_size(), 3);
-  for (const NodeDef& node : frozen_graph_def.node()) {
-    EXPECT_NE(node.op(), "Variable") << node.name();
-    EXPECT_NE(node.op(), "VariableV2") << node.name();
-  }
-
-  RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
-                                       frozen_graph_def, "c:0");
+TEST_F(FreezeTest, GraphDefWithoutDependentResourceVariables) {
+  TestFreezeGraphWithoutDependentVariables(true);
 }
 
-TEST_F(FreezeTest, GraphDefWithVariablesNeededAndNotNeededByOutputs) {
-  // Test freezing a graph with some variables that are needed and not needed by
-  // the outputs in the SignatureDef. The resulting graph should only freeze
-  // dependent variables.
-  SavedModelBundle saved_model_bundle;
-  GraphDef graph_def;
-  Scope scope = Scope::NewRootScope();
-  Output a = ops::Const(scope.WithOpName("a"), 10.0f, {});
-  Output var = ops::Variable(scope.WithOpName("var"), {}, DataType::DT_FLOAT);
-  Output c = ops::Mul(scope.WithOpName("c"), a, var);
-  Output assign = ops::Assign(scope.WithOpName("assign"), var, a);
-  Output var_1 =
-      ops::Variable(scope.WithOpName("var_1"), {}, DataType::DT_FLOAT);
-  Output assign_1 = ops::Assign(scope.WithOpName("assign_1"), var, a);
-  TF_ASSERT_OK(scope.ToGraphDef(&graph_def));
-  // "c" isnt dependent on the variable, so nothing should be frozen.
-  TF_ASSERT_OK(AddGraphDefWithOutputsToSavedModelBundle(
-      graph_def, {"c:0"}, assign.name(), &saved_model_bundle));
+TEST_F(FreezeTest, GraphDefWithDependentVariables) {
+  TestFreezeGraphWithDependentVariables(false);
+}
 
-  GraphDef frozen_graph_def;
-  std::unordered_set<string> inputs;
-  std::unordered_set<string> outputs;
-  TF_ASSERT_OK(FreezeSavedModel(saved_model_bundle, &frozen_graph_def, &inputs,
-                                &outputs));
+TEST_F(FreezeTest, GraphDefWithDependentResourceVariables) {
+  TestFreezeGraphWithDependentVariables(true);
+}
 
-  // There should be 3 nodes in the resulting graph_def, and none should be
-  // variables.
-  EXPECT_EQ(frozen_graph_def.node_size(), 3);
-  for (const NodeDef& node : frozen_graph_def.node()) {
-    EXPECT_NE(node.op(), "Variable") << node.name();
-    EXPECT_NE(node.op(), "VariableV2") << node.name();
-  }
+TEST_F(FreezeTest, GraphDefWithAndWithoutDependentVariables) {
+  TestFreezeGraphWithAndWithoutDependentVariables(false);
+}
 
-  RunAndCompareFrozenAndUnfrozenGraphs(saved_model_bundle.session.get(),
-                                       frozen_graph_def, "c:0");
+TEST_F(FreezeTest, GraphDefWithAndWithoutDependentResourceVariables) {
+  TestFreezeGraphWithAndWithoutDependentVariables(true);
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/jit/BUILD b/tensorflow/compiler/jit/BUILD
index a711319607f4ff2b83aa0ebe50e215b3d0e2258e..955d12dc203faa567b51a7b0b6f50ad6d8a94a54 100644
--- a/tensorflow/compiler/jit/BUILD
+++ b/tensorflow/compiler/jit/BUILD
@@ -29,7 +29,10 @@ load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda_is_configured")
 # Target that bundles up the XLA CPU and GPU JIT devices.
 cc_library(
     name = "jit",
-    visibility = [":friends"],
+    visibility = [
+        ":friends",
+        "//learning/tfx:__subpackages__",
+    ],
     deps = [
         ":xla_cpu_device",
         ":xla_cpu_jit",
@@ -102,12 +105,17 @@ cc_library(
 cc_library(
     name = "xla_interpreter_device",
     srcs = ["xla_interpreter_device.cc"],
+    visibility = [":friends"],
     deps = [
+        ":jit_compilation_passes",
         ":xla_device",
         "//tensorflow/compiler/jit/kernels:xla_launch_op",
         "//tensorflow/compiler/tf2xla:xla_compiler",
+        "//tensorflow/compiler/tf2xla/kernels:xla_ops",
+        "//tensorflow/compiler/xla/service:interpreter_plugin",  # buildcleaner: keep
+        "//tensorflow/core:lib",
     ],
-    alwayslink = True,
+    alwayslink = 1,
 )
 
 cc_library(
@@ -200,6 +208,7 @@ cc_library(
     name = "graph_to_functiondef",
     srcs = ["graph_to_functiondef.cc"],
     hdrs = ["graph_to_functiondef.h"],
+    visibility = [":friends"],
     deps = [
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework",
diff --git a/tensorflow/compiler/jit/xla_interpreter_device.cc b/tensorflow/compiler/jit/xla_interpreter_device.cc
index 2614deefd8823dcb8f38e9e22ae4e78145d0d96a..a329451b14a785b17913e3838a6571b62b422804 100644
--- a/tensorflow/compiler/jit/xla_interpreter_device.cc
+++ b/tensorflow/compiler/jit/xla_interpreter_device.cc
@@ -25,8 +25,8 @@ namespace tensorflow {
 const char* const DEVICE_XLA_INTERPRETER = "XLA_INTERPRETER";
 const char* const DEVICE_INTERPRETER_XLA_JIT = "XLA_INTERPRETER_JIT";
 
-constexpr std::array<DataType, 5> kExecAllTypes = {
-    {DT_INT32, DT_FLOAT, DT_BOOL, DT_DOUBLE, DT_INT64}};
+constexpr std::array<DataType, 6> kExecAllTypes = {
+    {DT_INT32, DT_INT64, DT_FLOAT, DT_DOUBLE, DT_COMPLEX64, DT_BOOL}};
 
 class XlaInterpreterDeviceFactory : public DeviceFactory {
  public:
diff --git a/tensorflow/compiler/tests/binary_ops_test.py b/tensorflow/compiler/tests/binary_ops_test.py
index 30a6d3a74d64f90ad33062df6d1e16e3a575bd63..6bcfed7b69fcee838acd5045a3b337809b5a52c8 100644
--- a/tensorflow/compiler/tests/binary_ops_test.py
+++ b/tensorflow/compiler/tests/binary_ops_test.py
@@ -71,7 +71,7 @@ class BinaryOpsTest(XLATestCase):
           expected=np.array([[[[False, True], [True, False]]]], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._real_div,
+          gen_math_ops.real_div,
           np.array([3, 3, -1.5, -8, 44], dtype=dtype),
           np.array([2, -2, 7, -4, 0], dtype=dtype),
           expected=np.array(
@@ -108,57 +108,57 @@ class BinaryOpsTest(XLATestCase):
               [0, np.pi / 4, np.pi / 2, np.pi * 3 / 4, np.pi], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._reciprocal_grad,
+          gen_math_ops.reciprocal_grad,
           np.array([4, -3, -2, 1], dtype=dtype),
           np.array([5, -6, 7, -8], dtype=dtype),
           expected=np.array([-80, 54, -28, 8], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._sigmoid_grad,
+          gen_math_ops.sigmoid_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array([-60, -36, -14, 0], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._rsqrt_grad,
+          gen_math_ops.rsqrt_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array([-160, -81, -28, -4], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._sqrt_grad,
+          gen_math_ops.sqrt_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array([0.625, 1, 1.75, 4], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._softplus_grad,
+          gen_nn_ops.softplus_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array(
               [3.97322869, 2.99258232, 1.99817801, 0.99966466], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._softsign_grad,
+          gen_nn_ops.softsign_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array(
               [0.11111111, 0.06122449, 0.03125, 0.01234568], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._tanh_grad,
+          gen_math_ops.tanh_grad,
           np.array([4, 3, 2, 1], dtype=dtype),
           np.array([5, 6, 7, 8], dtype=dtype),
           expected=np.array([-75, -48, -21, 0], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._elu_grad,
+          gen_nn_ops.elu_grad,
           np.array([1, 2, 3, 4, 5, 6], dtype=dtype),
           np.array([-.6, -.4, -.2, 0, .2, .4], dtype=dtype),
           expected=np.array([0.4, 1.2, 2.4, 4, 5, 6], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._selu_grad,
+          gen_nn_ops.selu_grad,
           np.array([1, 2, 3, 4, 5, 6], dtype=dtype),
           np.array([-.6, -.4, -.2, .2, .4, .6], dtype=dtype),
           expected=np.array(
@@ -166,20 +166,20 @@ class BinaryOpsTest(XLATestCase):
                4.202803949422, 5.2535049367774, 6.30420592413], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._relu_grad,
+          gen_nn_ops.relu_grad,
           np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=dtype),
           np.array([0, 0, 0, 0, 0, 0.1, 0.3, 0.5, 0.7, 0.9], dtype=dtype),
           expected=np.array([0, 0, 0, 0, 0, 6, 7, 8, 9, 10], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._relu6_grad,
+          gen_nn_ops.relu6_grad,
           np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=dtype),
           np.array(
               [0, 0, 0, 0, 0, 0.1, 0.3, 0.5, 0.7, 0.9, 6.1, 10.0], dtype=dtype),
           expected=np.array([0, 0, 0, 0, 0, 6, 7, 8, 9, 10, 0, 0], dtype=dtype))
 
       self._testBinary(
-          gen_nn_ops._softmax_cross_entropy_with_logits,
+          gen_nn_ops.softmax_cross_entropy_with_logits,
           np.array([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=dtype),
           np.array([[0.1, 0.2, 0.3, 0.4], [0.4, 0.3, 0.2, 0.1]], dtype=dtype),
           expected=[
@@ -191,7 +191,7 @@ class BinaryOpsTest(XLATestCase):
           equality_test=self.ListsAreClose)
 
       self._testBinary(
-          gen_nn_ops._sparse_softmax_cross_entropy_with_logits,
+          gen_nn_ops.sparse_softmax_cross_entropy_with_logits,
           np.array([[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8],
                     [0.9, 1.0, 1.1, 1.2]], dtype=dtype),
           np.array([2, 1, 7], dtype=np.int32),
@@ -207,7 +207,7 @@ class BinaryOpsTest(XLATestCase):
   def testIntOps(self):
     for dtype in self.int_types:
       self._testBinary(
-          gen_math_ops._truncate_div,
+          gen_math_ops.truncate_div,
           np.array([3, 3, -1, -9, -8], dtype=dtype),
           np.array([2, -2, 7, 2, -4], dtype=dtype),
           expected=np.array([1, -1, 0, -4, 2], dtype=dtype))
@@ -369,7 +369,7 @@ class BinaryOpsTest(XLATestCase):
           expected=np.array([[[[False, True], [True, False]]]], dtype=dtype))
 
       self._testBinary(
-          gen_math_ops._real_div,
+          gen_math_ops.real_div,
           np.array([3, 3j, -1.5j, -8, 2 + 3j, 2 + 4j], dtype=dtype),
           np.array([2, -2, 7j, -4j, 4 - 6j, 1 + 2j], dtype=dtype),
           expected=np.array(
@@ -378,7 +378,7 @@ class BinaryOpsTest(XLATestCase):
 
       # Test inf/nan scenarios.
       self._testBinary(
-          gen_math_ops._real_div,
+          gen_math_ops.real_div,
           np.array([4 + 3j, 4, 3j, -4, -4j, 2 - 3j], dtype=dtype),
           np.array([0, 0, 0, 0, 0, 0], dtype=dtype),
           expected=np.array(
@@ -418,19 +418,19 @@ class BinaryOpsTest(XLATestCase):
       lhs = np.array([4 + 2j, -3 - 1j, 2j, 1], dtype=dtype)
       rhs = np.array([5, -6j, 7 - 3j, -8j], dtype=dtype)
       self._testBinary(
-          gen_math_ops._reciprocal_grad, lhs, rhs, expected=-rhs * lhs * lhs)
+          gen_math_ops.reciprocal_grad, lhs, rhs, expected=-rhs * lhs * lhs)
 
       self._testBinary(
-          gen_math_ops._sigmoid_grad, lhs, rhs, expected=rhs * lhs * (1 - lhs))
+          gen_math_ops.sigmoid_grad, lhs, rhs, expected=rhs * lhs * (1 - lhs))
 
       self._testBinary(
-          gen_math_ops._rsqrt_grad, lhs, rhs, expected=lhs**3 * rhs / -2)
+          gen_math_ops.rsqrt_grad, lhs, rhs, expected=lhs**3 * rhs / -2)
 
       self._testBinary(
-          gen_math_ops._sqrt_grad, lhs, rhs, expected=rhs / (2 * lhs))
+          gen_math_ops.sqrt_grad, lhs, rhs, expected=rhs / (2 * lhs))
 
       self._testBinary(
-          gen_math_ops._tanh_grad, lhs, rhs, expected=rhs * (1 - lhs * lhs))
+          gen_math_ops.tanh_grad, lhs, rhs, expected=rhs * (1 - lhs * lhs))
 
   def testComplexMath(self):
     for dtype in self.complex_types:
@@ -538,7 +538,7 @@ class BinaryOpsTest(XLATestCase):
 
     if dtype not in self.complex_types:  # floordiv unsupported for complex.
       self._testBinary(
-          gen_math_ops._floor_div,
+          gen_math_ops.floor_div,
           np.array([3, 3, -1, -9, -8], dtype=dtype),
           np.array([2, -2, 7, 2, -4], dtype=dtype),
           expected=np.array([1, -2, -1, -5, 2], dtype=dtype))
@@ -554,12 +554,12 @@ class BinaryOpsTest(XLATestCase):
   def _testRemainder(self, dtype):
     """Test cases for remainder operators."""
     self._testBinary(
-        gen_math_ops._floor_mod,
+        gen_math_ops.floor_mod,
         np.array([3, 3, -1, -8], dtype=dtype),
         np.array([2, -2, 7, -4], dtype=dtype),
         expected=np.array([1, -1, 6, 0], dtype=dtype))
     self._testBinary(
-        gen_math_ops._truncate_mod,
+        gen_math_ops.truncate_mod,
         np.array([3, 3, -1, -8], dtype=dtype),
         np.array([2, -2, 7, -4], dtype=dtype),
         expected=np.array([1, 1, -1, 0], dtype=dtype))
@@ -1045,6 +1045,20 @@ class BinaryOpsTest(XLATestCase):
             ],
             equality_test=self.ListsAreClose)
 
+      def splitvOp(x, y):  # pylint: disable=invalid-name
+        return array_ops.split(value=y, num_or_size_splits=[2, 3], axis=x)
+      for axis in [1, -1]:
+        self._testBinary(
+            splitvOp,
+            np.int32(axis),
+            np.array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]],
+                     dtype=dtype),
+            expected=[
+                np.array([[0, 1], [5, 6]], dtype=dtype),
+                np.array([[2, 3, 4], [7, 8, 9]], dtype=dtype),
+            ],
+            equality_test=self.ListsAreClose)
+
   def testTile(self):
     for dtype in self.numeric_types:
       self._testBinary(
diff --git a/tensorflow/compiler/tests/concat_ops_test.py b/tensorflow/compiler/tests/concat_ops_test.py
index 81734082d9aab86f8bc763681265ef64ef32bd31..f10973e19f1945515b776cf86349445ed7334629 100644
--- a/tensorflow/compiler/tests/concat_ops_test.py
+++ b/tensorflow/compiler/tests/concat_ops_test.py
@@ -301,7 +301,7 @@ class ConcatOffsetTest(XLATestCase):
         s0 = constant_op.constant([2, 3, 5], dtypes.int32)
         s1 = constant_op.constant([2, 7, 5], dtypes.int32)
         s2 = constant_op.constant([2, 20, 5], dtypes.int32)
-        off = gen_array_ops._concat_offset(cdim, [s0, s1, s2])
+        off = gen_array_ops.concat_offset(cdim, [s0, s1, s2])
         ans = sess.run(off)
         self.assertAllEqual(ans, [[0, 0, 0], [0, 3, 0], [0, 10, 0]])
 
diff --git a/tensorflow/compiler/tests/image_ops_test.py b/tensorflow/compiler/tests/image_ops_test.py
index 538fa8e8e570b83ed681ecc0501285520cabdecb..3bc41b7cfd72bec7572097f8c53eef314a4369f6 100644
--- a/tensorflow/compiler/tests/image_ops_test.py
+++ b/tensorflow/compiler/tests/image_ops_test.py
@@ -426,7 +426,7 @@ class ResizeBilinearTest(XLATestCase):
     with self.test_session() as sess, self.test_scope():
       dtype = dtype or np.float32
       grads = array_ops.placeholder(np.float32)
-      resized = gen_image_ops._resize_bilinear_grad(
+      resized = gen_image_ops.resize_bilinear_grad(
           grads,
           np.zeros([1, input_shape[0], input_shape[1], 1], dtype=dtype),
           align_corners=True)
diff --git a/tensorflow/compiler/tests/lrn_ops_test.py b/tensorflow/compiler/tests/lrn_ops_test.py
index 5d8d89224d4a778d84803811710bb095872e86b2..69bd8f7230d4394c45764d02a88fb0ec097c5756 100644
--- a/tensorflow/compiler/tests/lrn_ops_test.py
+++ b/tensorflow/compiler/tests/lrn_ops_test.py
@@ -115,11 +115,11 @@ class LRNTest(XLATestCase):
       out_image = constant_op.constant(out_image_vals, shape=shape)
       out_grads = constant_op.constant(out_grads_vals, shape=shape)
       with ops.device(CPU_DEVICE):
-        expected = gen_nn_ops._lrn_grad(out_grads, in_image, out_image,
-                                        depth_radius, bias, alpha, beta)
+        expected = gen_nn_ops.lrn_grad(out_grads, in_image, out_image,
+                                       depth_radius, bias, alpha, beta)
       with self.test_scope():
-        actual = gen_nn_ops._lrn_grad(out_grads, in_image, out_image,
-                                      depth_radius, bias, alpha, beta)
+        actual = gen_nn_ops.lrn_grad(out_grads, in_image, out_image,
+                                     depth_radius, bias, alpha, beta)
       expected_val = expected.eval()
       actual_val = actual.eval()
     self.assertAllClose(actual_val, expected_val, rtol=1e-3)
diff --git a/tensorflow/compiler/tests/pooling_ops_3d_test.py b/tensorflow/compiler/tests/pooling_ops_3d_test.py
index eb48fe555a0b182ea7983cbd8c3b217d56350408..4eed903963a34a253ea5c409782d9a89a97a4fdf 100644
--- a/tensorflow/compiler/tests/pooling_ops_3d_test.py
+++ b/tensorflow/compiler/tests/pooling_ops_3d_test.py
@@ -33,7 +33,7 @@ from tensorflow.python.platform import test
 # MaxPoolGrad.
 def _AvgPoolGrad(inputs, outputs, output_gradients, ksize, strides, padding):
   del outputs  # Unused by average-pooling gradients.
-  return gen_nn_ops._avg_pool3d_grad(
+  return gen_nn_ops.avg_pool3d_grad(
       inputs.get_shape().as_list(),
       output_gradients,
       ksize=ksize,
@@ -263,7 +263,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradValidPadding1_1_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[1, 3, 3, 3, 1],
         ksize=[1, 1, 1],
         strides=[1, 1, 1],
@@ -272,7 +272,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradValidPadding2_1_6_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 3, 3, 6, 3],
         ksize=[2, 2, 2],
         strides=[1, 1, 1],
@@ -281,7 +281,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradValidPadding2_1_7_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 3, 5, 7, 3],
         ksize=[2, 2, 2],
         strides=[1, 1, 1],
@@ -290,7 +290,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradValidPadding2_2_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 2, 2, 2, 3],
         ksize=[2, 2, 2],
         strides=[2, 2, 2],
@@ -299,7 +299,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradSamePadding1_1_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 3, 2, 4, 1],
         ksize=[1, 1, 1],
         strides=[1, 1, 1],
@@ -308,7 +308,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradSamePadding2_1_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 3, 2, 4, 1],
         ksize=[2, 2, 2],
         strides=[1, 1, 1],
@@ -317,7 +317,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradSamePadding2_2_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[2, 5, 2, 4, 3],
         ksize=[2, 2, 2],
         strides=[2, 2, 2],
@@ -326,7 +326,7 @@ class Pooling3DTest(XLATestCase):
   def testMaxPoolGradSamePadding3_1_3d(self):
     self._VerifyGradient(
         nn_ops.max_pool3d,
-        gen_nn_ops._max_pool3d_grad,
+        gen_nn_ops.max_pool3d_grad,
         input_sizes=[1, 3, 3, 7, 1],
         ksize=[3, 3, 3],
         strides=[1, 1, 1],
diff --git a/tensorflow/compiler/tests/pooling_ops_test.py b/tensorflow/compiler/tests/pooling_ops_test.py
index 7c19a99c4eb4be3ca34b3ce949216e557b0a681d..e0e85295fecdd4d3d69ebf09860003888633f3da 100644
--- a/tensorflow/compiler/tests/pooling_ops_test.py
+++ b/tensorflow/compiler/tests/pooling_ops_test.py
@@ -459,7 +459,7 @@ class PoolGradTest(XLATestCase):
         padding="SAME")
 
   def testMaxPool(self):
-    self._TestPooling(nn_ops.max_pool, gen_nn_ops._max_pool_grad)
+    self._TestPooling(nn_ops.max_pool, gen_nn_ops.max_pool_grad)
 
   def testAvgPool(self):
     # Wrapper around AvgPoolGrad that ignores extra arguments needed by
@@ -467,7 +467,7 @@ class PoolGradTest(XLATestCase):
     def AvgPoolGrad(inputs, outputs, output_gradients, ksize, strides, padding,
                     data_format):
       del outputs  # Unused by average-pooling gradients.
-      return gen_nn_ops._avg_pool_grad(
+      return gen_nn_ops.avg_pool_grad(
           inputs.get_shape().as_list(),
           output_gradients,
           ksize=ksize,
@@ -483,7 +483,7 @@ class PoolGradTest(XLATestCase):
   def testMaxPoolKernelSmallerThanStrideValid(self):
     self._VerifyValues(
         nn_ops.max_pool,
-        gen_nn_ops._max_pool_grad,
+        gen_nn_ops.max_pool_grad,
         input_sizes=[1, 7, 7, 1],
         ksize=[1, 2, 2, 1],
         strides=[1, 3, 3, 1],
@@ -492,7 +492,7 @@ class PoolGradTest(XLATestCase):
   def testMaxPoolKernelSmallerThanStrideSame(self):
     self._VerifyValues(
         nn_ops.max_pool,
-        gen_nn_ops._max_pool_grad,
+        gen_nn_ops.max_pool_grad,
         input_sizes=[1, 3, 3, 1],
         ksize=[1, 1, 1, 1],
         strides=[1, 2, 2, 1],
@@ -500,7 +500,7 @@ class PoolGradTest(XLATestCase):
 
     self._VerifyValues(
         nn_ops.max_pool,
-        gen_nn_ops._max_pool_grad,
+        gen_nn_ops.max_pool_grad,
         input_sizes=[1, 4, 4, 1],
         ksize=[1, 1, 1, 1],
         strides=[1, 2, 2, 1],
diff --git a/tensorflow/compiler/tests/segment_reduction_ops_test.py b/tensorflow/compiler/tests/segment_reduction_ops_test.py
index 23bc39cf3f7087424719edfb8b6ee35d87295534..4a9c0e7471f9cdb2a47b54705495d2dda9748890 100644
--- a/tensorflow/compiler/tests/segment_reduction_ops_test.py
+++ b/tensorflow/compiler/tests/segment_reduction_ops_test.py
@@ -63,10 +63,10 @@ class SegmentReductionOpsTest(XLATestCase):
   def testUnsortedSegmentSum1DIndices1DDataNegativeIndices(self):
     for dtype in self.numeric_types:
       self.assertAllClose(
-          np.array([0, 3, 2, 5], dtype=dtype),
+          np.array([6, 3, 0, 6], dtype=dtype),
           self.UnsortedSegmentSum(
-              np.array([0, 1, 2, 3, 4, 5], dtype=dtype),
-              np.array([3, -1, 2, 1, -1, 3], dtype=np.int32), 4))
+              np.array([0, 1, 2, 3, 4, 5, 6], dtype=dtype),
+              np.array([3, -1, 0, 1, 0, -1, 3], dtype=np.int32), 4))
 
   def testUnsortedSegmentSum1DIndices2DDataDisjoint(self):
     for dtype in self.numeric_types:
diff --git a/tensorflow/compiler/tests/slice_ops_test.py b/tensorflow/compiler/tests/slice_ops_test.py
index a7cbfb04003c397212a35e16c6b23d7c2a18f7df..305ca0c6b78d3ef985deb38816f9388e7983906b 100644
--- a/tensorflow/compiler/tests/slice_ops_test.py
+++ b/tensorflow/compiler/tests/slice_ops_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 from tensorflow.compiler.tests.xla_test import XLATestCase
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import googletest
 
@@ -137,6 +138,34 @@ class StridedSliceTest(XLATestCase):
 
         self.assertAllEqual([6, 4], result)
 
+  def test2DDegenerate(self):
+    for dtype in self.numeric_types:
+      with self.test_session():
+        i = array_ops.placeholder(dtype, shape=[2, 3])
+        with self.test_scope():
+          o = array_ops.strided_slice(i, [-1, 0], [0, 3])
+        params = {
+            i: [[0, 1, 2],
+                [3, 4, 5]]
+        }
+        result = o.eval(feed_dict=params)
+
+        self.assertEqual(tensor_shape.TensorShape((0, 3)), result.shape)
+
+  def test2DDegenerateNegativeStride(self):
+    for dtype in self.numeric_types:
+      with self.test_session():
+        i = array_ops.placeholder(dtype, shape=[2, 3])
+        with self.test_scope():
+          o = array_ops.strided_slice(i, [0, 0], [-1, 3], [-1, 1])
+        params = {
+            i: [[0, 1, 2],
+                [3, 4, 5]]
+        }
+        result = o.eval(feed_dict=params)
+
+        self.assertEqual(tensor_shape.TensorShape((0, 3)), result.shape)
+
   def test3D(self):
     for dtype in self.numeric_types:
       with self.test_session():
diff --git a/tensorflow/compiler/tests/spacetobatch_op_test.py b/tensorflow/compiler/tests/spacetobatch_op_test.py
index c013f4b50a4cf95be8028248c52b10b1c3be2bd3..92518aadc4bf5c601cfb4192c093799784b6aa72 100644
--- a/tensorflow/compiler/tests/spacetobatch_op_test.py
+++ b/tensorflow/compiler/tests/spacetobatch_op_test.py
@@ -75,11 +75,11 @@ class SpaceToBatchTest(XLATestCase):
       for dtype in self.float_types:
         # outputs = space_to_batch(inputs)
         placeholder = array_ops.placeholder(dtype)
-        x_tf = gen_array_ops._space_to_batch(
+        x_tf = gen_array_ops.space_to_batch(
             placeholder, paddings, block_size=block_size)
         self.assertAllEqual(sess.run(x_tf, {placeholder: inputs}), outputs)
         # inputs = batch_to_space(outputs)
-        x_tf = gen_array_ops._batch_to_space(
+        x_tf = gen_array_ops.batch_to_space(
             placeholder, paddings, block_size=block_size)
         self.assertAllEqual(sess.run(x_tf, {placeholder: outputs}), inputs)
 
diff --git a/tensorflow/compiler/tests/stack_ops_test.py b/tensorflow/compiler/tests/stack_ops_test.py
index 2b9c2279737ccee531d488d27ccdb0cafa1dc8fc..94342f9567ca71274609e63b0482d55637c98d51 100644
--- a/tensorflow/compiler/tests/stack_ops_test.py
+++ b/tensorflow/compiler/tests/stack_ops_test.py
@@ -34,33 +34,33 @@ class StackOpTest(XLATestCase):
     with self.test_session(), self.test_scope():
       size = array_ops.placeholder(dtypes.int32)
       v = array_ops.placeholder(dtypes.float32)
-      h = gen_data_flow_ops._stack_v2(size, dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, v)
+      h = gen_data_flow_ops.stack_v2(size, dtypes.float32, stack_name="foo")
+      c = gen_data_flow_ops.stack_push_v2(h, v)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
       self.assertAllClose([[4.0, 5.0]], c1.eval({size: 5, v: [[4.0, 5.0]]}))
 
   def testStackPushPopSwap(self):
     with self.test_session(), self.test_scope():
       a = np.arange(2000)
       x = array_ops.placeholder(dtypes.float32)
-      h = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, x, swap_memory=True)
+      h = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
+      c = gen_data_flow_ops.stack_push_v2(h, x, swap_memory=True)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
       self.assertAllClose(a, c1.eval({x: a}))
 
   def testMultiStack(self):
     with self.test_session(), self.test_scope():
       v = array_ops.placeholder(dtypes.float32)
-      h1 = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_push_v2(h1, v)
+      h1 = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
+      c1 = gen_data_flow_ops.stack_push_v2(h1, v)
       with ops.control_dependencies([c1]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h1, dtypes.float32)
-      h2 = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="bar")
-      c2 = gen_data_flow_ops._stack_push_v2(h2, 5.0)
+        c1 = gen_data_flow_ops.stack_pop_v2(h1, dtypes.float32)
+      h2 = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="bar")
+      c2 = gen_data_flow_ops.stack_push_v2(h2, 5.0)
       with ops.control_dependencies([c2]):
-        c2 = gen_data_flow_ops._stack_pop_v2(h2, dtypes.float32)
+        c2 = gen_data_flow_ops.stack_pop_v2(h2, dtypes.float32)
       r = c1 + c2
       self.assertAllClose(9.0, r.eval({v: 4.0}))
 
@@ -69,15 +69,15 @@ class StackOpTest(XLATestCase):
     with self.test_session() as sess, self.test_scope():
       v1 = array_ops.placeholder(dtypes.float32)
       v2 = array_ops.placeholder(dtypes.float32)
-      h1 = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
-      h2 = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
+      h1 = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
+      h2 = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
 
-      c1 = gen_data_flow_ops._stack_push_v2(h1, v1)
+      c1 = gen_data_flow_ops.stack_push_v2(h1, v1)
       with ops.control_dependencies([c1]):
-        c2 = gen_data_flow_ops._stack_push_v2(h2, v2)
+        c2 = gen_data_flow_ops.stack_push_v2(h2, v2)
       with ops.control_dependencies([c2]):
-        pop1 = gen_data_flow_ops._stack_pop_v2(h1, dtypes.float32)
-        pop2 = gen_data_flow_ops._stack_pop_v2(h2, dtypes.float32)
+        pop1 = gen_data_flow_ops.stack_pop_v2(h1, dtypes.float32)
+        pop2 = gen_data_flow_ops.stack_pop_v2(h2, dtypes.float32)
 
       out1, out2 = sess.run([pop1, pop2], {v1: 4.0, v2: 5.0})
       self.assertAllClose(out1, 4.0)
@@ -86,17 +86,17 @@ class StackOpTest(XLATestCase):
   def testCloseStack(self):
     with self.test_session() as sess, self.test_scope():
       size = array_ops.placeholder(dtypes.int32)
-      h = gen_data_flow_ops._stack_v2(size, dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_close_v2(h)
+      h = gen_data_flow_ops.stack_v2(size, dtypes.float32, stack_name="foo")
+      c1 = gen_data_flow_ops.stack_close_v2(h)
       sess.run(c1, {size: 5})
 
   def testPushCloseStack(self):
     with self.test_session() as sess, self.test_scope():
       v = array_ops.placeholder(dtypes.float32)
-      h = gen_data_flow_ops._stack_v2(5, dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, v)
+      h = gen_data_flow_ops.stack_v2(5, dtypes.float32, stack_name="foo")
+      c = gen_data_flow_ops.stack_push_v2(h, v)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_close_v2(h)
+        c1 = gen_data_flow_ops.stack_close_v2(h)
       sess.run(c1, {v: [[4.0, 5.0]]})
 
 
diff --git a/tensorflow/compiler/tests/tensor_array_ops_test.py b/tensorflow/compiler/tests/tensor_array_ops_test.py
index a62925a1818da00cb0a9e82e1281db20fb38b208..7624d6e4b2e2ece6a61155743fc8b866f6903f32 100644
--- a/tensorflow/compiler/tests/tensor_array_ops_test.py
+++ b/tensorflow/compiler/tests/tensor_array_ops_test.py
@@ -338,7 +338,7 @@ class TensorArrayTest(xla_test.XLATestCase):
         w0 = ta.write(0, [[4.0, 5.0]])
 
         # Test reading wrong datatype.
-        r0_bad = gen_data_flow_ops._tensor_array_read_v3(
+        r0_bad = gen_data_flow_ops.tensor_array_read_v3(
             handle=w0.handle, index=0, dtype=dtype2, flow_in=w0.flow)
         with self.assertRaisesOpError("TensorArray dtype is "):
           r0_bad.eval()
diff --git a/tensorflow/compiler/tf2xla/const_analysis.cc b/tensorflow/compiler/tf2xla/const_analysis.cc
index 82923722c54d235716b9138d95a75a441df924ca..6f46532419d3389bafe8c3bf41fa41e8a3e173b7 100644
--- a/tensorflow/compiler/tf2xla/const_analysis.cc
+++ b/tensorflow/compiler/tf2xla/const_analysis.cc
@@ -37,7 +37,7 @@ Status BackwardsConstAnalysis(const Graph& g,
   };
 
   Status status;
-  std::unordered_set<Node*> must_be_const;
+  std::unordered_set<const Node*> must_be_const;
   auto visit = [&status, &metadata_ops, &must_be_const,
                 compile_time_const_args](Node* node) {
     if (!status.ok()) return;
@@ -55,7 +55,7 @@ Status BackwardsConstAnalysis(const Graph& g,
         compile_time_const_args->at(index) = true;
         return;
       }
-      for (Node* pred : node->in_nodes()) {
+      for (const Node* pred : node->in_nodes()) {
         must_be_const.insert(pred);
       }
       return;
diff --git a/tensorflow/compiler/tf2xla/functionalize_control_flow.cc b/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
index f8169795ddfb7fd4e93d3f136c51623385868951..8b7beef83ec2ed0df780d6a9cb2a4bcf737d008b 100644
--- a/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
+++ b/tensorflow/compiler/tf2xla/functionalize_control_flow.cc
@@ -583,13 +583,15 @@ class FunctionalizeCond {
   // CondArgNode represents a input to the conditional and its corresponding
   // switch nodes.
   struct CondArgNode {
-    explicit CondArgNode(Node* input) : input(input) {}
+    explicit CondArgNode(Node* src, int src_output)
+        : src(src), src_output(src_output) {}
     string ToString() const {
-      return strings::StrCat("input=", input->name(),
+      return strings::StrCat("src=", src->name(), ":", src_output,
                              " switches=", NodesToString(switches));
     }
 
-    Node* input;
+    Node* src;
+    int src_output;
     std::vector<Node*> switches;
   };
   using CondArgNodes = std::vector<CondArgNode>;
@@ -606,14 +608,15 @@ class FunctionalizeCond {
 
   // Group of switch nodes that will be part of the same XlaIf.
   struct SwitchCluster {
-    explicit SwitchCluster(Node* predicate) : predicate(predicate) {}
+    explicit SwitchCluster(const Edge* predicate_edge)
+        : predicate_edge(predicate_edge) {}
     string ToString() const {
-      return strings::StrCat(name, " predicate=", predicate->name(),
+      return strings::StrCat(name, " predicate=", predicate_edge->src()->name(),
                              " switches=", NodesToString(switches));
     }
 
     string name;
-    Node* predicate;
+    const Edge* predicate_edge;
     std::vector<Node*> switches;
   };
 
@@ -653,8 +656,8 @@ class FunctionalizeCond {
                      Graph* body);
 
   // Adds all the input edges to `if_node` corresponding to the arguments.
-  Status AddInputEdges(const CondArgNodes& cond_arg_nodes, Node* predicate,
-                       Node* if_node);
+  Status AddInputEdges(const CondArgNodes& cond_arg_nodes,
+                       const Edge* predicate_edge, Node* if_node);
 
   // Adds all output edges from the `if_node`.
   Status AddOutputEdges(const std::vector<Node*>& outputs, Node* if_node);
@@ -756,8 +759,8 @@ Status FunctionalizeCond::Join(const ForwardFlowNode& src_state,
     if (IsMerge(dst)) {
       dst_state->branch = Branch::kBoth;
     } else {
-      return errors::Internal("Illegal merge: ", src_state.ToString(), " with ",
-                              dst_state->ToString(), " for ",
+      return errors::Internal("Illegal merge:\n", src_state.ToString(),
+                              " with ", dst_state->ToString(), " for\n",
                               dst->DebugString());
     }
   }
@@ -861,8 +864,8 @@ FunctionalizeCond::DeterminePredicateSwitchOrder() {
     if (IsSwitch(n)) {
       Node* input;
       TF_CHECK_OK(n->input_node(0, &input));
-      entry_cluster[n->id()] = &clusters[input->id()];
-      UnionFind<Cluster>* cluster = find_output_cluster(input);
+      entry_cluster[n->id()] = find_output_cluster(input);
+      UnionFind<Cluster>* cluster = entry_cluster[n->id()];
       int cluster_depth = switch_depth[cluster->Get().representative];
       // Merge the inputs of the switch node with one another. This results in
       // predicates and control input residing in the same cluster.
@@ -956,16 +959,21 @@ FunctionalizeCond::DeterminePredicateSwitchOrder() {
   // node whose cluster is later in the topological order of clustered
   // switches).
   for (auto it = switch_order.rbegin(); it != switch_order.rend(); ++it) {
-    Node* pred;
-    TF_CHECK_OK((*it)->input_node(1, &pred));
-    auto repr = std::make_pair(pred, clusters[(*it)->id()].Get());
+    const Edge* pred_edge;
+    TF_CHECK_OK((*it)->input_edge(1, &pred_edge));
+    // The predicate can be preceded by a identity node. Look through identity
+    // nodes to predicate.
+    while (pred_edge->src()->IsIdentity()) {
+      TF_CHECK_OK(pred_edge->src()->input_edge(0, &pred_edge));
+    }
+    auto repr = std::make_pair(pred_edge->src(), clusters[(*it)->id()].Get());
     if (predicate_index.find(repr) == predicate_index.end()) {
       predicate_index[repr] = switch_clusters.size();
-      switch_clusters.emplace_back(pred);
+      switch_clusters.emplace_back(pred_edge);
       // Generate a name by concatenating with the cluster representative as
       // there could be multiple switch clusters with the same predicate.
-      switch_clusters[predicate_index[repr]].name =
-          strings::StrCat(pred->name(), "_", repr.second.representative, "_If");
+      switch_clusters[predicate_index[repr]].name = strings::StrCat(
+          pred_edge->src()->name(), "_", repr.second.representative, "_If");
     }
     switch_clusters[predicate_index[repr]].switches.push_back(*it);
   }
@@ -1044,9 +1052,12 @@ FunctionalizeCond::DetermineBranchMapAndFrontier(
       ForwardFlowNode& ffn = branch_map[out];
       if (IsSwitch(n)) {
         int index = e->IsControlEdge() ? Branch::kNeither : e->src_output();
-        TF_RETURN_IF_ERROR(Join(ForwardFlowNode(Branch(index)), out, &ffn));
+        TF_RETURN_WITH_CONTEXT_IF_ERROR(
+            Join(ForwardFlowNode(Branch(index)), out, &ffn), " when joining ",
+            e->DebugString());
       } else {
-        TF_RETURN_IF_ERROR(Join(branch_map[n], out, &ffn));
+        TF_RETURN_WITH_CONTEXT_IF_ERROR(Join(branch_map[n], out, &ffn),
+                                        " when joining ", e->DebugString());
       }
       if (IsMerge(out)) {
         if (out->in_edges().size() == ffn.count) {
@@ -1083,8 +1094,7 @@ Status FunctionalizeCond::FunctionalizeInternal() {
   for (auto it = predicate_switch_order.rbegin();
        it != predicate_switch_order.rend(); ++it) {
     auto& ps = *it;
-    VLOG(3) << "Flow down from: " << NodesToString(ps.switches) << " ("
-            << ps.predicate->name() << ")";
+    VLOG(3) << "Flow down from: " << ps.ToString();
 
     std::unordered_map<Node*, ForwardFlowNode> branch_map;
     std::unordered_set<Node*> frontier;
@@ -1097,21 +1107,29 @@ Status FunctionalizeCond::FunctionalizeInternal() {
                                                library_);
     TF_RETURN_IF_ERROR(ValidateFrontier(branch_map, frontier));
 
+    struct Hash {
+      size_t operator()(const std::pair<Node*, int>& item) const {
+        return Hash64Combine(hash<Node*>()(item.first),
+                             std::hash<int>()(item.second));
+      }
+    };
+
     // Sort the merge and switch nodes using NodeCmp. The switch-nodes are
     // further grouped (post sorting) by input to the switch node as in the
     // functionalized form each input will be passed in only once. This grouping
     // should retain the sorted order.
     CondArgNodes cond_arg_nodes;
-    std::unordered_map<Node*, int> input_index;
     std::sort(ps.switches.begin(), ps.switches.end(), NodeCmp());
+    std::unordered_map<std::pair<Node*, int>, int, Hash> input_index;
     for (Node* switch_node : ps.switches) {
-      Node* in;
-      TF_RETURN_IF_ERROR(switch_node->input_node(0, &in));
-      if (input_index.find(in) == input_index.end()) {
-        input_index[in] = cond_arg_nodes.size();
-        cond_arg_nodes.emplace_back(in);
+      const Edge* e;
+      TF_RETURN_IF_ERROR(switch_node->input_edge(0, &e));
+      std::pair<Node*, int> key = std::make_pair(e->src(), e->src_output());
+      if (input_index.find(key) == input_index.end()) {
+        input_index[key] = cond_arg_nodes.size();
+        cond_arg_nodes.emplace_back(key.first, key.second);
       }
-      cond_arg_nodes.at(input_index.at(in)).switches.push_back(switch_node);
+      cond_arg_nodes.at(input_index.at(key)).switches.push_back(switch_node);
     }
     std::vector<Node*> merge_nodes(frontier.begin(), frontier.end());
     std::sort(merge_nodes.begin(), merge_nodes.end(), NodeCmp());
@@ -1200,11 +1218,12 @@ StatusOr<Node*> FunctionalizeCond::BuildAndAddXlaIfOp(
   builder.Attr("Tout", out_type);
 
   builder.Attr("Tcond", DT_BOOL);
-  builder.Device(switch_cluster.predicate->assigned_device_name());
+  builder.Device(switch_cluster.predicate_edge->src()->assigned_device_name());
   // Conditional should be the first input ...
-  builder.Input(
-      NodeDefBuilder::NodeOut(switch_cluster.predicate->name(), 0,
-                              switch_cluster.predicate->output_type(0)));
+  builder.Input(NodeDefBuilder::NodeOut(
+      switch_cluster.predicate_edge->src()->name(),
+      switch_cluster.predicate_edge->src_output(),
+      switch_cluster.predicate_edge->src()->output_type(0)));
   // ... followed by the other inputs.
   builder.Input(inputs);
 
@@ -1264,24 +1283,17 @@ Status FunctionalizeCond::ExtractBody(const CondArgNodes& cond_arg_nodes,
 }
 
 Status FunctionalizeCond::AddInputEdges(const CondArgNodes& cond_arg_nodes,
-                                        Node* predicate, Node* if_node) {
+                                        const Edge* predicate_edge,
+                                        Node* if_node) {
   VLOG(3) << "AddInputEdges for " << if_node->name();
   int index = 0;
-  graph_->AddEdge(predicate, 0, if_node, index++);
-  for (auto& kv : cond_arg_nodes) {
-    bool inserted = false;
-    for (const Node* arg : kv.switches) {
-      const Edge* in_edge;
-      TF_RETURN_IF_ERROR(arg->input_edge(0, &in_edge));
-      if (in_edge->IsControlEdge()) {
-        graph_->AddControlEdge(in_edge->src(), if_node);
-      } else {
-        if (!inserted) {
-          graph_->AddEdge(in_edge->src(), in_edge->src_output(), if_node,
-                          index++);
-          inserted = true;
-        }
-      }
+  graph_->AddEdge(predicate_edge->src(), predicate_edge->src_output(), if_node,
+                  index++);
+  for (auto& arg : cond_arg_nodes) {
+    if (arg.src_output == Graph::kControlSlot) {
+      graph_->AddControlEdge(arg.src, if_node);
+    } else {
+      graph_->AddEdge(arg.src, arg.src_output, if_node, index++);
     }
   }
   return Status::OK();
@@ -1302,10 +1314,10 @@ Status FunctionalizeCond::AddOutputEdges(const std::vector<Node*>& outputs,
         return errors::Unimplemented("Output of index (", edge->src_output(),
                                      ") of merge node ", node->name());
       }
-      graph_->RemoveEdge(edge);
 
       int src_output =
           dst_input == Graph::kControlSlot ? Graph::kControlSlot : i;
+      graph_->RemoveEdge(edge);
       graph_->AddEdge(if_node, src_output, dst, dst_input);
     }
   }
@@ -1323,7 +1335,7 @@ StatusOr<Node*> FunctionalizeCond::ConvertToXlaIf(
       Node * if_node,
       BuildAndAddXlaIfOp(cond_arg_nodes, switch_cluster, merge_nodes));
   TF_RETURN_IF_ERROR(
-      AddInputEdges(cond_arg_nodes, switch_cluster.predicate, if_node));
+      AddInputEdges(cond_arg_nodes, switch_cluster.predicate_edge, if_node));
   TF_RETURN_IF_ERROR(AddOutputEdges(merge_nodes, if_node));
 
   return if_node;
@@ -1345,6 +1357,7 @@ Status FunctionalizeControlFlow(Graph* graph,
   VLOG(2) << "FunctionalizeControlFlow (initial): "
           << dump_graph::DumpGraphToFile("functionalize_initial", *graph,
                                          library);
+
   // Note: BuildControlFlowInfo() requires that the graph's source node is
   // connected to all source nodes in the graph. Many graphs violate this
   // invariant.
diff --git a/tensorflow/compiler/tf2xla/graph_compiler.cc b/tensorflow/compiler/tf2xla/graph_compiler.cc
index 058a1f2621c64a735bd9d9c9d0ae007f93aa4dea..b20c1ffc7d8956f3f5530ee63e9b711a26439be5 100644
--- a/tensorflow/compiler/tf2xla/graph_compiler.cc
+++ b/tensorflow/compiler/tf2xla/graph_compiler.cc
@@ -130,7 +130,7 @@ Status GraphCompiler::Compile() {
     // Set up inputs from outputs of previous nodes.
     for (auto* e : n->in_edges()) {
       if (e->IsControlEdge()) continue;
-      Node* src = e->src();
+      const Node* src = e->src();
       TF_RET_CHECK(src->id() < output_registry.size());
       const NodeOutputs& src_outputs = output_registry[src->id()];
 
diff --git a/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc b/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc
index 344a2ab2b6835c518c41de6f7a30fb2a34d130d2..cbade79e85eed10ecb5ead7151ee778c86a0de37 100644
--- a/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/batchtospace_op.cc
@@ -159,7 +159,9 @@ class BatchToSpaceNDOp : public XlaOpKernel {
                  block_shape, crops);
   }
 };
-REGISTER_XLA_OP(Name("BatchToSpaceND").CompileTimeConstInput("crops"),
+REGISTER_XLA_OP(Name("BatchToSpaceND")
+                    .CompileTimeConstInput("block_shape")
+                    .CompileTimeConstInput("crops"),
                 BatchToSpaceNDOp);
 
 class BatchToSpaceOp : public XlaOpKernel {
diff --git a/tensorflow/compiler/tf2xla/kernels/split_op.cc b/tensorflow/compiler/tf2xla/kernels/split_op.cc
index 79c435c90a1f57250be90c2c2523bf3d7d231461..43c15e753805352875034dfd2c70a2a1ed9a4114 100644
--- a/tensorflow/compiler/tf2xla/kernels/split_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/split_op.cc
@@ -111,27 +111,24 @@ class SplitVOp : public XlaOpKernel {
 
   void Compile(XlaOpKernelContext* ctx) override {
     const int32 num_split = num_outputs();
+    const TensorShape input_shape = ctx->InputShape(0);
     const TensorShape index_shape = ctx->InputShape(2);
-    xla::Literal literal_index;
-    OP_REQUIRES_OK(ctx, ctx->ConstantInput(2, &literal_index));
 
-    int32 split_dim;
-    OP_REQUIRES(ctx, index_shape.dims() == 0,
-                errors::InvalidArgument("split_dim input to Split Op must be a "
-                                        "scalar"));
-    split_dim = literal_index.Get<int>({});
+    int64 split_dim_orig;
+    OP_REQUIRES_OK(ctx, ctx->ConstantInputAsIntScalar(2, &split_dim_orig));
+    int64 split_dim = split_dim_orig < 0 ? split_dim_orig + input_shape.dims()
+                                         : split_dim_orig;
+    OP_REQUIRES(ctx, 0 <= split_dim && split_dim < input_shape.dims(),
+                errors::InvalidArgument("-input rank(-", input_shape.dims(),
+                                        ") <= split_dim < input rank (",
+                                        input_shape.dims(), "), but got ",
+                                        split_dim_orig));
 
     xla::ComputationDataHandle input = ctx->Input(0);
-    const TensorShape input_shape = ctx->InputShape(0);
 
     OP_REQUIRES(ctx, input_shape.dims() > 0,
                 errors::InvalidArgument("Can't split a 0 dimensional input"));
 
-    OP_REQUIRES(
-        ctx, 0 <= split_dim && split_dim < input_shape.dims(),
-        errors::InvalidArgument("0 <= split_dim < number of input dimensions (",
-                                input_shape.dims(), "), but got ", split_dim));
-
     OP_REQUIRES(
         ctx, num_split > 0,
         errors::InvalidArgument(
diff --git a/tensorflow/compiler/tf2xla/kernels/strided_slice_op.cc b/tensorflow/compiler/tf2xla/kernels/strided_slice_op.cc
index 91c169428c7a88a8d107a97445aeea999946e3e9..6204aa4e27000fddec7f5b82b2198d37956f6aba 100644
--- a/tensorflow/compiler/tf2xla/kernels/strided_slice_op.cc
+++ b/tensorflow/compiler/tf2xla/kernels/strided_slice_op.cc
@@ -77,13 +77,14 @@ class StridedSliceOp : public XlaOpKernel {
     for (int i = 0; i < begin.size(); ++i) {
       if (strides[i] > 0) {
         slice_begin.push_back(begin[i]);
-        slice_end.push_back(end[i]);
+        slice_end.push_back(std::max(end[i], begin[i]));
         slice_strides.push_back(strides[i]);
       } else {
         // Negative stride: swap begin and end, add 1 because the interval
         // is semi-open, and mark the dimension to be reversed.
         slice_begin.push_back(input_shape.dim_size(i) - begin[i] - 1);
-        slice_end.push_back(input_shape.dim_size(i) - end[i] - 1);
+        slice_end.push_back(std::max(input_shape.dim_size(i) - end[i] - 1,
+                                     input_shape.dim_size(i) - begin[i] - 1));
         slice_strides.push_back(-strides[i]);
         dimensions_to_reverse.push_back(i);
       }
diff --git a/tensorflow/compiler/tf2xla/kernels/unary_ops.cc b/tensorflow/compiler/tf2xla/kernels/unary_ops.cc
index 0c5ad9e5255ffc3dfcfb83335060ae833937b3ce..7cb47f908d4ff43f455f1e77c53cd3cc956579ee 100644
--- a/tensorflow/compiler/tf2xla/kernels/unary_ops.cc
+++ b/tensorflow/compiler/tf2xla/kernels/unary_ops.cc
@@ -60,11 +60,13 @@ XLAJIT_MAKE_UNARY(
                     b->Add(XlaHelpers::One(b, input_type(0)), x))));
 
 // acosh(x) = log(x + sqrt(x^2 - 1))
+//          = log(x + sqrt((x+1)*(x-1)))
 XLAJIT_MAKE_UNARY(
     Acosh,
-    b->Log(b->Add(x, b->Pow(b->Sub(b->Mul(x, x),
-                                   XlaHelpers::One(b, input_type(0))),
-                            XlaHelpers::FloatLiteral(b, input_type(0), 0.5)))));
+    b->Log(b->Add(x,
+                  b->Pow(b->Mul(b->Add(x, XlaHelpers::One(b, input_type(0))),
+                                b->Sub(x, XlaHelpers::One(b, input_type(0)))),
+                         XlaHelpers::FloatLiteral(b, input_type(0), 0.5)))));
 
 // asin(x) = 2 * atan(x / (1 + sqrt(1 - x^2)))
 XLAJIT_MAKE_UNARY(
diff --git a/tensorflow/compiler/tf2xla/lib/scatter.cc b/tensorflow/compiler/tf2xla/lib/scatter.cc
index 6009243f9774eea24e8049e2bd50fe32f291132f..45699233ea8b2a75e3850098250307b95546cc28 100644
--- a/tensorflow/compiler/tf2xla/lib/scatter.cc
+++ b/tensorflow/compiler/tf2xla/lib/scatter.cc
@@ -141,6 +141,8 @@ xla::StatusOr<xla::ComputationDataHandle> XlaScatter(
                                 body_builder->ConstantR0<bool>(true),
                                 xla::CreateScalarAndComputation(body_builder));
 
+    // Make the index in bounds to prevent implementation defined behavior.
+    index = body_builder->Max(index, zero_index);
     index = body_builder->Pad(
         index, zero_index,
         xla::MakeEdgePaddingConfig({{0, buffer_shape_post_axes.size()}}));
@@ -157,8 +159,8 @@ xla::StatusOr<xla::ComputationDataHandle> XlaScatter(
     auto update = body_builder->DynamicSlice(updates, updates_offset,
                                              flat_updates_slice_shape);
 
-    // Unflatten the major (iteration) dimensions of the slice to their original
-    // shape.
+    // Unflatten the major (iteration) dimensions of the slice to their
+    // original shape.
     std::vector<int64> updates_slice_shape(num_index_dims, 1);
     updates_slice_shape.insert(updates_slice_shape.end(),
                                buffer_shape_post_axes.begin(),
@@ -167,15 +169,16 @@ xla::StatusOr<xla::ComputationDataHandle> XlaScatter(
 
     // Apply the update to the buffer. If there is a combiner, use it to merge
     // the current values with the update.
+    auto current_value =
+        body_builder->DynamicSlice(buffer, index, updates_slice_shape);
     if (combiner) {
-      auto current_value =
-          body_builder->DynamicSlice(buffer, index, updates_slice_shape);
       update = combiner(current_value, update, body_builder);
     }
-    // Apply the update if it is in range.
-    buffer = body_builder->Select(
-        index_in_range, body_builder->DynamicUpdateSlice(buffer, update, index),
-        buffer);
+    // Use the current value instead of the update if the index is out of
+    // bounds.
+    update = body_builder->Select(index_in_range, update, current_value);
+    // Apply the update.
+    buffer = body_builder->DynamicUpdateSlice(buffer, update, index);
 
     return std::vector<xla::ComputationDataHandle>{indices, updates, buffer};
   };
diff --git a/tensorflow/compiler/tf2xla/xla_compiler.cc b/tensorflow/compiler/tf2xla/xla_compiler.cc
index 15bba46ac62a97592656942afc767a303c9b97f3..5ec05c4121e059ad2b1307376766a41916fe61ae 100644
--- a/tensorflow/compiler/tf2xla/xla_compiler.cc
+++ b/tensorflow/compiler/tf2xla/xla_compiler.cc
@@ -365,6 +365,13 @@ Status BuildComputation(
               return a->arg_num() < b->arg_num();
             });
 
+  // Attach a common operator name as metadata. This has no semantic effect — it
+  // merely makes the HLO graph more readable when visualized via TensorBoard,
+  // since TensorBoard forms groups out of operators with similar names.
+  xla::OpMetadata retval_metadata;
+  retval_metadata.set_op_name("XLA_Retvals");
+  builder->SetOpMetadata(retval_metadata);
+
   for (const XlaResource* resource : arg_resources) {
     const XlaCompiler::Argument& arg = args[resource->arg_num()];
     const int core = arg_cores[resource->arg_num()];
@@ -412,6 +419,8 @@ Status BuildComputation(
 
   // Builds the XLA computation.
   builder->Tuple(elems);
+  builder->ClearOpMetadata();
+
   xla::StatusOr<xla::Computation> computation_status = builder->Build();
   if (!computation_status.ok()) {
     return computation_status.status();
@@ -514,6 +523,13 @@ Status XlaCompiler::BuildArguments(
     }
   }
 
+  // Attach a common operator name as metadata. This has no semantic effect — it
+  // merely makes the HLO graph more readable when visualized via TensorBoard,
+  // since TensorBoard forms groups out of operators with similar names.
+  xla::OpMetadata arg_metadata;
+  arg_metadata.set_op_name("XLA_Args");
+  builder->SetOpMetadata(arg_metadata);
+
   // Build parameter handles for non-constant arguments.
   std::vector<xla::ComputationDataHandle> arg_handles(input_mapping->size());
   if (use_tuple_arg) {
@@ -552,6 +568,8 @@ Status XlaCompiler::BuildArguments(
     }
   }
 
+  builder->ClearOpMetadata();
+
   // Fill in the handles in non-constant arguments.
   VLOG(2) << "XLA computation inputs:";
   for (std::vector<int>::size_type i = 0; i < input_mapping->size(); ++i) {
diff --git a/tensorflow/compiler/xla/BUILD b/tensorflow/compiler/xla/BUILD
index 34e733bc8d80b364cec1783006eba0a5468b55ea..c7cb69215fb051b7f87c3be3b0b419b9c1b8998c 100644
--- a/tensorflow/compiler/xla/BUILD
+++ b/tensorflow/compiler/xla/BUILD
@@ -372,7 +372,6 @@ tf_cc_test(
 
 cc_library(
     name = "array2d",
-    srcs = ["array2d.cc"],
     hdrs = ["array2d.h"],
     visibility = ["//visibility:public"],
     deps = [
diff --git a/tensorflow/compiler/xla/array.h b/tensorflow/compiler/xla/array.h
index 46ee4e64c9ae7ca111d9d04bedcb74ff02a42386..24b58bec11bd8d8b5c79ac84c5f43c509644b51d 100644
--- a/tensorflow/compiler/xla/array.h
+++ b/tensorflow/compiler/xla/array.h
@@ -121,10 +121,31 @@ class Array {
     CHECK(idx == num_elements());
   }
 
-  // Creates a 2D array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates a 1D array of a floating-point type (half, bfloat16, float,
+  // or double) from an initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
+                             std::is_same<T2, float>::value>::type>
+  Array(std::initializer_list<T2> values)
+      : Array(ToInt64Vector({values.size()})) {
+    int64 idx = 0;
+    for (const auto& it1 : values) {
+      values_[idx] = static_cast<T>(it1);
+      ++idx;
+    }
+    CHECK(idx == num_elements());
+  }
+
+  // Creates a 2D array of a floating-point type (half, bfloat16, float,
+  // or double) from an initializer list of float values.
+  template <typename T2, typename = typename std::enable_if<
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array(std::initializer_list<std::initializer_list<T2>> values)
       : Array(ToInt64Vector({values.size(), values.begin()->size()})) {
@@ -155,10 +176,13 @@ class Array {
     CHECK(idx == num_elements());
   }
 
-  // Creates a 3D array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates a 3D array of a floating-point type (half, bfloat16, float,
+  // or double) from an initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array(std::initializer_list<std::initializer_list<std::initializer_list<T2>>>
             values)
@@ -196,10 +220,13 @@ class Array {
     CHECK(idx == num_elements());
   }
 
-  // Creates a 4D array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates a 4D array of a floating-point type (half, bfloat16, float,
+  // or double) from an initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array(std::initializer_list<
         std::initializer_list<std::initializer_list<std::initializer_list<T2>>>>
diff --git a/tensorflow/compiler/xla/array2d.h b/tensorflow/compiler/xla/array2d.h
index 41f563486d21e42e88dcf6c751ce4a64da5e3213..a17e81f44832f272fd93dce9f854042b4a84fde4 100644
--- a/tensorflow/compiler/xla/array2d.h
+++ b/tensorflow/compiler/xla/array2d.h
@@ -25,6 +25,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/compiler/xla/array.h"
+#include "tensorflow/compiler/xla/ptr_util.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/core/lib/core/bits.h"
 #include "tensorflow/core/lib/strings/str_util.h"
@@ -52,10 +53,13 @@ class Array2D : public Array<T> {
   Array2D(std::initializer_list<std::initializer_list<T>> values)
       : Array<T>(values) {}
 
-  // Creates an array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates an array of a floating-point type (half, bfloat16, float,
+  // or double) from the given nested initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array2D(std::initializer_list<std::initializer_list<T2>> values)
       : Array<T>(values) {}
@@ -94,9 +98,23 @@ class Array2D : public Array<T> {
 
 // Returns a linspace-populated Array2D in the range [from, to] (inclusive)
 // with dimensions n1 x n2.
-std::unique_ptr<Array2D<float>> MakeLinspaceArray2D(float from, float to,
-                                                    int64 n1, int64 n2);
-
+template <typename NativeT = float>
+std::unique_ptr<Array2D<NativeT>> MakeLinspaceArray2D(double from, double to,
+                                                      int64 n1, int64 n2) {
+  auto array = MakeUnique<Array2D<NativeT>>(n1, n2);
+  int64 count = n1 * n2;
+  NativeT step =
+      static_cast<NativeT>((count > 1) ? (to - from) / (count - 1) : 0);
+  auto set = [&array, n1, n2](int64 index, NativeT value) {
+    (*array)(index / n2, index % n2) = value;
+  };
+  for (int64 i = 0; i < count - 1; ++i) {
+    set(i, (static_cast<NativeT>(from) +
+            static_cast<NativeT>(i) * static_cast<NativeT>(step)));
+  }
+  set(count - 1, static_cast<NativeT>(to));
+  return array;
+}
 }  // namespace xla
 
 #endif  // TENSORFLOW_COMPILER_XLA_ARRAY2D_H_
diff --git a/tensorflow/compiler/xla/array3d.h b/tensorflow/compiler/xla/array3d.h
index e5eb235d45d160d486d1499db665ed14a8509043..0e9a0722ae43e1dc6ecddde9cbc3daf1db058840 100644
--- a/tensorflow/compiler/xla/array3d.h
+++ b/tensorflow/compiler/xla/array3d.h
@@ -57,10 +57,13 @@ class Array3D : public Array<T> {
               values)
       : Array<T>(values) {}
 
-  // Creates an array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates an array of a floating-point type (half, bfloat16, float,
+  // or double) from the given nested initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array3D(
       std::initializer_list<std::initializer_list<std::initializer_list<T2>>>
diff --git a/tensorflow/compiler/xla/array4d.h b/tensorflow/compiler/xla/array4d.h
index cff70e54bad0116bdd08674b626b3bf99dc89e1f..a75fffc605aa0df3e1e2eeb6d3129718cbbba0e4 100644
--- a/tensorflow/compiler/xla/array4d.h
+++ b/tensorflow/compiler/xla/array4d.h
@@ -82,10 +82,13 @@ class Array4D : public Array<T> {
               values)
       : Array<T>(values) {}
 
-  // Creates an array of Eigen::half from the given nested initializer list of
-  // float values.
+  // Creates an array of a floating-point type (half, bfloat16, float,
+  // or double) from the given nested initializer list of float values.
   template <typename T2, typename = typename std::enable_if<
-                             std::is_same<T, Eigen::half>::value &&
+                             (std::is_same<T, Eigen::half>::value ||
+                              std::is_same<T, bfloat16>::value ||
+                              std::is_same<T, float>::value ||
+                              std::is_same<T, double>::value) &&
                              std::is_same<T2, float>::value>::type>
   Array4D(std::initializer_list<std::initializer_list<
               std::initializer_list<std::initializer_list<T2>>>>
diff --git a/tensorflow/compiler/xla/client/compile_only_client.cc b/tensorflow/compiler/xla/client/compile_only_client.cc
index c7e2c4367b89ca2112022fa40449ae3ebe28463e..59662c95ac15e7c23790c5b5ff5d75a694613aeb 100644
--- a/tensorflow/compiler/xla/client/compile_only_client.cc
+++ b/tensorflow/compiler/xla/client/compile_only_client.cc
@@ -39,16 +39,15 @@ CompileOnlyClient::CompileAheadOfTime(
   return compiler_service_->CompileAheadOfTime(service_instances, options);
 }
 
-int64 CompileOnlyClient::PointerSizeForTriple(
-    tensorflow::StringPiece target_triple) {
-  llvm::Triple triple(llvm::Triple::normalize(
-      llvm::StringRef(target_triple.data(), target_triple.size())));
-  if (triple.isArch64Bit()) {
+int64 CompileOnlyClient::PointerSizeForTriple(tensorflow::StringPiece triple) {
+  llvm::Triple llvm_triple(
+      llvm::Triple::normalize(llvm::StringRef(triple.data(), triple.size())));
+  if (llvm_triple.isArch64Bit()) {
     return 8;
-  } else if (triple.isArch32Bit()) {
+  } else if (llvm_triple.isArch32Bit()) {
     return 4;
   } else {
-    CHECK(triple.isArch16Bit());
+    CHECK(llvm_triple.isArch16Bit());
     return 2;
   }
 }
diff --git a/tensorflow/compiler/xla/client/computation_builder.cc b/tensorflow/compiler/xla/client/computation_builder.cc
index b1dcad6a49a270935b07e26de2d3945b912359d1..4afef6e448ed154b64b0aa71f0a93a4cba4e8dec 100644
--- a/tensorflow/compiler/xla/client/computation_builder.cc
+++ b/tensorflow/compiler/xla/client/computation_builder.cc
@@ -408,7 +408,7 @@ ComputationDataHandle ComputationBuilder::Reshape(
 
 ComputationDataHandle ComputationBuilder::Collapse(
     const ComputationDataHandle& operand,
-    tensorflow::gtl::ArraySlice<int64> dims_to_collapse) {
+    tensorflow::gtl::ArraySlice<int64> dimensions) {
   if (!first_error_.ok()) {
     return ComputationDataHandle();
   }
@@ -416,8 +416,8 @@ ComputationDataHandle ComputationBuilder::Collapse(
   // Don't support out-of-order collapse here.
   // Checks that the collapsed dimensions are in order and consecutive.
   for (tensorflow::gtl::ArraySlice<int64>::size_type i = 1;
-       i < dims_to_collapse.size(); ++i) {
-    if (dims_to_collapse[i] - 1 != dims_to_collapse[i - 1]) {
+       i < dimensions.size(); ++i) {
+    if (dimensions[i] - 1 != dimensions[i - 1]) {
       NoteError(InvalidArgument(
           "Collapsed dimensions are not in order and consecutive."));
       return ComputationDataHandle();
@@ -434,9 +434,9 @@ ComputationDataHandle ComputationBuilder::Collapse(
 
   VLOG(3) << "original shape: " << ShapeUtil::HumanString(*original_shape);
   VLOG(3) << "dims to collapse: "
-          << tensorflow::str_util::Join(dims_to_collapse, ",");
+          << tensorflow::str_util::Join(dimensions, ",");
 
-  if (dims_to_collapse.size() <= 1) {
+  if (dimensions.size() <= 1) {
     // Not collapsing anything, trivially we can return the operand versus
     // enqueueing a trivial reshape.
     return operand;
@@ -444,7 +444,7 @@ ComputationDataHandle ComputationBuilder::Collapse(
 
   std::vector<int64> new_sizes;
   for (int i = 0; i < ShapeUtil::Rank(*original_shape); ++i) {
-    if (i <= dims_to_collapse.front() || i > dims_to_collapse.back()) {
+    if (i <= dimensions.front() || i > dimensions.back()) {
       new_sizes.push_back(original_shape->dimensions(i));
     } else {
       new_sizes.back() *= original_shape->dimensions(i);
@@ -753,13 +753,13 @@ ComputationDataHandle ComputationBuilder::Infeed(const Shape& shape,
 }
 
 void ComputationBuilder::Outfeed(const ComputationDataHandle& operand,
-                                 const Shape& shape,
+                                 const Shape& shape_with_layout,
                                  const string& outfeed_config) {
   OpRequest op_request;
   OutfeedRequest* request = op_request.mutable_outfeed_request();
   request->set_outfeed_config(outfeed_config);
   *request->mutable_operand() = operand;
-  *request->mutable_shape() = shape;
+  *request->mutable_shape() = shape_with_layout;
   RunOpAndNoteError(&op_request);
 }
 
@@ -789,6 +789,20 @@ ComputationDataHandle ComputationBuilder::CustomCall(
   return RunOpAndParseResponse(&op_request);
 }
 
+ComputationDataHandle ComputationBuilder::HostCompute(
+    tensorflow::gtl::ArraySlice<ComputationDataHandle> operands,
+    const string& channel_name, int64 cost_estimate_ns, const Shape& shape) {
+  OpRequest op_request;
+  HostComputeRequest* request = op_request.mutable_host_compute_request();
+  for (const ComputationDataHandle& operand : operands) {
+    *request->add_operands() = operand;
+  }
+  *request->mutable_shape() = shape;
+  request->set_channel_name(channel_name);
+  request->set_cost_estimate_ns(cost_estimate_ns);
+  return RunOpAndParseResponse(&op_request);
+}
+
 ComputationDataHandle ComputationBuilder::Complex(
     const ComputationDataHandle& real, const ComputationDataHandle& imag,
     tensorflow::gtl::ArraySlice<int64> broadcast_dimensions) {
@@ -1220,6 +1234,22 @@ ComputationDataHandle ComputationBuilder::While(
   return RunOpAndParseResponse(&op_request);
 }
 
+ComputationDataHandle ComputationBuilder::Gather(
+    const ComputationDataHandle& input,
+    const ComputationDataHandle& gather_indices,
+    const GatherDimensionNumbers& dimension_numbers,
+    tensorflow::gtl::ArraySlice<int64> window_bounds) {
+  OpRequest op_request;
+  GatherRequest* gather_request = op_request.mutable_gather_request();
+  *gather_request->mutable_input() = input;
+  *gather_request->mutable_gather_indices() = gather_indices;
+  *gather_request->mutable_dimension_numbers() = dimension_numbers;
+  for (int64 window_bound : window_bounds) {
+    gather_request->add_window_bounds(window_bound);
+  }
+  return RunOpAndParseResponse(&op_request);
+}
+
 ComputationDataHandle ComputationBuilder::Conditional(
     const ComputationDataHandle& predicate,
     const ComputationDataHandle& true_operand,
@@ -1352,15 +1382,16 @@ ComputationDataHandle ComputationBuilder::BatchNormInference(
 
 ComputationDataHandle ComputationBuilder::BatchNormGrad(
     const ComputationDataHandle& operand, const ComputationDataHandle& scale,
-    const ComputationDataHandle& mean, const ComputationDataHandle& var,
+    const ComputationDataHandle& batch_mean,
+    const ComputationDataHandle& batch_var,
     const ComputationDataHandle& grad_output, float epsilon,
     int64 feature_index) {
   OpRequest op_request;
   BatchNormGradRequest* request = op_request.mutable_batch_norm_grad_request();
   *request->mutable_operand() = operand;
   *request->mutable_scale() = scale;
-  *request->mutable_mean() = mean;
-  *request->mutable_variance() = var;
+  *request->mutable_mean() = batch_mean;
+  *request->mutable_variance() = batch_var;
   *request->mutable_grad_output() = grad_output;
   request->set_epsilon(epsilon);
   request->set_feature_index(feature_index);
diff --git a/tensorflow/compiler/xla/client/computation_builder.h b/tensorflow/compiler/xla/client/computation_builder.h
index 7cae91e9e04bba8f28f2348c552a941e4f7a36b4..e085fcb3b1894a54c9563513a00b783cc3eb1ef9 100644
--- a/tensorflow/compiler/xla/client/computation_builder.h
+++ b/tensorflow/compiler/xla/client/computation_builder.h
@@ -198,9 +198,8 @@ class ComputationBuilder {
                                 tensorflow::gtl::ArraySlice<int64> new_sizes);
 
   // Enqueues an operation onto the computation that collapses the operand, from
-  // minor to major order, then reshapes it into the shape with the given
-  // dimension sizes, also from major to minor. Conceptually, this is a limited
-  // form of "shape casting".
+  // first to last dimension (C order), then reshapes it to the given dimension
+  // sizes. Conceptually, this is a limited form of "shape casting".
   ComputationDataHandle Reshape(const ComputationDataHandle& operand,
                                 tensorflow::gtl::ArraySlice<int64> new_sizes);
 
@@ -446,6 +445,16 @@ class ComputationBuilder {
       tensorflow::gtl::ArraySlice<ComputationDataHandle> operands,
       const Shape& shape);
 
+  // Enqueues a pseudo-op to represent host-side computation data-dependencies.
+  // During code generation, host send and receive operations will be generated
+  // to transfer |operands| to the host and a single result of |shape| back to
+  // the device.  Host send/recv operations are emitted using |channel_name|.
+  // Dataflow dependencies and the |cost_estimate_ns| field may be used in HLO
+  // instruction scheduling.
+  ComputationDataHandle HostCompute(
+      tensorflow::gtl::ArraySlice<ComputationDataHandle> operands,
+      const string& channel_name, int64 cost_estimate_ns, const Shape& shape);
+
   // The following methods enqueue element-wise binary arithmetic operations
   // onto the computation. The shapes of the operands have to match unless one
   // of the operands is a scalar, or an explicit broadcast dimension is given
@@ -708,6 +717,13 @@ class ComputationBuilder {
                                         const int exponent_bits,
                                         const int mantissa_bits);
 
+  // Enqueues a Gather node onto the computation.
+  ComputationDataHandle Gather(
+      const ComputationDataHandle& input,
+      const ComputationDataHandle& gather_indices,
+      const GatherDimensionNumbers& dimension_numbers,
+      tensorflow::gtl::ArraySlice<int64> window_bounds);
+
   // Enqueues a Send node onto the computation, to send the given operand to
   // a Recv instruction that shares the same channel handle.
   void Send(const ComputationDataHandle& operand, const ChannelHandle& handle);
@@ -856,7 +872,7 @@ class ComputationBuilder {
                   Window* window);
 
   // Internal helper method that does the building for an arbitrary unary op.
-  ComputationDataHandle UnaryOp(UnaryOperation binop,
+  ComputationDataHandle UnaryOp(UnaryOperation unop,
                                 const ComputationDataHandle& operand);
 
   // Internal helper method that does the building for an arbitrary binary op.
diff --git a/tensorflow/compiler/xla/client/local_client.h b/tensorflow/compiler/xla/client/local_client.h
index b52a30f5a0b92e0094e6b0de3241c10a5a909cad..de0ed13c43f87966c272102b2e9af9ff3be63aea 100644
--- a/tensorflow/compiler/xla/client/local_client.h
+++ b/tensorflow/compiler/xla/client/local_client.h
@@ -69,7 +69,7 @@ class LocalExecutable {
   // of the computation.
   tensorflow::Status ValidateExecutionOptions(
       const tensorflow::gtl::ArraySlice<const ShapedBuffer*> arguments,
-      const ExecutableRunOptions& options, const Backend& backend);
+      const ExecutableRunOptions& run_options, const Backend& backend);
 
   // Records the computation in a SessionModule proto with the arguments used to
   // invoke it, and the result. Enabled by flag: --tla_dump_executions_to.
diff --git a/tensorflow/compiler/xla/literal_util.cc b/tensorflow/compiler/xla/literal_util.cc
index e0a9b148b443e90a0c4f3e19660b6234d49eef84..1d1418fc2f7d2f47641bbe5806fc06dfbcb7ebd0 100644
--- a/tensorflow/compiler/xla/literal_util.cc
+++ b/tensorflow/compiler/xla/literal_util.cc
@@ -223,7 +223,7 @@ Status Literal::CopySliceFromInternal(
     Literal::StrideConfig stride_config(src_literal.shape(), shape(),
                                         copy_size);
 
-    auto copy_proc = [&](const std::vector<int64>& indexes) {
+    auto copy_proc = [&](tensorflow::gtl::ArraySlice<int64> indexes) {
       // Map from multi-dimensional index, to source index.
       std::transform(indexes.begin(), indexes.end(), src_base.begin(),
                      src_indexes.begin(), std::plus<int64>());
@@ -343,7 +343,7 @@ Status Literal::Piece::CopyFrom(const Literal::Piece& src) {
 #undef COPY_ELEMENTS
       default:
         return Unimplemented(
-            "Unhandled primitive type %s",
+            "Copying a Literal object with element type %s is not implemented.",
             PrimitiveType_Name(subshape().element_type()).c_str());
     }
   }
@@ -491,7 +491,10 @@ Status Literal::CopySliceFrom(const Literal& src_literal,
     default:
       break;
   }
-  return Unimplemented("Unhandled primitive type %d", shape().element_type());
+  return Unimplemented(
+      "Copying a slice from a Literal object with element type %d is not "
+      "implemented.",
+      shape().element_type());
 }
 
 /* static */ Literal Literal::Zero(PrimitiveType primitive_type) {
@@ -810,7 +813,7 @@ std::unique_ptr<Literal> Literal::Slice(
     CHECK_GE(start_indices[dnum], 0);
     CHECK_LE(limit_indices[dnum], shape().dimensions(dnum));
     int64 dimension = limit_indices[dnum] - start_indices[dnum];
-    CHECK_GT(dimension, 0);
+    CHECK_GE(dimension, 0);
     result_dimensions.push_back(dimension);
   }
   const auto result_shape =
@@ -1009,6 +1012,49 @@ void Literal::SortSparseElements(const ShapeIndex& shape_index) {
   piece(shape_index).SortSparseElements();
 }
 
+Literal Literal::GetFirstScalarLiteral() const {
+  CHECK(ShapeUtil::IsArray(shape_));
+  CHECK_GT(ShapeUtil::ElementsIn(shape_), 0);
+  switch (shape_.element_type()) {
+    case PRED:
+      return std::move(*Literal::CreateR0<bool>(GetFirstElement<bool>()));
+    // 8 bit types.
+    case S8:
+      return std::move(*Literal::CreateR0<int8>(GetFirstElement<int8>()));
+    case U8:
+      return std::move(*Literal::CreateR0<uint8>(GetFirstElement<uint8>()));
+    // 16 bit types.
+    case BF16:
+      return std::move(
+          *Literal::CreateR0<bfloat16>(GetFirstElement<bfloat16>()));
+    case F16:
+      return std::move(*Literal::CreateR0<half>(GetFirstElement<half>()));
+    case S16:
+      return std::move(*Literal::CreateR0<int16>(GetFirstElement<int16>()));
+    case U16:
+      return std::move(*Literal::CreateR0<uint16>(GetFirstElement<uint16>()));
+    // 32 bit types.
+    case F32:
+      return std::move(*Literal::CreateR0<float>(GetFirstElement<float>()));
+    case S32:
+      return std::move(*Literal::CreateR0<int32>(GetFirstElement<int32>()));
+    case U32:
+      return std::move(*Literal::CreateR0<uint32>(GetFirstElement<uint32>()));
+    // 64 bit types.
+    case C64:
+      return std::move(
+          *Literal::CreateR0<complex64>(GetFirstElement<complex64>()));
+    case F64:
+      return std::move(*Literal::CreateR0<double>(GetFirstElement<double>()));
+    case S64:
+      return std::move(*Literal::CreateR0<int64>(GetFirstElement<int64>()));
+    case U64:
+      return std::move(*Literal::CreateR0<uint64>(GetFirstElement<uint64>()));
+    default:
+      LOG(FATAL) << "Unhandled primitive type " << shape_.element_type();
+  }
+}
+
 void Literal::Piece::SortSparseElements() {
   switch (subshape().element_type()) {
     case PRED:
@@ -1351,8 +1397,8 @@ StatusOr<std::unique_ptr<Literal>> ConvertIfDestTypeMatches(
       return ConvertToC64<primitive_src_type>(src_literal);
     // Other types are not yet supported.
     default:
-      return InvalidArgument(
-          "Unimplemented: Convert from type %s to type %s",
+      return Unimplemented(
+          "Converting from type %s to type %s is not implemented.",
           PrimitiveType_Name(src_literal.shape().element_type()).c_str(),
           PrimitiveType_Name(primitive_dest_type).c_str());
   }
@@ -1381,12 +1427,31 @@ StatusOr<std::unique_ptr<Literal>> Literal::Convert(
 #undef CONVERT_IF_DEST_TYPE_MATCHES
       // Other types are not yet supported.
     default:
-      return InvalidArgument("Unimplemented: Convert from type %s to type %s",
-                             PrimitiveType_Name(shape().element_type()).c_str(),
-                             PrimitiveType_Name(primitive_dest_type).c_str());
+      return Unimplemented(
+          "Converting from type %s to type %s is not implemented.",
+          PrimitiveType_Name(shape().element_type()).c_str(),
+          PrimitiveType_Name(primitive_dest_type).c_str());
   }
 }
 
+StatusOr<std::unique_ptr<Literal>> Literal::ConvertToShape(
+    const Shape& dest_shape) const {
+  if (!ShapeUtil::IsTuple(dest_shape)) {
+    return Convert(dest_shape.element_type());
+  }
+  std::vector<Literal> elements;
+  for (int i = 0; i < ShapeUtil::TupleElementCount(shape()); ++i) {
+    auto element = LiteralView::Create(*this, {i});
+    TF_ASSIGN_OR_RETURN(
+        auto new_element,
+        element.ConvertToShape(ShapeUtil::GetSubshape(dest_shape, {i})));
+    elements.push_back(std::move(*new_element));
+  }
+  auto converted = MakeUnique<Literal>();
+  *converted = Literal::MoveIntoTuple(&elements);
+  return std::move(converted);
+}
+
 template <typename NativeT>
 bool Literal::Piece::EqualElementsInternal(
     const Literal::Piece& other, std::vector<int64>* multi_index) const {
@@ -1571,6 +1636,92 @@ bool Literal::IsAllComplex(complex64 value) const {
   }
 }
 
+bool Literal::IsAllFirst() const {
+  for (const auto& pair : pieces_) {
+    const Piece& piece = pair.second;
+    if (!ShapeUtil::IsArray(piece.subshape())) {
+      continue;
+    }
+
+    // Empty shapes are not all the first element since there is no first
+    // element.
+    if (ShapeUtil::HasZeroElements(piece.subshape())) {
+      return false;
+    }
+    auto piece_is_all = [&]() {
+      switch (piece.subshape().element_type()) {
+        case PRED: {
+          auto data = piece.data<bool>();
+          return AllElementsEqualValue<bool>(data, data[0]);
+        }
+        // 8 bit types
+        case S8: {
+          auto data = piece.data<int8>();
+          return AllElementsEqualValue<int8>(data, data[0]);
+        }
+        case U8: {
+          auto data = piece.data<uint8>();
+          return AllElementsEqualValue<uint8>(data, data[0]);
+        }
+        // 16 bit types
+        case BF16: {
+          auto data = piece.data<bfloat16>();
+          return AllElementsEqualValue<bfloat16>(data, data[0]);
+        }
+        case F16: {
+          auto data = piece.data<half>();
+          return AllElementsEqualValue<half>(data, data[0]);
+        }
+        case S16: {
+          auto data = piece.data<int16>();
+          return AllElementsEqualValue<int16>(data, data[0]);
+        }
+        case U16: {
+          auto data = piece.data<uint16>();
+          return AllElementsEqualValue<uint16>(data, data[0]);
+        }
+        // 32 bit types
+        case F32: {
+          auto data = piece.data<float>();
+          return AllElementsEqualValue<float>(data, data[0]);
+        }
+        case U32: {
+          auto data = piece.data<uint32>();
+          return AllElementsEqualValue<uint32>(data, data[0]);
+        }
+        case S32: {
+          auto data = piece.data<int32>();
+          return AllElementsEqualValue<int32>(data, data[0]);
+        }
+        // 64 bit types
+        case C64: {
+          auto data = piece.data<complex64>();
+          return AllElementsEqualValue<complex64>(data, data[0]);
+        }
+        case F64: {
+          auto data = piece.data<double>();
+          return AllElementsEqualValue<double>(data, data[0]);
+        }
+        case S64: {
+          auto data = piece.data<int64>();
+          return AllElementsEqualValue<int64>(data, data[0]);
+        }
+        case U64: {
+          auto data = piece.data<uint64>();
+          return AllElementsEqualValue<uint64>(data, data[0]);
+        }
+        default:
+          return false;
+      }
+    };
+
+    if (!piece_is_all()) {
+      return false;
+    }
+  }
+  return true;
+}
+
 bool Literal::IsZero(tensorflow::gtl::ArraySlice<int64> indices) const {
   CHECK(ShapeUtil::IsArray(shape()));
   switch (shape().element_type()) {
diff --git a/tensorflow/compiler/xla/literal_util.h b/tensorflow/compiler/xla/literal_util.h
index d996004888ab521790b4c5a10da2a93f0d98d12f..cdc5d807e09e09663f3e03d6556a4b832d9420e5 100644
--- a/tensorflow/compiler/xla/literal_util.h
+++ b/tensorflow/compiler/xla/literal_util.h
@@ -333,6 +333,11 @@ class Literal {
   StatusOr<std::unique_ptr<Literal>> Convert(
       PrimitiveType primitive_dest_type) const;
 
+  // Converts this literal to the given shape. Returns an error is the
+  // conversion is not possible.
+  StatusOr<std::unique_ptr<Literal>> ConvertToShape(
+      const Shape& dest_shape) const;
+
   // Creates a scalar literal value zero of the given primitive type.
   static Literal Zero(PrimitiveType primitive_type);
 
@@ -451,6 +456,9 @@ class Literal {
   template <typename NativeT>
   NativeT GetFirstElement() const;
 
+  // Returns a literal scalar representing the first element.
+  Literal GetFirstScalarLiteral() const;
+
   // As Get(), but determines the correct type and converts the value
   // into text.
   string GetAsString(tensorflow::gtl::ArraySlice<int64> multi_index,
@@ -602,6 +610,9 @@ class Literal {
   // This literal must have a dense layout.
   bool IsAllComplex(complex64 value) const;
 
+  // Literal consists entirely of the first element of the literal.
+  bool IsAllFirst() const;
+
   // Returns whether this literal is zero at the specified index. This literal
   // must be an array with a dense layout.
   bool IsZero(tensorflow::gtl::ArraySlice<int64> indices) const;
@@ -1263,7 +1274,7 @@ Status Literal::Populate(const FnType& generator) {
     int64 minor_dimension_size =
         ShapeUtil::GetDimension(this_shape, stride_config.minor_dimension);
 
-    auto init_function = [&](const std::vector<int64>& indexes) {
+    auto init_function = [&](tensorflow::gtl::ArraySlice<int64> indexes) {
       const int64 index =
           IndexUtil::MultidimensionalIndexToLinearIndex(shape(), indexes);
       std::copy(indexes.begin(), indexes.end(), minor_scan_indexes.begin());
diff --git a/tensorflow/compiler/xla/literal_util_test.cc b/tensorflow/compiler/xla/literal_util_test.cc
index b3583c2eb75de8297d5e7507430491f119bd4462..04e45f00491b0bef94f3c0af1c875b2d007194fd 100644
--- a/tensorflow/compiler/xla/literal_util_test.cc
+++ b/tensorflow/compiler/xla/literal_util_test.cc
@@ -30,6 +30,7 @@ limitations under the License.
 namespace xla {
 namespace {
 
+using tensorflow::gtl::ArraySlice;
 using ::testing::ElementsAre;
 using ::testing::HasSubstr;
 
@@ -214,11 +215,11 @@ TEST_F(LiteralUtilTest, CreateSparse) {
   std::vector<int64> expected_values = {8, 9, 7, 10};
 
   EXPECT_EQ(literal->sparse_indices()->data(),
-            tensorflow::gtl::ArraySlice<int64>(
-                expected_indices.data(), expected_indices.num_elements()));
-  EXPECT_EQ(tensorflow::gtl::ArraySlice<int64>(literal->data<int64>().data(),
-                                               expected_values.size()),
-            tensorflow::gtl::ArraySlice<int64>(expected_values));
+            ArraySlice<int64>(expected_indices.data(),
+                              expected_indices.num_elements()));
+  EXPECT_EQ(
+      ArraySlice<int64>(literal->data<int64>().data(), expected_values.size()),
+      ArraySlice<int64>(expected_values));
 }
 
 TEST_F(LiteralUtilTest, LiteralR4F32ProjectedStringifies) {
@@ -290,7 +291,7 @@ TEST_F(LiteralUtilTest, EachCellR2F32) {
   // clang-format on
   std::vector<std::tuple<int64, int64, string>> seen;
   literal->EachCellAsString(
-      [&seen](tensorflow::gtl::ArraySlice<int64> indices, const string& value) {
+      [&seen](ArraySlice<int64> indices, const string& value) {
         seen.emplace_back(indices[0], indices[1], value);
       });
 
@@ -501,6 +502,24 @@ TEST_F(LiteralUtilTest, IsAllComplex) {
                    ->IsAllComplex({8.0f, 9.0f}));
 }
 
+TEST_F(LiteralUtilTest, IsAllFirst) {
+  // IsAllComplex always returns false when the literal is not complex.
+  EXPECT_FALSE(Literal::CreateR1<bool>({false, true})->IsAllFirst());
+  EXPECT_TRUE(Literal::CreateR1<bool>({false, false})->IsAllFirst());
+  EXPECT_FALSE(Literal::CreateR1<int8>({1, 1, 2})->IsAllFirst());
+  EXPECT_TRUE(Literal::CreateR1<int8>({5, 5, 5, 5})->IsAllFirst());
+  EXPECT_FALSE(Literal::CreateR1<uint8>({1, 1, 2})->IsAllFirst());
+  EXPECT_TRUE(Literal::CreateR1<int32>({5, 5, 5, 5})->IsAllFirst());
+  EXPECT_FALSE(Literal::CreateR1<int32>({1, 1, 2})->IsAllFirst());
+  EXPECT_TRUE(Literal::CreateR1<uint32>({5, 5, 5, 5})->IsAllFirst());
+  EXPECT_FALSE(Literal::CreateR1<uint32>({1, 1, 2})->IsAllFirst());
+
+  complex64 c8_9 = {8, 9};
+  complex64 c7_9 = {7, 9};
+  EXPECT_TRUE(Literal::CreateR2<complex64>({{c8_9}, {c8_9}})->IsAllFirst());
+  EXPECT_FALSE(Literal::CreateR2<complex64>({{c7_9}, {c8_9}})->IsAllFirst());
+}
+
 TEST_F(LiteralUtilTest, IsZero) {
   auto scalar_zero = Literal::CreateR0<float>(0.0f);
   auto scalar_one = Literal::CreateR0<float>(1.0f);
@@ -604,11 +623,10 @@ TEST_F(LiteralUtilTest, TransposeR4) {
   // clang-format on
   auto reshape = original->Transpose(/*permutation=*/{2, 3, 0, 1});
 
-  reshape->EachCell<float>(
-      [&](tensorflow::gtl::ArraySlice<int64> indices, float value) {
-        EXPECT_EQ(value, original->Get<float>(
-                             {indices[2], indices[3], indices[0], indices[1]}));
-      });
+  reshape->EachCell<float>([&](ArraySlice<int64> indices, float value) {
+    EXPECT_EQ(value, original->Get<float>(
+                         {indices[2], indices[3], indices[0], indices[1]}));
+  });
 }
 
 TEST_F(LiteralUtilTest, TestR4RelayoutEquivalence) {
@@ -845,7 +863,7 @@ TEST_F(LiteralUtilTest, CopySliceFrom) {
     const int64 zero_base[] = {0, 0, 0, 0};
     const int64 step[] = {1, 1, 1, 1};
     uint32 seqnr = 0;
-    auto init_proc = [&](const std::vector<int64>& indexes) {
+    auto init_proc = [&](ArraySlice<int64> indexes) {
       source->Set(indexes, ++seqnr);
       return true;
     };
@@ -861,7 +879,7 @@ TEST_F(LiteralUtilTest, CopySliceFrom) {
     std::vector<int64> source_indexes(TF_ARRAYSIZE(dimensions), 0);
     std::vector<int64> blank_indexes(TF_ARRAYSIZE(dimensions), 0);
     bool matched = true;
-    auto check_proc = [&](const std::vector<int64>& indexes) {
+    auto check_proc = [&](ArraySlice<int64> indexes) {
       std::copy(indexes.begin(), indexes.end(), source_indexes.begin());
       std::transform(source_indexes.begin(), source_indexes.end(), src_base,
                      source_indexes.begin(), std::plus<int64>());
@@ -1049,7 +1067,7 @@ TEST_F(LiteralUtilTest, Populate) {
         primitive_util::NativeToPrimitiveType<uint32>(), data.dimensions,
         data.layout);
     auto literal = Literal::CreateFromShape(shape);
-    auto generator = [&](tensorflow::gtl::ArraySlice<int64> indexes) -> uint32 {
+    auto generator = [&](ArraySlice<int64> indexes) -> uint32 {
       // Offsets from linear index just to avoid R0 literals to be initialized
       // with zero.
       return IndexUtil::MultidimensionalIndexToLinearIndex(literal->shape(),
@@ -1061,7 +1079,7 @@ TEST_F(LiteralUtilTest, Populate) {
     std::vector<int64> zero_base(data.dimensions.size(), 0);
     std::vector<int64> step(data.dimensions.size(), 1);
     bool matched = true;
-    auto check_function = [&](const std::vector<int64>& indexes) {
+    auto check_function = [&](ArraySlice<int64> indexes) {
       auto value = literal->Get<uint32>(indexes);
       matched = matched && (value == generator(indexes));
       return matched;
@@ -1214,15 +1232,15 @@ TEST_F(LiteralUtilTest, ConvertIfTypesMatch) {
   EXPECT_EQ(*conv, *c64);
 
   EXPECT_EQ(s32->Convert(TUPLE).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
   EXPECT_EQ(s32->Convert(S16).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
   EXPECT_EQ(s32->Convert(U16).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
   EXPECT_EQ(c64->Convert(F32).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
   EXPECT_EQ(c64->Convert(S32).status().code(),
-            tensorflow::error::INVALID_ARGUMENT);
+            tensorflow::error::UNIMPLEMENTED);
 }
 
 TEST_F(LiteralUtilTest, CopyFromProto_Bool) {
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.cc b/tensorflow/compiler/xla/python/local_computation_builder.cc
index cb7bb21e092c80d7360c23f3d6b00409a75dce23..b21ab3044fae7136071f50bdba6e74b799a309d5 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.cc
+++ b/tensorflow/compiler/xla/python/local_computation_builder.cc
@@ -368,6 +368,12 @@ ComputationDataHandle LocalComputationBuilder::Slice(
   return builder_.Slice(operand, start_indices, limit_indices, strides);
 }
 
+ComputationDataHandle LocalComputationBuilder::SliceInDim(
+    const ComputationDataHandle& operand, int64 start_index, int64 limit_index,
+    int64 stride, int64 dimno) {
+  return builder_.SliceInDim(operand, start_index, limit_index, stride, dimno);
+}
+
 ComputationDataHandle LocalComputationBuilder::DynamicSlice(
     const ComputationDataHandle& operand,
     const ComputationDataHandle& start_indices,
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.h b/tensorflow/compiler/xla/python/local_computation_builder.h
index d3e9503ea10b011520ec5148a756ef4d421f244c..a7375c8965e9041226ffee08dab6ffafa25312af 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.h
+++ b/tensorflow/compiler/xla/python/local_computation_builder.h
@@ -170,6 +170,10 @@ class LocalComputationBuilder {
                               tensorflow::gtl::ArraySlice<int64> limit_indices,
                               tensorflow::gtl::ArraySlice<int64> strides);
 
+  ComputationDataHandle SliceInDim(const ComputationDataHandle& operand,
+                                   int64 start_index, int64 limit_index,
+                                   int64 stride, int64 dimno);
+
   ComputationDataHandle DynamicSlice(
       const ComputationDataHandle& operand,
       const ComputationDataHandle& start_indices,
diff --git a/tensorflow/compiler/xla/python/local_computation_builder.i b/tensorflow/compiler/xla/python/local_computation_builder.i
index 456e341f877e529f7fc5ebc81d85862bfa291943..b5354131c94930b75ea66036ddb61ecd3993414f 100644
--- a/tensorflow/compiler/xla/python/local_computation_builder.i
+++ b/tensorflow/compiler/xla/python/local_computation_builder.i
@@ -886,6 +886,7 @@ tensorflow::ImportNumpy();
 %unignore xla::swig::LocalComputationBuilder::Collapse;
 %unignore xla::swig::LocalComputationBuilder::CrossReplicaSum;
 %unignore xla::swig::LocalComputationBuilder::Slice;
+%unignore xla::swig::LocalComputationBuilder::SliceInDim;
 %unignore xla::swig::LocalComputationBuilder::DynamicSlice;
 %unignore xla::swig::LocalComputationBuilder::DynamicUpdateSlice;
 %unignore xla::swig::LocalComputationBuilder::ConcatInDim;
diff --git a/tensorflow/compiler/xla/python/xla_client.py b/tensorflow/compiler/xla/python/xla_client.py
index 9bda9d09294bc75acaa35d8e4a512820046e8920..90cda42f3227c80826ffbf4e5473647c2795544d 100644
--- a/tensorflow/compiler/xla/python/xla_client.py
+++ b/tensorflow/compiler/xla/python/xla_client.py
@@ -30,9 +30,9 @@ from tensorflow.compiler.xla import xla_data_pb2
 from tensorflow.compiler.xla.python import pywrap_xla as c_api
 
 
-# Most functions are snake_case for consistency with other modules,
-# whereas method names of ComputationBuilder and LocalComputation are
-# CamelCase for consistency with XLA.
+# Most functions are snake_case for consistency with other modules, whereas
+# method names of ComputationBuilder and LocalComputation are CamelCase for
+# consistency with XLA.
 # pylint: disable=invalid-name
 
 
@@ -123,24 +123,34 @@ _BINARY_OPS = [
     'Pow',
 ]
 
+
 XLA_ELEMENT_TYPE_TO_DTYPE = {
-    xla_data_pb2.F32: np.dtype(np.float32),
-    xla_data_pb2.F64: np.dtype(np.float64),
-    xla_data_pb2.S32: np.dtype(np.int32),
-    xla_data_pb2.S64: np.dtype(np.int64),
-    xla_data_pb2.U32: np.dtype(np.uint32),
-    xla_data_pb2.U64: np.dtype(np.uint64),
-    xla_data_pb2.PRED: np.dtype(np.bool),
+    xla_data_pb2.PRED: np.dtype('bool'),
+    xla_data_pb2.S8: np.dtype('int8'),
+    xla_data_pb2.S16: np.dtype('int16'),
+    xla_data_pb2.S32: np.dtype('int32'),
+    xla_data_pb2.S64: np.dtype('int64'),
+    xla_data_pb2.U8: np.dtype('uint8'),
+    xla_data_pb2.U16: np.dtype('uint16'),
+    xla_data_pb2.U32: np.dtype('uint32'),
+    xla_data_pb2.U64: np.dtype('uint64'),
+    xla_data_pb2.F16: np.dtype('float16'),
+    xla_data_pb2.F32: np.dtype('float32'),
+    xla_data_pb2.F64: np.dtype('float64'),
+    xla_data_pb2.C64: np.dtype('complex64'),
     xla_data_pb2.TUPLE: np.dtype(np.object),
 }
 
 # Note the conversion on the key. Numpy has a known issue wherein dtype hashing
 # doesn't work as expected (https://github.com/numpy/numpy/issues/7242). Thus,
 # when keying by dtype in this dict, we use the string form of dtypes.
-DTYPE_TO_XLA_ELEMENT_TYPE = {
-    str(v): k
-    for k, v in XLA_ELEMENT_TYPE_TO_DTYPE.items()
-}
+DTYPE_TO_XLA_ELEMENT_TYPE = {str(dt): et
+                             for et, dt in XLA_ELEMENT_TYPE_TO_DTYPE.items()}
+
+
+def dtype_to_etype(dtype):
+  """Convenience function for reading DTYPE_TO_XLA_ELEMENT_TYPE."""
+  return DTYPE_TO_XLA_ELEMENT_TYPE[str(np.dtype(dtype))]
 
 
 class LocalBuffer(object):
@@ -656,7 +666,7 @@ class ComputationBuilder(object):
         representing the configuration of the padding operation.
 
     Returns:
-      A ComputationDataHandle representing the added pad op.
+      A ComputationDataHandle representing the added Pad op.
     """
     if not isinstance(padding_config, xla_data_pb2.PaddingConfig):
       padding_config = GetPaddingConfigFromTriples(padding_config)
@@ -666,7 +676,20 @@ class ComputationBuilder(object):
                          padding_config))
 
   def Reshape(self, operand, dimensions, new_sizes):
-    """Reshape op."""
+    """Enqueues a reshape op onto the computation.
+
+    Args:
+      operand: ComputationDataHandle representing the array to be reshaped.
+      dimensions: sequence of integers encoding the order in which dimensions
+        are collapsed or None, in which case dimensions are flattened in order.
+      new_sizes: sequence of integers encoding the new dimension sizes (shape).
+
+    Returns:
+      A ComputationDataHandle representing the added Reshape op.
+    """
+    if dimensions is None:
+      ndim = len(self.GetShape(operand).dimensions())
+      dimensions = tuple(range(ndim))
     return _wrap_data_handle(
         self._client.Reshape(
             _unwrap_data_handle(operand), dimensions, new_sizes))
@@ -772,11 +795,27 @@ class ComputationBuilder(object):
       strides = [1] * len(start_indices)
     return _wrap_data_handle(
         self._client.Slice(
-            _unwrap_data_handle(operand),
-            start_indices,
-            limit_indices,
+            _unwrap_data_handle(operand), start_indices, limit_indices,
             strides))
 
+  def SliceInDim(self, operand, start_index, limit_index, stride, dimno):
+    """Enqueues a slice-in-dimension operation onto the computation.
+
+    Args:
+      operand: ComputationDataHandle for the N dimensional array to be sliced.
+      start_index: an integer containing the start index of the slice.
+      limit_index: an integer containing the end index of the slice.
+      stride: an integer containing the stride size for the slice.
+      dimno: an integer indicating the dimension along which to slice.
+
+    Returns:
+      A ComputationDataHandle representing the added Slice op.
+    """
+    return _wrap_data_handle(
+        self._client.SliceInDim(
+            _unwrap_data_handle(operand), start_index, limit_index, stride,
+            dimno))
+
   def DynamicSlice(self, operand, start_indices, slice_sizes):
     """Enqueues a slice op with dynamic start indices onto the computation.
 
diff --git a/tensorflow/compiler/xla/python/xla_client_test.py b/tensorflow/compiler/xla/python/xla_client_test.py
index c9d09cd5d57e001fd48d2dba9f2b0ee18374231b..4c16c1f8b07a28d8098e92e27f81a126ed9bdf0c 100644
--- a/tensorflow/compiler/xla/python/xla_client_test.py
+++ b/tensorflow/compiler/xla/python/xla_client_test.py
@@ -762,6 +762,23 @@ class SingleOpTest(LocalComputationTest):
         [3, 2])
     self._ExecuteAndCompareExact(c, expected=[[4, 5], [7, 8]])
 
+  def testSliceInDim(self):
+    c = self._NewComputation()
+    c.SliceInDim(
+        c.Constant(NumpyArrayS32([[1, 2, 3], [4, 5, 6], [7, 8, 9]])),
+        start_index=1,
+        limit_index=2,
+        stride=1,
+        dimno=1)
+    self._ExecuteAndCompareExact(c, expected=[[2], [5], [8]])
+    c.SliceInDim(
+        c.Constant(NumpyArrayS32([[1, 2, 3], [4, 5, 6], [7, 8, 9]])),
+        start_index=0,
+        limit_index=3,
+        stride=2,
+        dimno=0)
+    self._ExecuteAndCompareExact(c, expected=[[1, 2, 3], [7, 8, 9]])
+
   def testDynamicSlice(self):
     c = self._NewComputation()
     c.DynamicSlice(
diff --git a/tensorflow/compiler/xla/reference_util.cc b/tensorflow/compiler/xla/reference_util.cc
index a9acdae380af5b7f9efb3d08302fc717108f5e40..8711b8aa2ef47103f0ec5972f790843273c54f8c 100644
--- a/tensorflow/compiler/xla/reference_util.cc
+++ b/tensorflow/compiler/xla/reference_util.cc
@@ -30,29 +30,23 @@ limitations under the License.
 
 namespace xla {
 
-/* static */ std::unique_ptr<Array2D<float>> ReferenceUtil::TransposeArray2D(
-    const Array2D<float>& operand) {
-  auto result = MakeUnique<Array2D<float>>(operand.width(), operand.height());
-  for (int64 w = 0; w < operand.width(); ++w) {
-    for (int64 h = 0; h < operand.height(); ++h) {
-      (*result)(w, h) = operand(h, w);
-    }
-  }
-
-  return result;
-}
-
-/* static */ std::unique_ptr<Array2D<float>> ReferenceUtil::MatmulArray2D(
-    const Array2D<float>& lhs, const Array2D<float>& rhs) {
+namespace {
+
+template <typename T>
+std::unique_ptr<Array2D<T>> MatmulArray2DImpl(
+    const Array2D<T>& lhs, const Array2D<T>& rhs,
+    const std::function<void(
+        const void* run_options_ptr, T* out, T* lhs, T* rhs, int64 m, int64 n,
+        int64 k, int32 transpose_lhs, int32 transpose_rhs)>& impl_fn) {
   CHECK_EQ(lhs.width(), rhs.height());
   int m = lhs.height();
   int n = rhs.width();
   int k = lhs.width();
-  auto result = MakeUnique<Array2D<float>>(m, n);
+  auto result = MakeUnique<Array2D<T>>(m, n);
   // Because Eigen is a header-oriented library, make sure that the Eigen code
   // is the same as the code used by the CPU backend (otherwise the linker will
   // randomly pick *some* definition).
-  __xla_cpu_runtime_EigenSingleThreadedMatMulF32(
+  impl_fn(
       /*run_options_ptr=*/nullptr, result->data(), rhs.data(), lhs.data(), n, m,
       k,
       /*transpose_lhs=*/0,
@@ -60,22 +54,24 @@ namespace xla {
   return result;
 }
 
+}  // namespace
+
+/* static */ std::unique_ptr<Array2D<Eigen::half>> ReferenceUtil::MatmulArray2D(
+    const Array2D<Eigen::half>& lhs, const Array2D<Eigen::half>& rhs) {
+  return MatmulArray2DImpl<Eigen::half>(
+      lhs, rhs, __xla_cpu_runtime_EigenSingleThreadedMatMulF16);
+}
+
+/* static */ std::unique_ptr<Array2D<float>> ReferenceUtil::MatmulArray2D(
+    const Array2D<float>& lhs, const Array2D<float>& rhs) {
+  return MatmulArray2DImpl<float>(
+      lhs, rhs, __xla_cpu_runtime_EigenSingleThreadedMatMulF32);
+}
+
 /* static */ std::unique_ptr<Array2D<double>> ReferenceUtil::MatmulArray2D(
     const Array2D<double>& lhs, const Array2D<double>& rhs) {
-  CHECK_EQ(lhs.width(), rhs.height());
-  int m = lhs.height();
-  int n = rhs.width();
-  int k = lhs.width();
-  auto result = MakeUnique<Array2D<double>>(m, n);
-  // Because Eigen is a header-oriented library, make sure that the Eigen code
-  // is the same as the code used by the CPU backend (otherwise the linker will
-  // randomly pick *some* definition).
-  __xla_cpu_runtime_EigenSingleThreadedMatMulF64(
-      /*run_options_ptr=*/nullptr, result->data(), rhs.data(), lhs.data(), n, m,
-      k,
-      /*transpose_lhs=*/0,
-      /*transpose_rhs=*/0);
-  return result;
+  return MatmulArray2DImpl<double>(
+      lhs, rhs, __xla_cpu_runtime_EigenSingleThreadedMatMulF64);
 }
 
 /* static */ std::unique_ptr<Array2D<double>> ReferenceUtil::Array2DF32ToF64(
diff --git a/tensorflow/compiler/xla/reference_util.h b/tensorflow/compiler/xla/reference_util.h
index 3ec96f2f38b8f91e1549419b60481327fa9bbd5f..57b0218882d818db8d21bb60f93a14235a944bbe 100644
--- a/tensorflow/compiler/xla/reference_util.h
+++ b/tensorflow/compiler/xla/reference_util.h
@@ -39,10 +39,22 @@ namespace xla {
 class ReferenceUtil {
  public:
   // Returns the result of a transpose operation on the input matrix.
-  static std::unique_ptr<Array2D<float>> TransposeArray2D(
-      const Array2D<float>& operand);
+  template <typename T>
+  static std::unique_ptr<Array2D<T>> TransposeArray2D(
+      const Array2D<T>& operand) {
+    auto result = MakeUnique<Array2D<T>>(operand.width(), operand.height());
+    for (int64 w = 0; w < operand.width(); ++w) {
+      for (int64 h = 0; h < operand.height(); ++h) {
+        (*result)(w, h) = operand(h, w);
+      }
+    }
+
+    return result;
+  }
 
   // Returns the result of a matrix multiply `lhs x rhs`.
+  static std::unique_ptr<Array2D<Eigen::half>> MatmulArray2D(
+      const Array2D<Eigen::half>& lhs, const Array2D<Eigen::half>& rhs);
   static std::unique_ptr<Array2D<float>> MatmulArray2D(
       const Array2D<float>& lhs, const Array2D<float>& rhs);
   static std::unique_ptr<Array2D<double>> MatmulArray2D(
diff --git a/tensorflow/compiler/xla/service/BUILD b/tensorflow/compiler/xla/service/BUILD
index 83c67ed9368bc617a90c528f200b566ee8754edd..d71790fb2d188c2100d317cd6bfdcd3be26dfea4 100644
--- a/tensorflow/compiler/xla/service/BUILD
+++ b/tensorflow/compiler/xla/service/BUILD
@@ -118,6 +118,42 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "bfloat16_propagation",
+    srcs = ["bfloat16_propagation.cc"],
+    hdrs = ["bfloat16_propagation.h"],
+    deps = [
+        ":bfloat16_support",
+        ":hlo",
+        ":hlo_dataflow_analysis",
+        ":hlo_dce",
+        ":hlo_pass",
+        ":tuple_simplifier",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_tree",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "bfloat16_propagation_test",
+    srcs = ["bfloat16_propagation_test.cc"],
+    deps = [
+        ":bfloat16_propagation",
+        ":bfloat16_support",
+        ":hlo",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla:test_helpers",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/tests:hlo_test_base",
+        "//tensorflow/compiler/xla/tests:literal_test_util",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",  # fixdeps: keep
+    ],
+)
+
 cc_library(
     name = "shape_inference",
     srcs = ["shape_inference.cc"],
@@ -145,7 +181,8 @@ tf_cc_test(
         "//tensorflow/compiler/xla:test_helpers",
         "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
-        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
+        "//tensorflow/compiler/xla/tests:xla_internal_test_main",  # fixdeps: keep
+        "//tensorflow/core:lib",
     ],
 )
 
@@ -718,6 +755,7 @@ cc_library(
     hdrs = ["llvm_compiler.h"],
     deps = [
         ":compiler",
+        "//tensorflow/core:lib_internal",
         "@llvm//:core",
     ],
 )
@@ -1177,6 +1215,41 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "conditional_simplifier",
+    srcs = ["conditional_simplifier.cc"],
+    hdrs = ["conditional_simplifier.h"],
+    deps = [
+        ":call_inliner",
+        ":hlo",
+        ":hlo_pass",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:status_macros",
+        "//tensorflow/compiler/xla:statusor",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:util",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "conditional_simplifier_test",
+    srcs = ["conditional_simplifier_test.cc"],
+    deps = [
+        ":conditional_simplifier",
+        ":hlo",
+        ":hlo_matchers",
+        "//tensorflow/compiler/xla:literal_util",
+        "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:test",
+        "//tensorflow/compiler/xla:types",
+        "//tensorflow/compiler/xla:xla_data_proto",
+        "//tensorflow/compiler/xla/tests:hlo_verified_test_base",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+    ],
+)
+
 cc_library(
     name = "while_loop_simplifier",
     srcs = ["while_loop_simplifier.cc"],
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier.cc b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
index fb857559f972a220a19b108baa4c441e09b90e1f..ecaa474336850c0ea3d2636826a7c62ecc5fe17e 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier.cc
@@ -122,6 +122,8 @@ class AlgebraicSimplifierVisitor : public DfsHloVisitorWithDefault {
 
   Status HandleBitcast(HloInstruction* bitcast) override;
 
+  Status HandleBitcastConvert(HloInstruction* bitcast) override;
+
   Status HandleBroadcast(HloInstruction* broadcast) override;
 
   Status HandleConcatenate(HloInstruction* concatenate) override;
@@ -411,6 +413,13 @@ Status AlgebraicSimplifierVisitor::HandleBitcast(HloInstruction* bitcast) {
   return Status::OK();
 }
 
+Status AlgebraicSimplifierVisitor::HandleBitcastConvert(
+    HloInstruction* bitcast) {
+  // Eliminate bitcast converts between same shape.
+  ReplaceInstructionIfSameShape(bitcast, bitcast->mutable_operand(0));
+  return Status::OK();
+}
+
 Status AlgebraicSimplifierVisitor::HandleCopy(HloInstruction* copy) {
   // If a copy feeds a copy, make it a single copy.
   if (copy->operand(0)->opcode() == HloOpcode::kCopy) {
@@ -516,6 +525,18 @@ Status AlgebraicSimplifierVisitor::HandleConstant(HloInstruction* constant) {
     return ReplaceInstruction(
         constant, BuildTupleConstant(computation_, constant->literal()));
   }
+
+  // If a literal is all the same element replace it with a scalar broadcast.
+  if (ShapeUtil::ElementsIn(constant->shape()) > 1 &&
+      constant->literal().IsAllFirst()) {
+    std::unique_ptr<Literal> unique_scalar =
+        MakeUnique<Literal>(constant->literal().GetFirstScalarLiteral());
+    HloInstruction* scalar = computation_->AddInstruction(
+        HloInstruction::CreateConstant(std::move(unique_scalar)));
+    return ReplaceWithNewInstruction(
+        constant,
+        HloInstruction::CreateBroadcast(constant->shape(), scalar, {}));
+  }
   return Status::OK();
 }
 
@@ -1604,6 +1625,14 @@ Status AlgebraicSimplifierVisitor::HandleDynamicUpdateSlice(
   if (IsAll(start_indices, 0) && SameShape(dynamic_update_slice, update)) {
     return ReplaceInstruction(dynamic_update_slice, update);
   }
+
+  // If any dimension of update is 0, elide the DynamicUpdateSlice.  This
+  // optimization becomes invalid should we later prefer to warn about out of
+  // bound indices.
+  if (ShapeUtil::HasZeroElements(update->shape())) {
+    return ReplaceInstruction(dynamic_update_slice,
+                              dynamic_update_slice->mutable_operand(0));
+  }
   return Status::OK();
 }
 
diff --git a/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc b/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
index 0f08eb3a3267c4b7b04958270a5788fc48d3fa04..451294ef5d8367686d7fc22b7f5ebfde89d14d42 100644
--- a/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
+++ b/tensorflow/compiler/xla/service/algebraic_simplifier_test.cc
@@ -162,6 +162,37 @@ TEST_F(AlgebraicSimplifierTest, AddBroadcastZeroR1Operand) {
   EXPECT_EQ(root, param0);
 }
 
+TEST_F(AlgebraicSimplifierTest, ConstantToBroadcast) {
+  HloComputation::Builder builder(TestName());
+  builder.AddInstruction(HloInstruction::CreateConstant(
+      Literal::CreateR1<float>({3.14f, 3.14f, 3.14f})));
+
+  auto computation = module().AddEntryComputation(builder.Build());
+  HloInstruction* root = computation->root_instruction();
+  EXPECT_THAT(root, op::Constant());
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(&module()).ValueOrDie());
+  root = computation->root_instruction();
+  EXPECT_THAT(root, op::Broadcast(op::Constant()));
+  EXPECT_EQ(3.14f, root->operand(0)->literal().GetFirstElement<float>());
+}
+
+TEST_F(AlgebraicSimplifierTest, ConstantNotToBroadcast) {
+  HloComputation::Builder builder(TestName());
+  builder.AddInstruction(HloInstruction::CreateConstant(
+      Literal::CreateR1<float>({3.14, 3.14, 4})));
+
+  auto computation = module().AddEntryComputation(builder.Build());
+  HloInstruction* root = computation->root_instruction();
+  EXPECT_THAT(root, op::Constant());
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_FALSE(simplifier.Run(&module()).ValueOrDie());
+  root = computation->root_instruction();
+  EXPECT_THAT(root, op::Constant());
+}
+
 // Test that A - 0 is simplified to A
 TEST_F(AlgebraicSimplifierTest, SubZero) {
   Shape r0f32 = ShapeUtil::MakeShape(F32, {});
@@ -2769,6 +2800,29 @@ DotOfConcatTestSpec kDotOfConcatTestSpecs[] = {
     {/*m=*/1, /*k=*/16, /*n=*/1},   //
 };
 
+// Test that DynamicUpdateSlice update param with any dimension equal to zero
+// gets removed.
+TEST_F(AlgebraicSimplifierTest, DynamicUpdateSliceZeroUpdate) {
+  HloComputation::Builder builder(TestName());
+  const Shape dslice_shape = ShapeUtil::MakeShape(F32, {10});
+  HloInstruction* const operand = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, dslice_shape, "operand"));
+  const Shape update_shape = ShapeUtil::MakeShape(F32, {0});
+  HloInstruction* const update = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, update_shape, "update"));
+  HloInstruction* const start_indices = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR1<int>({0})));
+  builder.AddInstruction(HloInstruction::CreateDynamicUpdateSlice(
+      dslice_shape, operand, update, start_indices));
+  const HloComputation* const computation =
+      module().AddEntryComputation(builder.Build());
+
+  AlgebraicSimplifier simplifier(/*is_layout_sensitive=*/false,
+                                 non_bitcasting_callback());
+  ASSERT_TRUE(simplifier.Run(&module()).ValueOrDie());
+  EXPECT_THAT(computation->root_instruction(), operand);
+}
+
 INSTANTIATE_TEST_CASE_P(DotOfConcatSimplificationTestInstantiation,
                         DotOfConcatSimplificationTest,
                         ::testing::ValuesIn(kDotOfConcatTestSpecs));
diff --git a/tensorflow/compiler/xla/service/allocation_tracker.cc b/tensorflow/compiler/xla/service/allocation_tracker.cc
index 4e80679c11dfdf7fdf8077a9f354139a4cab6803..7a75c025315f4c3473af94cb297348c9532f300b 100644
--- a/tensorflow/compiler/xla/service/allocation_tracker.cc
+++ b/tensorflow/compiler/xla/service/allocation_tracker.cc
@@ -109,7 +109,7 @@ StatusOr<std::vector<GlobalDataHandle>> AllocationTracker::DeconstructTuple(
   TF_RET_CHECK(ShapeUtil::IsTuple(shaped_buffer->on_device_shape()));
 
   if (ShapeUtil::IsNestedTuple(shaped_buffer->on_device_shape())) {
-    return Unimplemented("deconstructing nested tuples not yet supported");
+    return Unimplemented("Deconstructing nested tuples is not implemented.");
   }
 
   std::vector<GlobalDataHandle> element_handles;
diff --git a/tensorflow/compiler/xla/service/batchnorm_expander.cc b/tensorflow/compiler/xla/service/batchnorm_expander.cc
index 27ddfd47aa3096afd3e245af1ac3cedd9b48ce4a..84c9db32932becd9b701929b392efa4998d03067 100644
--- a/tensorflow/compiler/xla/service/batchnorm_expander.cc
+++ b/tensorflow/compiler/xla/service/batchnorm_expander.cc
@@ -153,6 +153,7 @@ Status BatchNormExpanderVisitor::HandleBatchNormTraining(
   std::vector<HloInstruction*> added_instructions;
   auto add = [&](std::unique_ptr<HloInstruction> inst) {
     HloInstruction* added_inst = computation_->AddInstruction(std::move(inst));
+    added_inst->set_metadata(batch_norm->metadata());
     added_instructions.push_back(added_inst);
     return added_inst;
   };
@@ -334,6 +335,7 @@ Status BatchNormExpanderVisitor::HandleBatchNormInference(
   std::vector<HloInstruction*> added_instructions;
   auto add = [&](std::unique_ptr<HloInstruction> inst) {
     HloInstruction* added_inst = computation_->AddInstruction(std::move(inst));
+    added_inst->set_metadata(batch_norm->metadata());
     added_instructions.push_back(added_inst);
     return added_inst;
   };
@@ -419,6 +421,7 @@ Status BatchNormExpanderVisitor::HandleBatchNormGrad(
   std::vector<HloInstruction*> added_instructions;
   auto add = [&](std::unique_ptr<HloInstruction> inst) {
     HloInstruction* added_inst = computation_->AddInstruction(std::move(inst));
+    added_inst->set_metadata(batch_norm->metadata());
     added_instructions.push_back(added_inst);
     return added_inst;
   };
diff --git a/tensorflow/compiler/xla/service/bfloat16_normalization.cc b/tensorflow/compiler/xla/service/bfloat16_normalization.cc
index b032c040e8aff49f9e0fc1ff9a1c1e79ea4bb77f..6176f5d20958453a2edfe464a46c3fa6e5d54add 100644
--- a/tensorflow/compiler/xla/service/bfloat16_normalization.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_normalization.cc
@@ -221,41 +221,37 @@ Status BFloat16NormalizationVisitor::HandleCrossReplicaSum(
 }
 
 Status BFloat16NormalizationVisitor::HandleInstruction(HloInstruction* hlo) {
-  std::vector<int64> bf16_operands;
-  std::vector<int64> f32_operands;
-  bool has_f32 = false;
-  bool has_bf16 = false;
+  int f32_count = 0;
+  int bf16_count = 1;
 
   for (int64 i = 0; i < hlo->operand_count(); ++i) {
     if (hlo->operand(i)->shape().element_type() == F32) {
-      f32_operands.push_back(i);
-      has_f32 = true;
+      f32_count += 1;
     } else if (hlo->operand(i)->shape().element_type() == BF16) {
-      bf16_operands.push_back(i);
-      has_bf16 = true;
+      bf16_count += 1;
     }
   }
 
   if (hlo->shape().element_type() == F32) {
-    has_f32 = true;
+    f32_count += 1;
   } else if (hlo->shape().element_type() == BF16) {
-    has_bf16 = true;
+    bf16_count += 1;
   }
 
   std::vector<HloComputation*> bf16_called_comps;
   for (auto* comp : hlo->called_computations()) {
     bool comp_has_bf16 = false;
     if (comp->root_instruction()->shape().element_type() == F32) {
-      has_f32 = true;
+      f32_count += 1;
     } else if (comp->root_instruction()->shape().element_type() == BF16) {
-      has_bf16 = true;
+      bf16_count += 1;
       comp_has_bf16 = true;
     }
     for (auto* param : comp->parameter_instructions()) {
       if (param->shape().element_type() == F32) {
-        has_f32 = true;
+        f32_count += 1;
       } else if (param->shape().element_type() == BF16) {
-        has_bf16 = true;
+        bf16_count += 1;
         comp_has_bf16 = true;
       }
     }
@@ -264,54 +260,69 @@ Status BFloat16NormalizationVisitor::HandleInstruction(HloInstruction* hlo) {
     }
   }
 
-  if (!bfloat16_support_->SupportsMixedPrecisions(*hlo) && has_bf16 &&
-      has_f32) {
-    // Resolve unsupported mixed precision.
-    //
-    // See if we can change everything to BF16.
-    if (hlo->called_computations().empty() &&
-        hlo->shape().element_type() == BF16) {
-      bool can_use_bf16 = true;
-      for (int i : f32_operands) {
-        if (bfloat16_support_->EffectiveOperandPrecisionIsOutputPrecision(*hlo,
-                                                                          i) &&
-            bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
-          continue;
-        }
-        can_use_bf16 = false;
-        break;
-      }
-      if (can_use_bf16) {
-        for (int i : f32_operands) {
-          TF_RETURN_IF_ERROR(
-              InsertConvertBeforeOperand(hlo, i, BF16, computation_));
-        }
-        return Status::OK();
-      }
-    }
-    if (hlo->shape().element_type() == BF16) {
-      TF_RETURN_IF_ERROR(
-          ChangeOutputTypeThenInsertConvertBack(hlo, F32, computation_));
-    }
-    for (int i : bf16_operands) {
-      TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(hlo, i, F32, computation_));
-    }
-    return ConvertCalledComputations(hlo, bf16_called_comps);
-  }
-
-  for (int i : bf16_operands) {
-    if (!bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
+  // Resolve unsupported BF16 operands.
+  for (int i = 0; i < hlo->operand_count(); ++i) {
+    if (hlo->operand(i)->shape().element_type() == BF16 &&
+        !bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
       TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(hlo, i, F32, computation_));
+      bf16_count -= 1;
+      f32_count += 1;
     }
   }
 
+  // Resolve unsupported BF16 output.
   if (hlo->shape().element_type() == BF16 &&
       !bfloat16_support_->SupportsBF16Output(*hlo)) {
     TF_RETURN_IF_ERROR(
         ChangeOutputTypeThenInsertConvertBack(hlo, F32, computation_));
+    bf16_count -= 1;
+    f32_count += 1;
   }
 
-  return Status::OK();
+  // Resolve unsupported mixed precision after resolving unsupported BF16
+  // operands and output, because the numbers of BF16 operands/output and F32
+  // operands/output may have changed.
+  if (bfloat16_support_->SupportsMixedPrecisions(*hlo) || bf16_count == 0 ||
+      f32_count == 0) {
+    return Status::OK();
+  }
+  // See if we can change everything to BF16.
+  if (hlo->called_computations().empty() &&
+      hlo->shape().element_type() == BF16) {
+    bool can_use_bf16 = true;
+    for (int i = 0; i < hlo->operand_count(); ++i) {
+      if (hlo->operand(i)->shape().element_type() == BF16) {
+        continue;
+      }
+      if ((bfloat16_support_->EffectiveOperandPrecisionIsBF16(*hlo, i) ||
+           bfloat16_support_->EffectiveOperandPrecisionIsOutputPrecision(*hlo,
+                                                                         i)) &&
+          bfloat16_support_->SupportsBF16Operand(*hlo, i)) {
+        continue;
+      }
+      can_use_bf16 = false;
+      break;
+    }
+    if (can_use_bf16) {
+      for (int i = 0; i < hlo->operand_count(); ++i) {
+        if (hlo->operand(i)->shape().element_type() == F32) {
+          TF_RETURN_IF_ERROR(
+              InsertConvertBeforeOperand(hlo, i, BF16, computation_));
+        }
+      }
+      return Status::OK();
+    }
+  }
+  if (hlo->shape().element_type() == BF16) {
+    TF_RETURN_IF_ERROR(
+        ChangeOutputTypeThenInsertConvertBack(hlo, F32, computation_));
+  }
+  for (int i = 0; i < hlo->operand_count(); ++i) {
+    if (hlo->operand(i)->shape().element_type() == BF16) {
+      TF_RETURN_IF_ERROR(InsertConvertBeforeOperand(hlo, i, F32, computation_));
+    }
+  }
+  return ConvertCalledComputations(hlo, bf16_called_comps);
 }
 
 Status BFloat16NormalizationVisitor::DefaultAction(HloInstruction* hlo) {
diff --git a/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc b/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
index 66c3085842c4afe7ffc4d5891883e4cce9389d45..fc0f6f1948d835d18f86658d2b25d387bf5d5354 100644
--- a/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_normalization_test.cc
@@ -41,13 +41,17 @@ class TestBFloat16Support : public BFloat16Support {
         hlo.opcode() == HloOpcode::kGetTupleElement) {
       return true;
     }
+    if (hlo.opcode() == HloOpcode::kDot) {
+      // Test that only the first operand of kDot supports BF16.
+      return operand_index == 0;
+    }
     return false;
   }
 
   bool SupportsBF16Output(const HloInstruction& hlo) const override {
     if (hlo.opcode() == HloOpcode::kAdd || hlo.opcode() == HloOpcode::kReduce ||
         hlo.opcode() == HloOpcode::kSubtract ||
-        hlo.opcode() == HloOpcode::kTuple ||
+        hlo.opcode() == HloOpcode::kDot || hlo.opcode() == HloOpcode::kTuple ||
         hlo.opcode() == HloOpcode::kGetTupleElement) {
       return true;
     }
@@ -245,4 +249,31 @@ TEST_F(BFloat16NormalizationTest, ResolveMixedPrecisionTupleCrossReplicaSum) {
   EXPECT_EQ(ShapeUtil::GetSubshape(crs->shape(), {1}).element_type(), F32);
 }
 
+// Tests that the normalization should not cause unsupported mixed precision due
+// to resolving unsupported BF16 operand.
+TEST_F(BFloat16NormalizationTest, DoNotAddUnsupportedMixedPrecision) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape bf16_shape = ShapeUtil::MakeShape(BF16, {4, 4});
+
+  HloInstruction* a = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, bf16_shape, "a"));
+  HloInstruction* b = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, bf16_shape, "b"));
+
+  HloInstruction* dot = builder.AddInstruction(
+      HloInstruction::CreateBinary(bf16_shape, HloOpcode::kDot, a, b));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(Normalize(module.get()));
+
+  EXPECT_EQ(computation->root_instruction()->opcode(), HloOpcode::kConvert);
+  EXPECT_EQ(dot->shape().element_type(), F32);
+  EXPECT_EQ(dot->operand(0)->shape().element_type(), F32);
+  EXPECT_EQ(dot->operand(0)->opcode(), HloOpcode::kConvert);
+  EXPECT_EQ(dot->operand(1)->shape().element_type(), F32);
+  EXPECT_EQ(dot->operand(1)->opcode(), HloOpcode::kConvert);
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/bfloat16_propagation.cc b/tensorflow/compiler/xla/service/bfloat16_propagation.cc
new file mode 100644
index 0000000000000000000000000000000000000000..7708504dc998f6da35ac5b180cb043d1e83d808a
--- /dev/null
+++ b/tensorflow/compiler/xla/service/bfloat16_propagation.cc
@@ -0,0 +1,679 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/bfloat16_propagation.h"
+
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/map_util.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_dce.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/tuple_simplifier.h"
+#include "tensorflow/compiler/xla/shape_tree.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/core/lib/gtl/cleanup.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace xla {
+
+BFloat16Propagation::BFloat16Propagation(
+    const BFloat16Support* bfloat16_support)
+    : bfloat16_support_(bfloat16_support) {}
+
+void BFloat16Propagation::DetermineAndMutateFusionComputationPrecision(
+    HloInstruction* fusion) {
+  CHECK_EQ(fusion->opcode(), HloOpcode::kFusion);
+  if (!bfloat16_support_->SupportsMixedPrecisions(*fusion)) {
+    return;
+  }
+
+  // We are depending on the fusion node itself having already been analyzed
+  // for whether it can output BF16 and this has been adjusted in the output
+  // shape, and now we're looking to update the interior of the fusion node to
+  // match the new output shape, as well as recursively process the whole fusion
+  // node even if the output shape was not modified.
+  auto root = fusion->fused_instructions_computation()->root_instruction();
+
+  // Adjust root's element types according to the fusion's output shape.
+  ShapeUtil::ForEachMutableSubshape(
+      root->mutable_shape(), [&](Shape* subshape, const ShapeIndex& index) {
+        if (subshape->element_type() != F32) {
+          return;
+        }
+        if (ShapeUtil::GetSubshape(fusion->shape(), index).element_type() ==
+            BF16) {
+          subshape->set_element_type(BF16);
+          changed_ = true;
+          VLOG(2) << "Fused root " << root->ToString() << " at shape index "
+                  << index << " changed to BF16 precision for fusion "
+                  << fusion->ToString();
+        }
+      });
+
+  // Propagate BF16 in the fusion computation.
+  auto insts =
+      fusion->fused_instructions_computation()->MakeInstructionPostOrder();
+  for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
+    DetermineAndMutateInstructionPrecision(*inst_it, /*skip_parameters=*/false);
+  }
+  computations_visited_in_mutation_pass_.insert(
+      fusion->fused_instructions_computation());
+}
+
+void BFloat16Propagation::DetermineAndMutateWhileComputationsPrecision(
+    HloInstruction* while_hlo) {
+  CHECK_EQ(while_hlo->opcode(), HloOpcode::kWhile);
+
+  // We are depending on the while node itself having already been analyzed for
+  // whether it can output BF16 and this has been adjusted in the output shape,
+  // and now we're looking to update the body and condition computations to
+  // match the new output shape, as well as recursively process the whole while
+  // node even if the output shape was not modified.
+  HloComputation* body = while_hlo->while_body();
+  auto body_root = body->root_instruction();
+  HloComputation* condition = while_hlo->while_condition();
+
+  ShapeUtil::ForEachMutableSubshape(
+      body_root->mutable_shape(),
+      [this, while_hlo, body_root](Shape* subshape, const ShapeIndex& index) {
+        if (subshape->element_type() != F32) {
+          return;
+        }
+        if (ShapeUtil::GetSubshape(while_hlo->shape(), index).element_type() ==
+            BF16) {
+          subshape->set_element_type(BF16);
+          changed_ = true;
+          VLOG(2) << "While body root " << body_root->ToString()
+                  << " at shape index " << index
+                  << " changed to BF16 precision for while "
+                  << while_hlo->ToString();
+        }
+      });
+
+  auto body_insts = body->MakeInstructionPostOrder();
+  for (auto inst_it = body_insts.rbegin(); inst_it != body_insts.rend();
+       ++inst_it) {
+    DetermineAndMutateInstructionPrecision(*inst_it, /*skip_parameters=*/false);
+  }
+  computations_visited_in_mutation_pass_.insert(body);
+
+  auto condition_insts = condition->MakeInstructionPostOrder();
+  for (auto inst_it = condition_insts.rbegin();
+       inst_it != condition_insts.rend(); ++inst_it) {
+    DetermineAndMutateInstructionPrecision(*inst_it, /*skip_parameters=*/false);
+  }
+  computations_visited_in_mutation_pass_.insert(condition);
+}
+
+bool BFloat16Propagation::AllUsersConsumeBF16(const HloInstruction& hlo,
+                                              const ShapeIndex& index) const {
+  auto value_set = dataflow_->GetValueSet(&hlo, index);
+  for (const HloValue* value : value_set.values()) {
+    if (ContainsKey(values_that_must_be_kept_as_f32_, value)) {
+      return false;
+    }
+    if (value->shape().element_type() == BF16) {
+      continue;
+    }
+    for (const HloUse& use : value->uses()) {
+      if (!ContainsKey(instructions_visited_in_mutation_pass_,
+                       use.instruction)) {
+        // We don't know yet whether use.instruction will consume BF16 since it
+        // hasn't been visited. Although we visit instructions in reverse
+        // topological order, this is still possible because there may be
+        // unvisited instruction that alias the same buffer. In this case, we
+        // aggressively skip this use, and if this causes inconsistency (e.g.,
+        // one use is in BF16 but another use is in F32), it will be resolved at
+        // the end of the BFloat16Propagation pass.
+        continue;
+      }
+      // Any visited user that can accept BF16 has already been updated if
+      // necessary, e.g., the output has been changed to BF16 if it propagates
+      // precision, or a called computation's parameters have been changed to
+      // BF16 for fusions or whiles.
+      if (use.instruction->opcode() == HloOpcode::kFusion) {
+        const auto* fused_parameter =
+            use.instruction->fused_parameter(use.operand_number);
+        if (ShapeUtil::GetSubshape(fused_parameter->shape(), use.operand_index)
+                .element_type() != BF16) {
+          return false;
+        }
+        continue;
+      } else if (use.instruction->opcode() == HloOpcode::kWhile) {
+        const auto* cond_parameter =
+            use.instruction->while_condition()->parameter_instruction(
+                use.operand_number);
+        if (ShapeUtil::GetSubshape(cond_parameter->shape(), use.operand_index)
+                .element_type() != BF16) {
+          return false;
+        }
+        const auto* body_parameter =
+            use.instruction->while_body()->parameter_instruction(
+                use.operand_number);
+        if (ShapeUtil::GetSubshape(body_parameter->shape(), use.operand_index)
+                .element_type() != BF16) {
+          return false;
+        }
+        continue;
+      }
+      if (bfloat16_support_->EffectiveOperandPrecisionIsBF16(
+              *use.instruction, use.operand_number)) {
+        continue;
+      }
+      // If the op propagates precision and it outputs a BF16, then it's OK to
+      // supply BF16 also as the input. In the backward mutation pass, the users
+      // shapes should have already been processed.
+      PrimitiveType user_output_type = PRIMITIVE_TYPE_INVALID;
+      if (use.instruction->opcode() == HloOpcode::kTuple ||
+          (use.instruction->opcode() == HloOpcode::kCrossReplicaSum &&
+           ShapeUtil::IsTuple(use.instruction->shape()))) {
+        user_output_type = ShapeUtil::GetSubshape(
+                               ShapeUtil::GetSubshape(use.instruction->shape(),
+                                                      {use.operand_number}),
+                               use.operand_index)
+                               .element_type();
+      } else {
+        user_output_type = use.instruction->shape().element_type();
+      }
+      if (bfloat16_support_->EffectiveOperandPrecisionIsOutputPrecision(
+              *use.instruction, use.operand_number) &&
+          user_output_type == BF16) {
+        continue;
+      }
+      return false;
+    }
+  }
+  return true;
+}
+
+void BFloat16Propagation::DetermineAndMutateInstructionPrecision(
+    HloInstruction* hlo, bool skip_parameters) {
+  // We handle any fusion computation or while body/condition after the
+  // instruction is handled, because we need to know the output shape of a
+  // fusion or while before propagating inside its  computations.
+  bool postpone_processing_called_computations = false;
+  auto cleaner = tensorflow::gtl::MakeCleanup(
+      [this, hlo, &postpone_processing_called_computations] {
+        if (!postpone_processing_called_computations) {
+          if (hlo->opcode() == HloOpcode::kFusion) {
+            DetermineAndMutateFusionComputationPrecision(hlo);
+          } else if (hlo->opcode() == HloOpcode::kWhile) {
+            DetermineAndMutateWhileComputationsPrecision(hlo);
+          }
+        }
+        instructions_visited_in_mutation_pass_.insert(hlo);
+      });
+
+  if (hlo->opcode() == HloOpcode::kWhile &&
+      (caller_counts_[hlo->while_condition()] > 1 ||
+       caller_counts_[hlo->while_body()] > 1)) {
+    postpone_processing_called_computations = true;
+    return;
+  }
+
+  // Do not change precision for instructions related to entry and exit of a
+  // computation, and control flow, because this pass might break the interfaces
+  // or assumptions for them.
+  if (hlo->opcode() == HloOpcode::kInfeed ||       //
+      hlo->opcode() == HloOpcode::kOutfeed ||      //
+      hlo->opcode() == HloOpcode::kCustomCall ||   //
+      hlo->opcode() == HloOpcode::kCall ||         //
+      hlo->opcode() == HloOpcode::kConditional ||  //
+      (hlo->opcode() == HloOpcode::kParameter && skip_parameters)) {
+    return;
+  }
+
+  // Prevent root instructions from having their output modified by recording
+  // all F32 output values as needing to stay as F32.
+  CHECK(hlo->parent() != nullptr);
+  if (hlo == hlo->parent()->root_instruction()) {
+    if (!hlo->parent()->IsFusionComputation()) {
+      ShapeUtil::ForEachSubshape(hlo->shape(), [&](const Shape& subshape,
+                                                   const ShapeIndex& index) {
+        if (subshape.element_type() != F32) {
+          return;
+        }
+        for (const auto* value : dataflow_->GetValueSet(hlo, index).values()) {
+          // Since we use HloValues from the dataflow analysis, this can also
+          // affect HLO instructions beyond the root, e.g., if the root is a
+          // Tuple HLO, then its operands are also affected.
+          values_that_must_be_kept_as_f32_.insert(value);
+        }
+      });
+    }
+    return;
+  }
+
+  if (!ContainsKey(consider_using_bfloat16_, hlo)) {
+    return;
+  }
+
+  if (!bfloat16_support_->SupportsBF16Output(*hlo)) {
+    return;
+  }
+
+  ShapeUtil::ForEachMutableSubshape(
+      hlo->mutable_shape(),
+      [hlo, this](Shape* subshape, const ShapeIndex& index) {
+        if (subshape->element_type() == F32 &&
+            AllUsersConsumeBF16(*hlo, index)) {
+          subshape->set_element_type(BF16);
+          changed_ = true;
+          VLOG(2) << "HloInstruction output at shape index " << index
+                  << " changed to BF16 precision: " << hlo->ToString();
+        }
+      });
+}
+
+bool BFloat16Propagation::InstructionIsCandidateForBF16Output(
+    HloInstruction* hlo) {
+  if (!bfloat16_support_->SupportsMixedPrecisions(*hlo) &&
+      hlo->opcode() != HloOpcode::kTuple &&
+      hlo->opcode() != HloOpcode::kGetTupleElement &&
+      hlo->shape().element_type() != BF16) {
+    for (int64 i = 0; i < hlo->operand_count(); ++i) {
+      if (!bfloat16_support_->EffectiveOperandPrecisionIsOutputPrecision(*hlo,
+                                                                         i) ||
+          !ContainsKey(consider_using_bfloat16_, hlo->operand(i))) {
+        return false;
+      }
+    }
+  }
+  return true;
+}
+
+void BFloat16Propagation::AdjustCalledComputationParameters(
+    HloInstruction* hlo) {
+  auto adjust_computation =
+      [this, hlo](HloComputation* computation,
+                  tensorflow::gtl::ArraySlice<HloInstruction*> operands) {
+        // Adjust parameters.
+        CHECK_EQ(operands.size(), computation->num_parameters());
+        for (int64 i = 0; i < operands.size(); ++i) {
+          auto parameter = computation->parameter_instruction(i);
+          ShapeUtil::ForEachMutableSubshape(
+              parameter->mutable_shape(),
+              [this, i, hlo, &operands, parameter](Shape* subshape,
+                                                   const ShapeIndex& index) {
+                if (!ShapeUtil::IsLeafIndex(parameter->shape(), index)) {
+                  return;
+                }
+                PrimitiveType operand_type =
+                    ShapeUtil::GetSubshape(operands[i]->shape(), index)
+                        .element_type();
+                if (subshape->element_type() == operand_type) {
+                  return;
+                }
+                CHECK(operand_type == F32 || operand_type == BF16);
+                subshape->set_element_type(operand_type);
+                changed_ = true;
+                VLOG(2) << "Called computation parameter "
+                        << parameter->ToString() << " at shape index " << index
+                        << " adjusted to match operand in HLO "
+                        << hlo->ToString();
+              });
+        }
+      };
+
+  switch (hlo->opcode()) {
+    case HloOpcode::kFusion:
+      adjust_computation(hlo->fused_instructions_computation(),
+                         hlo->operands());
+      break;
+    case HloOpcode::kWhile:
+      adjust_computation(hlo->while_condition(), hlo->operands());
+      adjust_computation(hlo->while_body(), hlo->operands());
+      break;
+    default:
+      break;
+  }
+}
+
+void BFloat16Propagation::AdjustCalledComputationRoot(HloInstruction* hlo) {
+  auto adjust_computation = [this, hlo](HloComputation* computation,
+                                        const Shape& output_shape) {
+    // Adjust root.
+    HloInstruction* root = computation->root_instruction();
+    ShapeUtil::ForEachMutableSubshape(
+        root->mutable_shape(), [this, hlo, root, &output_shape](
+                                   Shape* subshape, const ShapeIndex& index) {
+          if (!ShapeUtil::IsLeafIndex(hlo->shape(), index)) {
+            return;
+          }
+          const PrimitiveType output_type =
+              ShapeUtil::GetSubshape(output_shape, index).element_type();
+          if (subshape->element_type() == output_type) {
+            return;
+          }
+          CHECK(output_type == F32 || output_type == BF16);
+          subshape->set_element_type(output_type);
+          // It's possible that output_type is F32, but the root instruction's
+          // type is BF16; e.g., a fusion node's output was changed to BF16
+          // initially but then adjusted back to F32, and the fusion computation
+          // is now being adjusted after the fusion node.
+          if (output_type == F32) {
+            for (const auto* value :
+                 dataflow_->GetValueSet(root, index).values()) {
+              // We rely on the fact that this adjustment works in reverse
+              // topological order so that called computation will be
+              // processed later. Adding the value to
+              // values_that_must_be_kept_as_f32_ will ensure the
+              // correctness of the adjustment for HLOs that will be
+              // processed later.
+              values_that_must_be_kept_as_f32_.insert(value);
+            }
+          }
+          changed_ = true;
+          VLOG(2) << "Called computation root " << root->ToString()
+                  << " at shape index " << index
+                  << " adjusted to match output shape of " << hlo->ToString();
+        });
+  };
+
+  switch (hlo->opcode()) {
+    case HloOpcode::kFusion:
+      adjust_computation(hlo->fused_instructions_computation(), hlo->shape());
+      break;
+    case HloOpcode::kWhile:
+      adjust_computation(hlo->while_condition(), hlo->shape());
+      adjust_computation(hlo->while_body(), hlo->shape());
+      break;
+    default:
+      break;
+  }
+}
+
+bool BFloat16Propagation::ResolveInconsistencyOfAliasingBuffersHelper(
+    HloComputation* computation,
+    tensorflow::gtl::FlatSet<const HloComputation*>* visited_computations) {
+  bool parameter_changed = false;
+  auto insts = computation->MakeInstructionPostOrder();
+  // Do the adjustment on each instruction in the computation in reverse
+  // topological order.
+  for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
+    auto hlo = *inst_it;
+    auto adjust_hlo_output = [this, hlo, &parameter_changed](
+                                 Shape* subshape, const ShapeIndex& index) {
+      if (subshape->element_type() != F32 && subshape->element_type() != BF16) {
+        return;
+      }
+      PrimitiveType type = BF16;
+      for (const auto* value : dataflow_->GetValueSet(hlo, index).values()) {
+        if (value->shape().element_type() == BF16) {
+          continue;
+        }
+        CHECK_EQ(value->shape().element_type(), F32);
+        type = F32;
+        break;
+      }
+      // It's possible that a user has been changed from BF16 to F32
+      // during this final adjustment pass, so we need to check
+      // AllUsersConsumeBF16() again.
+      if (type == BF16 && !AllUsersConsumeBF16(*hlo, index)) {
+        type = F32;
+      }
+      if (type == F32) {
+        for (const auto* value : dataflow_->GetValueSet(hlo, index).values()) {
+          // We rely on the fact that this adjustment works in reverse
+          // topological order. Adding the value to
+          // values_that_must_be_kept_as_f32_ will ensure the correctness
+          // of the adjustment for HLOs that will be processed later.
+          values_that_must_be_kept_as_f32_.insert(value);
+        }
+      }
+      if (type != subshape->element_type()) {
+        subshape->set_element_type(type);
+        VLOG(2) << "HloInstruction output at shape index " << index
+                << " adjusted to " << *subshape << ": " << hlo->ToString();
+        if (hlo->opcode() == HloOpcode::kParameter) {
+          parameter_changed = true;
+        }
+      }
+    };
+    ShapeUtil::ForEachMutableSubshape(hlo->mutable_shape(), adjust_hlo_output);
+    AdjustCalledComputationRoot(hlo);
+    if (hlo->opcode() == HloOpcode::kWhile) {
+      // We need to run on the while body and condition repeatedly until a fixed
+      // point is reached, i.e., the parameters do not change any more. We may
+      // need more than one iteration because the while input and output alias
+      // each other, so changing one input parameter requires changing the
+      // corresponding output element and thus may transitively require changing
+      // another input parameter. A fixed point will be reached because the
+      // parameters can only be changed from BF16 to F32, not the other way
+      // around.
+      tensorflow::gtl::FlatSet<const HloComputation*> visited_in_while;
+      while (ResolveInconsistencyOfAliasingBuffersHelper(hlo->while_condition(),
+                                                         &visited_in_while) ||
+             ResolveInconsistencyOfAliasingBuffersHelper(hlo->while_body(),
+                                                         &visited_in_while)) {
+        visited_in_while.clear();
+        ShapeUtil::ForEachMutableSubshape(hlo->mutable_shape(),
+                                          adjust_hlo_output);
+        AdjustCalledComputationRoot(hlo);
+      }
+      visited_computations->insert(visited_in_while.begin(),
+                                   visited_in_while.end());
+    }
+  }
+  // Now adjust parameters of called computations.
+  for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
+    AdjustCalledComputationParameters(*inst_it);
+  }
+  return parameter_changed;
+}
+
+Status BFloat16Propagation::ResolveInconsistencyOfAliasingBuffers(
+    HloModule* module) {
+  std::list<HloComputation*> computations_topological_order =
+      module->MakeComputationPostOrder();
+  tensorflow::gtl::FlatSet<const HloComputation*> resolved;
+  for (auto comp_it = computations_topological_order.rbegin();
+       comp_it != computations_topological_order.rend(); ++comp_it) {
+    if (ContainsKey(resolved, *comp_it)) {
+      continue;
+    }
+    ResolveInconsistencyOfAliasingBuffersHelper(*comp_it, &resolved);
+  }
+
+  // We could have changed a fusion computation's root shape to have a different
+  // precision than the fusion node's output, if the fusion root does not
+  // define a buffer (e.g., a tuple). Now we add conversions after such fusion
+  // roots to make them match the fusion output. If the fusion output is a
+  // (possibly nested) tuple, we first create get-tuple-elements, then convert
+  // the unmatching leaf nodes, and finally create a new tuple as the fusion
+  // computation's root. If tuples and get-tuple-elements are created, we will
+  // run tuple simplifier and dead code elimination at the end (dead code is not
+  // allowed in fusion computation). E.g.,
+  //
+  // (1)             (2)             (3)
+  // a  b            a  b            a  b
+  // |\ |            |\ |            |\ |
+  // \ add   ->      |add    ->      | add
+  //  \ |            \ |        convert |
+  //  tuple         tuple             \ |
+  //                 / \              tuple
+  //               gte gte
+  //                |   |
+  //           convert  |
+  //                 \  /
+  //                 tuple
+  // (1) a is F32 but tuple is BF16
+  // (2) after adding conversion
+  // (3) after tuple simplifier and DCE.
+  bool needs_tuple_simplifier = false;
+  for (auto computation : computations_topological_order) {
+    auto insts = computation->MakeInstructionPostOrder();
+    for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
+      auto hlo = *inst_it;
+      if (hlo->opcode() != HloOpcode::kFusion) {
+        continue;
+      }
+      auto fusion_computation = hlo->fused_instructions_computation();
+      auto fusion_root = fusion_computation->root_instruction();
+      if (ShapeUtil::Compatible(fusion_root->shape(), hlo->shape())) {
+        continue;
+      }
+      ShapeTree<HloInstruction*> converted_outputs(hlo->shape());
+      // Iterate through nodes in the shape tree in pre-order and initialize
+      // each non-root node with a corresponding get-tuple-element. For a leaf
+      // node, if its shape does not match the fusion output, create a
+      // conversion node to overwrite the node value.
+      for (auto it = converted_outputs.begin(); it != converted_outputs.end();
+           ++it) {
+        ShapeIndex output_index = it->first;
+        HloInstruction*& output = it->second;
+        const Shape subshape =
+            ShapeUtil::GetSubshape(hlo->shape(), output_index);
+        if (output_index.empty()) {
+          output = fusion_root;
+        } else {
+          ShapeIndex parent_index = output_index;
+          parent_index.pop_back();
+          output = fusion_computation->AddInstruction(
+              HloInstruction::CreateGetTupleElement(
+                  subshape, converted_outputs.element(parent_index),
+                  output_index.back()));
+        }
+        if (ShapeUtil::IsTuple(subshape)) {
+          continue;
+        }
+        if (!ShapeUtil::Compatible(
+                subshape,
+                ShapeUtil::GetSubshape(fusion_root->shape(), output_index))) {
+          output = fusion_computation->AddInstruction(
+              HloInstruction::CreateConvert(subshape, output));
+        }
+      }
+      // Iterate through nodes in the shape tree in reverse pre-order and create
+      // a tuple instruction for each non-leaf node where the elements are the
+      // values of its child nodes.
+      for (auto it = converted_outputs.rbegin(); it != converted_outputs.rend();
+           ++it) {
+        ShapeIndex output_index = it->first;
+        HloInstruction*& output = it->second;
+        const Shape& subshape =
+            ShapeUtil::GetSubshape(hlo->shape(), output_index);
+        if (!ShapeUtil::IsTuple(subshape)) {
+          continue;
+        }
+        std::vector<HloInstruction*> elements(
+            ShapeUtil::TupleElementCount(subshape));
+        ShapeIndex child_index = output_index;
+        for (int64 i = 0; i < elements.size(); ++i) {
+          child_index.push_back(i);
+          elements[i] = converted_outputs.element(child_index);
+          child_index.pop_back();
+        }
+        output = fusion_computation->AddInstruction(
+            HloInstruction::CreateTuple(elements));
+      }
+      fusion_computation->set_root_instruction(converted_outputs.element({}));
+      needs_tuple_simplifier |= ShapeUtil::IsTuple(hlo->shape());
+    }
+  }
+
+  // We may have converted some constants from F32 to BF16, so adjust the
+  // constant literals in such cases. We do this here instead of when the
+  // constant node's is changed because 1) the HloInstruction interface does not
+  // allow resetting the literal so we have to create a new kConstant
+  // instruction to replace the old one, which invalidates dataflow analysis,
+  // and 2) it's possible that a kConstant's output gets changed to BF16 at the
+  // beginning but later on adjusted back to F32, so converting literals here
+  // can avoid repeated conversions.
+  //
+  // TODO(b/73833576): Consider resetting literal in HloInstruction.
+  bool needs_dce = needs_tuple_simplifier;
+  for (auto computation : computations_topological_order) {
+    for (auto hlo : computation->MakeInstructionPostOrder()) {
+      if (hlo->opcode() != HloOpcode::kConstant) {
+        continue;
+      }
+      if (!ShapeUtil::Equal(hlo->literal().shape(), hlo->shape())) {
+        TF_ASSIGN_OR_RETURN(auto converted_literal,
+                            hlo->literal().ConvertToShape(hlo->shape()));
+        auto new_constant = computation->AddInstruction(
+            HloInstruction::CreateConstant(std::move(converted_literal)));
+        TF_RETURN_IF_ERROR(hlo->ReplaceAllUsesWith(new_constant));
+        needs_dce = true;
+      }
+    }
+  }
+
+  if (needs_tuple_simplifier) {
+    TupleSimplifier tuple_simplifier;
+    TF_RETURN_IF_ERROR(tuple_simplifier.Run(module).status());
+  }
+  if (needs_dce) {
+    HloDCE dce;
+    TF_RETURN_IF_ERROR(dce.Run(module).status());
+  }
+  return Status::OK();
+}
+
+// The algorithm first does a forward pass (parameters to root) to determine a
+// set of instructions to consider using bfloat16, then does a backward pass to
+// determine the precisions of those instructions according to the need of
+// their users.
+StatusOr<bool> BFloat16Propagation::Run(HloModule* module) {
+  TF_ASSIGN_OR_RETURN(dataflow_, HloDataflowAnalysis::Run(*module));
+
+  std::list<HloComputation*> computations_topological_order =
+      module->MakeComputationPostOrder();
+  // The first step is a forward pass (parameters to root), where we determine
+  // the potential candidate instructions to use bfloat16 in the outputs that
+  // are not likely to cause overhead from extra explicit conversions. This is
+  // done forwardly because we determine whether an HLO is a candidate partially
+  // based on whether its operands are candidates.
+  for (auto computation : computations_topological_order) {
+    for (auto inst : computation->MakeInstructionPostOrder()) {
+      if (InstructionIsCandidateForBF16Output(inst)) {
+        consider_using_bfloat16_.insert(inst);
+      }
+    }
+  }
+
+  // The second step is a backward pass (root to parameters), where we modify
+  // the precisions of the instructions identified in the first step when
+  // feasible. This is done backwardly because we determine the precision of an
+  // HLO's output based on how it is later used.
+  //
+  // The precision of an instruction is determined by its users, so we do the
+  // propagation in reverse topological order.
+  for (auto comp_it = computations_topological_order.rbegin();
+       comp_it != computations_topological_order.rend(); ++comp_it) {
+    if ((*comp_it)->IsFusionComputation()) {
+      // Fusion computations are handled when visiting the fusion instruction.
+      continue;
+    }
+    auto insts = (*comp_it)->MakeInstructionPostOrder();
+    for (auto inst_it = insts.rbegin(); inst_it != insts.rend(); ++inst_it) {
+      DetermineAndMutateInstructionPrecision(*inst_it,
+                                             /*skip_parameters=*/true);
+    }
+  }
+
+  if (!changed_) {
+    return false;
+  }
+
+  // It's possible that an instruction does not define a buffer, but the
+  // defining instruction's shape has changed. So we need to adjust the output
+  // shapes of instructions according to the HLO values they refer to.
+  TF_RETURN_IF_ERROR(ResolveInconsistencyOfAliasingBuffers(module));
+  return true;
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/bfloat16_propagation.h b/tensorflow/compiler/xla/service/bfloat16_propagation.h
new file mode 100644
index 0000000000000000000000000000000000000000..89a5ac5db1549877a135182ae8df57fa6bf9d579
--- /dev/null
+++ b/tensorflow/compiler/xla/service/bfloat16_propagation.h
@@ -0,0 +1,159 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_BFLOAT16_PROPAGATION_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_BFLOAT16_PROPAGATION_H_
+
+#include <memory>
+#include <unordered_map>
+#include <unordered_set>
+#include <vector>
+
+#include "tensorflow/compiler/xla/service/bfloat16_support.h"
+#include "tensorflow/compiler/xla/service/hlo_dataflow_analysis.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_pass_interface.h"
+
+namespace xla {
+
+// HLO pass which reduces the precision of some HLO instructions to BF16
+// according to the backend-specific BFloat16Support rule provided by the
+// caller.
+//
+// This pass can be used to reduce instruction precision without affecting the
+// numerical accuracy of the module, i.e., the final output of the module would
+// be bitwise identical to that without this pass; this is possible if the
+// backend already reduces precision to BF16 on some HLO instructions.
+//
+// This pass will not modify the signature of a computation, unless it is a
+// fusion computation or its only caller is a while.
+//
+// !!! WARNING !!! This pass can introduce mixed precision in individual HLOs,
+// which has two issues:
+//
+// 1) It does not guarantee to respect the passed-in BFloat16Support
+// specification in terms of mixed precision, so the backend may not support an
+// HLO that has mixed precision produced by this pass. To address this issue,
+// run BFloat16Normalization with the same BFloat16Support after this pass.
+//
+// 2) In general, mixed precision may break the assumptions of some other HLO
+// passes even if the specific backend supports the individual HLOs. Such
+// assumptions include that there are no HLOs using mixed precision, or that the
+// precision of an HLO's output is determined by its inputs. It should be used
+// at the end of the HLO optimization pipeline but before
+// BFloat16ConversionFolding. If other passes are needed after this pass, run
+// BFloat16MixedPrecisionRemoval first to undo some of the changes made by this
+// pass.
+class BFloat16Propagation : public HloPassInterface {
+ public:
+  explicit BFloat16Propagation(const BFloat16Support* bfloat16_support);
+
+  ~BFloat16Propagation() override = default;
+
+  tensorflow::StringPiece name() const override {
+    return "bfloat16-propagation";
+  }
+
+  // Runs the pass on the given module. Returns whether the module was changed
+  // (precision reductions were added).
+  StatusOr<bool> Run(HloModule* module) override;
+
+ private:
+  // ***************************
+  // Function called and state produced by the forward analysis pass (from
+  // parameters to root) that determines the candidate HLOs to use BF16 outputs.
+
+  // Determines whether we should consider changing the precision of the given
+  // instruction in the forward pass.
+  bool InstructionIsCandidateForBF16Output(HloInstruction* hlo);
+
+  // The set of instructions to consider using bfloat16, computed in the forward
+  // pass.
+  tensorflow::gtl::FlatSet<const HloInstruction*> consider_using_bfloat16_;
+
+  // ***************************
+  // Functions called and state produced by the backward mutation pass (from
+  // root to parameters).
+
+  // Determines the precision for the given instruction in the mutation pass.
+  void DetermineAndMutateInstructionPrecision(HloInstruction* hlo,
+                                              bool skip_parameters);
+
+  // Special handling in the mutation pass for fusion computations.
+  //
+  // Precondition: hlo->opcode() == kFusion
+  void DetermineAndMutateFusionComputationPrecision(HloInstruction* fusion);
+
+  // Special handling in the mutation pass for while computations.
+  //
+  // Precondition: hlo->opcode() == kWhile
+  void DetermineAndMutateWhileComputationsPrecision(HloInstruction* while_hlo);
+
+  // The set of HloInstructions that have been visited in the mutation pass.
+  tensorflow::gtl::FlatSet<const HloInstruction*>
+      instructions_visited_in_mutation_pass_;
+
+  // The set of HloComputations that have been visited in the mutation pass.
+  tensorflow::gtl::FlatSet<const HloComputation*>
+      computations_visited_in_mutation_pass_;
+
+  // ***************************
+  // Functions called by the final inconsistency resolving pass.
+
+  // Adjusts the output shapes of HloInstructions such that if two
+  // HloInstructions have aliasing buffers in their outputs, they must have the
+  // same precision.
+  Status ResolveInconsistencyOfAliasingBuffers(HloModule* module);
+
+  // Resolves inconsistency of aliasing buffers for the given computation, and
+  // recursively runs on a while instruction's condition and body until a fixed
+  // point is reached.
+  bool ResolveInconsistencyOfAliasingBuffersHelper(
+      HloComputation* computation,
+      tensorflow::gtl::FlatSet<const HloComputation*>* visited_computations);
+
+  // Makes the parameters of called computations match how they are called by
+  // the given HLO.
+  void AdjustCalledComputationParameters(HloInstruction* hlo);
+
+  // Makes the root instructions of called computations match how they are used
+  // by the given HLO.
+  void AdjustCalledComputationRoot(HloInstruction* hlo);
+
+  // ***************************
+  // Functions called and state used by two or more passes.
+
+  // Returns whether all uses of the given HloInstruction can consume BF16
+  // input.
+  bool AllUsersConsumeBF16(const HloInstruction& hlo,
+                           const ShapeIndex& index) const;
+
+  // The set of F32 HLO values that must be kept in F32.
+  tensorflow::gtl::FlatSet<const HloValue*> values_that_must_be_kept_as_f32_;
+
+  // Mapping from each HloComputation to the number of callers to it in the
+  // module. Populated at the beginning of this pass.
+  tensorflow::gtl::FlatMap<const HloComputation*, int64> caller_counts_;
+
+  const BFloat16Support* bfloat16_support_;
+  std::unique_ptr<HloDataflowAnalysis> dataflow_;
+
+  bool changed_ = false;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_BFLOAT16_PROPAGATION_H_
diff --git a/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc b/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5950b004b3da439c442eec6e5e09ea2307fcb018
--- /dev/null
+++ b/tensorflow/compiler/xla/service/bfloat16_propagation_test.cc
@@ -0,0 +1,620 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/bfloat16_propagation.h"
+#include "tensorflow/compiler/xla/service/bfloat16_support.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/test_helpers.h"
+#include "tensorflow/compiler/xla/tests/hlo_test_base.h"
+#include "tensorflow/compiler/xla/tests/literal_test_util.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+
+namespace xla {
+
+// A class specifying the BF16 support used to test the propagation pass. It
+// specifies that BF16 and mixed precision are supported in all HloInstructions,
+// and that kDot reduces its operands precision to BF16.
+class TestBFloat16Support : public BFloat16Support {
+ public:
+  TestBFloat16Support() {}
+  ~TestBFloat16Support() override {}
+
+  bool SupportsBF16Operand(const HloInstruction& hlo,
+                           int64 operand_index) const override {
+    return true;
+  }
+
+  bool SupportsBF16Output(const HloInstruction& hlo) const override {
+    return true;
+  }
+
+  bool SupportsMixedPrecisions(const HloInstruction& hlo) const override {
+    return true;
+  }
+
+  bool EffectiveOperandPrecisionIsBF16(const HloInstruction& hlo,
+                                       int64 operand_index) const override {
+    return hlo.opcode() == HloOpcode::kDot;
+  }
+};
+
+class BFloat16PropagationTest : public HloTestBase {
+ protected:
+  // Runs the propagation pass on the given module, and returns whether the
+  // module is changed after this pass.
+  bool PropagatePrecision(HloModule* module) {
+    TestBFloat16Support bfloat16_support;
+    BFloat16Propagation propagation(&bfloat16_support);
+    StatusOr<bool> result = propagation.Run(module);
+    EXPECT_IS_OK(result.status());
+    return result.ValueOrDie();
+  }
+
+  // Returns whether the given HloInstruction's output element type is BF16 or
+  // the only use of it is converting to BF16.
+  bool OutputsBF16(const HloInstruction* inst) {
+    if (inst->shape().element_type() == BF16) {
+      return true;
+    }
+    return inst->user_count() == 1 &&
+           inst->users()[0]->opcode() == HloOpcode::kConvert &&
+           inst->users()[0]->shape().element_type() == BF16;
+  }
+};
+
+// Tests that BF16 can propagate through select over non-tuple buffers, but not
+// through add where reducing operand precision can affect the result.
+TEST_F(BFloat16PropagationTest, PropagateThroughSelectButNotAdd) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {2, 4});
+
+  HloInstruction* a =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, shape, "a"));
+  HloInstruction* b =
+      builder.AddInstruction(HloInstruction::CreateParameter(1, shape, "b"));
+  HloInstruction* c =
+      builder.AddInstruction(HloInstruction::CreateParameter(2, shape, "c"));
+  HloInstruction* add0 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, a, b));
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, add0, b));
+  HloInstruction* pred = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kEq, a, b));
+  HloInstruction* sel = builder.AddInstruction(
+      HloInstruction::CreateTernary(shape, HloOpcode::kSelect, pred, c, add1));
+  HloInstruction* xpose =
+      builder.AddInstruction(HloInstruction::CreateTranspose(
+          ShapeUtil::MakeShape(F32, {4, 2}), sel, {1, 0}));
+  HloInstruction* dot = builder.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(F32, {4, 4}), HloOpcode::kDot, xpose, a));
+  HloInstruction* root = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, dot, dot));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), root);
+  EXPECT_TRUE(OutputsBF16(xpose));
+  EXPECT_TRUE(OutputsBF16(sel));
+  EXPECT_TRUE(OutputsBF16(add1));
+  EXPECT_FALSE(OutputsBF16(add0));
+  EXPECT_FALSE(OutputsBF16(a));
+  EXPECT_FALSE(OutputsBF16(b));
+  EXPECT_FALSE(OutputsBF16(c));
+}
+
+// Tests that if a constant is converted to BF16 then its literal must also be
+// converted.
+TEST_F(BFloat16PropagationTest, ConvertConstantLiteral) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {4, 4});
+  Array2D<float> array_a(4, 4);
+  array_a.FillUnique(1.0f);
+  Array2D<float> array_b(4, 4);
+  array_b.FillUnique(10.0f);
+
+  HloInstruction* a = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateFromArray(array_a)));
+  HloInstruction* b = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateFromArray(array_b)));
+  HloInstruction* dot = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, a, b));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), dot);
+  EXPECT_TRUE(OutputsBF16(dot->operand(0)));
+  EXPECT_TRUE(OutputsBF16(dot->operand(1)));
+  EXPECT_EQ(dot->operand(0)->opcode(), HloOpcode::kConstant);
+  EXPECT_EQ(dot->operand(1)->opcode(), HloOpcode::kConstant);
+  LiteralTestUtil::ExpectEqual(
+      dot->operand(0)->literal(),
+      *LiteralTestUtil::ConvertF32ToBF16(*Literal::CreateFromArray(array_a)));
+  LiteralTestUtil::ExpectEqual(
+      dot->operand(1)->literal(),
+      *LiteralTestUtil::ConvertF32ToBF16(*Literal::CreateFromArray(array_b)));
+}
+
+// Tests that BF16 can be propagated through nested tuples.
+TEST_F(BFloat16PropagationTest, PropagateThroughTuples) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {2, 4});
+
+  HloInstruction* a =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, shape, "a"));
+  HloInstruction* b =
+      builder.AddInstruction(HloInstruction::CreateParameter(1, shape, "b"));
+  HloInstruction* add0 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, a, b));
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, a, a));
+  HloInstruction* add2 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, b, b));
+  HloInstruction* xpose =
+      builder.AddInstruction(HloInstruction::CreateTranspose(
+          ShapeUtil::MakeShape(F32, {4, 2}), add1, {1, 0}));
+
+  HloInstruction* tuple0 =
+      builder.AddInstruction(HloInstruction::CreateTuple({add0, add1, add2}));
+  HloInstruction* tuple1 =
+      builder.AddInstruction(HloInstruction::CreateTuple({tuple0, xpose}));
+
+  HloInstruction* lhs = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(xpose->shape(), tuple1, 1));
+  HloInstruction* rhs =
+      builder.AddInstruction(HloInstruction::CreateGetTupleElement(
+          add0->shape(),
+          builder.AddInstruction(HloInstruction::CreateGetTupleElement(
+              tuple0->shape(), tuple1, 0)),
+          0));
+  HloInstruction* dot = builder.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(F32, {4, 4}), HloOpcode::kDot, lhs, rhs));
+
+  HloInstruction* output_tuple =
+      builder.AddInstruction(HloInstruction::CreateTuple({dot, add2}));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), output_tuple);
+  EXPECT_TRUE(OutputsBF16(xpose));
+  EXPECT_TRUE(OutputsBF16(add0));
+  EXPECT_TRUE(OutputsBF16(add1));
+  EXPECT_FALSE(OutputsBF16(add2));
+}
+
+// Tests that even if an instruction does not define a buffer in its output, its
+// shape must match the defining instruction.
+TEST_F(BFloat16PropagationTest, SameValueReferencedTwice) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {2, 4});
+
+  HloInstruction* a =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, shape, "a"));
+  HloInstruction* b =
+      builder.AddInstruction(HloInstruction::CreateParameter(1, shape, "b"));
+  HloInstruction* add0 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, a, b));
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, a, a));
+
+  HloInstruction* lhs = builder.AddInstruction(HloInstruction::CreateTranspose(
+      ShapeUtil::MakeShape(F32, {4, 2}), add1, {1, 0}));
+
+  HloInstruction* tuple =
+      builder.AddInstruction(HloInstruction::CreateTuple({add0, add1}));
+  HloInstruction* rhs = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(add1->shape(), tuple, 1));
+
+  // lhs is the transpose of add1, and rhs is a get-tuple-element aliasing add1.
+  HloInstruction* dot = builder.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(F32, {4, 4}), HloOpcode::kDot, lhs, rhs));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), dot);
+  EXPECT_TRUE(OutputsBF16(add0));
+  EXPECT_TRUE(OutputsBF16(add1));
+  EXPECT_TRUE(OutputsBF16(lhs));
+  // rhs is a get-tuple-element, which does not define a buffer, but its shape
+  // should also be adjusted accordingly.
+  EXPECT_TRUE(OutputsBF16(rhs));
+}
+
+// Tests that a non-fusion computation's root should not be changed.
+TEST_F(BFloat16PropagationTest, DoNotChangeComputationRoot) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {2, 4});
+
+  HloInstruction* a =
+      builder.AddInstruction(HloInstruction::CreateParameter(0, shape, "a"));
+  HloInstruction* b =
+      builder.AddInstruction(HloInstruction::CreateParameter(1, shape, "b"));
+  HloInstruction* add = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, a, b));
+
+  HloInstruction* dot = builder.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(F32, {4, 4}), HloOpcode::kDot, add, add));
+
+  HloInstruction* tuple =
+      builder.AddInstruction(HloInstruction::CreateTuple({add, dot}));
+
+  auto module = CreateNewModule();
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_FALSE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), tuple);
+  EXPECT_FALSE(OutputsBF16(add));
+}
+
+// Tests that BF16 is propagated properly through fused computations.
+TEST_F(BFloat16PropagationTest, PropagateThroughFusion) {
+  auto module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {2, 4});
+
+  HloInstruction* param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "param"));
+  HloInstruction* add = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param, param));
+
+  auto builder_f0 = HloComputation::Builder("fusion0");
+  HloInstruction* a_f0 =
+      builder_f0.AddInstruction(HloInstruction::CreateParameter(0, shape, "a"));
+  HloInstruction* b_f0 =
+      builder_f0.AddInstruction(HloInstruction::CreateParameter(1, shape, "b"));
+  HloInstruction* tuple_f0 =
+      builder_f0.AddInstruction(HloInstruction::CreateTuple({a_f0, b_f0}));
+  auto comp_f0 = module->AddEmbeddedComputation(builder_f0.Build());
+  auto fusion0 = builder.AddInstruction(HloInstruction::CreateFusion(
+      tuple_f0->shape(), HloInstruction::FusionKind::kCustom, {add, add},
+      comp_f0));
+
+  auto builder_f1 = HloComputation::Builder("fusion1");
+  HloInstruction* p_f1 = builder_f1.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple_f0->shape(), "param"));
+  HloInstruction* a_f1 = builder_f1.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, p_f1, 0));
+  HloInstruction* b_f1 = builder_f1.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, p_f1, 1));
+  HloInstruction* dot = builder_f1.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(F32, {4, 4}), HloOpcode::kDot, a_f1, b_f1));
+  auto comp_f1 = module->AddEmbeddedComputation(builder_f1.Build());
+  auto fusion1 = builder.AddInstruction(HloInstruction::CreateFusion(
+      dot->shape(), HloInstruction::FusionKind::kCustom, {fusion0}, comp_f1));
+
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), fusion1);
+  EXPECT_TRUE(OutputsBF16(add));
+  EXPECT_TRUE(OutputsBF16(a_f0));
+  EXPECT_TRUE(OutputsBF16(b_f0));
+  EXPECT_TRUE(OutputsBF16(a_f1));
+  EXPECT_TRUE(OutputsBF16(b_f1));
+}
+
+// Tests that if 1) the root instruction of a fusion is a tuple, 2) the fusion
+// outputs are only used by a dot, and 3) one element of the tuple is used by
+// an add in the fusion computation, then the propagation pass should create a
+// convert in the fusion computation to keep the add's operand in F32 but change
+// the fusion output to BF16. E.g., the following fusion computation
+//   (F32, F32) fusion_computation(F32 a, F32 b)
+//     = tuple(F32 a, F32 add(F32 a, F32 b))
+// will be changed to
+//   (BF16, BF16) fusion_computation(F32 a, F32 b)
+//     = tuple(BF16 convert(a), BF16 add(F32 a, F32 b))
+TEST_F(BFloat16PropagationTest, ConvertTupleFusionElementIfUsedByAdd) {
+  auto module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {4, 4});
+
+  HloInstruction* param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "param"));
+  HloInstruction* add = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param, param));
+
+  auto builder_f = HloComputation::Builder("fusion0");
+  HloInstruction* a_f =
+      builder_f.AddInstruction(HloInstruction::CreateParameter(0, shape, "a"));
+  HloInstruction* b_f =
+      builder_f.AddInstruction(HloInstruction::CreateParameter(1, shape, "b"));
+  HloInstruction* add_f = builder_f.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, a_f, b_f));
+  HloInstruction* tuple_f =
+      builder_f.AddInstruction(HloInstruction::CreateTuple({a_f, add_f}));
+  auto comp_f = module->AddEmbeddedComputation(builder_f.Build());
+  auto fusion = builder.AddInstruction(HloInstruction::CreateFusion(
+      tuple_f->shape(), HloInstruction::FusionKind::kCustom, {add, add},
+      comp_f));
+
+  HloInstruction* gte0 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, fusion, 0));
+  HloInstruction* gte1 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, fusion, 1));
+  HloInstruction* dot = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, gte0, gte1));
+
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), dot);
+  EXPECT_TRUE(OutputsBF16(gte0));
+  EXPECT_TRUE(OutputsBF16(gte1));
+  EXPECT_FALSE(OutputsBF16(a_f));
+  EXPECT_FALSE(OutputsBF16(b_f));
+  EXPECT_TRUE(OutputsBF16(add_f));
+  auto new_fusion_root = comp_f->root_instruction();
+  EXPECT_EQ(new_fusion_root->opcode(), HloOpcode::kTuple);
+  EXPECT_EQ(new_fusion_root->operand(1), add_f);
+  EXPECT_EQ(new_fusion_root->operand(0)->opcode(), HloOpcode::kConvert);
+  EXPECT_TRUE(OutputsBF16(new_fusion_root->operand(0)));
+}
+
+// A select over tuples does not define the leaf buffers, so the types in
+// on_true and on_false must match, so that as long as one of them is F32, the
+// other must be F32 as well.
+TEST_F(BFloat16PropagationTest, SelectOverTuples) {
+  auto module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {2, 4});
+
+  HloInstruction* param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "param"));
+  HloInstruction* pred = builder.AddInstruction(HloInstruction::CreateParameter(
+      1, ShapeUtil::MakeShape(PRED, {}), "pred"));
+
+  HloInstruction* add0 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param, param));
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, add0, param));
+  HloInstruction* tuple0 =
+      builder.AddInstruction(HloInstruction::CreateTuple({param, add0}));
+  HloInstruction* tuple1 =
+      builder.AddInstruction(HloInstruction::CreateTuple({param, add1}));
+  HloInstruction* sel = builder.AddInstruction(HloInstruction::CreateTernary(
+      tuple0->shape(), HloOpcode::kSelect, pred, tuple0, tuple1));
+  HloInstruction* gte0 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, sel, 0));
+  HloInstruction* gte1 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, sel, 1));
+  HloInstruction* xpose =
+      builder.AddInstruction(HloInstruction::CreateTranspose(
+          ShapeUtil::MakeShape(F32, {4, 2}), gte0, {1, 0}));
+  HloInstruction* dot = builder.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(F32, {4, 4}), HloOpcode::kDot, xpose, gte1));
+
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), dot);
+  EXPECT_FALSE(OutputsBF16(add0));
+  EXPECT_FALSE(OutputsBF16(add1));
+  EXPECT_FALSE(OutputsBF16(gte0));
+  EXPECT_FALSE(OutputsBF16(gte1));
+  EXPECT_TRUE(OutputsBF16(xpose));
+}
+
+// Tests that BF16 is propagated properly through while computations.
+TEST_F(BFloat16PropagationTest, PropagateThroughWhile) {
+  auto module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {4, 4});
+
+  HloInstruction* param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "param0"));
+  HloInstruction* param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, shape, "param1"));
+  HloInstruction* add0 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* tuple =
+      builder.AddInstruction(HloInstruction::CreateTuple({add0, add1}));
+
+  auto builder_cond = HloComputation::Builder("cond");
+  auto cond_param = builder_cond.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple->shape(), "cond_param"));
+  auto cond_lhs = builder_cond.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond_param, 0));
+  auto cond_rhs = builder_cond.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond_param, 1));
+  // This add should prevent RHS from using BF16
+  auto cond_add_rhs = builder_cond.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, cond_rhs, cond_rhs));
+  auto cond_dot = builder_cond.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot, cond_lhs, cond_add_rhs));
+  builder_cond.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(PRED, {}), HloOpcode::kGt,
+      builder_cond.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond_dot, {0, 0}, {1, 1}, {1, 1})),
+      builder_cond.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond_dot, {1, 1}, {2, 2}, {1, 1}))));
+  auto cond = module->AddEmbeddedComputation(builder_cond.Build());
+
+  auto builder_body = HloComputation::Builder("body");
+  auto body_param = builder_body.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple->shape(), "body_param"));
+  auto body_lhs = builder_body.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, body_param, 0));
+  auto body_rhs = builder_body.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, body_param, 1));
+  auto body_dot = builder_body.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, body_lhs, body_rhs));
+  builder_body.AddInstruction(
+      HloInstruction::CreateTuple({body_dot, body_rhs}));
+  auto body = module->AddEmbeddedComputation(builder_body.Build());
+
+  auto while_hlo = builder.AddInstruction(
+      HloInstruction::CreateWhile(tuple->shape(), cond, body, tuple));
+
+  auto lhs = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, while_hlo, 0));
+  auto rhs = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, while_hlo, 1));
+  auto dot = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, lhs, rhs));
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+
+  EXPECT_EQ(computation->root_instruction(), dot);
+  EXPECT_TRUE(OutputsBF16(lhs));
+  EXPECT_FALSE(OutputsBF16(rhs));
+  EXPECT_TRUE(OutputsBF16(body_dot));
+  EXPECT_TRUE(OutputsBF16(body_lhs));
+  EXPECT_FALSE(OutputsBF16(body_rhs));
+  EXPECT_TRUE(OutputsBF16(cond_lhs));
+  EXPECT_FALSE(OutputsBF16(cond_rhs));
+  EXPECT_TRUE(OutputsBF16(add0));
+  EXPECT_FALSE(OutputsBF16(add1));
+}
+
+// Tests that BF16 is not propagated through multiple whiles that invoke the
+// same computation as long as one while prevents the propagation.
+TEST_F(BFloat16PropagationTest, DoNotPropagateWhilesCallingSameComputation) {
+  auto module = CreateNewModule();
+  auto builder = HloComputation::Builder(TestName());
+  Shape shape = ShapeUtil::MakeShape(F32, {4, 4});
+
+  HloInstruction* param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "param0"));
+  HloInstruction* param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, shape, "param1"));
+  HloInstruction* add0 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* add1 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* add2 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* add3 = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kAdd, param0, param1));
+  HloInstruction* tuple0 =
+      builder.AddInstruction(HloInstruction::CreateTuple({add0, add1}));
+  HloInstruction* tuple1 =
+      builder.AddInstruction(HloInstruction::CreateTuple({add2, add3}));
+
+  // Condition computation for the first while.
+  auto builder_cond0 = HloComputation::Builder("cond0");
+  auto cond0_param = builder_cond0.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple0->shape(), "cond0_param"));
+  auto cond0_lhs = builder_cond0.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond0_param, 0));
+  auto cond0_rhs = builder_cond0.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond0_param, 1));
+  // This add should prevent RHS from using BF16
+  auto cond0_add_rhs =
+      builder_cond0.AddInstruction(HloInstruction::CreateBinary(
+          shape, HloOpcode::kAdd, cond0_rhs, cond0_rhs));
+  auto cond0_dot = builder_cond0.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot, cond0_lhs, cond0_add_rhs));
+  builder_cond0.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(PRED, {}), HloOpcode::kGt,
+      builder_cond0.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond0_dot, {0, 0}, {1, 1}, {1, 1})),
+      builder_cond0.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond0_dot, {1, 1}, {2, 2}, {1, 1}))));
+  auto cond0 = module->AddEmbeddedComputation(builder_cond0.Build());
+
+  // Condition computation for the second while.
+  auto builder_cond1 = HloComputation::Builder("cond1");
+  auto cond1_param = builder_cond1.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple1->shape(), "cond1_param"));
+  auto cond1_lhs = builder_cond1.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond1_param, 0));
+  auto cond1_rhs = builder_cond1.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, cond1_param, 1));
+  // This add should prevent LHS from using BF16
+  auto cond1_add_lhs =
+      builder_cond1.AddInstruction(HloInstruction::CreateBinary(
+          shape, HloOpcode::kAdd, cond1_lhs, cond1_lhs));
+  auto cond1_dot = builder_cond1.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot, cond1_add_lhs, cond1_rhs));
+  builder_cond1.AddInstruction(HloInstruction::CreateBinary(
+      ShapeUtil::MakeShape(PRED, {}), HloOpcode::kGt,
+      builder_cond1.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond1_dot, {0, 0}, {1, 1}, {1, 1})),
+      builder_cond1.AddInstruction(HloInstruction::CreateSlice(
+          ShapeUtil::MakeShape(F32, {}), cond1_dot, {1, 1}, {2, 2}, {1, 1}))));
+  auto cond1 = module->AddEmbeddedComputation(builder_cond1.Build());
+
+  // Body computation shared by both whiles.
+  auto builder_body = HloComputation::Builder("body");
+  auto body_param = builder_body.AddInstruction(
+      HloInstruction::CreateParameter(0, tuple0->shape(), "body_param"));
+  auto body_lhs = builder_body.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, body_param, 0));
+  auto body_rhs = builder_body.AddInstruction(
+      HloInstruction::CreateGetTupleElement(shape, body_param, 1));
+  auto body_dot = builder_body.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, body_lhs, body_rhs));
+  builder_body.AddInstruction(
+      HloInstruction::CreateTuple({body_dot, body_rhs}));
+  auto body = module->AddEmbeddedComputation(builder_body.Build());
+
+  auto while0 = builder.AddInstruction(
+      HloInstruction::CreateWhile(tuple0->shape(), cond0, body, tuple0));
+  auto while1 = builder.AddInstruction(
+      HloInstruction::CreateWhile(tuple1->shape(), cond1, body, tuple1));
+
+  auto lhs = builder.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot,
+      builder.AddInstruction(
+          HloInstruction::CreateGetTupleElement(shape, while0, 0)),
+      builder.AddInstruction(
+          HloInstruction::CreateGetTupleElement(shape, while0, 1))));
+  auto rhs = builder.AddInstruction(HloInstruction::CreateBinary(
+      shape, HloOpcode::kDot,
+      builder.AddInstruction(
+          HloInstruction::CreateGetTupleElement(shape, while1, 0)),
+      builder.AddInstruction(
+          HloInstruction::CreateGetTupleElement(shape, while1, 1))));
+  auto dot = builder.AddInstruction(
+      HloInstruction::CreateBinary(shape, HloOpcode::kDot, lhs, rhs));
+  auto computation = module->AddEntryComputation(builder.Build());
+
+  EXPECT_TRUE(PropagatePrecision(module.get()));
+  EXPECT_FALSE(OutputsBF16(body_dot));
+  EXPECT_FALSE(OutputsBF16(body_rhs));
+  EXPECT_FALSE(OutputsBF16(body_lhs));
+  EXPECT_FALSE(OutputsBF16(cond0_lhs));
+  EXPECT_FALSE(OutputsBF16(cond0_rhs));
+  EXPECT_FALSE(OutputsBF16(cond1_lhs));
+  EXPECT_FALSE(OutputsBF16(cond1_rhs));
+  EXPECT_TRUE(OutputsBF16(cond0_add_rhs));
+  EXPECT_TRUE(OutputsBF16(cond1_add_lhs));
+  EXPECT_EQ(computation->root_instruction(), dot);
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/bfloat16_support.cc b/tensorflow/compiler/xla/service/bfloat16_support.cc
index 3fd9e24601f27633c8063e4574c7c4f91f30dcff..07b4b14b5ec1bdbc01345091105df69368b0b2fb 100644
--- a/tensorflow/compiler/xla/service/bfloat16_support.cc
+++ b/tensorflow/compiler/xla/service/bfloat16_support.cc
@@ -79,6 +79,7 @@ bool BFloat16Support::EffectiveOperandPrecisionIsOutputPrecision(
     case HloOpcode::kBroadcast:
     case HloOpcode::kClamp:
     case HloOpcode::kConcatenate:
+    case HloOpcode::kConvert:
     case HloOpcode::kCopy:
     case HloOpcode::kGetTupleElement:
     case HloOpcode::kMaximum:
diff --git a/tensorflow/compiler/xla/service/bfloat16_support.h b/tensorflow/compiler/xla/service/bfloat16_support.h
index 29f662d22b4e5486662a1387407d41e0fd2ed1b3..82c2745f444e4f9c544c78cb36dafc11f678518a 100644
--- a/tensorflow/compiler/xla/service/bfloat16_support.h
+++ b/tensorflow/compiler/xla/service/bfloat16_support.h
@@ -39,7 +39,7 @@ class BFloat16Support {
   // precisions (BF16 and F32).
   virtual bool SupportsMixedPrecisions(const HloInstruction& hlo) const;
 
-  // Returns whether the given HLO inherits its BF16 operand precision at the
+  // Returns whether the given HLO preserves its BF16 operand precision at the
   // given index, so even if the output is F32, elements in the output that
   // depend on the BF16 operand will still have BF16 effective precision even if
   // they have F32 format. Similarly, this also means if the output is BF16 then
diff --git a/tensorflow/compiler/xla/service/buffer_assignment.cc b/tensorflow/compiler/xla/service/buffer_assignment.cc
index b1e693da9d5af4babe619b8796007f2da318f6a8..d44d3d71d9f28de0fd38f0c1c3aac3cf7418255e 100644
--- a/tensorflow/compiler/xla/service/buffer_assignment.cc
+++ b/tensorflow/compiler/xla/service/buffer_assignment.cc
@@ -48,6 +48,183 @@ using ::tensorflow::strings::HumanReadableNumBytes;
 using ::tensorflow::strings::Printf;
 using ::tensorflow::strings::StrAppend;
 
+namespace {
+
+template <typename T>
+string ColocatedBufferSetsToString(const T& container, const char* title) {
+  string result;
+  StrAppend(&result, title, "\n");
+  for (const auto& it : container) {
+    StrAppend(&result, "\t", it->ToString(), "\n");
+  }
+  return result;
+}
+
+// Walk the call graph of the HLO module and place each computation into either
+// thread_local_computations or global_computations depending upon whether the
+// computation requires thread-local allocations or global allocations. The
+// elements in thread_local_computations and global_computations are in post
+// order (if computation A has an instruction which calls computation B, then A
+// will appear after B in the vector).
+Status GatherComputationsByAllocationType(
+    const HloModule* module,
+    std::vector<const HloComputation*>* thread_local_computations,
+    std::vector<const HloComputation*>* global_computations) {
+  // Create a worklist of computations paired with whether the allocation must
+  // be thread-local.
+  std::deque<std::pair<const HloComputation*, bool>> worklist;
+  worklist.push_back(std::make_pair(module->entry_computation(),
+                                    /*is_thread_local*/ false));
+
+  // Sets for quickly checking membership. Computations are returned in vectors
+  // for stable iteration.
+  FlatSet<const HloComputation*> thread_local_set;
+  FlatSet<const HloComputation*> global_set;
+
+  while (!worklist.empty()) {
+    auto worklist_front = worklist.front();
+    worklist.pop_front();
+    const HloComputation* computation = worklist_front.first;
+    bool is_thread_local = worklist_front.second;
+    bool in_thread_local_set = thread_local_set.count(computation) > 0;
+    bool in_global_set = global_set.count(computation) > 0;
+
+    // If the computation has already been added to the respective set, then
+    // nothing to do.
+    if ((is_thread_local && in_thread_local_set) ||
+        (!is_thread_local && in_global_set)) {
+      continue;
+    }
+
+    // If the computation has already been added to the other set this is an
+    // error condition because the global call to the computation (eg,
+    // while/call) may return a reference to one of the thread-local buffers to
+    // the calling computation which will become a dangling reference when the
+    // thread-local is deallocated with the call return.
+    if ((is_thread_local && in_global_set) ||
+        (!is_thread_local && in_thread_local_set)) {
+      return InvalidArgument(
+          "computation %s has conflicting allocation requirements (global "
+          "and thread-local)",
+          computation->name().c_str());
+    }
+
+    if (is_thread_local) {
+      thread_local_set.insert(computation);
+    } else {
+      global_set.insert(computation);
+    }
+
+    for (auto* instruction : computation->instructions()) {
+      for (HloComputation* subcomputation :
+           instruction->called_computations()) {
+        switch (instruction->opcode()) {
+          case HloOpcode::kCall:
+          case HloOpcode::kConditional:
+          case HloOpcode::kWhile:
+            // Call and while must be called from a computation with global
+            // allocations as they may return references to buffers inside the
+            // called computation which cannot be thread-local.
+            if (is_thread_local) {
+              return InvalidArgument(
+                  "computation %s cannot contain call/while op because it "
+                  "requires thread-local buffer allocations",
+                  computation->name().c_str());
+            }
+            worklist.push_back(std::make_pair(subcomputation,
+                                              false));  // Not thread local.
+            break;
+          case HloOpcode::kMap:
+          case HloOpcode::kReduce:
+          case HloOpcode::kReduceWindow:
+          case HloOpcode::kSelectAndScatter:
+          case HloOpcode::kFusion:
+            // Map/reduce etc computations are always thread-local.
+            worklist.push_back(std::make_pair(subcomputation,
+                                              true));  // Thread local.
+            break;
+          default:
+            return InternalError(
+                "Unexpected calling opcode: %s",
+                HloOpcodeString(instruction->opcode()).c_str());
+        }
+      }
+    }
+  }
+
+  // Add the computations to the vectors in post order.
+  for (auto* computation : module->MakeComputationPostOrder()) {
+    if (thread_local_set.count(computation) > 0) {
+      thread_local_computations->push_back(computation);
+    } else if (global_set.count(computation) > 0) {
+      global_computations->push_back(computation);
+    }
+    // If the computation is not reachable from the entry computation, then it
+    // will not appear in either thread_local_set or global_set. We don't bother
+    // assigning buffers for these.
+  }
+  return Status::OK();
+}
+
+// Checks that points-to set of 'instruction' is unambiguous and distinct
+// (ensured by CopyInsertion), then adds the buffer from the points-to set at
+// 'index' to 'colocated_set'.
+const LogicalBuffer* AddBufferToColocatedSet(
+    const HloInstruction* instruction, const ShapeIndex& index,
+    const TuplePointsToAnalysis& points_to_analysis,
+    std::vector<const LogicalBuffer*>* colocated_set) {
+  // CopyInsertion ensures root points-to set is unambiguous and distinct.
+  const auto& points_to = points_to_analysis.GetPointsToSet(instruction);
+  DCHECK(!points_to.IsAmbiguous());
+  colocated_set->push_back(points_to.element(index)[0]);
+  return colocated_set->back();
+}
+
+// Given the interference map of a graph (the list of interfering node indices
+// for each node), perform graph coloring such that interfering nodes are
+// assigned to different colors. Returns the assigned color of the nodes, where
+// the colors are represented as integer values [0, color_count).
+std::vector<int64> ColorInterferenceGraph(
+    const std::vector<std::vector<int64>>& interference_map) {
+  const int64 node_count = interference_map.size();
+
+  // Sort the nodes such that we assign nodes with more interference first. This
+  // relies on the common heuristic of assigning the most constrained node
+  // first, but it would be good to investigate other ordering heuristics too.
+  std::vector<int64> nodes(node_count);
+  std::iota(nodes.begin(), nodes.end(), 0);
+  std::sort(nodes.begin(), nodes.end(),
+            [&interference_map](const int64 i, const int64 j) {
+              return interference_map[i].size() > interference_map[j].size();
+            });
+
+  const int64 kColorUnassigned = -1;
+  std::vector<int64> assigned_colors(node_count, kColorUnassigned);
+  for (int64 node : nodes) {
+    // Mark the colors that are already assigned to the neighbors.
+    std::vector<bool> available_colors(node_count, true);
+    for (int64 neighbor : interference_map[node]) {
+      int64 color = assigned_colors[neighbor];
+      if (color != kColorUnassigned) {
+        available_colors[color] = false;
+      }
+    }
+
+    // Find the color that is not yet assigned to the neighbors.
+    int64 color = kColorUnassigned;
+    for (color = 0; color < available_colors.size(); ++color) {
+      if (available_colors[color]) {
+        break;
+      }
+    }
+    CHECK_NE(color, kColorUnassigned);
+    assigned_colors[node] = color;
+  }
+  return assigned_colors;
+}
+
+}  // namespace
+
 size_t BufferAllocation::Slice::Hasher::operator()(Slice s) const {
   uint64 h = std::hash<int64>()(s.index());
   h = tensorflow::Hash64Combine(h, std::hash<int64>()(s.offset()));
@@ -523,116 +700,6 @@ BufferAssignmentProto BufferAssignment::ToProto() const {
   return proto;
 }
 
-namespace {
-
-// Walk the call graph of the HLO module and place each computation into either
-// thread_local_computations or global_computations depending upon whether the
-// computation requires thread-local allocations or global allocations. The
-// elements in thread_local_computations and global_computations are in post
-// order (if computation A has an instruction which calls computation B, then A
-// will appear after B in the vector).
-Status GatherComputationsByAllocationType(
-    const HloModule* module,
-    std::vector<const HloComputation*>* thread_local_computations,
-    std::vector<const HloComputation*>* global_computations) {
-  // Create a worklist of computations paired with whether the allocation must
-  // be thread-local.
-  std::deque<std::pair<const HloComputation*, bool>> worklist;
-  worklist.push_back(std::make_pair(module->entry_computation(),
-                                    /*is_thread_local*/ false));
-
-  // Sets for quickly checking membership. Computations are returned in vectors
-  // for stable iteration.
-  FlatSet<const HloComputation*> thread_local_set;
-  FlatSet<const HloComputation*> global_set;
-
-  while (!worklist.empty()) {
-    auto worklist_front = worklist.front();
-    worklist.pop_front();
-    const HloComputation* computation = worklist_front.first;
-    bool is_thread_local = worklist_front.second;
-    bool in_thread_local_set = thread_local_set.count(computation) > 0;
-    bool in_global_set = global_set.count(computation) > 0;
-
-    // If the computation has already been added to the respective set, then
-    // nothing to do.
-    if ((is_thread_local && in_thread_local_set) ||
-        (!is_thread_local && in_global_set)) {
-      continue;
-    }
-
-    // If the computation has already been added to the other set this is an
-    // error condition because the global call to the computation (eg,
-    // while/call) may return a reference to one of the thread-local buffers to
-    // the calling computation which will become a dangling reference when the
-    // thread-local is deallocated with the call return.
-    if ((is_thread_local && in_global_set) ||
-        (!is_thread_local && in_thread_local_set)) {
-      return InvalidArgument(
-          "computation %s has conflicting allocation requirements (global "
-          "and thread-local)",
-          computation->name().c_str());
-    }
-
-    if (is_thread_local) {
-      thread_local_set.insert(computation);
-    } else {
-      global_set.insert(computation);
-    }
-
-    for (auto* instruction : computation->instructions()) {
-      for (HloComputation* subcomputation :
-           instruction->called_computations()) {
-        switch (instruction->opcode()) {
-          case HloOpcode::kCall:
-          case HloOpcode::kConditional:
-          case HloOpcode::kWhile:
-            // Call and while must be called from a computation with global
-            // allocations as they may return references to buffers inside the
-            // called computation which cannot be thread-local.
-            if (is_thread_local) {
-              return InvalidArgument(
-                  "computation %s cannot contain call/while op because it "
-                  "requires thread-local buffer allocations",
-                  computation->name().c_str());
-            }
-            worklist.push_back(std::make_pair(subcomputation,
-                                              false));  // Not thread local.
-            break;
-          case HloOpcode::kMap:
-          case HloOpcode::kReduce:
-          case HloOpcode::kReduceWindow:
-          case HloOpcode::kSelectAndScatter:
-          case HloOpcode::kFusion:
-            // Map/reduce etc computations are always thread-local.
-            worklist.push_back(std::make_pair(subcomputation,
-                                              true));  // Thread local.
-            break;
-          default:
-            return InternalError(
-                "Unexpected calling opcode: %s",
-                HloOpcodeString(instruction->opcode()).c_str());
-        }
-      }
-    }
-  }
-
-  // Add the computations to the vectors in post order.
-  for (auto* computation : module->MakeComputationPostOrder()) {
-    if (thread_local_set.count(computation) > 0) {
-      thread_local_computations->push_back(computation);
-    } else if (global_set.count(computation) > 0) {
-      global_computations->push_back(computation);
-    }
-    // If the computation is not reachable from the entry computation, then it
-    // will not appear in either thread_local_set or global_set. We don't bother
-    // assigning buffers for these.
-  }
-  return Status::OK();
-}
-
-}  // namespace
-
 /* static */
 StatusOr<std::unique_ptr<BufferAssignment>> BufferAssigner::Run(
     const HloModule* module, std::unique_ptr<HloOrdering> hlo_ordering,
@@ -1085,7 +1152,8 @@ void BufferAssigner::AddSetToColocatedBufferSets(
   if (colocated_set.empty()) {
     return;
   }
-
+  VLOG(5) << ColocatedBufferSetsToString(colocated_set,
+                                         "Adding colocated buffer set");
   // Find existing sets that overlap with at least one buffer from the
   // colocated_set. The resulting 'overlap_set_indices' will have at most
   // colocated_buffer_sets->size() entries, and will be in increasing order.
@@ -1093,6 +1161,10 @@ void BufferAssigner::AddSetToColocatedBufferSets(
   for (size_t index = 0; index < colocated_buffer_sets->size(); ++index) {
     for (const LogicalBuffer* buffer : colocated_set) {
       if ((*colocated_buffer_sets)[index].count(buffer) > 0) {
+        VLOG(5) << "Found overlap with existing set on buffer "
+                << buffer->ToString() << "\n"
+                << ColocatedBufferSetsToString((*colocated_buffer_sets)[index],
+                                               "Overlapping set");
         overlap_set_indices.push_back(index);
         break;
       }
@@ -1104,6 +1176,7 @@ void BufferAssigner::AddSetToColocatedBufferSets(
     colocated_buffer_sets->emplace_back();
     colocated_buffer_sets->back().insert(colocated_set.begin(),
                                          colocated_set.end());
+    VLOG(5) << "No overlap found, new group created";
     return;
   }
 
@@ -1115,6 +1188,8 @@ void BufferAssigner::AddSetToColocatedBufferSets(
     first->insert(overlap_set.begin(), overlap_set.end());
   }
   first->insert(colocated_set.begin(), colocated_set.end());
+  VLOG(5) << ColocatedBufferSetsToString(
+      *first, "Result of the colocated buffer set merging");
 
   // Remove overlap sets that we just merged. The offset accounts for the fact
   // that as elements are erased, the indices need to be adjusted. Keep in mind
@@ -1125,67 +1200,6 @@ void BufferAssigner::AddSetToColocatedBufferSets(
   }
 }
 
-namespace {
-
-// Checks that points-to set of 'instruction' is unambiguous and distinct
-// (ensured by CopyInsertion), then adds the buffer from the points-to set at
-// 'index' to 'colocated_set'.
-const LogicalBuffer* AddBufferToColocatedSet(
-    const HloInstruction* instruction, const ShapeIndex& index,
-    const TuplePointsToAnalysis& points_to_analysis,
-    std::vector<const LogicalBuffer*>* colocated_set) {
-  // CopyInsertion ensures root points-to set is unambiguous and distinct.
-  const auto& points_to = points_to_analysis.GetPointsToSet(instruction);
-  DCHECK(!points_to.IsAmbiguous());
-  colocated_set->push_back(points_to.element(index)[0]);
-  return colocated_set->back();
-}
-
-// Given the interference map of a graph (the list of interfering node indices
-// for each node), perform graph coloring such that interfering nodes are
-// assigned to different colors. Returns the assigned color of the nodes, where
-// the colors are represented as integer values [0, color_count).
-std::vector<int64> ColorInterferenceGraph(
-    const std::vector<std::vector<int64>>& interference_map) {
-  const int64 node_count = interference_map.size();
-
-  // Sort the nodes such that we assign nodes with more interference first. This
-  // relies on the common heuristic of assigning the most constrained node
-  // first, but it would be good to investigate other ordering heuristics too.
-  std::vector<int64> nodes(node_count);
-  std::iota(nodes.begin(), nodes.end(), 0);
-  std::sort(nodes.begin(), nodes.end(),
-            [&interference_map](const int64 i, const int64 j) {
-              return interference_map[i].size() > interference_map[j].size();
-            });
-
-  const int64 kColorUnassigned = -1;
-  std::vector<int64> assigned_colors(node_count, kColorUnassigned);
-  for (int64 node : nodes) {
-    // Mark the colors that are already assigned to the neighbors.
-    std::vector<bool> available_colors(node_count, true);
-    for (int64 neighbor : interference_map[node]) {
-      int64 color = assigned_colors[neighbor];
-      if (color != kColorUnassigned) {
-        available_colors[color] = false;
-      }
-    }
-
-    // Find the color that is not yet assigned to the neighbors.
-    int64 color = kColorUnassigned;
-    for (color = 0; color < available_colors.size(); ++color) {
-      if (available_colors[color]) {
-        break;
-      }
-    }
-    CHECK_NE(color, kColorUnassigned);
-    assigned_colors[node] = color;
-  }
-  return assigned_colors;
-}
-
-}  // namespace
-
 std::vector<BufferAssigner::ColocatedBufferSet>
 BufferAssigner::MergeColocatedBufferSets(
     const std::vector<ColocatedBufferSet>& colocated_buffer_sets,
diff --git a/tensorflow/compiler/xla/service/compile_only_service.cc b/tensorflow/compiler/xla/service/compile_only_service.cc
index dab73596e1639eed62151197048ee8d29570b20a..6664496ab6c603c35c7dce923fcf94c54d1ce714 100644
--- a/tensorflow/compiler/xla/service/compile_only_service.cc
+++ b/tensorflow/compiler/xla/service/compile_only_service.cc
@@ -72,8 +72,7 @@ CompileOnlyService::CompileAheadOfTime(
     VersionedComputationHandle versioned_handle =
         user_computation->GetVersionedHandle();
 
-    // TODO(b/63773457): Track DebugOptions in AotCompilationOptions.
-    DebugOptions debug_options = legacy_flags::GetDebugOptionsFromFlags();
+    const DebugOptions& debug_options = options.debug_options();
 
     // Dump computation proto state if flag is set.
     const string& directory_path = debug_options.xla_dump_computations_to();
diff --git a/tensorflow/compiler/xla/service/compiler.cc b/tensorflow/compiler/xla/service/compiler.cc
index e2e9d2a0c048fec6c6ffbeef1223ae0e6aef50d1..0392d4af48a040c4a648f7bf9bf21a62ce03a990 100644
--- a/tensorflow/compiler/xla/service/compiler.cc
+++ b/tensorflow/compiler/xla/service/compiler.cc
@@ -86,4 +86,7 @@ Compiler::GetPlatformCompilers() {
   return compilers->at(platform->id()).get();
 }
 
+AotCompilationOptions::AotCompilationOptions()
+    : debug_options_(legacy_flags::GetDebugOptionsFromFlags()) {}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/compiler.h b/tensorflow/compiler/xla/service/compiler.h
index 74fd24edf88d44b2dfdc87556b0af43987e69e08..33e19efc72c6d30ccd7e0b3a13f664a4f42208bf 100644
--- a/tensorflow/compiler/xla/service/compiler.h
+++ b/tensorflow/compiler/xla/service/compiler.h
@@ -79,11 +79,15 @@ class AotCompilationOptions {
     device_allocator_ = device_allocator;
   }
 
+  const DebugOptions& debug_options() const { return debug_options_; }
+  DebugOptions* mutable_debug_options() { return &debug_options_; }
+
  protected:
-  AotCompilationOptions() = default;
+  AotCompilationOptions();
 
  private:
   DeviceMemoryAllocator* device_allocator_ = nullptr;
+  DebugOptions debug_options_;
 };
 
 // Abstract compiler interface that is subclassed for compilation on a
diff --git a/tensorflow/compiler/xla/service/conditional_simplifier.cc b/tensorflow/compiler/xla/service/conditional_simplifier.cc
new file mode 100644
index 0000000000000000000000000000000000000000..f35de080853f7ec986565cb2df1050946ac3f244
--- /dev/null
+++ b/tensorflow/compiler/xla/service/conditional_simplifier.cc
@@ -0,0 +1,106 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/conditional_simplifier.h"
+
+#include <string>
+#include <utility>
+#include <vector>
+
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/call_inliner.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/status_macros.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/lib/strings/strcat.h"
+
+namespace xla {
+
+// Tries to replace a conditional with a call operation of the corresponding
+// computation. If the given conditional has a constant predicate, tries to
+// replace it with a call to its true/false computation as appropirate and then
+// inline that computation.
+//
+// Returns true if it made a change to the graph.
+static StatusOr<bool> TryRemoveConditional(HloInstruction* conditional) {
+  CHECK_EQ(conditional->opcode(), HloOpcode::kConditional);
+  // Do not remove conditionals that contain side-effecting instructions or
+  // have control predecessors/successors in either true/false computation.
+  if (!conditional->parent()->IsRemovable(conditional) ||
+      conditional->HasSideEffect()) {
+    VLOG(2) << "Not attempting to remove conditional as it is not removable or "
+               "has side effect: "
+            << conditional->ToShortString();
+    return false;
+  }
+
+  if (conditional->operand(0)->opcode() != HloOpcode::kConstant) {
+    VLOG(2) << "Not attempting to remove conditional as its predicate is not a "
+               "compile-time constant: "
+            << conditional->ToShortString();
+    return false;
+  }
+
+  auto computation = conditional->parent();
+  HloInstruction* call_op;
+  if (conditional->operand(0)->literal().Get<bool>({})) {
+    call_op = computation->AddInstruction(HloInstruction::CreateCall(
+        conditional->shape(), {conditional->mutable_operand(1)},
+        conditional->true_computation()));
+  } else {
+    call_op = computation->AddInstruction(HloInstruction::CreateCall(
+        conditional->shape(), {conditional->mutable_operand(2)},
+        conditional->false_computation()));
+  }
+
+  TF_RETURN_IF_ERROR(computation->ReplaceInstruction(conditional, call_op));
+  TF_RETURN_IF_ERROR(CallInliner::Inline(call_op).status());
+
+  return true;
+}
+
+StatusOr<bool> ConditionalSimplifier::Run(HloModule* module) {
+  XLA_VLOG_LINES(
+      3, "ConditionalSimplifier::Run(), before:\n" + module->ToString());
+  bool changed = false;
+
+  // Gather all the conditional ops in our module. We do this ahead of time so
+  // we don't have to worry about mutating the lists of computations or
+  // instructions as we iterate.
+  std::vector<HloInstruction*> conditional_ops;
+  for (auto* comp : module->computations()) {
+    for (auto* instr : comp->instructions()) {
+      if (instr->opcode() == HloOpcode::kConditional) {
+        conditional_ops.push_back(instr);
+      }
+    }
+  }
+
+  for (HloInstruction* conditional_op : conditional_ops) {
+    TF_ASSIGN_OR_RETURN(bool result, TryRemoveConditional(conditional_op));
+    changed |= result;
+  }
+
+  XLA_VLOG_LINES(3,
+                 "ConditionalSimplifier::Run(), after:\n" + module->ToString());
+  return changed;
+}
+
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/conditional_simplifier.h b/tensorflow/compiler/xla/service/conditional_simplifier.h
new file mode 100644
index 0000000000000000000000000000000000000000..063261e26d06e21a297e8e3c405898a17221b7ca
--- /dev/null
+++ b/tensorflow/compiler/xla/service/conditional_simplifier.h
@@ -0,0 +1,38 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_CONDITIONAL_SIMPLIFIER_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_CONDITIONAL_SIMPLIFIER_H_
+
+#include "tensorflow/compiler/xla/service/hlo_module.h"
+#include "tensorflow/compiler/xla/service/hlo_pass_interface.h"
+#include "tensorflow/compiler/xla/statusor.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+
+namespace xla {
+
+// HLO pass that removes kConditional with a constant predicate, replacing them
+// with their true or false computation as appropriate.
+class ConditionalSimplifier : public HloPassInterface {
+ public:
+  tensorflow::StringPiece name() const override {
+    return "simplify-conditional";
+  }
+  StatusOr<bool> Run(HloModule* module) override;
+};
+
+}  // namespace xla
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_CONDITIONAL_SIMPLIFIER_H_
diff --git a/tensorflow/compiler/xla/service/conditional_simplifier_test.cc b/tensorflow/compiler/xla/service/conditional_simplifier_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..868348547d9f5cbdc7576c7fc0697d72c3a3e557
--- /dev/null
+++ b/tensorflow/compiler/xla/service/conditional_simplifier_test.cc
@@ -0,0 +1,153 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/compiler/xla/service/conditional_simplifier.h"
+
+#include <string>
+#include <utility>
+
+#include "tensorflow/compiler/xla/literal_util.h"
+#include "tensorflow/compiler/xla/service/hlo_computation.h"
+#include "tensorflow/compiler/xla/service/hlo_instruction.h"
+#include "tensorflow/compiler/xla/service/hlo_matchers.h"
+#include "tensorflow/compiler/xla/service/hlo_opcode.h"
+#include "tensorflow/compiler/xla/shape_util.h"
+#include "tensorflow/compiler/xla/test.h"
+#include "tensorflow/compiler/xla/tests/hlo_verified_test_base.h"
+#include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace xla {
+namespace {
+
+namespace op = xla::testing::opcode_matchers;
+
+class ConditionalSimplifierTest : public HloVerifiedTestBase {
+ public:
+  // Makes a computation that contains a conditional with constant predicate.
+  HloComputation* MakeConditional(HloModule* module);
+};
+
+HloComputation* ConditionalSimplifierTest::MakeConditional(HloModule* module) {
+  HloComputation::Builder builder(TestName());
+
+  // true_computation returns param+1.
+  HloComputation* true_computation;
+  {
+    HloComputation::Builder true_computation_builder(TestName() +
+                                                     ".true_computation");
+    auto param =
+        true_computation_builder.AddInstruction(HloInstruction::CreateParameter(
+            0, ShapeUtil::MakeShape(S32, {}), "param"));
+    auto one = true_computation_builder.AddInstruction(
+        HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+
+    true_computation_builder.AddInstruction(HloInstruction::CreateBinary(
+        ShapeUtil::MakeShape(S32, {}), HloOpcode::kAdd, param, one));
+
+    true_computation =
+        module->AddEmbeddedComputation(true_computation_builder.Build());
+  }
+
+  // false_computation returns param+42.
+  HloComputation* false_computation;
+  {
+    HloComputation::Builder false_computation_builder(TestName() +
+                                                      ".false_computation");
+    auto param = false_computation_builder.AddInstruction(
+        HloInstruction::CreateParameter(0, ShapeUtil::MakeShape(S32, {}),
+                                        "param"));
+    auto forty_two = false_computation_builder.AddInstruction(
+        HloInstruction::CreateConstant(Literal::CreateR0<int32>(42)));
+
+    false_computation_builder.AddInstruction(HloInstruction::CreateBinary(
+        ShapeUtil::MakeShape(S32, {}), HloOpcode::kAdd, param, forty_two));
+    false_computation =
+        module->AddEmbeddedComputation(false_computation_builder.Build());
+  }
+
+  auto false_instrn = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<bool>(false)));
+  auto false_param = builder.AddInstruction(HloInstruction::CreateParameter(
+      0, ShapeUtil::MakeShape(S32, {}), "false_param"));
+  auto one = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<int32>(1)));
+
+  builder.AddInstruction(HloInstruction::CreateConditional(
+      ShapeUtil::MakeShape(S32, {}), false_instrn, one, true_computation,
+      false_param, false_computation));
+
+  return module->AddEntryComputation(builder.Build());
+}
+
+TEST_F(ConditionalSimplifierTest, ConditionalGetsInlined) {
+  HloComputation* computation = MakeConditional(&module());
+  ASSERT_TRUE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+  EXPECT_THAT(computation->root_instruction(),
+              op::Add(op::Parameter(), op::Constant()));
+}
+
+TEST_F(ConditionalSimplifierTest, ConditionalWithControlDependency) {
+  HloComputation* computation = MakeConditional(&module());
+
+  auto* true_op = computation->AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR0<bool>(true)));
+  TF_ASSERT_OK(
+      true_op->AddControlDependencyTo(computation->root_instruction()));
+
+  EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+}
+
+TEST_F(ConditionalSimplifierTest, NotRemovedIfContainsSend) {
+  HloComputation* computation = MakeConditional(&module());
+  auto* conditional = computation->root_instruction();
+  ASSERT_EQ(conditional->opcode(), HloOpcode::kConditional);
+
+  auto* true_computation = conditional->true_computation();
+  auto* send = true_computation->AddInstruction(HloInstruction::CreateSend(
+      true_computation->AddInstruction(
+          HloInstruction::CreateConstant(Literal::CreateR0<bool>(true))),
+      /*channel_id=*/0));
+  true_computation->AddInstruction(HloInstruction::CreateSendDone(send));
+  EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+}
+
+TEST_F(ConditionalSimplifierTest, NotRemovedIfContainsRecv) {
+  HloComputation* computation = MakeConditional(&module());
+  auto* conditional = computation->root_instruction();
+  ASSERT_EQ(conditional->opcode(), HloOpcode::kConditional);
+
+  auto* true_computation = conditional->true_computation();
+  auto* recv = true_computation->AddInstruction(HloInstruction::CreateRecv(
+      ShapeUtil::MakeShape(F32, {1}), /*channel_id=*/0));
+  true_computation->AddInstruction(HloInstruction::CreateRecvDone(recv));
+  EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+}
+
+TEST_F(ConditionalSimplifierTest, NotRemovedIfContainsNonRemovableInstruction) {
+  HloComputation* computation = MakeConditional(&module());
+  auto* conditional = computation->root_instruction();
+  ASSERT_EQ(conditional->opcode(), HloOpcode::kConditional);
+  auto* false_computation = conditional->false_computation();
+  false_computation->AddInstruction(
+      HloInstruction::CreateInfeed(ShapeUtil::MakeShape(F32, {1}), "config"));
+  EXPECT_FALSE(ConditionalSimplifier().Run(&module()).ValueOrDie());
+}
+
+}  // namespace
+}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/copy_insertion.cc b/tensorflow/compiler/xla/service/copy_insertion.cc
index cd983bc03e993caed883916de01d75dffdbc4bab..df73c285971e237b6f5492f8a7c587f23646ec1e 100644
--- a/tensorflow/compiler/xla/service/copy_insertion.cc
+++ b/tensorflow/compiler/xla/service/copy_insertion.cc
@@ -58,6 +58,45 @@ bool ValueIsReadOnly(const HloValue& value) {
   return IsConstantValue(value) || IsEntryParameterValue(value);
 }
 
+// Data structure describing the action which should be taken on parts of a
+// computation buffers, with respect to the adding of special case copies.
+struct SpecialCaseCopyPolicy {
+  // Insert a copy if the same buffer is found at multiple indices within the
+  // output tuple.
+  bool copy_root_replicated_buffers = false;
+  // If true, insert a copy if a buffer coming from a constant or a parameter
+  // is found wihtin the output tuple.
+  bool copy_parameters_and_constants = false;
+};
+
+SpecialCaseCopyPolicy GetSpecialCaseCopyPolicy(const CallGraphNode& node,
+                                               HloModule* module,
+                                               HloComputation* computation) {
+  SpecialCaseCopyPolicy policy;
+  if (computation == module->entry_computation()) {
+    policy.copy_parameters_and_constants = true;
+    policy.copy_root_replicated_buffers = true;
+  }
+  for (const CallSite& site : node.caller_callsites()) {
+    // The kWhile instruction does not have an handling here, as the
+    // AddCopiesForWhile() API takes care of adding its own copies.
+    if (site.instruction()->opcode() == HloOpcode::kConditional) {
+      policy.copy_parameters_and_constants = true;
+      policy.copy_root_replicated_buffers = true;
+    }
+  }
+  return policy;
+}
+
+bool ShouldCopyRootValue(const HloValue& value,
+                         const SpecialCaseCopyPolicy& policy) {
+  if (policy.copy_parameters_and_constants) {
+    return IsConstantValue(value) ||
+           value.defining_instruction()->opcode() == HloOpcode::kParameter;
+  }
+  return false;
+}
+
 // Deep copy the given instructions 'from' and 'to' at the ShapeIndexes given in
 // 'indices_to_copy'. Add control edges from the respective kCopy instructions
 // in deep copy of 'from' to the respective kCopy instruction in the deep copy
@@ -729,7 +768,8 @@ class CopyRemover {
       // has a different operand (the operand of the elided copy).
       for (const HloUse* copy_use : copy_value_node->uses) {
         operand_node->uses.push_back(copy_use);
-        if (copy_use->instruction->opcode() == HloOpcode::kCopy) {
+        if (copy_use->instruction->opcode() == HloOpcode::kCopy &&
+            ContainsKey(copy_map_, copy_use->instruction)) {
           copy_map_.at(copy_use->instruction).src = operand_node;
         }
       }
@@ -956,7 +996,8 @@ Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module) {
     }
     TF_RET_CHECK(node.context() == CallContext::kSequential);
 
-    const bool is_entry = computation == module->entry_computation();
+    SpecialCaseCopyPolicy policy =
+        GetSpecialCaseCopyPolicy(node, module, computation);
     HloInstruction* root = computation->root_instruction();
 
     // Mark nondistinct/ambiguous indices.
@@ -969,27 +1010,26 @@ Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module) {
           for (const HloBuffer* buffer : buffers_at_index) {
             buffer_seen_before |= !seen.insert(buffer).second;
           }
-          if (buffers_at_index.size() > 1 || (buffer_seen_before && is_entry)) {
-            VLOG(2) << "Index " << index << " of root of computation "
+          if (buffers_at_index.size() > 1 ||
+              (buffer_seen_before && policy.copy_root_replicated_buffers)) {
+            VLOG(2) << "Index " << index << " of computation "
                     << computation->name() << " (" << root->name()
                     << ") has ambiguous or non-distinct buffer. Copying.";
             add_index_to_copy(root, index);
           }
         });
 
-    // For entry instructions, mark any parameter or constant values.
-    if (is_entry) {
-      for (const auto& pair :
-           alias_analysis->dataflow_analysis().GetInstructionValueSet(root)) {
-        const ShapeIndex& index = pair.first;
-        const HloValueSet& value_set = pair.second;
-        for (const HloValue* value : value_set.values()) {
-          if (ValueIsReadOnly(*value)) {
-            VLOG(2) << "Root of entry computation (" << root->name()
-                    << ") has constant or entry parameter value at index "
-                    << index << ". Copying.";
-            add_index_to_copy(root, index);
-          }
+    for (const auto& pair :
+         alias_analysis->dataflow_analysis().GetInstructionValueSet(root)) {
+      const ShapeIndex& index = pair.first;
+      const HloValueSet& value_set = pair.second;
+      for (const HloValue* value : value_set.values()) {
+        if (ShouldCopyRootValue(*value, policy)) {
+          VLOG(2) << "Root of (" << root->name() << ") of computation("
+                  << computation->name()
+                  << ") has constant or parameter value at index " << index
+                  << ". Copying.";
+          add_index_to_copy(root, index);
         }
       }
     }
@@ -1011,7 +1051,6 @@ Status AddSpecialCaseCopies(const CallGraph& call_graph, HloModule* module) {
       instruction->parent()->set_root_instruction(deep_copy);
     }
   }
-
   return Status::OK();
 }
 
@@ -1155,7 +1194,7 @@ bool IsWhileBody(const HloComputation* computation,
     HloModule* module) {
   std::unique_ptr<CallGraph> call_graph = CallGraph::Build(module);
   TF_ASSIGN_OR_RETURN(std::unique_ptr<HloDataflowAnalysis> dataflow,
-                      HloDataflowAnalysis::Run(module));
+                      HloDataflowAnalysis::Run(*module));
 
   bool changed = false;
 
diff --git a/tensorflow/compiler/xla/service/cpu/BUILD b/tensorflow/compiler/xla/service/cpu/BUILD
index c13a0b1cdf0b5be0b69db98b2b9587f30ca4c304..38a54fcb644f26355916abcff0ec18f7094856dc 100644
--- a/tensorflow/compiler/xla/service/cpu/BUILD
+++ b/tensorflow/compiler/xla/service/cpu/BUILD
@@ -105,6 +105,7 @@ cc_library(
         "//tensorflow/compiler/xla/service:buffer_assignment",
         "//tensorflow/compiler/xla/service:buffer_liveness",
         "//tensorflow/compiler/xla/service:call_inliner",
+        "//tensorflow/compiler/xla/service:conditional_simplifier",
         "//tensorflow/compiler/xla/service:dot_decomposer",
         "//tensorflow/compiler/xla/service:executable",
         "//tensorflow/compiler/xla/service:flatten_call_graph",
@@ -163,6 +164,7 @@ cc_library(
         ":disassembler",
         ":external_constant_pool",
         ":orc_jit_memory_mapper",
+        ":runtime_fp16",
         ":runtime_conv2d",
         ":runtime_fft",
         ":runtime_fork_join",
@@ -182,6 +184,20 @@ cc_library(
     ] + ORC_JIT_MEMORY_MAPPER_TARGETS,
 )
 
+cc_library(
+    name = "runtime_fp16",
+    srcs = [
+        "runtime_fp16.cc",
+    ],
+    hdrs = [
+        "runtime_fp16.h",
+    ],
+    copts = runtime_copts(),
+    deps = [
+        "//tensorflow/core:framework_lite",
+    ],
+)
+
 cc_library(
     name = "cpu_executable",
     srcs = ["cpu_executable.cc"],
@@ -499,7 +515,6 @@ cc_library(
 
 cc_library(
     name = "runtime_matvec",
-    srcs = ["runtime_matvec.cc"],
     hdrs = ["runtime_matvec.h"],
     copts = runtime_copts(),
     deps = [
diff --git a/tensorflow/compiler/xla/service/cpu/compiler_functor.cc b/tensorflow/compiler/xla/service/cpu/compiler_functor.cc
index ed290fcdf8bb69f1bbad57fa5a0926376bc9405a..61b2da7a7dce7f6fba46a23cc8e5462a3899a18c 100644
--- a/tensorflow/compiler/xla/service/cpu/compiler_functor.cc
+++ b/tensorflow/compiler/xla/service/cpu/compiler_functor.cc
@@ -93,8 +93,8 @@ class FilteredPassManager : public llvm::legacy::PassManager {
 };
 }  // anonymous namespace
 
-llvm::object::OwningBinary<llvm::object::ObjectFile> CompilerFunctor::
-operator()(llvm::Module& module) const {
+std::unique_ptr<llvm::MemoryBuffer> CompilerFunctor::operator()(
+    llvm::Module& module) const {
   FilteredPassManager module_passes(disable_expensive_passes_);
   FilteredFunctionPassManager function_passes(&module,
                                               disable_expensive_passes_);
@@ -157,27 +157,8 @@ operator()(llvm::Module& module) const {
   codegen_passes.run(module);
 
   // Construct ObjectFile from machine code buffer.
-  std::unique_ptr<llvm::MemoryBuffer> memory_buffer(
+  return std::unique_ptr<llvm::MemoryBuffer>(
       new llvm::ObjectMemoryBuffer(std::move(stream_buffer)));
-  llvm::Expected<std::unique_ptr<llvm::object::ObjectFile>>
-      object_file_or_error = llvm::object::ObjectFile::createObjectFile(
-          memory_buffer->getMemBufferRef());
-  CHECK(object_file_or_error);
-
-  std::unique_ptr<llvm::object::ObjectFile> object_file =
-      std::move(object_file_or_error.get());
-  if (VLOG_IS_ON(2)) {
-    StatusOr<DisassemblerResult> disassembly_status =
-        disassembler_->DisassembleObjectFile(*object_file);
-    if (disassembly_status.ok()) {
-      auto result = disassembly_status.ValueOrDie();
-      XLA_VLOG_LINES(2, result.text);
-      VLOG(2) << "compiled code size: " << result.code_size_bytes << " bytes";
-    }
-  }
-
-  return llvm::object::OwningBinary<llvm::object::ObjectFile>(
-      std::move(object_file), std::move(memory_buffer));
 }
 
 static std::vector<llvm::VecDesc> VectorFunctionsForTargetLibraryInfoImpl() {
diff --git a/tensorflow/compiler/xla/service/cpu/compiler_functor.h b/tensorflow/compiler/xla/service/cpu/compiler_functor.h
index 1a8283a702223a7414c1ffcd99c1ac42c04ac068..c38b896c5019b48fd2a16a51abd59e12ebdb29eb 100644
--- a/tensorflow/compiler/xla/service/cpu/compiler_functor.h
+++ b/tensorflow/compiler/xla/service/cpu/compiler_functor.h
@@ -47,7 +47,7 @@ class CompilerFunctor {
         post_optimization_hook_(post_optimization_hook) {}
 
   // Compile a Module to an ObjectFile.
-  llvm::object::OwningBinary<llvm::object::ObjectFile> operator()(
+  std::unique_ptr<llvm::MemoryBuffer> operator()(
       llvm::Module& module) const;  // NOLINT
 
  private:
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc b/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
index f9cc9651846cca7bd6ab7e9e61590cec4e2400da..0d15be5a23ec6e3ced551d6f9d05078d17b2612c 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_compiler.cc
@@ -47,6 +47,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/buffer_assignment.h"
 #include "tensorflow/compiler/xla/service/buffer_liveness.h"
 #include "tensorflow/compiler/xla/service/call_inliner.h"
+#include "tensorflow/compiler/xla/service/conditional_simplifier.h"
 #include "tensorflow/compiler/xla/service/cpu/compiler_functor.h"
 #include "tensorflow/compiler/xla/service/cpu/conv_canonicalization.h"
 #include "tensorflow/compiler/xla/service/cpu/cpu_copy_insertion.h"
@@ -275,6 +276,7 @@ Status CpuCompiler::RunHloPasses(HloModule* module, bool is_aot_compile) {
     pass.AddPass<HloDCE>();
     pass.AddPass<ReshapeMover>();
     pass.AddPass<HloConstantFolding>();
+    pass.AddPass<ConditionalSimplifier>();
   }
   pipeline.AddPass<TransposeFolding>(
       [](const HloInstruction& dot,
@@ -889,11 +891,10 @@ CpuCompiler::CompileAheadOfTime(std::vector<std::unique_ptr<HloModule>> modules,
         module->config().debug_options().xla_enable_fast_math(),
         module->config().debug_options().xla_llvm_disable_expensive_passes(),
         pre_optimization_ir_dump_hook, post_optimization_ir_dump_hook);
-    llvm::object::OwningBinary<llvm::object::ObjectFile> object_file =
+    std::unique_ptr<llvm::MemoryBuffer> object_file =
         compiler_functor(llvm_module);
-    llvm::StringRef object_file_data_ref = object_file.getBinary()->getData();
-    ObjectFileData object_file_data(object_file_data_ref.begin(),
-                                    object_file_data_ref.end());
+    ObjectFileData object_file_data(object_file->getBufferStart(),
+                                    object_file->getBufferEnd());
 
     BufferSizes buffer_sizes;
     for (const BufferAllocation& allocation : assignment->Allocations()) {
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_executable.cc b/tensorflow/compiler/xla/service/cpu/cpu_executable.cc
index 802d0a6fb46890b31d14b1fbf3b2e7d6520caccc..c053703c3524a47ee1de9681c1b986edbf109430 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_executable.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_executable.cc
@@ -63,7 +63,7 @@ CpuExecutable::CpuExecutable(
       assignment_(std::move(assignment)) {
   // Resolve symbols in the constructor rather than at execution time to avoid
   // races because FindSymbol is not thread safe.
-  llvm::JITSymbol sym = jit_->FindSymbol(entry_function_name);
+  llvm::JITSymbol sym = jit_->FindCompiledSymbol(entry_function_name);
   // We expect to find the symbol provided with entry_function_name; otherwise
   // this is an internal error.
   CHECK(sym) << "Symbol " << entry_function_name << " not found.";
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion.cc b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion.cc
index 482e04052d5a914eab0e5bff2c7a83f3b698052f..0fc5a746bbbc7685ff5d4647111a750e7d7b1c19 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion.cc
@@ -30,7 +30,6 @@ bool CanBeLoopFused(const HloInstruction& hlo) {
   // These are the only ones we fuse since we rely on effective elemental IR
   // generation.
   return hlo.IsElementwise() ||  //
-         hlo.opcode() == HloOpcode::kBitcast ||
          hlo.opcode() == HloOpcode::kBroadcast ||
          hlo.opcode() == HloOpcode::kConcatenate ||
          hlo.opcode() == HloOpcode::kDynamicSlice ||
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
index 595c3f55b321f47e2312b93e0c238c7637495d77..6ed1cd31b18f6360bdd7fd41bd5be2e657b310a5 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_instruction_fusion_test.cc
@@ -77,7 +77,7 @@ TEST_F(InstructionFusionTest, DotOperationFusion_Basic_1) {
   EXPECT_THAT(computation->root_instruction(), op::Fusion());
 }
 
-TEST_F(InstructionFusionTest, DotOperationFusion_Bitcast) {
+TEST_F(InstructionFusionTest, DotOperationNoFusion_Bitcast) {
   HloComputation::Builder builder(TestName());
   HloInstruction* arg0 = builder.AddInstruction(HloInstruction::CreateParameter(
       0, ShapeUtil::MakeShape(F32, {2, 512, 2, 128}), "arg0"));
@@ -94,8 +94,7 @@ TEST_F(InstructionFusionTest, DotOperationFusion_Bitcast) {
   auto module = CreateNewModule();
   auto computation = module->AddEntryComputation(builder.Build());
   EXPECT_EQ(dot, computation->root_instruction());
-  EXPECT_TRUE(CpuInstructionFusion().Run(module.get()).ValueOrDie());
-  EXPECT_THAT(computation->root_instruction(), op::Fusion());
+  EXPECT_FALSE(CpuInstructionFusion().Run(module.get()).ValueOrDie());
 }
 
 TEST_F(InstructionFusionTest, DotOperationFusion_Reshape) {
@@ -244,35 +243,33 @@ class OpcodeFusionTest : public InstructionFusionTest {
   }
 };
 
-TEST_F(OpcodeFusionTest, Exponential_Bitcast_Negate) {
+TEST_F(OpcodeFusionTest, Exponential_Reshape_Negate) {
   HloComputation::Builder builder(TestName());
   Shape param_shape = ShapeUtil::MakeShape(F32, {1, 4});
   Shape result_shape = ShapeUtil::MakeShape(F32, {4});
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, param_shape, "param"));
-  // InstructionFusion::ShouldFuse() precludes fusing a bitcast whose operand
-  // is a parameter, so create an operand between the parameter and bitcast.
   HloInstruction* exp1 = builder.AddInstruction(
       HloInstruction::CreateUnary(param_shape, HloOpcode::kExp, param0));
-  HloInstruction* bitcast2 = builder.AddInstruction(
-      HloInstruction::CreateUnary(result_shape, HloOpcode::kBitcast, exp1));
+  HloInstruction* reshape2 =
+      builder.AddInstruction(HloInstruction::CreateReshape(result_shape, exp1));
   builder.AddInstruction(
-      HloInstruction::CreateUnary(result_shape, HloOpcode::kNegate, bitcast2));
+      HloInstruction::CreateUnary(result_shape, HloOpcode::kNegate, reshape2));
 
   auto module = CreateNewModule();
   module->AddEntryComputation(builder.Build());
 
   RunFusionAndCheckOpcodesWereFused(
-      module.get(), {HloOpcode::kNegate, HloOpcode::kBitcast, HloOpcode::kExp,
+      module.get(), {HloOpcode::kNegate, HloOpcode::kReshape, HloOpcode::kExp,
                      HloOpcode::kParameter});
 }
 
-TEST_F(OpcodeFusionTest, Broadcast_Bitcast_DynamicSlice_Tanh) {
+TEST_F(OpcodeFusionTest, Broadcast_Reshape_DynamicSlice_Tanh) {
   HloComputation::Builder builder(TestName());
   Shape param_shape = ShapeUtil::MakeShape(F32, {8});
   Shape starts_shape = ShapeUtil::MakeShape(F32, {2});
   Shape broadcast_shape = ShapeUtil::MakeShape(F32, {1, 8, 8});
-  Shape bitcast_shape = ShapeUtil::MakeShape(F32, {8, 8});
+  Shape reshape_shape = ShapeUtil::MakeShape(F32, {8, 8});
   Shape dynamic_slice_shape = ShapeUtil::MakeShape(F32, {4, 4});
   HloInstruction* param0 = builder.AddInstruction(
       HloInstruction::CreateParameter(0, param_shape, "param"));
@@ -280,11 +277,11 @@ TEST_F(OpcodeFusionTest, Broadcast_Bitcast_DynamicSlice_Tanh) {
       HloInstruction::CreateParameter(1, starts_shape, "starts"));
   HloInstruction* broadcast2 = builder.AddInstruction(
       HloInstruction::CreateBroadcast(broadcast_shape, param0, {1}));
-  HloInstruction* bitcast3 = builder.AddInstruction(HloInstruction::CreateUnary(
-      bitcast_shape, HloOpcode::kBitcast, broadcast2));
+  HloInstruction* reshape3 = builder.AddInstruction(
+      HloInstruction::CreateReshape(reshape_shape, broadcast2));
   HloInstruction* dynamic_slice4 =
       builder.AddInstruction(HloInstruction::CreateDynamicSlice(
-          dynamic_slice_shape, bitcast3, param1, {4, 4}));
+          dynamic_slice_shape, reshape3, param1, {4, 4}));
   builder.AddInstruction(HloInstruction::CreateUnary(
       dynamic_slice_shape, HloOpcode::kTanh, dynamic_slice4));
 
@@ -293,7 +290,7 @@ TEST_F(OpcodeFusionTest, Broadcast_Bitcast_DynamicSlice_Tanh) {
 
   RunFusionAndCheckOpcodesWereFused(
       module.get(),
-      {HloOpcode::kTanh, HloOpcode::kDynamicSlice, HloOpcode::kBitcast,
+      {HloOpcode::kTanh, HloOpcode::kDynamicSlice, HloOpcode::kReshape,
        HloOpcode::kBroadcast, HloOpcode::kParameter, HloOpcode::kParameter});
 }
 
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc b/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc
index 40ace963270e8cead47cc731cc326351178dff7d..9a3bd68c80c6e8bcdb231c63ba025d1f73619eb7 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc
+++ b/tensorflow/compiler/xla/service/cpu/cpu_runtime.cc
@@ -31,6 +31,8 @@ XfeedManager* GetXfeedManager() {
   return manager;
 }
 
+extern const char* const kEigenMatMulF16SymbolName =
+    "__xla_cpu_runtime_EigenMatMulF16";
 extern const char* const kEigenMatMulF32SymbolName =
     "__xla_cpu_runtime_EigenMatMulF32";
 extern const char* const kEigenMatMulF64SymbolName =
@@ -40,6 +42,8 @@ extern const char* const kEigenConvF16SymbolName =
 extern const char* const kEigenConvF32SymbolName =
     "__xla_cpu_runtime_EigenConvF32";
 extern const char* const kEigenFftSymbolName = "__xla_cpu_runtime_EigenFft";
+extern const char* const kEigenSingleThreadedMatMulF16SymbolName =
+    "__xla_cpu_runtime_EigenSingleThreadedMatMulF16";
 extern const char* const kEigenSingleThreadedMatMulF32SymbolName =
     "__xla_cpu_runtime_EigenSingleThreadedMatMulF32";
 extern const char* const kEigenSingleThreadedMatMulF64SymbolName =
diff --git a/tensorflow/compiler/xla/service/cpu/cpu_runtime.h b/tensorflow/compiler/xla/service/cpu/cpu_runtime.h
index 2141dfe1cedd6f9674acc348152574b4fd30895b..e61d6ea28b633398863357541e056ee887582f9c 100644
--- a/tensorflow/compiler/xla/service/cpu/cpu_runtime.h
+++ b/tensorflow/compiler/xla/service/cpu/cpu_runtime.h
@@ -41,11 +41,13 @@ namespace runtime {
 //    the actual symbol.
 // 2. When using ahead-of-time compilation, the linker can resolve the name
 //    because it is a symbol in the cpu_runtime library.
+extern const char* const kEigenMatMulF16SymbolName;
 extern const char* const kEigenMatMulF32SymbolName;
 extern const char* const kEigenMatMulF64SymbolName;
 extern const char* const kEigenConvF16SymbolName;
 extern const char* const kEigenConvF32SymbolName;
 extern const char* const kEigenFftSymbolName;
+extern const char* const kEigenSingleThreadedMatMulF16SymbolName;
 extern const char* const kEigenSingleThreadedMatMulF32SymbolName;
 extern const char* const kEigenSingleThreadedMatMulF64SymbolName;
 extern const char* const kEigenSingleThreadedConvF16SymbolName;
diff --git a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
index cfe7c9c3af0be109ac8a86753e880e2bcbceba41..6f06256e08e8e3342e77c7c79a2a47465b89eca3 100644
--- a/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/dot_op_emitter.cc
@@ -919,6 +919,12 @@ tensorflow::Status DotOpEmitter::EmitCallToRuntime() {
   llvm::Type* float_type;
   const char* fn_name;
   switch (type) {
+    case F16:
+      fn_name = multi_threaded_eigen
+                    ? runtime::kEigenMatMulF16SymbolName
+                    : runtime::kEigenSingleThreadedMatMulF16SymbolName;
+      float_type = ir_builder_->getHalfTy();
+      break;
     case F32:
       fn_name = multi_threaded_eigen
                     ? runtime::kEigenMatMulF32SymbolName
@@ -1051,7 +1057,8 @@ static bool AreValidGemmShapes(const Shape& lhs_shape, const Shape& rhs_shape,
   // The inputs and the output must
   // 1) be matrices with no padding, and
   // 2) have an allowed element type.
-  return output_shape.element_type() == F32 &&
+  PrimitiveType output_primitive_type = output_shape.element_type();
+  return (output_primitive_type == F32 || output_primitive_type == F16) &&
          IsRank2WithNoPadding(lhs_shape) && IsRank2WithNoPadding(rhs_shape) &&
          IsRank2WithNoPadding(output_shape);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/ir_emitter.cc b/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
index 4dffaee87f6b33933b58c8c58478eec918569197..3b8056d50500cac381a1c5ad6b05028476504a47 100644
--- a/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/cpu/ir_emitter.cc
@@ -2074,7 +2074,7 @@ Status IrEmitter::HandleFusion(HloInstruction* fusion) {
 
     TF_RETURN_IF_ERROR(ElementTypesSameAndSupported(
         /*instruction=*/*root, /*operands=*/{lhs, rhs},
-        /*supported_types=*/{F32}));
+        /*supported_types=*/{F16, F32}));
 
     llvm_ir::IrArray lhs_array(GetIrArrayFor(lhs));
     llvm_ir::IrArray rhs_array(GetIrArrayFor(rhs));
diff --git a/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc b/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
index cd997f07890cdc1d9a546ede58cc1d992b6416ae..07a9f0efcb64db4b2ff0c6518d4b48eee9a505e0 100644
--- a/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
+++ b/tensorflow/compiler/xla/service/cpu/parallel_cpu_executable.cc
@@ -394,7 +394,7 @@ Status ParallelCpuExecutable::ExecuteComputeFunctions(
   for (auto& entry : *function_names_) {
     tensorflow::mutex_lock lock(jit_mutex_);
     HloInstruction* instruction = entry.first;
-    llvm::JITSymbol sym = jit_->FindSymbol(entry.second);
+    llvm::JITSymbol sym = jit_->FindCompiledSymbol(entry.second);
     TF_RET_CHECK(sym);
     InsertOrDie(
         &functions, instruction,
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_fp16.cc b/tensorflow/compiler/xla/service/cpu/runtime_fp16.cc
new file mode 100644
index 0000000000000000000000000000000000000000..af0275c8bd00c82220fbe116eb90d2692393713b
--- /dev/null
+++ b/tensorflow/compiler/xla/service/cpu/runtime_fp16.cc
@@ -0,0 +1,133 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <cstring>
+
+#include "tensorflow/compiler/xla/service/cpu/runtime_fp16.h"
+#include "tensorflow/core/platform/macros.h"
+
+namespace {
+using tensorflow::uint16;
+using tensorflow::uint32;
+
+// Helper class that lets us access the underlying bit representation
+// of a float without breaking C++ strict aliasing.
+class AliasedFloatInt {
+ public:
+  static_assert(sizeof(float) == sizeof(uint32), "");
+
+  static AliasedFloatInt FromFloat(float f) {
+    AliasedFloatInt value;
+    value.set_float(f);
+    return value;
+  }
+
+  static AliasedFloatInt FromUInt(uint32 u) {
+    AliasedFloatInt value;
+    value.set_uint(u);
+    return value;
+  }
+
+  void set_float(float f) { memcpy(&value_, &f, sizeof(f)); }
+  float as_float() const {
+    float f;
+    memcpy(&f, &value_, sizeof(f));
+    return f;
+  }
+
+  void set_uint(uint32 u) { value_ = u; }
+  uint32 as_uint() const { return value_; }
+
+ private:
+  uint32 value_;
+};
+}  // namespace
+
+// __gnu_f2h_ieee and __gnu_h2f_ieee are marked as weak symbols so if XLA is
+// built with compiler-rt (that also defines these symbols) we don't get a
+// duplicate definition linker error.  Making these symbols weak also ensures
+// that the compiler-rt definitions "win", but that isn't essential.
+
+// Algorithm copied from Eigen.
+uint16 TF_ATTRIBUTE_WEAK __gnu_f2h_ieee(float float_value) {
+  AliasedFloatInt f = AliasedFloatInt::FromFloat(float_value);
+
+  const AliasedFloatInt f32infty = AliasedFloatInt::FromUInt(255 << 23);
+  const AliasedFloatInt f16max = AliasedFloatInt::FromUInt((127 + 16) << 23);
+  const AliasedFloatInt denorm_magic =
+      AliasedFloatInt::FromUInt(((127 - 15) + (23 - 10) + 1) << 23);
+  unsigned int sign_mask = 0x80000000u;
+  uint32 o = static_cast<uint16>(0x0u);
+
+  unsigned int sign = f.as_uint() & sign_mask;
+  f.set_uint(f.as_uint() ^ sign);
+
+  // NOTE all the integer compares in this function can be safely
+  // compiled into signed compares since all operands are below
+  // 0x80000000. Important if you want fast straight SSE2 code
+  // (since there's no unsigned PCMPGTD).
+
+  if (f.as_uint() >=
+      f16max.as_uint()) {  // result is Inf or NaN (all exponent bits set)
+    o = (f.as_uint() > f32infty.as_uint()) ? 0x7e00
+                                           : 0x7c00;  // NaN->qNaN and Inf->Inf
+  } else {                            // (De)normalized number or zero
+    if (f.as_uint() < (113 << 23)) {  // resulting FP16 is subnormal or zero
+      // use a magic value to align our 10 mantissa bits at the bottom of
+      // the float. as long as FP addition is round-to-nearest-even this
+      // just works.
+      f.set_float(f.as_float() + denorm_magic.as_float());
+
+      // and one integer subtract of the bias later, we have our final float!
+      o = static_cast<uint16>(f.as_uint() - denorm_magic.as_uint());
+    } else {
+      unsigned int mant_odd =
+          (f.as_uint() >> 13) & 1;  // resulting mantissa is odd
+
+      // update exponent, rounding bias part 1
+      f.set_uint(f.as_uint() + (static_cast<unsigned int>(15 - 127) << 23) +
+                 0xfff);
+      // rounding bias part 2
+      f.set_uint(f.as_uint() + mant_odd);
+      // take the bits!
+      o = static_cast<uint16>(f.as_uint() >> 13);
+    }
+  }
+
+  o |= static_cast<uint16>(sign >> 16);
+  return o;
+}
+
+// Algorithm copied from Eigen.
+float TF_ATTRIBUTE_WEAK __gnu_h2f_ieee(uint16 h) {
+  const AliasedFloatInt magic = AliasedFloatInt::FromUInt(113 << 23);
+  const unsigned int shifted_exp = 0x7c00 << 13;  // exponent mask after shift
+  AliasedFloatInt o;
+
+  o.set_uint((h & 0x7fff) << 13);                // exponent/mantissa bits
+  unsigned int exp = shifted_exp & o.as_uint();  // just the exponent
+  o.set_uint(o.as_uint() + ((127 - 15) << 23));  // exponent adjust
+
+  // handle exponent special cases
+  if (exp == shifted_exp) {                        // Inf/NaN?
+    o.set_uint(o.as_uint() + ((128 - 16) << 23));  // extra exp adjust
+  } else if (exp == 0) {                           // Zero/Denormal?
+    o.set_uint(o.as_uint() + (1 << 23));           // extra exp adjust
+    o.set_float(o.as_float() - magic.as_float());  // renormalize
+  }
+
+  o.set_uint(o.as_uint() | (h & 0x8000) << 16);  // sign bit
+  return o.as_float();
+}
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_fp16.h b/tensorflow/compiler/xla/service/cpu/runtime_fp16.h
new file mode 100644
index 0000000000000000000000000000000000000000..01d92d031904af99884c2583a8c7b5086b289d44
--- /dev/null
+++ b/tensorflow/compiler/xla/service/cpu/runtime_fp16.h
@@ -0,0 +1,27 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_FP16_H_
+#define TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_FP16_H_
+
+#include "tensorflow/core/platform/types.h"
+
+// Converts an F32 value to a F16.
+extern "C" tensorflow::uint16 __gnu_f2h_ieee(float);
+
+// Converts an F16 value to a F32.
+extern "C" float __gnu_h2f_ieee(tensorflow::uint16);
+
+#endif  // TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_FP16_H_
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc b/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc
index bff57d33ae23fbba8c664cbd18df77e4c35eb592..39b13183ff093611a42b3931d45f64eadb420622 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc
+++ b/tensorflow/compiler/xla/service/cpu/runtime_matmul.cc
@@ -63,30 +63,41 @@ void MatMul(const void* run_options_ptr, T* out, T* lhs, T* rhs, int64 m,
   C.device(*run_options->intra_op_thread_pool()) = A.contract(B, dims);
 }
 
+template <typename T>
+void MatMulImpl(const void* run_options_ptr, T* out, T* lhs, T* rhs, int64 m,
+                int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
+  if (m == 1 || n == 1) {
+    // Despite being single threaded, this version of matrix * vector is faster.
+    xla::EigenMatVec<T>(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
+  } else {
+    MatMul<T>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
+              transpose_rhs);
+  }
+}
+
 }  // namespace
 
+void __xla_cpu_runtime_EigenMatMulF16(const void* run_options_ptr,
+                                      Eigen::half* out, Eigen::half* lhs,
+                                      Eigen::half* rhs, int64 m, int64 n,
+                                      int64 k, int32 transpose_lhs,
+                                      int32 transpose_rhs) {
+  MatMulImpl<Eigen::half>(run_options_ptr, out, lhs, rhs, m, n, k,
+                          transpose_lhs, transpose_rhs);
+}
+
 void __xla_cpu_runtime_EigenMatMulF32(const void* run_options_ptr, float* out,
                                       float* lhs, float* rhs, int64 m, int64 n,
                                       int64 k, int32 transpose_lhs,
                                       int32 transpose_rhs) {
-  if (m == 1 || n == 1) {
-    // Despite being single threaded, this version of matrix * vector is faster.
-    xla::EigenMatVecF32(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-  } else {
-    MatMul<float>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
-                  transpose_rhs);
-  }
+  MatMulImpl<float>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
+                    transpose_rhs);
 }
 
 void __xla_cpu_runtime_EigenMatMulF64(const void* run_options_ptr, double* out,
                                       double* lhs, double* rhs, int64 m,
                                       int64 n, int64 k, int32 transpose_lhs,
                                       int32 transpose_rhs) {
-  if (m == 1 || n == 1) {
-    // Despite being single threaded, this version of matrix * vector is faster.
-    xla::EigenMatVecF64(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-  } else {
-    MatMul<double>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
-                   transpose_rhs);
-  }
+  MatMulImpl<double>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
+                     transpose_rhs);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matmul.h b/tensorflow/compiler/xla/service/cpu/runtime_matmul.h
index fdb644651dd5d0fa0345580f52ed0fb051672285..b5156434f6d568012b8f51ba9b14d64ce418cec7 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_matmul.h
+++ b/tensorflow/compiler/xla/service/cpu/runtime_matmul.h
@@ -25,6 +25,12 @@ extern "C" {
 // order. 'out' is a pointer to a buffer sufficiently large to hold the result
 // of the operation. Following standard nomenclature: lhs is m x k,
 // rhs is k x n, and out is m x n.
+extern void __xla_cpu_runtime_EigenMatMulF16(
+    const void* /* xla::ExecutableRunOptions* */ run_options_ptr,
+    Eigen::half* out, Eigen::half* lhs, Eigen::half* rhs, tensorflow::int64 m,
+    tensorflow::int64 n, tensorflow::int64 k, tensorflow::int32 transpose_lhs,
+    tensorflow::int32 transpose_rhs);
+
 extern void __xla_cpu_runtime_EigenMatMulF32(
     const void* /* xla::ExecutableRunOptions* */ run_options_ptr, float* out,
     float* lhs, float* rhs, tensorflow::int64 m, tensorflow::int64 n,
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matvec.cc b/tensorflow/compiler/xla/service/cpu/runtime_matvec.cc
deleted file mode 100644
index 435820cdd36e2a906d9dfbe2555f4c0df623c729..0000000000000000000000000000000000000000
--- a/tensorflow/compiler/xla/service/cpu/runtime_matvec.cc
+++ /dev/null
@@ -1,110 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#include <algorithm>
-#include <cassert>
-
-#include "third_party/eigen3/Eigen/Core"
-#include "tensorflow/compiler/xla/service/cpu/runtime_matvec.h"
-
-using tensorflow::int32;
-using tensorflow::int64;
-
-namespace {
-
-// Does mat * x or mat^T * x.
-template <typename T>
-void MatVec(T* out_buf, T* mat_buf, T* x_buf, int64 rows, int64 cols,
-            int32 transpose) {
-  // Use an Eigen Matrix instead of a Tensor, as the GEMV from Matrix seems to
-  // be faster (b/30223679).  See also: the matmul op kernel in TensorFlow,
-  // which implements the same optimization.
-  using Matrix = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>;
-  using MatrixMap = Eigen::Map<Matrix>;
-
-  using Vector = Eigen::Matrix<T, Eigen::Dynamic, 1>;
-  using VectorMap = Eigen::Map<Vector>;
-
-  auto x = VectorMap(x_buf, cols);
-  auto out = VectorMap(out_buf, rows);
-
-  int64 mat_rows = rows;
-  int64 mat_cols = cols;
-
-  if (transpose) {
-    std::swap(mat_rows, mat_cols);
-  }
-
-  auto mat = MatrixMap(mat_buf, mat_rows, mat_cols);
-
-  if (transpose) {
-    out = mat.transpose() * x;
-  } else {
-    out = mat * x;
-  }
-}
-
-// Converts matmul-style args to matvec.
-template <typename T>
-void DispatchMatVec(T* out, T* lhs, T* rhs, int64 m, int64 n, int64 k,
-                    int32 transpose_lhs, int32 transpose_rhs) {
-  // If the input is in the form x * A, where x is the vector, then bring A back
-  // over to the left hand side.  We make use of the identity
-  //
-  //   (x * A)^T = A^T * x^T
-  //
-  // We do not need to take the transpose of x or of the result since taking
-  // the transpose of a vector does not change the memory layout.
-  const int64 cols = k;
-
-  T* mat;
-  T* vec;
-  int64 rows;
-  bool transpose_mat;
-
-  bool is_mat_vec = (n == 1);
-
-  if (is_mat_vec) {
-    mat = lhs;
-    vec = rhs;
-    rows = m;
-    transpose_mat = transpose_lhs;
-  } else {
-    mat = rhs;
-    vec = lhs;
-    rows = n;
-    transpose_mat = !transpose_rhs;
-  }
-
-  MatVec<T>(out, mat, vec, rows, cols, transpose_mat);
-}
-
-}  // namespace
-
-namespace xla {
-
-void EigenMatVecF32(float* out, float* lhs, float* rhs, int64 m, int64 n,
-                    int64 k, int32 transpose_lhs, int32 transpose_rhs) {
-  assert((m == 1 || n == 1) && "not a matrix-vector multiply");
-  DispatchMatVec<float>(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-}
-
-void EigenMatVecF64(double* out, double* lhs, double* rhs, int64 m, int64 n,
-                    int64 k, int32 transpose_lhs, int32 transpose_rhs) {
-  assert((m == 1 || n == 1) && "not a matrix-vector multiply");
-  DispatchMatVec<double>(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-}
-
-}  // namespace xla
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_matvec.h b/tensorflow/compiler/xla/service/cpu/runtime_matvec.h
index 1bd8dfb377acc1f7cfbe9a92773f87f0ef25de3a..70eb98c54169824e220d9287753c0849362eade6 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_matvec.h
+++ b/tensorflow/compiler/xla/service/cpu/runtime_matvec.h
@@ -16,10 +16,86 @@ limitations under the License.
 #ifndef TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_MATVEC_H_
 #define TENSORFLOW_COMPILER_XLA_SERVICE_CPU_RUNTIME_MATVEC_H_
 
+#include "third_party/eigen3/Eigen/Core"
+
 #include "tensorflow/core/platform/types.h"
 
 namespace xla {
 
+namespace detail {
+
+using tensorflow::int32;
+using tensorflow::int64;
+
+// Does mat * x or mat^T * x.
+template <typename T>
+void MatVec(T* out_buf, T* mat_buf, T* x_buf, int64 rows, int64 cols,
+            int32 transpose) {
+  // Use an Eigen Matrix instead of a Tensor, as the GEMV from Matrix seems to
+  // be faster (b/30223679).  See also: the matmul op kernel in TensorFlow,
+  // which implements the same optimization.
+  using Matrix = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>;
+  using MatrixMap = Eigen::Map<Matrix>;
+
+  using Vector = Eigen::Matrix<T, Eigen::Dynamic, 1>;
+  using VectorMap = Eigen::Map<Vector>;
+
+  auto x = VectorMap(x_buf, cols);
+  auto out = VectorMap(out_buf, rows);
+
+  int64 mat_rows = rows;
+  int64 mat_cols = cols;
+
+  if (transpose) {
+    std::swap(mat_rows, mat_cols);
+  }
+
+  auto mat = MatrixMap(mat_buf, mat_rows, mat_cols);
+
+  if (transpose) {
+    out = mat.transpose() * x;
+  } else {
+    out = mat * x;
+  }
+}
+
+// Converts matmul-style args to matvec.
+template <typename T>
+void DispatchMatVec(T* out, T* lhs, T* rhs, int64 m, int64 n, int64 k,
+                    int32 transpose_lhs, int32 transpose_rhs) {
+  // If the input is in the form x * A, where x is the vector, then bring A back
+  // over to the left hand side.  We make use of the identity
+  //
+  //   (x * A)^T = A^T * x^T
+  //
+  // We do not need to take the transpose of x or of the result since taking
+  // the transpose of a vector does not change the memory layout.
+  const int64 cols = k;
+
+  T* mat;
+  T* vec;
+  int64 rows;
+  bool transpose_mat;
+
+  bool is_mat_vec = (n == 1);
+
+  if (is_mat_vec) {
+    mat = lhs;
+    vec = rhs;
+    rows = m;
+    transpose_mat = transpose_lhs;
+  } else {
+    mat = rhs;
+    vec = lhs;
+    rows = n;
+    transpose_mat = !transpose_rhs;
+  }
+
+  MatVec<T>(out, mat, vec, rows, cols, transpose_mat);
+}
+
+}  // namespace detail
+
 // Performs a matrix-vector multiplication using Eigen. 'lhs' and 'rhs' are
 // pointers to buffers containing input matrices in column-major order. 'out' is
 // a pointer to a buffer sufficiently large to hold the result of the
@@ -30,15 +106,15 @@ namespace xla {
 //
 // TODO(b/64684907): Compare runtime performance of these functions with dot
 // simplification.
-void EigenMatVecF32(float* out, float* lhs, float* rhs, tensorflow::int64 m,
-                    tensorflow::int64 n, tensorflow::int64 k,
-                    tensorflow::int32 transpose_lhs,
-                    tensorflow::int32 transpose_rhs);
-
-void EigenMatVecF64(double* out, double* lhs, double* rhs, tensorflow::int64 m,
-                    tensorflow::int64 n, tensorflow::int64 k,
-                    tensorflow::int32 transpose_lhs,
-                    tensorflow::int32 transpose_rhs);
+template <typename T>
+void EigenMatVec(T* out, T* lhs, T* rhs, tensorflow::int64 m,
+                 tensorflow::int64 n, tensorflow::int64 k,
+                 tensorflow::int32 transpose_lhs,
+                 tensorflow::int32 transpose_rhs) {
+  assert((m == 1 || n == 1) && "not a matrix-vector multiply");
+  detail::DispatchMatVec<T>(out, lhs, rhs, m, n, k, transpose_lhs,
+                            transpose_rhs);
+}
 
 }  // namespace xla
 
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc
index ee8eb081556d60fcf6537b1036a4a5825c4c7bf6..17303e2f0d34e531a3a56aa147608b949e0f43ae 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc
+++ b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.cc
@@ -57,26 +57,38 @@ void MatMul(const void* run_options_ptr, T* out, T* lhs, T* rhs, int64 m,
   C = A.contract(B, dims);
 }
 
+template <typename T>
+void SingleThreadedMatMul(const void* run_options_ptr, T* out, T* lhs, T* rhs,
+                          int64 m, int64 n, int64 k, int32 transpose_lhs,
+                          int32 transpose_rhs) {
+  if (m == 1 || n == 1) {
+    xla::EigenMatVec<T>(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
+  } else {
+    MatMul<T>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
+              transpose_rhs);
+  }
+}
+
 }  // namespace
 
+void __xla_cpu_runtime_EigenSingleThreadedMatMulF16(
+    const void* run_options_ptr, Eigen::half* out, Eigen::half* lhs,
+    Eigen::half* rhs, int64 m, int64 n, int64 k, int32 transpose_lhs,
+    int32 transpose_rhs) {
+  SingleThreadedMatMul<Eigen::half>(run_options_ptr, out, lhs, rhs, m, n, k,
+                                    transpose_lhs, transpose_rhs);
+}
+
 void __xla_cpu_runtime_EigenSingleThreadedMatMulF32(
     const void* run_options_ptr, float* out, float* lhs, float* rhs, int64 m,
     int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
-  if (m == 1 || n == 1) {
-    xla::EigenMatVecF32(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-  } else {
-    MatMul<float>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
-                  transpose_rhs);
-  }
+  SingleThreadedMatMul<float>(run_options_ptr, out, lhs, rhs, m, n, k,
+                              transpose_lhs, transpose_rhs);
 }
 
 void __xla_cpu_runtime_EigenSingleThreadedMatMulF64(
     const void* run_options_ptr, double* out, double* lhs, double* rhs, int64 m,
     int64 n, int64 k, int32 transpose_lhs, int32 transpose_rhs) {
-  if (m == 1 || n == 1) {
-    xla::EigenMatVecF64(out, lhs, rhs, m, n, k, transpose_lhs, transpose_rhs);
-  } else {
-    MatMul<double>(run_options_ptr, out, lhs, rhs, m, n, k, transpose_lhs,
-                   transpose_rhs);
-  }
+  SingleThreadedMatMul<double>(run_options_ptr, out, lhs, rhs, m, n, k,
+                               transpose_lhs, transpose_rhs);
 }
diff --git a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h
index 029eb9514287d8c69cde2cfb06e0d56e78d6f165..9371a62242328a67618321e2b1d112956c06ee4b 100644
--- a/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h
+++ b/tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h
@@ -25,6 +25,12 @@ extern "C" {
 // 'out' is a pointer to a buffer sufficiently large to hold the result of the
 // operation. Following standard nomenclature: lhs is m x k, rhs is k x n, and
 // out is m x n.
+extern void __xla_cpu_runtime_EigenSingleThreadedMatMulF16(
+    const void* /* xla::ExecutableRunOptions* */ run_options_ptr,
+    Eigen::half* out, Eigen::half* lhs, Eigen::half* rhs, tensorflow::int64 m,
+    tensorflow::int64 n, tensorflow::int64 k, tensorflow::int32 transpose_lhs,
+    tensorflow::int32 transpose_rhs);
+
 extern void __xla_cpu_runtime_EigenSingleThreadedMatMulF32(
     const void* /* xla::ExecutableRunOptions* */ run_options_ptr, float* out,
     float* lhs, float* rhs, tensorflow::int64 m, tensorflow::int64 n,
diff --git a/tensorflow/compiler/xla/service/cpu/simple_orc_jit.cc b/tensorflow/compiler/xla/service/cpu/simple_orc_jit.cc
index cfed551eedf9c429752c52bb4411ce1200845d11..80c24eaccfc2a83f8f3f311d60860715668d0c08 100644
--- a/tensorflow/compiler/xla/service/cpu/simple_orc_jit.cc
+++ b/tensorflow/compiler/xla/service/cpu/simple_orc_jit.cc
@@ -33,6 +33,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/cpu/runtime_conv2d.h"
 #include "tensorflow/compiler/xla/service/cpu/runtime_fft.h"
 #include "tensorflow/compiler/xla/service/cpu/runtime_fork_join.h"
+#include "tensorflow/compiler/xla/service/cpu/runtime_fp16.h"
 #include "tensorflow/compiler/xla/service/cpu/runtime_matmul.h"
 #include "tensorflow/compiler/xla/service/cpu/runtime_single_threaded_conv2d.h"
 #include "tensorflow/compiler/xla/service/cpu/runtime_single_threaded_matmul.h"
@@ -44,36 +45,6 @@ namespace xla {
 namespace cpu {
 namespace {
 
-// A simple SymbolResolver that delegates to the host dynamic linker.
-class SimpleResolver : public llvm::LegacyJITSymbolResolver {
- public:
-  explicit SimpleResolver(ExternalConstantPool* external_constant_pool)
-      : external_constant_pool_(external_constant_pool) {}
-
-  llvm::JITSymbol findSymbol(const std::string& name) override {
-    if (const uint8* from_constant_pool =
-            external_constant_pool_->Find(string(name))) {
-      return llvm::JITEvaluatedSymbol(
-          reinterpret_cast<uint64_t>(from_constant_pool),
-          llvm::JITSymbolFlags::None);
-    }
-
-    void* func_addr = CustomCallTargetRegistry::Global()->Lookup(name);
-    if (func_addr == nullptr) {
-      return nullptr;
-    }
-    llvm::JITEvaluatedSymbol symbol_info(reinterpret_cast<uint64_t>(func_addr),
-                                         llvm::JITSymbolFlags::None);
-    return symbol_info;
-  }
-  llvm::JITSymbol findSymbolInLogicalDylib(const std::string& name) override {
-    return nullptr;
-  }
-
- private:
-  ExternalConstantPool* external_constant_pool_;
-};
-
 llvm::SmallVector<std::string, 0> DetectMachineAttributes() {
   llvm::SmallVector<std::string, 0> result;
   llvm::StringMap<bool> host_features;
@@ -119,21 +90,7 @@ SimpleOrcJIT::SimpleOrcJIT(const llvm::TargetOptions& target_options,
       execution_session_(string_pool_),
       symbol_resolver_(llvm::orc::createLegacyLookupResolver(
           [this](const std::string& name) -> llvm::JITSymbol {
-            if (const uint8* from_constant_pool =
-                    external_constant_pool_.Find(string(name))) {
-              return llvm::JITEvaluatedSymbol(
-                  reinterpret_cast<uint64_t>(from_constant_pool),
-                  llvm::JITSymbolFlags::None);
-            }
-
-            void* func_addr = CustomCallTargetRegistry::Global()->Lookup(name);
-            if (func_addr == nullptr) {
-              return nullptr;
-            }
-            llvm::JITEvaluatedSymbol symbol_info(
-                reinterpret_cast<uint64_t>(func_addr),
-                llvm::JITSymbolFlags::None);
-            return symbol_info;
+            return this->ResolveRuntimeSymbol(name);
           },
           [](llvm::Error Err) {
             cantFail(std::move(Err), "lookupFlags failed");
@@ -157,6 +114,23 @@ SimpleOrcJIT::SimpleOrcJIT(const llvm::TargetOptions& target_options,
           << " features: " << target_machine_->getTargetFeatureString().str();
 }
 
+llvm::JITSymbol SimpleOrcJIT::ResolveRuntimeSymbol(const std::string& name) {
+  if (const uint8* from_constant_pool =
+          external_constant_pool_.Find(string(name))) {
+    return llvm::JITEvaluatedSymbol(
+        reinterpret_cast<uint64_t>(from_constant_pool),
+        llvm::JITSymbolFlags::None);
+  }
+
+  void* func_addr = CustomCallTargetRegistry::Global()->Lookup(name);
+  if (func_addr == nullptr) {
+    return nullptr;
+  }
+  llvm::JITEvaluatedSymbol symbol_info(reinterpret_cast<uint64_t>(func_addr),
+                                       llvm::JITSymbolFlags::None);
+  return symbol_info;
+}
+
 SimpleOrcJIT::VModuleKeyT SimpleOrcJIT::AddModule(
     std::unique_ptr<llvm::Module> module) {
   auto key = execution_session_.allocateVModule();
@@ -171,19 +145,13 @@ void SimpleOrcJIT::RemoveModule(SimpleOrcJIT::VModuleKeyT key) {
   cantFail(compile_layer_.removeModule(key));
 }
 
-llvm::JITSymbol SimpleOrcJIT::FindSymbol(const std::string& name) {
-  std::string mangled_name;
-  {
-    llvm::raw_string_ostream mangled_name_stream(mangled_name);
-    llvm::Mangler::getNameWithPrefix(mangled_name_stream, name, data_layout_);
-  }
-
+llvm::JITSymbol SimpleOrcJIT::FindCompiledSymbol(const std::string& name) {
   // Resolve symbol from last module to first, allowing later redefinitions of
   // symbols shadow earlier ones.
   for (auto& key :
        llvm::make_range(module_keys_.rbegin(), module_keys_.rend())) {
     if (auto symbol =
-            compile_layer_.findSymbolIn(key, mangled_name,
+            compile_layer_.findSymbolIn(key, name,
                                         /*ExportedSymbolsOnly=*/true)) {
       return symbol;
     }
@@ -213,16 +181,21 @@ bool RegisterKnownJITSymbols() {
   REGISTER_CPU_RUNTIME_SYMBOL(EigenConvF16);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenConvF32);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenFft);
+  REGISTER_CPU_RUNTIME_SYMBOL(EigenMatMulF16);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenMatMulF32);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenMatMulF64);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedConvF16);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedConvF32);
+  REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedMatMulF16);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedMatMulF32);
   REGISTER_CPU_RUNTIME_SYMBOL(EigenSingleThreadedMatMulF64);
   REGISTER_CPU_RUNTIME_SYMBOL(ParallelForkJoin);
   REGISTER_CPU_RUNTIME_SYMBOL(ReleaseInfeedBufferAfterDequeue);
   REGISTER_CPU_RUNTIME_SYMBOL(ReleaseOutfeedBufferAfterPopulation);
 
+  registry->Register("__gnu_f2h_ieee", reinterpret_cast<void*>(__gnu_f2h_ieee));
+  registry->Register("__gnu_h2f_ieee", reinterpret_cast<void*>(__gnu_h2f_ieee));
+
 #undef REGISTER_CPU_RUNTIME_SYMBOL
 
 // Register both the f32 (float) and f64 (double) versions of a libm symbol.
diff --git a/tensorflow/compiler/xla/service/cpu/simple_orc_jit.h b/tensorflow/compiler/xla/service/cpu/simple_orc_jit.h
index 50993afc8f73617a2c65310ae73b3ab00519f550..aaeff2de8785b99d271f13b261c63118bcf7bd4a 100644
--- a/tensorflow/compiler/xla/service/cpu/simple_orc_jit.h
+++ b/tensorflow/compiler/xla/service/cpu/simple_orc_jit.h
@@ -46,9 +46,7 @@ namespace cpu {
 class SimpleOrcJIT {
  public:
   using ObjLayerT = llvm::orc::RTDyldObjectLinkingLayer;
-  using CompileFtor =
-      std::function<llvm::object::OwningBinary<llvm::object::ObjectFile>(
-          llvm::Module&)>;
+  using CompileFtor = std::function<ObjLayerT::ObjectPtr(llvm::Module&)>;
   using CompileLayerT = llvm::orc::IRCompileLayer<ObjLayerT, CompileFtor>;
   using VModuleKeyT = llvm::orc::VModuleKey;
 
@@ -89,7 +87,7 @@ class SimpleOrcJIT {
 
   // Get the runtime address of the compiled symbol whose name is given. Returns
   // nullptr if the symbol cannot be found.
-  llvm::JITSymbol FindSymbol(const std::string& name);
+  llvm::JITSymbol FindCompiledSymbol(const std::string& name);
 
   llvm::TargetMachine* target_machine() const { return target_machine_.get(); }
 
@@ -98,6 +96,8 @@ class SimpleOrcJIT {
   }
 
  private:
+  llvm::JITSymbol ResolveRuntimeSymbol(const std::string& name);
+
   std::vector<VModuleKeyT> module_keys_;
   std::unique_ptr<llvm::TargetMachine> target_machine_;
   const Disassembler disassembler_;
diff --git a/tensorflow/compiler/xla/service/cpu/vector_support_library.cc b/tensorflow/compiler/xla/service/cpu/vector_support_library.cc
index 150db1cb6edec1af6724a8bca6a5f6272f1a7416..cd1165e23812861ba9951546b7dd744529232196 100644
--- a/tensorflow/compiler/xla/service/cpu/vector_support_library.cc
+++ b/tensorflow/compiler/xla/service/cpu/vector_support_library.cc
@@ -370,6 +370,9 @@ std::vector<llvm::Value*> VectorSupportLibrary::ComputeHorizontalSums(
 std::vector<llvm::Value*>
 VectorSupportLibrary::ComputeAvxOptimizedHorizontalSums(
     std::vector<llvm::Value*> vectors, llvm::Value* init_values) {
+  // vectors are N llvm vector values, each with N elements.
+  int64 lane_width = vectors.size();
+
   while (vectors.size() != 2) {
     std::vector<llvm::Value*> new_vectors;
     for (int i = 0; i < vectors.size(); i += 2) {
@@ -390,10 +393,14 @@ VectorSupportLibrary::ComputeAvxOptimizedHorizontalSums(
     high = AddInternal(ExtractHighHalf(init_values), high);
   }
 
+  // `low` has the first `lane_width / 2` horizontal reductions, and `high` has
+  // the next `lane_width / 2` horizontal reductions.
+
   std::vector<llvm::Value*> results;
-  for (int i = 0; i < 8; i++) {
+  for (int i = 0; i < lane_width; i++) {
     llvm::Value* scalar_result = ir_builder()->CreateExtractElement(
-        i < 4 ? low : high, ir_builder()->getInt32(i % 4), name());
+        i < (lane_width / 2) ? low : high,
+        ir_builder()->getInt32(i % (lane_width / 2)), name());
     results.push_back(scalar_result);
   }
 
diff --git a/tensorflow/compiler/xla/service/dfs_hlo_visitor.h b/tensorflow/compiler/xla/service/dfs_hlo_visitor.h
index a803b3171f9afa6297553c5507c4f9aa45e420ab..56723e765048698baedc50ae7b189d0287ee56b8 100644
--- a/tensorflow/compiler/xla/service/dfs_hlo_visitor.h
+++ b/tensorflow/compiler/xla/service/dfs_hlo_visitor.h
@@ -190,6 +190,7 @@ class DfsHloVisitorBase {
 
   virtual Status HandleInfeed(HloInstructionPtr hlo) = 0;
   virtual Status HandleOutfeed(HloInstructionPtr hlo) = 0;
+  virtual Status HandleHostCompute(HloInstructionPtr hlo) = 0;
   virtual Status HandleRng(HloInstructionPtr hlo) = 0;
   virtual Status HandleReverse(HloInstructionPtr hlo) = 0;
   virtual Status HandleSort(HloInstructionPtr hlo) = 0;
@@ -213,6 +214,7 @@ class DfsHloVisitorBase {
   virtual Status HandleSelectAndScatter(HloInstructionPtr hlo) = 0;
   virtual Status HandleWhile(HloInstructionPtr hlo) = 0;
   virtual Status HandleConditional(HloInstructionPtr hlo) = 0;
+  virtual Status HandleGather(HloInstructionPtr hlo) = 0;
 
   virtual Status HandlePad(HloInstructionPtr hlo) = 0;
 
diff --git a/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h b/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h
index 170adb3d241b3648bc53f96dde9866f0b794f80a..ecda5288ee17a3856ce95f0caa327c3524fd180b 100644
--- a/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h
+++ b/tensorflow/compiler/xla/service/dfs_hlo_visitor_with_default.h
@@ -103,6 +103,9 @@ class DfsHloVisitorWithDefaultBase
   Status HandleOutfeed(HloInstructionPtr outfeed) override {
     return DefaultAction(outfeed);
   }
+  Status HandleHostCompute(HloInstructionPtr host_compute) override {
+    return DefaultAction(host_compute);
+  }
   Status HandleReverse(HloInstructionPtr reverse) override {
     return DefaultAction(reverse);
   }
@@ -185,6 +188,9 @@ class DfsHloVisitorWithDefaultBase
   Status HandleSendDone(HloInstructionPtr send_done) override {
     return DefaultAction(send_done);
   }
+  Status HandleGather(HloInstructionPtr gather) override {
+    return DefaultAction(gather);
+  }
 
   // Invoked to inform the visitor that the traversal has completed, and that
   // the root was "root".
diff --git a/tensorflow/compiler/xla/service/elemental_ir_emitter.cc b/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
index 4468adbadbf823f1420a8b665a26f66cb7d36b43..c732974995f70d9ba1b46e18aa4cc2c6ab467182 100644
--- a/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/elemental_ir_emitter.cc
@@ -226,7 +226,7 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerUnaryOp(
       if (primitive_util::IsIntegralType(to_type)) {
         return ir_builder_->CreateIntCast(
             operand_value, llvm_ir::PrimitiveTypeToIrType(to_type, module_),
-            primitive_util::IsSignedIntegralType(to_type));
+            primitive_util::IsSignedIntegralType(from_type));
       }
       if (primitive_util::IsFloatingPointType(to_type)) {
         if (to_type == BF16) {
@@ -1003,6 +1003,30 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitReducePrecision(
                                   ir_builder_);
 }
 
+static llvm::Value* SaturateShiftIfNecessary(llvm::IRBuilder<>* ir_builder,
+                                             llvm::Value* lhs, llvm::Value* rhs,
+                                             llvm::Value* shift_result,
+                                             bool saturate_to_sign_bit) {
+  llvm::IntegerType* integer_type =
+      llvm::cast<llvm::IntegerType>(lhs->getType());
+  unsigned integer_bitsize = integer_type->getBitWidth();
+  llvm::ConstantInt* integer_bitsize_constant =
+      llvm::ConstantInt::get(integer_type, integer_bitsize);
+  llvm::ConstantInt* zero = llvm::ConstantInt::get(integer_type, 0);
+  llvm::ConstantInt* minus_one = llvm::ConstantInt::get(integer_type, -1);
+  llvm::Value* saturated_value;
+  if (saturate_to_sign_bit) {
+    saturated_value = ir_builder->CreateSelect(
+        ir_builder->CreateICmpSLT(lhs, zero), minus_one, zero);
+  } else {
+    saturated_value = zero;
+  }
+  llvm::Value* shift_amt_in_range =
+      ir_builder->CreateICmpULT(rhs, integer_bitsize_constant, "shft.chk");
+  return ir_builder->CreateSelect(shift_amt_in_range, shift_result,
+                                  saturated_value);
+}
+
 StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerBinaryOp(
     const HloInstruction* op, llvm::Value* lhs_value, llvm::Value* rhs_value,
     bool is_signed) const {
@@ -1050,12 +1074,27 @@ StatusOr<llvm::Value*> ElementalIrEmitter::EmitIntegerBinaryOp(
       return ir_builder_->CreateAnd(lhs_value, rhs_value);
     case HloOpcode::kOr:
       return ir_builder_->CreateOr(lhs_value, rhs_value);
-    case HloOpcode::kShiftLeft:
-      return ir_builder_->CreateShl(lhs_value, rhs_value);
+
+    // Shifting out bits >= the number of bits in the type being shifted
+    // produces a poison value in LLVM which is basically "deferred undefined
+    // behavior" -- doing something observable with such a value precipitates
+    // UB.  We replace the poison value with a constant to avoid this deferred
+    // UB.
     case HloOpcode::kShiftRightArithmetic:
-      return ir_builder_->CreateAShr(lhs_value, rhs_value);
+      return SaturateShiftIfNecessary(
+          ir_builder_, lhs_value, rhs_value,
+          ir_builder_->CreateAShr(lhs_value, rhs_value),
+          /*saturate_to_sign_bit=*/true);
+    case HloOpcode::kShiftLeft:
+      return SaturateShiftIfNecessary(
+          ir_builder_, lhs_value, rhs_value,
+          ir_builder_->CreateShl(lhs_value, rhs_value),
+          /*saturate_to_sign_bit=*/false);
     case HloOpcode::kShiftRightLogical:
-      return ir_builder_->CreateLShr(lhs_value, rhs_value);
+      return SaturateShiftIfNecessary(
+          ir_builder_, lhs_value, rhs_value,
+          ir_builder_->CreateLShr(lhs_value, rhs_value),
+          /*saturate_to_sign_bit=*/false);
     default:
       return Unimplemented("binary integer op '%s'",
                            HloOpcodeString(op->opcode()).c_str());
diff --git a/tensorflow/compiler/xla/service/gpu/BUILD b/tensorflow/compiler/xla/service/gpu/BUILD
index 9da4fb97fa27a238fead74985cb481a9be1f4a65..334efff1e61e1c89b857433df2dfd03855bad201 100644
--- a/tensorflow/compiler/xla/service/gpu/BUILD
+++ b/tensorflow/compiler/xla/service/gpu/BUILD
@@ -510,6 +510,7 @@ cc_library(
         "//tensorflow/compiler/xla/service:buffer_assignment",
         "//tensorflow/compiler/xla/service:buffer_liveness",
         "//tensorflow/compiler/xla/service:call_inliner",
+        "//tensorflow/compiler/xla/service:conditional_simplifier",
         "//tensorflow/compiler/xla/service:dot_decomposer",
         "//tensorflow/compiler/xla/service:executable",
         "//tensorflow/compiler/xla/service:flatten_call_graph",
diff --git a/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc b/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
index 8e3aebbc12b5e6d746700956b9743bc94db50167..ca54b2eed8a3f7839f88107180be30448d979b97 100644
--- a/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
+++ b/tensorflow/compiler/xla/service/gpu/gemm_thunk.cc
@@ -108,11 +108,13 @@ bool DoGemmWithAlgorithm(MatrixDescriptor lhs_matrix,
   return stream
       ->ThenBlasGemmWithAlgorithm(
           lhs_transpose, rhs_transpose, output_matrix.num_rows,
-          output_matrix.num_cols, /*size of reduce dim=*/k, /*alpha=*/1.0,
-          lhs_data, /*leading dim of LHS=*/lhs_matrix.num_rows, rhs_data,
-          /*leading dim of RHS=*/rhs_matrix.num_rows, /*beta=*/0.0,
-          &output_data, /*leading dim of output=*/output_matrix.num_rows,
-          computation_type, algorithm, output_profile_result)
+          output_matrix.num_cols, /*size of reduce dim=*/k,
+          /*alpha=*/static_cast<Element>(1.0f), lhs_data,
+          /*leading dim of LHS=*/lhs_matrix.num_rows, rhs_data,
+          /*leading dim of RHS=*/rhs_matrix.num_rows,
+          /*beta=*/static_cast<Element>(0.0f), &output_data,
+          /*leading dim of output=*/output_matrix.num_rows, computation_type,
+          algorithm, output_profile_result)
       .ok();
 }
 
@@ -137,9 +139,9 @@ StatusOr<se::blas::AlgorithmType> DoGemmAutotune(
     // for all algorithms if we're targeting < sm_50.  But because we pass a
     // non-null ProfileResult, DoGemmWithAlgorithm should always return true,
     // and the actual success-ness is returned in ProfileResult::is_valid.
-    DCHECK(DoGemmWithAlgorithm<Element>(lhs_matrix, rhs_matrix, output_matrix,
-                                        computation_type, algorithm, stream,
-                                        &profile_result));
+    CHECK(DoGemmWithAlgorithm<Element>(lhs_matrix, rhs_matrix, output_matrix,
+                                       computation_type, algorithm, stream,
+                                       &profile_result));
 
     if (profile_result.is_valid() && profile_result.elapsed_time_in_ms() <
                                          best_result.elapsed_time_in_ms()) {
@@ -161,6 +163,8 @@ StatusOr<se::blas::AlgorithmType> DoGemmAutotune(
 // DoGemm/DoGemmWithAlgorithm/DoGemmAutotune.
 auto GetGemmFn(PrimitiveType type) -> decltype(&DoGemm<float>) {
   switch (type) {
+    case F16:
+      return &DoGemm<Eigen::half>;
     case F32:
       return &DoGemm<float>;
     case F64:
@@ -172,6 +176,8 @@ auto GetGemmFn(PrimitiveType type) -> decltype(&DoGemm<float>) {
 auto GetGemmWithAlgorithmFn(PrimitiveType type)
     -> decltype(&DoGemmWithAlgorithm<float>) {
   switch (type) {
+    case F16:
+      return &DoGemmWithAlgorithm<Eigen::half>;
     case F32:
       return &DoGemmWithAlgorithm<float>;
     case F64:
@@ -182,6 +188,8 @@ auto GetGemmWithAlgorithmFn(PrimitiveType type)
 }
 auto GetGemmAutotuneFn(PrimitiveType type) -> decltype(&DoGemmAutotune<float>) {
   switch (type) {
+    case F16:
+      return &DoGemmAutotune<Eigen::half>;
     case F32:
       return &DoGemmAutotune<float>;
     case F64:
@@ -196,6 +204,10 @@ auto GetGemmAutotuneFn(PrimitiveType type) -> decltype(&DoGemmAutotune<float>) {
 // separately from the precision of the inputs and result.
 se::blas::ComputationType GetBlasComputationType(PrimitiveType type) {
   switch (type) {
+    case F16:
+      // Use F32 as computation type for F16 as we currently only implement the
+      // cuDNN pseudo half configuration for half precision.
+      return se::blas::ComputationType::kF32;
     case F32:
       return se::blas::ComputationType::kF32;
     case F64:
@@ -315,6 +327,9 @@ tensorflow::Status GemmThunk::ExecuteOnStream(
           stream,
           /*output_profile_result=*/nullptr);
     }
+
+    // Autotune will fail when CUDA 8 and GPU sm_50 or older are used.
+    // Use the older Gemm API in this case.
     return GetGemmFn(element_type)(lhs_matrix, rhs_matrix, output_matrix,
                                    stream);
   };
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc b/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
index 28ebd034ee0c89137f4e6eb417d8a37f4a00af7a..9e37acdf31b4f583f41a410929df0964cb799c80 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_compiler.cc
@@ -33,6 +33,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/buffer_assignment.h"
 #include "tensorflow/compiler/xla/service/buffer_liveness.h"
 #include "tensorflow/compiler/xla/service/call_inliner.h"
+#include "tensorflow/compiler/xla/service/conditional_simplifier.h"
 #include "tensorflow/compiler/xla/service/dot_decomposer.h"
 #include "tensorflow/compiler/xla/service/flatten_call_graph.h"
 #include "tensorflow/compiler/xla/service/gpu/cudnn_batchnorm_rewriter.h"
@@ -176,6 +177,7 @@ tensorflow::Status OptimizeHloModule(HloModule* hlo_module,
       pass.AddPass<HloDCE>();
       pass.AddPass<ReshapeMover>();
       pass.AddPass<HloConstantFolding>();
+      pass.AddPass<ConditionalSimplifier>();
     }
 
     pipeline.AddPass<TransposeFolding>(
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc b/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
index 916b556fd43a453a4da2c96217e74c367f8c7653..9db85bc788bde46c890a46ce9b0902ddce3f5675 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_copy_insertion.cc
@@ -49,7 +49,7 @@ StatusOr<bool> GpuCopyInsertion::Run(HloModule* module) {
   TF_ASSIGN_OR_RETURN(bool changed, generic_copy_insertion.Run(module));
 
   TF_ASSIGN_OR_RETURN(std::unique_ptr<HloDataflowAnalysis> dataflow,
-                      HloDataflowAnalysis::Run(module));
+                      HloDataflowAnalysis::Run(*module));
 
   // Make sure all operands of a library call are in memory instead of constants
   // in IR.
diff --git a/tensorflow/compiler/xla/service/gpu/gpu_executable.cc b/tensorflow/compiler/xla/service/gpu/gpu_executable.cc
index 623d6714de501000e38b7698620925f66425f157..04b37d913e0bc8f8226057f107da05fd1e675010 100644
--- a/tensorflow/compiler/xla/service/gpu/gpu_executable.cc
+++ b/tensorflow/compiler/xla/service/gpu/gpu_executable.cc
@@ -46,12 +46,14 @@ namespace {
 class HloExecutionProfiler {
  public:
   // If profiling is enabled, start an execution timer running.
-  explicit HloExecutionProfiler(bool do_profile, HloExecutionProfile* profile,
-                                se::Stream* stream,
-                                const HloComputation* computation)
+  explicit HloExecutionProfiler(
+      bool do_profile, HloExecutionProfile* profile, se::Stream* stream,
+      const std::vector<Pool<se::Stream>::SmartPtr>& sub_streams,
+      const HloComputation* computation)
       : do_profile_(do_profile),
         profile_(profile),
         stream_(stream),
+        sub_streams_(sub_streams),
         computation_(computation) {
     if (do_profile_) {
       clock_rate_ghz_ =
@@ -70,6 +72,7 @@ class HloExecutionProfiler {
     CHECK(!finished_execution_) << "Call FinishExecution only once!";
     finished_execution_ = true;
     if (do_profile_) {
+      stream_->ThenWaitFor(&sub_streams_);
       stream_->ThenStopTimer(execution_timer_.get());
       stream_->BlockHostUntilDone().IgnoreError();
       profile_->set_total_cycles_executed(
@@ -88,6 +91,7 @@ class HloExecutionProfiler {
   // that the hlo_instruction took to execute in the profile.
   void FinishOperation(const HloInstruction* hlo_instruction) {
     if (do_profile_) {
+      stream_->ThenWaitFor(&sub_streams_);
       stream_->ThenStopTimer(per_op_timer_.get());
       stream_->BlockHostUntilDone().IgnoreError();
       profile_->SetCyclesTakenBy(
@@ -100,6 +104,7 @@ class HloExecutionProfiler {
   double clock_rate_ghz_;
   HloExecutionProfile* profile_;
   se::Stream* stream_;
+  const std::vector<Pool<se::Stream>::SmartPtr>& sub_streams_;
   const HloComputation* computation_;
   std::unique_ptr<se::Timer> execution_timer_;
   std::unique_ptr<se::Timer> per_op_timer_;
@@ -147,13 +152,9 @@ Status GpuExecutable::ExecuteThunks(
     LOG(WARNING) << "PROFILING: profiling is enabled";
   }
 
-  HloExecutionProfiler profiler(do_profile, hlo_execution_profile, main_stream,
-                                hlo_module_->entry_computation());
-
-  uint64 start_micros = tensorflow::Env::Default()->NowMicros();
-
   // Stream 0 indicates `main_stream` and substreams start from stream 1.
   std::vector<Pool<se::Stream>::SmartPtr> sub_streams;
+  sub_streams.reserve(thunk_schedule_->StreamCount() - 1);
   while (sub_streams.size() + 1 < thunk_schedule_->StreamCount()) {
     sub_streams.emplace_back();
     TF_ASSIGN_OR_RETURN(
@@ -161,6 +162,10 @@ Status GpuExecutable::ExecuteThunks(
         run_options->BorrowStream(main_stream->parent()->device_ordinal()));
   }
 
+  HloExecutionProfiler profiler(do_profile, hlo_execution_profile, main_stream,
+                                sub_streams, hlo_module_->entry_computation());
+  uint64 start_micros = tensorflow::Env::Default()->NowMicros();
+
   // The next event enqueued on stream N must not run until the thunk at
   // last_blocking_thunk_for_stream[N] completes.
   std::map<int32, const Thunk*> last_blocking_thunk_for_stream;
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
index 2f65edffea81db7dba1f8545f92b27ea622044e7..1b89dfa7ae40d45d10c0838308a8e5e4bc037244 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emission_utils.cc
@@ -49,8 +49,10 @@ bool AreValidGemmShapes(const Shape& lhs_shape, const Shape& rhs_shape,
   // The inputs and the output must
   // 1) be matrices with no padding and a non-zero number of elements,
   // 2) have an allowed element type.
-  bool type_is_allowed = (output_shape.element_type() == F32 ||
-                          output_shape.element_type() == F64);
+  PrimitiveType output_primitive_type = output_shape.element_type();
+  bool type_is_allowed =
+      (output_primitive_type == F16 || output_primitive_type == F32 ||
+       output_primitive_type == F64);
   return type_is_allowed && IsRank2WithNoPadding(lhs_shape) &&
          IsRank2WithNoPadding(rhs_shape) &&
          IsRank2WithNoPadding(output_shape) &&
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter.cc b/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
index a3df67a87344d6ece2ea9047321ad9542c13f8cf..1e0db2821a2c212d0f212ae94ab69231bc6053ea 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include <string>
 #include <unordered_map>
+#include <utility>
 
 #include "tensorflow/core/platform/logging.h"
 // IWYU pragma: no_include "llvm/IR/Intrinsics.gen.inc"
@@ -438,6 +439,32 @@ Status IrEmitter::HandleSelect(HloInstruction* select) {
   return IrEmitter::DefaultAction(select);
 }
 
+namespace {
+llvm::Value* Real(llvm::Value* x, llvm::IRBuilder<>* ir_builder) {
+  return ir_builder->CreateExtractValue(x, {0});
+}
+
+llvm::Value* Imag(llvm::Value* x, llvm::IRBuilder<>* ir_builder) {
+  return ir_builder->CreateExtractValue(x, {1});
+}
+
+std::pair<llvm::Value*, llvm::Value*> MultiplyComplex(
+    llvm::Value* lhs_value, llvm::Value* rhs_value,
+    llvm::IRBuilder<>* ir_builder) {
+  llvm::Value* lhs_real = Real(lhs_value, ir_builder);
+  llvm::Value* lhs_imag = Imag(lhs_value, ir_builder);
+  llvm::Value* rhs_real = Real(rhs_value, ir_builder);
+  llvm::Value* rhs_imag = Imag(rhs_value, ir_builder);
+  llvm::Value* real_result1 = ir_builder->CreateFMul(lhs_real, rhs_real);
+  llvm::Value* real_result2 = ir_builder->CreateFMul(lhs_imag, rhs_imag);
+  llvm::Value* real_result = ir_builder->CreateFSub(real_result1, real_result2);
+  llvm::Value* imag_result1 = ir_builder->CreateFMul(lhs_real, rhs_imag);
+  llvm::Value* imag_result2 = ir_builder->CreateFMul(lhs_imag, rhs_real);
+  llvm::Value* imag_result = ir_builder->CreateFAdd(imag_result1, imag_result2);
+  return {real_result, imag_result};
+}
+}  // namespace
+
 Status IrEmitter::HandleDot(HloInstruction* dot) {
   auto lhs_instruction = dot->operand(0);
   auto rhs_instruction = dot->operand(1);
@@ -456,21 +483,10 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
         rhs_array.EmitReadArrayElement(/*index=*/{}, &ir_builder_);
     llvm::Value* result;
     if (ShapeUtil::ElementIsComplex(lhs_shape)) {
-      auto real = [&](llvm::Value* x) {
-        return ir_builder_.CreateExtractValue(x, {0});
-      };
-      auto imag = [&](llvm::Value* x) {
-        return ir_builder_.CreateExtractValue(x, {1});
-      };
-      llvm::Value* real_result = ir_builder_.CreateFSub(
-          ir_builder_.CreateFMul(real(lhs_value), real(rhs_value)),
-          ir_builder_.CreateFMul(imag(lhs_value), imag(rhs_value)));
-      llvm::Value* imag_result = ir_builder_.CreateFAdd(
-          ir_builder_.CreateFMul(real(lhs_value), imag(rhs_value)),
-          ir_builder_.CreateFMul(imag(lhs_value), real(rhs_value)));
+      auto value = MultiplyComplex(lhs_value, rhs_value, &ir_builder_);
       result = llvm::ConstantAggregateZero::get(lhs_array.GetElementLlvmType());
-      result = ir_builder_.CreateInsertValue(result, real_result, {0});
-      result = ir_builder_.CreateInsertValue(result, imag_result, {1});
+      result = ir_builder_.CreateInsertValue(result, value.first, {0});
+      result = ir_builder_.CreateInsertValue(result, value.second, {1});
     } else {
       result = ir_builder_.CreateFMul(lhs_value, rhs_value);
     }
@@ -548,20 +564,13 @@ Status IrEmitter::HandleDot(HloInstruction* dot) {
   llvm::Value* accum = ir_builder_.CreateLoad(accum_address);
   llvm::Value* updated_accum;
   if (ShapeUtil::ElementIsComplex(lhs_shape)) {
-#define REAL(x) ir_builder_.CreateExtractValue(x, {0})
-#define IMAG(x) ir_builder_.CreateExtractValue(x, {1})
-    llvm::Value* product_real = ir_builder_.CreateFSub(
-        ir_builder_.CreateFMul(REAL(lhs_element), REAL(rhs_element)),
-        ir_builder_.CreateFMul(IMAG(lhs_element), IMAG(rhs_element)));
-    llvm::Value* product_imag = ir_builder_.CreateFAdd(
-        ir_builder_.CreateFMul(REAL(lhs_element), IMAG(rhs_element)),
-        ir_builder_.CreateFMul(IMAG(lhs_element), REAL(rhs_element)));
-    updated_accum = ir_builder_.CreateInsertValue(
-        accum, ir_builder_.CreateFAdd(REAL(accum), product_real), {0});
-    updated_accum = ir_builder_.CreateInsertValue(
-        updated_accum, ir_builder_.CreateFAdd(IMAG(accum), product_imag), {1});
-#undef IMAG
-#undef REAL
+    auto value = MultiplyComplex(lhs_element, rhs_element, &ir_builder_);
+    llvm::Value* accum_real = Real(accum, &ir_builder_);
+    llvm::Value* real_sum = ir_builder_.CreateFAdd(accum_real, value.first);
+    updated_accum = ir_builder_.CreateInsertValue(accum, real_sum, {0});
+    llvm::Value* accum_imag = Imag(accum, &ir_builder_);
+    llvm::Value* imag_sum = ir_builder_.CreateFAdd(accum_imag, value.second);
+    updated_accum = ir_builder_.CreateInsertValue(updated_accum, imag_sum, {1});
   } else {
     llvm::Value* product = ir_builder_.CreateFMul(lhs_element, rhs_element);
     updated_accum = ir_builder_.CreateFAdd(accum, product);
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
index aa2a0a9800bab142481e1def785c9052526fcd8c..30c88c0a5d38f6ea3f94d3b47b7b69c7122bf6ac 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
@@ -2064,6 +2064,11 @@ GetHloBufferSlices(const HloInstruction* hlo,
   return slices;
 }
 
+Status IrEmitterUnnested::HandleGather(HloInstruction* gather) {
+  // TODO(b/72710576): Gather is not implemented on GPUs
+  return Unimplemented("Gather is not implemented on GPUs.");
+}
+
 std::unique_ptr<Thunk> IrEmitterUnnested::BuildKernelThunk(
     const HloInstruction* inst) {
   const BufferAssignment& buffer_assn =
diff --git a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
index 688760efbd2c725a4bf48e45eb6f2734b63d25e1..b83a2337e2decd9d4fba3d40fcf33f131fca8a3c 100644
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.h
@@ -67,6 +67,7 @@ class IrEmitterUnnested : public IrEmitter {
   Status HandleDot(HloInstruction* dot) override;
   Status HandleFft(HloInstruction* fft) override;
   Status HandleFusion(HloInstruction* fusion) override;
+  Status HandleGather(HloInstruction* gather) override;
   Status HandleGetTupleElement(HloInstruction* get_tuple_element) override;
   Status HandleReduce(HloInstruction* reduce) override;
   Status HandleSelectAndScatter(HloInstruction* instruction) override;
diff --git a/tensorflow/compiler/xla/service/heap_simulator.cc b/tensorflow/compiler/xla/service/heap_simulator.cc
index cde5877e29f36abc61c5417ce960e2c7699e2749..3dd4c4a0794e5c41b877078c4e69c6c9584ce6c0 100644
--- a/tensorflow/compiler/xla/service/heap_simulator.cc
+++ b/tensorflow/compiler/xla/service/heap_simulator.cc
@@ -27,38 +27,6 @@ namespace xla {
 using tensorflow::gtl::FlatMap;
 using tensorflow::gtl::FlatSet;
 
-namespace {
-
-// Returns the set of buffers that may be sources of all operands of the given
-// instruction.  The returned buffers are guaranteed to have no duplicates, and
-// to be sorted in a deterministic order.
-std::vector<const LogicalBuffer*> UniqueOperandSourceBuffers(
-    const HloInstruction* instruction,
-    const TuplePointsToAnalysis& points_to_analysis) {
-  std::vector<const LogicalBuffer*> buffers;
-  for (const HloInstruction* operand : instruction->operands()) {
-    points_to_analysis.GetPointsToSet(operand).ForEachElement(
-        [&](const ShapeIndex& /*index*/,
-            const PointsToSet::BufferList& points_to) {
-          buffers.insert(buffers.end(), points_to.begin(), points_to.end());
-        });
-  }
-
-  // Sort and then remove duplicates from buffers.
-  std::sort(buffers.begin(), buffers.end(),
-            [](const LogicalBuffer* a, const LogicalBuffer* b) {
-              return a->id() < b->id();
-            });
-  buffers.erase(std::unique(buffers.begin(), buffers.end(),
-                            [](const LogicalBuffer* a, const LogicalBuffer* b) {
-                              return a->id() == b->id();
-                            }),
-                buffers.end());
-  return buffers;
-}
-
-}  // namespace
-
 /*static*/
 StatusOr<HeapSimulator::Result> HeapSimulator::Run(
     std::unique_ptr<HeapAlgorithm> algorithm, const HloModule& module,
@@ -93,6 +61,7 @@ Status HeapSimulator::RunComputation(
     const HloComputation& computation,
     const std::vector<const HloInstruction*>& instruction_sequence,
     const TuplePointsToAnalysis& points_to_analysis) {
+  VLOG(3) << "Computation:\n" << computation.ToString();
   // The goal here is to minimize memory usage, assuming the given sequential
   // ordering of instructions.  The strategy is to walk through the instruction
   // sequence, calling Alloc and Free on the underlying heap algorithm.  The
@@ -101,7 +70,51 @@ Status HeapSimulator::RunComputation(
   // 'live_buffers' tracks the liveness of each buffer that we assign, by
   // associating it with a set of HloInstructions that need to be visited.  When
   // the set becomes empty, the buffer is no longer used, and can be freed.
+  // 'used_buffers' is the reverse map - it tracks which buffers were used by an
+  // instruction, so that we can remove the instructions from a buffer's live
+  // set after they are visited.
   FlatMap<const LogicalBuffer*, FlatSet<const HloInstruction*>> live_buffers;
+  FlatMap<const HloInstruction*, FlatSet<const LogicalBuffer*>> used_buffers;
+  auto add_user_to_buffer = [this, &live_buffers, &used_buffers](
+                                const HloInstruction* user,
+                                const LogicalBuffer* buffer) {
+    if (!IgnoreBuffer(buffer)) {
+      VLOG(4) << "  Adding user " << user->name() << " to buffer "
+              << buffer->ToString();
+      live_buffers[buffer].insert(user);
+      used_buffers[user].insert(buffer);
+    }
+  };
+
+  // Initialize live_buffers for each buffer that we're going to assign.  The
+  // set of instructions that need to be visited contains all users of all
+  // aliases, that is, all users of all instructions that have the buffer
+  // contained in their points-to set.
+  for (const HloInstruction* instruction : instruction_sequence) {
+    const PointsToSet& points_to =
+        points_to_analysis.GetPointsToSet(instruction);
+    const PointsToSet::BufferSet& buffer_set = points_to.CreateFlattenedSet();
+    for (const HloInstruction* user : instruction->users()) {
+      if (user->opcode() != HloOpcode::kGetTupleElement) {
+        for (const LogicalBuffer* buffer : buffer_set) {
+          add_user_to_buffer(user, buffer);
+        }
+      } else {
+        // A GetTupleElement doesn't need to keep all of its operand's buffers
+        // alive. It only needs the buffers that relate to the element its
+        // extracting, and the tuple it's extracting from, but not the buffers
+        // for the other elements.
+        for (const LogicalBuffer* buffer : points_to.element({})) {
+          add_user_to_buffer(user, buffer);
+        }
+        const PointsToSet& gte_points_to =
+            points_to_analysis.GetPointsToSet(user);
+        for (const LogicalBuffer* buffer : gte_points_to.CreateFlattenedSet()) {
+          add_user_to_buffer(user, buffer);
+        }
+      }
+    }
+  }
 
   const HloInstruction* root = computation.root_instruction();
   auto output_source_buffers =
@@ -114,34 +127,17 @@ Status HeapSimulator::RunComputation(
         buffers_defined_by_instruction =
             points_to_analysis.GetBuffersDefinedByInstruction(instruction);
 
-    // Initialize live_buffers for each buffer that we're going to assign.  The
-    // set of instructions that need to be visited contains all users of all
-    // aliases.  The alias itself is not necessary; if it has users, the users
-    // are necessarily scheduled after the alias.  And if it has no users, it is
-    // either a dead value or an output, both of which are handled below.
-    //
-    // We ignore control dependencies here. The reasoning is that the control
-    // dependencies have already been accounted for in the ordering of the given
-    // 'instruction_sequence', and should not otherwise artificially extend the
-    // lifetime of buffers that aren't already connected by a data dependency.
+    VLOG(3) << "Instruction: " << instruction->ToString();
+    for (const LogicalBuffer* buffer : buffers_defined_by_instruction) {
+      VLOG(4) << "  Defines: " << buffer->ToString()
+              << (IgnoreBuffer(buffer) ? " (Ignored)" : "");
+    }
+
     dead_buffers_to_free.clear();
     for (const LogicalBuffer* buffer : buffers_defined_by_instruction) {
       if (IgnoreBuffer(buffer)) {
         continue;
       }
-      FlatSet<const HloInstruction*>* live_set = nullptr;
-      for (const BufferAlias& alias :
-           points_to_analysis.GetBufferAliases(*buffer)) {
-        const std::vector<HloInstruction*>& users =
-            alias.instruction()->users();
-        if (!users.empty()) {
-          if (live_set == nullptr) {
-            live_set = &live_buffers[buffer];
-          }
-          live_set->insert(users.begin(), users.end());
-        }
-      }
-
       // Add a nullptr sentry to ensure entry parameters and output source
       // buffers are not freed until the very end.
       const bool entry_parameter =
@@ -165,11 +161,12 @@ Status HeapSimulator::RunComputation(
     // have no instructions left to visit are moved from live_buffers to
     // operand_buffers_to_free.
     operand_buffers_to_free.clear();
-    for (const LogicalBuffer* operand_buffer :
-         UniqueOperandSourceBuffers(instruction, points_to_analysis)) {
+    for (const LogicalBuffer* operand_buffer : used_buffers[instruction]) {
       if (IgnoreBuffer(operand_buffer)) {
         continue;
       }
+      VLOG(4) << "  Removing user " << instruction->name() << " from buffer "
+              << operand_buffer->ToString();
       auto it = live_buffers.find(operand_buffer);
       FlatSet<const HloInstruction*>* live_set = &it->second;
       live_set->erase(instruction);
@@ -178,6 +175,11 @@ Status HeapSimulator::RunComputation(
         operand_buffers_to_free.push_back(operand_buffer);
       }
     }
+    // Sort to get a deterministic iteration order.
+    std::sort(operand_buffers_to_free.begin(), operand_buffers_to_free.end(),
+              [](const LogicalBuffer* x, const LogicalBuffer* y) {
+                return x->id() < y->id();
+              });
 
     // Allocate buffers defined by this instruction.  This is the latest point
     // that we can allocate; right before the buffer is first used.  This must
@@ -203,6 +205,8 @@ Status HeapSimulator::RunComputation(
               CanShareOperandBufferWithUser(
                   operand_buffer->instruction(), operand_buffer->index(),
                   buffer->instruction(), buffer->index(), points_to_analysis)) {
+            VLOG(3) << "  Sharing: " << buffer->ToString() << " with "
+                    << operand_buffer->ToString();
             ShareBuffer(buffer, operand_buffer, instruction);
             shared = true;
             break;
@@ -211,6 +215,7 @@ Status HeapSimulator::RunComputation(
       }
 
       if (!shared) {
+        VLOG(3) << "  Allocating: " << buffer->ToString();
         Alloc(buffer, instruction);
       }
     }
@@ -225,6 +230,7 @@ Status HeapSimulator::RunComputation(
     // sub-computations will never be run concurrently.
     if (module_sequence_ != nullptr) {
       if (instruction->opcode() == HloOpcode::kCall ||
+          instruction->opcode() == HloOpcode::kConditional ||
           instruction->opcode() == HloOpcode::kWhile) {
         for (const HloComputation* called_computation :
              instruction->called_computations()) {
@@ -243,20 +249,34 @@ Status HeapSimulator::RunComputation(
     // Free buffers that are no longer live.  This is the earliest point that we
     // can de-allocate; right after the last use of the buffer.
     for (const LogicalBuffer* buffer : dead_buffers_to_free) {
+      VLOG(3) << "  Freeing dead: " << buffer->ToString();
       Free(buffer, instruction);
     }
     for (const LogicalBuffer* buffer : operand_buffers_to_free) {
+      VLOG(3) << "  Freeing operand: " << buffer->ToString();
       Free(buffer, instruction);
     }
   }
 
   // Any remaining live buffers must be entry parameters or output source
-  // buffers, which had a nullptr sentry added.  Free them now.
+  // buffers, which had a nullptr sentry added.  Free them now, in a
+  // deterministic order.
+  std::vector<const LogicalBuffer*> to_free;
+  to_free.reserve(live_buffers.size());
   for (const auto& buffer_pending : live_buffers) {
     const LogicalBuffer* buffer = buffer_pending.first;
     const FlatSet<const HloInstruction*>& pending = buffer_pending.second;
     CHECK_EQ(pending.size(), 1) << *buffer;
     CHECK(*pending.begin() == nullptr) << *buffer;
+    to_free.push_back(buffer);
+  }
+
+  std::sort(to_free.begin(), to_free.end(),
+            [](const LogicalBuffer* x, const LogicalBuffer* y) {
+              return x->id() < y->id();
+            });
+  for (const LogicalBuffer* buffer : to_free) {
+    VLOG(3) << "Freeing pending: " << buffer->ToString();
     Free(buffer, root);
   }
 
diff --git a/tensorflow/compiler/xla/service/heap_simulator_test.cc b/tensorflow/compiler/xla/service/heap_simulator_test.cc
index 387b649a731ebcbfd8307807469f39f22d192b06..688a271712ac243666ba4ff02932aa4f7f7ed21c 100644
--- a/tensorflow/compiler/xla/service/heap_simulator_test.cc
+++ b/tensorflow/compiler/xla/service/heap_simulator_test.cc
@@ -410,6 +410,56 @@ TEST_F(HeapSimulatorTest, MultiplyDotDotTuple) {
   });
 }
 
+TEST_F(HeapSimulatorTest, IndependentTupleElements) {
+  auto builder = HloComputation::Builder(TestName());
+  auto paramA = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, f32scalar_, "paramA"));
+  auto paramB = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, f32scalar_, "paramB"));
+  auto mul = builder.AddInstruction(HloInstruction::CreateBinary(
+      f32scalar_, HloOpcode::kMultiply, paramA, paramB));
+  auto add = builder.AddInstruction(HloInstruction::CreateBinary(
+      f32scalar_, HloOpcode::kAdd, paramA, paramB));
+  auto tuple = builder.AddInstruction(HloInstruction::CreateTuple({mul, add}));
+  auto element0 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(f32scalar_, tuple, 0));
+  auto broadcast = builder.AddInstruction(
+      HloInstruction::CreateBroadcast(f32vec4_, element0, {0}));
+  auto sub = builder.AddInstruction(HloInstruction::CreateBinary(
+      f32scalar_, HloOpcode::kSubtract, paramA, paramB));
+  auto element1 = builder.AddInstruction(
+      HloInstruction::CreateGetTupleElement(f32scalar_, tuple, 1));
+  auto output = builder.AddInstruction(
+      HloInstruction::CreateTuple({broadcast, sub, element1}));
+
+  HeapSimulatorTracker tracker(TestName(), builder.Build(),
+                               {paramA, paramB, mul, add, tuple, element0,
+                                broadcast, sub, element1, output});
+  tracker.ExpectCallSequence({
+      {kAlloc, tracker.BufferAt(paramA, {})},
+      {kAlloc, tracker.BufferAt(paramB, {})},
+      {kAlloc, tracker.BufferAt(mul, {})},
+      {kAlloc, tracker.BufferAt(add, {})},
+      {kAlloc, tracker.BufferAt(tuple, {})},
+      {kAlloc, tracker.BufferAt(broadcast, {})},
+      // The mul can be freed right after the broadcast happens, even though
+      // The other GetTupleElement is still alive.
+      {kFree, tracker.BufferAt(mul, {})},
+      {kAlloc, tracker.BufferAt(sub, {})},
+      // The temporary tuple is now dead.
+      {kFree, tracker.BufferAt(tuple, {})},
+      {kAlloc, tracker.BufferAt(output, {})},
+      // All params and outputs are freed at the end.
+      {kFree, tracker.BufferAt(paramA, {})},
+      {kFree, tracker.BufferAt(paramB, {})},
+      {kFree, tracker.BufferAt(add, {})},
+      {kFree, tracker.BufferAt(broadcast, {})},
+      {kFree, tracker.BufferAt(sub, {})},
+      {kFree, tracker.BufferAt(output, {})},
+      {kFinish, nullptr},
+  });
+}
+
 TEST_F(HeapSimulatorTest, WholeModule) {
   HeapSimulatorTracker tracker(TestName());
 
diff --git a/tensorflow/compiler/xla/service/hlo.proto b/tensorflow/compiler/xla/service/hlo.proto
index 36db711c6c3570efdf678261ad38bbdb08cf94aa..a43785b4a9701369ae315f67d4d64d03dc6c081d 100644
--- a/tensorflow/compiler/xla/service/hlo.proto
+++ b/tensorflow/compiler/xla/service/hlo.proto
@@ -129,6 +129,10 @@ message HloInstructionProto {
 
   // FFT length.
   repeated int64 fft_length = 32;
+
+  // Gather dimension numbers.
+  xla.GatherDimensionNumbers gather_dimension_numbers = 33;
+  repeated int64 gather_window_bounds = 34;
 }
 
 // Serialization of HloComputation.
diff --git a/tensorflow/compiler/xla/service/hlo_alias_analysis.cc b/tensorflow/compiler/xla/service/hlo_alias_analysis.cc
index 6d2a3aa5b531650a658502531e050702ffbd3760..30e32a46d7dd0923f738939c33407ac7484b5bbe 100644
--- a/tensorflow/compiler/xla/service/hlo_alias_analysis.cc
+++ b/tensorflow/compiler/xla/service/hlo_alias_analysis.cc
@@ -419,7 +419,7 @@ StatusOr<std::unique_ptr<HloAliasAnalysis>> HloAliasAnalysis::Run(
   auto alias_analysis = WrapUnique(new HloAliasAnalysis(module));
   TF_ASSIGN_OR_RETURN(
       alias_analysis->dataflow_analysis_,
-      HloDataflowAnalysis::Run(module, /*ssa_form=*/true,
+      HloDataflowAnalysis::Run(*module, /*ssa_form=*/true,
                                /*bitcast_defines_value=*/false));
 
   BufferValueMap buffer_map(alias_analysis->dataflow_analysis());
diff --git a/tensorflow/compiler/xla/service/hlo_computation.cc b/tensorflow/compiler/xla/service/hlo_computation.cc
index 5432419e4a2dd2916da32ac6566851bf52fd68ca..21e6b2ca730f6347af902097e6496826b861e8a3 100644
--- a/tensorflow/compiler/xla/service/hlo_computation.cc
+++ b/tensorflow/compiler/xla/service/hlo_computation.cc
@@ -509,13 +509,14 @@ StatusOr<HloInstruction*> HloComputation::DeepCopyInstruction(
         "Can't deep copy instruction %s: instruction is not in computation %s",
         instruction->name().c_str(), name().c_str());
   }
-
   if (indices_to_copy != nullptr &&
       !ShapeUtil::Compatible(instruction->shape(), indices_to_copy->shape())) {
     return FailedPrecondition(
         "Can't deep copy instruction %s: given shape tree of indices to copy "
-        "has incompatible shape",
-        instruction->name().c_str());
+        "has incompatible shapes: %s vs. %s",
+        instruction->name().c_str(),
+        ShapeUtil::HumanString(instruction->shape()).c_str(),
+        ShapeUtil::HumanString(indices_to_copy->shape()).c_str());
   }
 
   ShapeIndex index;
diff --git a/tensorflow/compiler/xla/service/hlo_computation.h b/tensorflow/compiler/xla/service/hlo_computation.h
index 061c59abe5e315917161ed737f89de53d71bb1b6..39d864efcb70382b6f8e631d7e6e452ea6410104 100644
--- a/tensorflow/compiler/xla/service/hlo_computation.h
+++ b/tensorflow/compiler/xla/service/hlo_computation.h
@@ -77,6 +77,14 @@ class HloComputation {
       return last_added_instruction_;
     }
 
+    Status ForEachInstruction(
+        const std::function<Status(const HloInstruction*)>& func) const {
+      for (const auto& instruction : instructions_) {
+        TF_RETURN_IF_ERROR(func(instruction.get()));
+      }
+      return Status::OK();
+    }
+
    private:
     const string name_;
     HloInstruction* last_added_instruction_;
diff --git a/tensorflow/compiler/xla/service/hlo_constant_folding.cc b/tensorflow/compiler/xla/service/hlo_constant_folding.cc
index 53450991b6fad5b9651d9d23b55c908e6b68e5dd..35ecd4428d0dfde2de445ea34472d2c78148c6c9 100644
--- a/tensorflow/compiler/xla/service/hlo_constant_folding.cc
+++ b/tensorflow/compiler/xla/service/hlo_constant_folding.cc
@@ -35,7 +35,10 @@ limitations under the License.
 namespace xla {
 
 StatusOr<bool> HloConstantFolding::Run(HloModule* module) {
-  auto evaluator = MakeUnique<HloEvaluator>();
+  // Limit the constant folding to 0 iterations to skip folding loops. This
+  // retains the behavior from before while loop support in HloEvaluator and may
+  // be revised.
+  auto evaluator = MakeUnique<HloEvaluator>(/*max_loop_iterations=*/0);
 
   XLA_VLOG_LINES(2,
                  "HloConstantFolding::Run(), before:\n" + module->ToString());
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis.cc b/tensorflow/compiler/xla/service/hlo_cost_analysis.cc
index 9cd5a1e2b71a7aa768e478289e8e4cc13030fcc3..4ec2ef27bf59b0c877ec38e55ef5c12debeec227 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis.cc
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis.cc
@@ -229,6 +229,10 @@ Status HloCostAnalysis::HandleOutfeed(const HloInstruction*) {
   return Status::OK();
 }
 
+Status HloCostAnalysis::HandleHostCompute(const HloInstruction*) {
+  return Status::OK();
+}
+
 Status HloCostAnalysis::HandleMap(const HloInstruction* map) {
   // Compute properties of the mapped function.
   TF_ASSIGN_OR_RETURN(const Properties sub_properties,
@@ -529,6 +533,11 @@ Status HloCostAnalysis::HandleConditional(const HloInstruction* conditional) {
   return Status::OK();
 }
 
+Status HloCostAnalysis::HandleGather(const HloInstruction* gather) {
+  // Gather does not issue any flops.
+  return Status::OK();
+}
+
 Status HloCostAnalysis::FinishVisit(const HloInstruction*) {
   return Status::OK();
 }
diff --git a/tensorflow/compiler/xla/service/hlo_cost_analysis.h b/tensorflow/compiler/xla/service/hlo_cost_analysis.h
index e5783539e5436f09fa58bf7889118380ee90fea0..d17678d20f2a23fd98d18b77d5fb25853901a789 100644
--- a/tensorflow/compiler/xla/service/hlo_cost_analysis.h
+++ b/tensorflow/compiler/xla/service/hlo_cost_analysis.h
@@ -71,6 +71,7 @@ class HloCostAnalysis : public ConstDfsHloVisitor {
   Status HandleCrossReplicaSum(const HloInstruction* crs) override;
   Status HandleInfeed(const HloInstruction* infeed) override;
   Status HandleOutfeed(const HloInstruction* outfeed) override;
+  Status HandleHostCompute(const HloInstruction* host_compute) override;
   Status HandleRng(const HloInstruction* random) override;
   Status HandleReverse(const HloInstruction* reverse) override;
   Status HandleSort(const HloInstruction* sort) override;
@@ -99,6 +100,7 @@ class HloCostAnalysis : public ConstDfsHloVisitor {
   Status HandleTranspose(const HloInstruction* transpose) override;
   Status HandleWhile(const HloInstruction* xla_while) override;
   Status HandleConditional(const HloInstruction* conditional) override;
+  Status HandleGather(const HloInstruction* gather) override;
   Status FinishVisit(const HloInstruction* root) override;
 
   Status Preprocess(const HloInstruction* hlo) override;
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
index ccbbe8f1966d59b4ab2904dcc6ea724aaf4a7603..934e43ba4879628362009267c671ec4cb0d79c52 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.cc
@@ -38,12 +38,12 @@ namespace xla {
 using ::tensorflow::strings::StrAppend;
 using ::tensorflow::strings::StrCat;
 
-HloDataflowAnalysis::HloDataflowAnalysis(HloModule* module, bool ssa_form,
+HloDataflowAnalysis::HloDataflowAnalysis(const HloModule& module, bool ssa_form,
                                          bool bitcast_defines_value)
     : module_(module),
       ssa_form_(ssa_form),
       bitcast_defines_value_(bitcast_defines_value),
-      call_graph_(CallGraph::Build(module)) {}
+      call_graph_(CallGraph::Build(&module)) {}
 
 bool HloDataflowAnalysis::ValueIsDefinedAt(const HloInstruction* instruction,
                                            const ShapeIndex& index) const {
@@ -115,9 +115,9 @@ void HloDataflowAnalysis::DeleteMarkedValues() {
 }
 
 string HloDataflowAnalysis::ToString() const {
-  string out = StrCat("HloDataflowAnalysis, module ", module_->name(), "\n");
+  string out = StrCat("HloDataflowAnalysis, module ", module_.name(), "\n");
   StrAppend(&out, "  Instruction value sets:\n");
-  for (const HloComputation* computation : module_->computations()) {
+  for (const HloComputation* computation : module_.computations()) {
     for (const HloInstruction* instruction : computation->instructions()) {
       StrAppend(&out, "    ", instruction->name(), ":\n");
       if (ShapeUtil::IsTuple(instruction->shape())) {
@@ -592,7 +592,7 @@ void HloDataflowAnalysis::Propagate() {
     }
   };
 
-  for (HloComputation* computation : module_->computations()) {
+  for (HloComputation* computation : module_.computations()) {
     for (HloInstruction* instruction : computation->instructions()) {
       add_to_worklist(instruction);
     }
@@ -686,7 +686,7 @@ InstructionValueSet& HloDataflowAnalysis::GetInstructionValueSet(
 }
 
 Status HloDataflowAnalysis::InitializeInstructionValueSets() {
-  for (const HloComputation* computation : module_->computations()) {
+  for (const HloComputation* computation : module_.computations()) {
     const CallGraphNode& call_graph_node = call_graph_->GetNode(computation);
     for (HloInstruction* instruction : computation->instructions()) {
       // Create an empty shape tree.
@@ -787,9 +787,9 @@ Status HloDataflowAnalysis::InitializeInstructionValueSets() {
 
 /* static */
 StatusOr<std::unique_ptr<HloDataflowAnalysis>> HloDataflowAnalysis::Run(
-    HloModule* module, bool ssa_form, bool bitcast_defines_value) {
-  VLOG(1) << "HloDataflowAnalysis::Run on module " << module->name();
-  XLA_VLOG_LINES(2, module->ToString());
+    const HloModule& module, bool ssa_form, bool bitcast_defines_value) {
+  VLOG(1) << "HloDataflowAnalysis::Run on module " << module.name();
+  XLA_VLOG_LINES(2, module.ToString());
 
   auto dataflow_analysis = WrapUnique(
       new HloDataflowAnalysis(module, ssa_form, bitcast_defines_value));
@@ -806,7 +806,7 @@ StatusOr<std::unique_ptr<HloDataflowAnalysis>> HloDataflowAnalysis::Run(
   // lookup is faster.
   std::vector<std::vector<HloPosition>> value_positions(
       dataflow_analysis->next_value_id_);
-  for (const HloComputation* computation : module->computations()) {
+  for (const HloComputation* computation : module.computations()) {
     for (HloInstruction* instruction : computation->instructions()) {
       for (const auto& pair :
            dataflow_analysis->GetInstructionValueSet(instruction)) {
@@ -858,7 +858,7 @@ Status HloDataflowAnalysis::Verify() const {
 
   // For each value in each value set, verify that the value set's position
   // appears in the value's positions().
-  for (const auto& computation : module_->computations()) {
+  for (const auto& computation : module_.computations()) {
     for (const auto& instruction : computation->instructions()) {
       for (const auto& pair : GetInstructionValueSet(instruction)) {
         const ShapeIndex& index = pair.first;
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.h b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.h
index 89d318188f0855c7924836a51cfe98d531e08cb4..7b8a74b096ff48733717e78ada5bb56a28caed72 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis.h
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis.h
@@ -60,7 +60,7 @@ class HloDataflowAnalysis {
   //     a new HLO value in the analysis. If false then Bitcast forwards the
   //     value of its operand.
   static StatusOr<std::unique_ptr<HloDataflowAnalysis>> Run(
-      HloModule* module, bool ssa_form = false,
+      const HloModule& module, bool ssa_form = false,
       bool bitcast_defines_value = false);
 
   // Returns true if 'instruction' defines an HLO value at the given shape index
@@ -119,7 +119,7 @@ class HloDataflowAnalysis {
   string ToString() const;
 
  protected:
-  HloDataflowAnalysis(HloModule* module, bool ssa_form,
+  HloDataflowAnalysis(const HloModule& module, bool ssa_form,
                       bool bitcast_defines_value = false);
 
   // Returns a new HloValue defined at the given instruction and shape index.
@@ -180,7 +180,7 @@ class HloDataflowAnalysis {
   // Verify various invariants of the dataflow analysis.
   Status Verify() const;
 
-  HloModule* const module_;
+  const HloModule& module_;
   const bool ssa_form_;
   const bool bitcast_defines_value_;
 
diff --git a/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc b/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
index e714b2567fd1b3eab607a19f0bb7e3288150dc64..7bf3a1a06045c79621d75b653bf42220705a69d4 100644
--- a/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_dataflow_analysis_test.cc
@@ -50,7 +50,7 @@ class HloDataflowAnalysisTest : public HloTestBase,
                                          bool bitcast_defines_value = false) {
     hlo_graph_dumper::MaybeDumpHloModule(*module_, "Before dataflow analysis");
     analysis_ =
-        HloDataflowAnalysis::Run(module_.get(), ssa_form, bitcast_defines_value)
+        HloDataflowAnalysis::Run(*module_, ssa_form, bitcast_defines_value)
             .ConsumeValueOrDie();
     return *analysis_;
   }
diff --git a/tensorflow/compiler/xla/service/hlo_dce.cc b/tensorflow/compiler/xla/service/hlo_dce.cc
index 1e5f0f797a13fd7e7ce1cc934387a274a74153bc..fcd723af146e2227b8661b1a4993f1338f7de389 100644
--- a/tensorflow/compiler/xla/service/hlo_dce.cc
+++ b/tensorflow/compiler/xla/service/hlo_dce.cc
@@ -40,7 +40,7 @@ StatusOr<bool> HloDCE::Run(HloModule* module) {
   VLOG(2) << "Before dce:";
   XLA_VLOG_LINES(2, module->ToString());
 
-  for (auto* computation : module->MakeNonfusionComputations()) {
+  for (auto* computation : module->MakeComputationPostOrder()) {
     std::unordered_set<HloInstruction*> live_instructions;
     TF_RETURN_IF_ERROR(computation->root_instruction()->Accept(
         [&live_instructions](HloInstruction* instruction) {
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator.cc b/tensorflow/compiler/xla/service/hlo_evaluator.cc
index 8016b38d15330842c0e11f192587b6035a0dae01..42de7ada61aae7028231f457450c404cb9e19ed8 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator.cc
+++ b/tensorflow/compiler/xla/service/hlo_evaluator.cc
@@ -34,8 +34,6 @@ limitations under the License.
 #include "tensorflow/compiler/xla/service/hlo_query.h"
 #include "tensorflow/compiler/xla/service/shape_inference.h"
 #include "tensorflow/compiler/xla/shape_util.h"
-#include "tensorflow/compiler/xla/status.h"
-#include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/util.h"
 #include "tensorflow/compiler/xla/window_util.h"
@@ -53,12 +51,22 @@ namespace xla {
 
 namespace {
 
+using tensorflow::gtl::ArraySlice;
+using tensorflow::gtl::FlatSet;
+using tensorflow::gtl::optional;
+
 template <typename T>
 struct is_complex_t : public std::false_type {};
 
 template <>
 struct is_complex_t<complex64> : public std::true_type {};
 
+template <typename T>
+struct is_complex64_t : public std::false_type {};
+
+template <>
+struct is_complex64_t<complex64> : public std::true_type {};
+
 template <typename OperandT>
 StatusOr<std::unique_ptr<Literal>> Compare(const Shape& shape, HloOpcode opcode,
                                            const Literal& lhs_literal,
@@ -101,11 +109,10 @@ StatusOr<std::unique_ptr<Literal>> Compare(const Shape& shape, HloOpcode opcode,
   }
 
   auto result = Literal::CreateFromShape(shape);
-  TF_RETURN_IF_ERROR(result->Populate<bool>(
-      [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
-        return compare_op(lhs_literal.Get<OperandT>(multi_index),
-                          rhs_literal.Get<OperandT>(multi_index));
-      }));
+  TF_RETURN_IF_ERROR(result->Populate<bool>([&](ArraySlice<int64> multi_index) {
+    return compare_op(lhs_literal.Get<OperandT>(multi_index),
+                      rhs_literal.Get<OperandT>(multi_index));
+  }));
 
   return std::move(result);
 }
@@ -132,11 +139,10 @@ StatusOr<std::unique_ptr<Literal>> Compare<complex64>(
   }
 
   auto result = Literal::CreateFromShape(shape);
-  TF_RETURN_IF_ERROR(result->Populate<bool>(
-      [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
-        return compare_op(lhs_literal.Get<complex64>(multi_index),
-                          rhs_literal.Get<complex64>(multi_index));
-      }));
+  TF_RETURN_IF_ERROR(result->Populate<bool>([&](ArraySlice<int64> multi_index) {
+    return compare_op(lhs_literal.Get<complex64>(multi_index),
+                      rhs_literal.Get<complex64>(multi_index));
+  }));
 
   return std::move(result);
 }
@@ -161,8 +167,8 @@ StatusOr<std::unique_ptr<Literal>> ElementWiseUnaryOpImpl(
 
   auto result = Literal::CreateFromShape(shape);
 
-  TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-      [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+  TF_RETURN_IF_ERROR(
+      result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
         return unary_op(operand_literal.Get<NativeT>(multi_index));
       }));
   return std::move(result);
@@ -174,7 +180,7 @@ StatusOr<std::unique_ptr<Literal>> ElementWiseUnaryOpImpl(
 // with the base index.
 void IterateThroughWindow(
     const Shape& window_shape, const Window& window, const Shape& base_shape,
-    const tensorflow::gtl::ArraySlice<int64>& window_count_index,
+    const ArraySlice<int64>& window_count_index,
     const std::function<void(const std::vector<int64>&)>& f) {
   const int64 rank = ShapeUtil::Rank(base_shape);
   DimensionVector window_index(rank);
@@ -250,17 +256,37 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
   template <
       typename NativeT,
-      typename std::enable_if<std::is_signed<NativeT>::value ||
-                              is_complex_t<NativeT>::value>::type* = nullptr>
+      typename std::enable_if<std::is_signed<NativeT>::value>::type* = nullptr>
   Status HandleAbs(HloInstruction* abs) {
     TF_ASSIGN_OR_RETURN(parent_->evaluated_[abs],
-                        ElementWiseUnaryOp(abs, [](ElementwiseT elem_operand) {
+                        ElementWiseUnaryOp(abs, [](NativeT elem_operand) {
                           return std::abs(elem_operand);
                         }));
     return Status::OK();
   }
 
+  template <
+      typename NativeT,
+      typename std::enable_if<is_complex64_t<NativeT>::value>::type* = nullptr>
+  Status HandleAbs(HloInstruction* abs) {
+    const Literal& operand_literal =
+        parent_->GetEvaluatedLiteralFor(abs->operand(0));
+    TF_ASSIGN_OR_RETURN(
+        parent_->evaluated_[abs],
+        (ElementWiseUnaryOpImpl<float, NativeT>(
+            abs, [](NativeT elem_operand) { return std::abs(elem_operand); },
+            operand_literal)));
+
+    return Status::OK();
+  }
+
   Status HandleAbs(HloInstruction* abs) override {
+    // If the operand is of C64 type, the return type of abs will be F32.
+    // However, ElementwiseT would still be the return type, F32, and thus
+    // specifying the ElementwiseT explicitly as C64 is needed below.
+    if (abs->operand(0)->shape().element_type() == C64) {
+      return HandleAbs<complex64>(abs);
+    }
     return HandleAbs<ElementwiseT>(abs);
   }
 
@@ -308,13 +334,12 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
                    operand_to_broadcast.shape().dimensions(i));
     }
 
-    return output->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
-          for (int64 i = 0; i < broadcast->dimensions().size(); ++i) {
-            broadcast_indices[i] = multi_index[broadcast->dimensions(i)];
-          }
-          return operand_to_broadcast.Get<ReturnT>(broadcast_indices);
-        });
+    return output->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
+      for (int64 i = 0; i < broadcast->dimensions().size(); ++i) {
+        broadcast_indices[i] = multi_index[broadcast->dimensions(i)];
+      }
+      return operand_to_broadcast.Get<ReturnT>(broadcast_indices);
+    });
   }
 
   template <
@@ -588,14 +613,25 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     return Status::OK();
   }
 
-  template <
-      typename NativeT,
-      typename std::enable_if<!is_complex_t<NativeT>::value>::type* = nullptr>
+  template <typename NativeT,
+            typename std::enable_if<std::is_integral<NativeT>::value>::type* =
+                nullptr>
+  Status HandleMaximum(HloInstruction* maximum) {
+    TF_ASSIGN_OR_RETURN(
+        parent_->evaluated_[maximum],
+        ElementWiseBinaryOp(maximum, [](ElementwiseT lhs, ElementwiseT rhs) {
+          return std::max(lhs, rhs);
+        }));
+    return Status::OK();
+  }
+
+  template <typename NativeT, typename std::enable_if<std::is_floating_point<
+                                  NativeT>::value>::type* = nullptr>
   Status HandleMaximum(HloInstruction* maximum) {
     TF_ASSIGN_OR_RETURN(
         parent_->evaluated_[maximum],
         ElementWiseBinaryOp(maximum, [](ElementwiseT lhs, ElementwiseT rhs) {
-          return std::fmax(lhs, rhs);
+          return ((lhs >= rhs) || std::isnan(lhs)) ? lhs : rhs;
         }));
     return Status::OK();
   }
@@ -611,18 +647,30 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     return HandleMaximum<ElementwiseT>(maximum);
   }
 
-  template <
-      typename NativeT,
-      typename std::enable_if<!is_complex_t<NativeT>::value>::type* = nullptr>
+  template <typename NativeT,
+            typename std::enable_if<std::is_integral<NativeT>::value>::type* =
+                nullptr>
   Status HandleMinimum(HloInstruction* minimum) {
     TF_ASSIGN_OR_RETURN(parent_->evaluated_[minimum],
                         ElementWiseBinaryOp(minimum, [](ElementwiseT lhs_el,
                                                         ElementwiseT rhs_el) {
-                          return std::fmin(lhs_el, rhs_el);
+                          return std::min(lhs_el, rhs_el);
                         }));
     return Status::OK();
   }
 
+  template <typename NativeT, typename std::enable_if<std::is_floating_point<
+                                  NativeT>::value>::type* = nullptr>
+  Status HandleMinimum(HloInstruction* minimum) {
+    TF_ASSIGN_OR_RETURN(
+        parent_->evaluated_[minimum],
+        ElementWiseBinaryOp(minimum, [](ElementwiseT lhs_el,
+                                        ElementwiseT rhs_el) {
+          return ((lhs_el <= rhs_el) || std::isnan(lhs_el)) ? lhs_el : rhs_el;
+        }));
+    return Status::OK();
+  }
+
   template <
       typename NativeT,
       typename std::enable_if<is_complex_t<NativeT>::value>::type* = nullptr>
@@ -742,7 +790,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     TF_ASSIGN_OR_RETURN(
         parent_->evaluated_[shl],
         ElementWiseBinaryOp(shl, [](NativeT lhs_elem, NativeT rhs_elem) {
-          return lhs_elem << rhs_elem;
+          return IsShiftOutOfBounds<NativeT>(rhs_elem) ? 0
+                                                       : (lhs_elem << rhs_elem);
         }));
     return Status::OK();
   }
@@ -767,8 +816,12 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     TF_ASSIGN_OR_RETURN(
         parent_->evaluated_[shr],
         ElementWiseBinaryOp(shr, [](NativeT lhs_elem, NativeT rhs_elem) {
-          return static_cast<NativeT>(static_cast<SignedT>(lhs_elem) >>
-                                      rhs_elem);
+          SignedT lhs_signed = static_cast<SignedT>(lhs_elem);
+          if (IsShiftOutOfBounds<NativeT>(rhs_elem)) {
+            return lhs_signed < 0 ? static_cast<SignedT>(-1) : 0;
+          } else {
+            return lhs_signed >> rhs_elem;
+          }
         }));
     return Status::OK();
   }
@@ -795,7 +848,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
         parent_->evaluated_[shr],
         ElementWiseBinaryOp(shr, [](NativeT lhs_elem, NativeT rhs_elem) {
           // If shift amount is greater than the number of bits, then return 0.
-          if (rhs_elem >= sizeof(UnsignedT) * CHAR_BIT) {
+          if (IsShiftOutOfBounds<NativeT>(rhs_elem)) {
             return static_cast<NativeT>(0);
           }
           return static_cast<NativeT>(static_cast<UnsignedT>(lhs_elem) >>
@@ -873,8 +926,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     const Literal& operand_literal = parent_->GetEvaluatedLiteralFor(operand);
     auto result = Literal::CreateFromShape(result_shape);
 
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> out_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> out_index) {
           std::vector<int64> from_index(out_index.begin(), out_index.end());
           for (const int64 dim : reverse_dimensions) {
             from_index[dim] = result_shape.dimensions(dim) - 1 - out_index[dim];
@@ -949,7 +1002,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     DimensionVector rhs_index(rhs_rank);
     DimensionVector rhs_spatial_index(dnums.kernel_spatial_dimensions_size());
 
-    auto func = [&](tensorflow::gtl::ArraySlice<int64> out_index) {
+    auto func = [&](ArraySlice<int64> out_index) {
       ElementwiseT result_val = static_cast<ElementwiseT>(0);
 
       std::fill(lhs_index.begin(), lhs_index.end(), 0);
@@ -1071,9 +1124,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     }
 
     std::vector<int64> rhs_non_batch_non_contracting_dims;
-    tensorflow::gtl::FlatSet<int64> batch_dims_set(
-        dnums.rhs_batch_dimensions().begin(),
-        dnums.rhs_batch_dimensions().end());
+    FlatSet<int64> batch_dims_set(dnums.rhs_batch_dimensions().begin(),
+                                  dnums.rhs_batch_dimensions().end());
     for (int64 i = 0; i < rhs_rank; i++) {
       if (i != rhs_contracting_dimension && batch_dims_set.count(i) == 0) {
         rhs_non_batch_non_contracting_dims.push_back(i);
@@ -1085,8 +1137,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     DimensionVector lhs_index(lhs_rank);
     DimensionVector rhs_index(rhs_rank);
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> result_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> result_index) {
           ElementwiseT result_val = static_cast<ElementwiseT>(0);
 
           // Find the corresponding non-contracting indices for lhs and rhs.
@@ -1180,9 +1232,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
         parent_->GetEvaluatedLiteralFor(pad->operand(1)).Get<ReturnT>({});
     auto result = Literal::CreateFromShape(pad->shape());
     TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&scalar](tensorflow::gtl::ArraySlice<int64> multi_index) {
-          return scalar;
-        }));
+        [&scalar](ArraySlice<int64> multi_index) { return scalar; }));
 
     const Literal& evaluated_operand =
         parent_->GetEvaluatedLiteralFor(pad->operand(0));
@@ -1195,7 +1245,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     // corresponding index of the resulting padded literal.
     const PaddingConfig& pad_config = pad->padding_config();
 
-    auto func = [&](const std::vector<int64>& input_index) {
+    auto func = [&](ArraySlice<int64> input_index) {
       for (auto i = 0; i < input_index.size(); ++i) {
         // Interior padding occurs logically before edge padding, so in the case
         // of negative edge padding elements are removed from the
@@ -1345,9 +1395,9 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     auto result = Literal::CreateFromShape(map->shape());
 
-    HloEvaluator embedded_evaluator;
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    HloEvaluator embedded_evaluator(parent_->max_loop_iterations_);
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           std::vector<std::unique_ptr<Literal>> arg_literals;
           arg_literals.reserve(operands.size());
 
@@ -1437,7 +1487,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
   Status HandleReduce(HloInstruction* reduce) override {
     auto arg = reduce->operand(0);
     auto init_value = reduce->operand(1);
-    tensorflow::gtl::ArraySlice<int64> dimensions(reduce->dimensions());
+    ArraySlice<int64> dimensions(reduce->dimensions());
     HloComputation* function = reduce->to_apply();
     TF_RET_CHECK(ShapeUtil::Rank(reduce->shape()) ==
                  ShapeUtil::Rank(arg->shape()) - dimensions.size());
@@ -1480,10 +1530,10 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
       }
     }
 
-    HloEvaluator embedded_evaluator;
+    HloEvaluator embedded_evaluator(parent_->max_loop_iterations_);
     // For each resulting dimension, calculate and assign computed value.
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           ReturnT result_val = init_scalar;
 
           std::vector<int64> base(arg_dimensions.size());
@@ -1491,7 +1541,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
             base[result_to_arg_index[i]] = multi_index[i];
           }
 
-          auto func = [&](const std::vector<int64>& input_index) {
+          auto func = [&](ArraySlice<int64> input_index) {
             auto curr_val = arg_literal.Get<ReturnT>(input_index);
 
             // Evaluate computation with specified literal operands.
@@ -1537,9 +1587,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     // Initialize result array with the init value.
     TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> output_index) {
-          return init_scalar;
-        }));
+        [&](ArraySlice<int64> output_index) { return init_scalar; }));
 
     std::vector<int64> window_dimension_sizes;
     for (const auto& window_dimension : window.dimensions()) {
@@ -1556,7 +1604,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     int64 rank = ShapeUtil::Rank(operand_literal.shape());
 
-    HloEvaluator embedded_evaluator;
+    HloEvaluator embedded_evaluator(parent_->max_loop_iterations_);
     DimensionVector source_index(rank);
 
     std::fill(source_index.begin(), source_index.end(), 0);
@@ -1572,8 +1620,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
       // 2. Using the selected index, scatter value from `source` to result. We
       // do this by iterating through the window, and compare each index with
       // the selected index.
-      tensorflow::gtl::optional<ReturnT> selected_val;
-      tensorflow::gtl::optional<std::vector<int64>> selected_index;
+      optional<ReturnT> selected_val;
+      optional<std::vector<int64>> selected_index;
 
       IterateThroughWindow(
           window_shape, window, operand_literal.shape(), source_index,
@@ -1667,10 +1715,10 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     DimensionVector window_index(window.dimensions_size());
     DimensionVector operand_index(ShapeUtil::Rank(operand_literal.shape()));
 
-    HloEvaluator embedded_evaluator;
+    HloEvaluator embedded_evaluator(parent_->max_loop_iterations_);
     // For each resulting dimension, calculate and assign computed value.
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> output_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> output_index) {
           ReturnT result_val = init_scalar;
 
           std::fill(window_index.begin(), window_index.end(), 0);
@@ -1720,7 +1768,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     const int64 rank = ShapeUtil::Rank(operand->shape());
     const Literal& operand_literal = parent_->GetEvaluatedLiteralFor(operand);
-    auto func = [&](tensorflow::gtl::ArraySlice<int64> out_index) {
+    auto func = [&](ArraySlice<int64> out_index) {
       DimensionVector operand_index(rank);
       for (int64 i = 0; i < rank; ++i) {
         operand_index[i] =
@@ -1901,8 +1949,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     std::vector<int64> operand_indices(start.size());
 
     auto result = Literal::CreateFromShape(result_shape);
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           for (int64 i = 0; i < operand_indices.size(); ++i) {
             CHECK_GE(multi_index[i] + start[i], 0);
             // Mod is only used here to be consistent with the existing
@@ -1929,7 +1977,7 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     auto result = operand_literal.CloneToUnique();
     std::vector<int64> result_index(ShapeUtil::Rank(result->shape()), 0);
 
-    auto func = [&](const std::vector<int64>& update_index) {
+    auto func = [&](ArraySlice<int64> update_index) {
       std::transform(update_index.begin(), update_index.end(), start.begin(),
                      result_index.begin(), std::plus<int64>());
 
@@ -1985,8 +2033,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     auto result = Literal::CreateFromShape(shape);
 
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           return ConvertBinaryFunction(binary_op)(
               lhs_literal.Get<ReturnT>(multi_index),
               rhs_literal.Get<ReturnT>(multi_index));
@@ -2023,8 +2071,8 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
 
     auto result = Literal::CreateFromShape(shape);
 
-    TF_RETURN_IF_ERROR(result->Populate<ReturnT>(
-        [&](tensorflow::gtl::ArraySlice<int64> multi_index) {
+    TF_RETURN_IF_ERROR(
+        result->Populate<ReturnT>([&](ArraySlice<int64> multi_index) {
           return ternary_op(lhs_literal.Get<LhsType>(multi_index),
                             rhs_literal.Get<RhsType>(multi_index),
                             ehs_literal.Get<EhsType>(multi_index));
@@ -2033,10 +2081,19 @@ class HloEvaluator::TypedVisitor : public DfsHloVisitorWithDefault {
     return std::move(result);
   }
 
+  template <typename NativeT>
+  static bool IsShiftOutOfBounds(NativeT rhs) {
+    typedef typename std::make_unsigned<NativeT>::type UnsignedT;
+    UnsignedT lhs_size_unsigned = sizeof(NativeT) * CHAR_BIT;
+    UnsignedT rhs_unsigned = static_cast<UnsignedT>(rhs);
+    return rhs_unsigned >= lhs_size_unsigned;
+  }
+
   HloEvaluator* parent_;
 };  // class HloEvaluator::TypedVisitor
 
-HloEvaluator::HloEvaluator() {
+HloEvaluator::HloEvaluator(int64 max_loop_iterations)
+    : max_loop_iterations_(max_loop_iterations) {
   typed_visitors_[PRED] = MakeUnique<TypedVisitor<bool>>(this);
   typed_visitors_[U8] = MakeUnique<TypedVisitor<uint8>>(this);
   typed_visitors_[U16] = MakeUnique<FunctionVisitor>([](HloInstruction*) {
@@ -2070,8 +2127,7 @@ HloEvaluator::HloEvaluator() {
 
 template <typename LiteralPtr>
 StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
-    const HloModule& module,
-    tensorflow::gtl::ArraySlice<LiteralPtr> arg_literals) {
+    const HloModule& module, ArraySlice<LiteralPtr> arg_literals) {
   XLA_VLOG_LINES(2, "HloEvaluator::Evaluate module:\n" + module.ToString());
 
   evaluated_.clear();
@@ -2088,8 +2144,7 @@ StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
 
 template <typename LiteralPtr>
 StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
-    const HloComputation& computation,
-    tensorflow::gtl::ArraySlice<LiteralPtr> arg_literals) {
+    const HloComputation& computation, ArraySlice<LiteralPtr> arg_literals) {
   XLA_VLOG_LINES(
       2, "HloEvaluator::Evaluate computation:\n" + computation.ToString());
 
@@ -2105,8 +2160,7 @@ StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
 
 template <typename LiteralPtr>
 StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate(
-    HloInstruction* instruction,
-    tensorflow::gtl::ArraySlice<LiteralPtr> arg_literals) {
+    HloInstruction* instruction, ArraySlice<LiteralPtr> arg_literals) {
   TF_RET_CHECK(hlo_query::AllOperandsAreParametersOrConstants(*instruction));
   TF_RETURN_IF_ERROR(ShapeUtil::ValidateShape(instruction->shape()));
 
@@ -2231,8 +2285,7 @@ Status HloEvaluator::HandleTranspose(HloInstruction* transpose) {
 }
 
 Status HloEvaluator::HandleConcatenate(HloInstruction* concatenate) {
-  tensorflow::gtl::ArraySlice<HloInstruction*> operands(
-      concatenate->operands());
+  ArraySlice<HloInstruction*> operands(concatenate->operands());
   // The result concatenate dimension is going to be the sum of all
   // concatenate dimensions of the operands taking part of the operation.
   const Shape& reference_shape = operands[0]->shape();
@@ -2434,6 +2487,84 @@ Status HloEvaluator::HandleCopy(HloInstruction* copy) {
   return Status::OK();
 }
 
+Status HloEvaluator::HandleCall(HloInstruction* call) {
+  auto* computation = call->to_apply();
+  auto operands = call->operands();
+
+  std::vector<const Literal*> arg_literals;
+  arg_literals.reserve(operands.size());
+  for (auto operand : operands) {
+    const Literal& arg_literal = GetEvaluatedLiteralFor(operand);
+    arg_literals.push_back(&arg_literal);
+  }
+
+  HloEvaluator embedded_evaluator;
+  std::unique_ptr<Literal> result =
+      embedded_evaluator.Evaluate<const Literal*>(*computation, arg_literals)
+          .ConsumeValueOrDie();
+
+  evaluated_[call] = std::move(result);
+  return Status::OK();
+}
+
+Status HloEvaluator::HandleConditional(HloInstruction* conditional) {
+  const auto& pred = GetEvaluatedLiteralFor(conditional->operand(0));
+  const auto& true_computation_arg =
+      GetEvaluatedLiteralFor(conditional->operand(1));
+  const auto& false_computation_arg =
+      GetEvaluatedLiteralFor(conditional->operand(2));
+
+  auto* true_computation = conditional->true_computation();
+  auto* false_computation = conditional->false_computation();
+
+  auto result = Literal::CreateFromShape(conditional->shape());
+  HloEvaluator embedded_evaluator;
+  if (pred.Get<bool>({})) {
+    result = embedded_evaluator
+                 .Evaluate<const Literal*>(*true_computation,
+                                           {&true_computation_arg})
+                 .ConsumeValueOrDie();
+  } else {
+    result = embedded_evaluator
+                 .Evaluate<const Literal*>(*false_computation,
+                                           {&false_computation_arg})
+                 .ConsumeValueOrDie();
+  }
+
+  evaluated_[conditional] = std::move(result);
+  return Status::OK();
+}
+
+Status HloEvaluator::HandleWhile(HloInstruction* while_hlo) {
+  HloComputation* cond_comp = while_hlo->while_condition();
+  HloComputation* body_comp = while_hlo->while_body();
+  // Initialize the loop carried valued with the input to the While instruction.
+  auto lcv = GetEvaluatedLiteralFor(while_hlo->operand(0)).CloneToUnique();
+  bool keep_going = true;
+  int64 iteration_count = 0;
+  HloEvaluator cond_evaluator(max_loop_iterations_);
+  HloEvaluator loop_body_evaluator(max_loop_iterations_);
+  while (keep_going) {
+    if (max_loop_iterations_ >= 0 && iteration_count++ > max_loop_iterations_) {
+      return InvalidArgument("Loop %s exceeded loop iteration limit (%lld).",
+                             while_hlo->name().c_str(), max_loop_iterations_);
+    }
+    TF_ASSIGN_OR_RETURN(auto cond_val, cond_evaluator.Evaluate<Literal*>(
+                                           *cond_comp, {lcv.get()}));
+    keep_going = cond_val->GetFirstElement<bool>();
+    if (keep_going) {
+      TF_ASSIGN_OR_RETURN(auto body_val, loop_body_evaluator.Evaluate<Literal*>(
+                                             *body_comp, {lcv.get()}));
+      VLOG(3) << "Loop iteration result: " << body_val->ToString();
+      lcv = std::move(body_val);
+      cond_evaluator.ResetVisitStates();
+      loop_body_evaluator.ResetVisitStates();
+    }
+  }
+  evaluated_[while_hlo] = std::move(lcv);
+  return Status::OK();
+}
+
 Status HloEvaluator::Preprocess(HloInstruction* hlo) {
   VLOG(2) << "About to visit HLO: " << hlo->ToString();
   return Status::OK();
@@ -2447,28 +2578,27 @@ Status HloEvaluator::Postprocess(HloInstruction* hlo) {
 
 // Explicit instantiation of templatized Evaluate* methods.
 //
-template StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate<
-    const Literal*>(const HloModule& module,
-                    tensorflow::gtl::ArraySlice<const Literal*> arg_literals);
+template StatusOr<std::unique_ptr<Literal>>
+HloEvaluator::Evaluate<const Literal*>(const HloModule& module,
+                                       ArraySlice<const Literal*> arg_literals);
 template StatusOr<std::unique_ptr<Literal>>
 HloEvaluator::Evaluate<std::unique_ptr<Literal>>(
-    const HloModule& module,
-    tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> arg_literals);
+    const HloModule& module, ArraySlice<std::unique_ptr<Literal>> arg_literals);
 
-template StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate<
-    const Literal*>(const HloComputation& computation,
-                    tensorflow::gtl::ArraySlice<const Literal*> arg_literals);
+template StatusOr<std::unique_ptr<Literal>>
+HloEvaluator::Evaluate<const Literal*>(const HloComputation& computation,
+                                       ArraySlice<const Literal*> arg_literals);
 template StatusOr<std::unique_ptr<Literal>>
 HloEvaluator::Evaluate<std::unique_ptr<Literal>>(
     const HloComputation& computation,
-    tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> arg_literals);
+    ArraySlice<std::unique_ptr<Literal>> arg_literals);
 
-template StatusOr<std::unique_ptr<Literal>> HloEvaluator::Evaluate<
-    const Literal*>(HloInstruction* instruction,
-                    tensorflow::gtl::ArraySlice<const Literal*> arg_literals);
+template StatusOr<std::unique_ptr<Literal>>
+HloEvaluator::Evaluate<const Literal*>(HloInstruction* instruction,
+                                       ArraySlice<const Literal*> arg_literals);
 template StatusOr<std::unique_ptr<Literal>>
 HloEvaluator::Evaluate<std::unique_ptr<Literal>>(
     HloInstruction* instruction,
-    tensorflow::gtl::ArraySlice<std::unique_ptr<Literal>> arg_literals);
+    ArraySlice<std::unique_ptr<Literal>> arg_literals);
 
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_evaluator.h b/tensorflow/compiler/xla/service/hlo_evaluator.h
index 3b2b697e492a78a06a4e5ae6bf056ff8676f2ff5..8a27cf9a3a70ee695d6a2f871f5a7770c429e616 100644
--- a/tensorflow/compiler/xla/service/hlo_evaluator.h
+++ b/tensorflow/compiler/xla/service/hlo_evaluator.h
@@ -36,7 +36,10 @@ namespace xla {
 // This class is not thread-safe.
 class HloEvaluator : public DfsHloVisitorWithDefault {
  public:
-  HloEvaluator();
+  // Only evaluate up to max_loop_iterations per while-loop execution if
+  // specified.
+  explicit HloEvaluator(int64 max_loop_iterations = -1);
+
   // Evaluates an HLO module and an array of pointers to literals.
   // Returns the evaluated result as a literal if successful.
   // Precondition: The indices of arg_literals correspond to the parameter
@@ -153,6 +156,12 @@ class HloEvaluator : public DfsHloVisitorWithDefault {
 
   Status HandleCopy(HloInstruction* copy) override;
 
+  Status HandleConditional(HloInstruction* conditional) override;
+
+  Status HandleCall(HloInstruction* call) override;
+
+  Status HandleWhile(HloInstruction* while_hlo) override;
+
  private:
   // Returns the already-evaluated literal result for the instruction.
   // A Constant instruction is considered evaluated and its literal will be
@@ -190,6 +199,9 @@ class HloEvaluator : public DfsHloVisitorWithDefault {
   // Must be cleared for each evaluation.
   std::vector<const Literal*> arg_literals_;
 
+  // Max loop iterations to execute with no maximum if negative.
+  int64 max_loop_iterations_;
+
   TF_DISALLOW_COPY_AND_ASSIGN(HloEvaluator);
 };
 
diff --git a/tensorflow/compiler/xla/service/hlo_graph_dumper.cc b/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
index 44fcd36370dcd0cf77601aa1cd2b92810947bd5f..1dc72355cf179e996caab4d6b52068dc99d02244 100644
--- a/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
+++ b/tensorflow/compiler/xla/service/hlo_graph_dumper.cc
@@ -157,52 +157,60 @@ enum ColorScheme {
   kDashedBorder,
 };
 
+// Graphviz attributes/colors that make up a color scheme.
+struct NodeColors {
+  const char* style;
+  const char* fill_color;
+  const char* stroke_color;
+  const char* font_color;
+};
+
+NodeColors NodeColorsForScheme(ColorScheme color) {
+  switch (color) {
+    case kBlue:
+      return NodeColors{"filled", "#bbdefb", "#8aacc8", "black"};
+    case kBrown:
+      return NodeColors{"filled", "#bcaaa4", "#8c7b75", "black"};
+    case kDarkBlue:
+      return NodeColors{"filled", "#1565c0", "#003c8f", "white"};
+    case kDarkGreen:
+      return NodeColors{"filled", "#2e7d32", "#005005", "white"};
+    case kDarkRed:
+      return NodeColors{"filled", "#b71c1c", "#7f0000", "white"};
+    case kGray:
+      return NodeColors{"filled", "#cfd8dc", "#9ea7aa", "black"};
+    case kGreen:
+      return NodeColors{"filled", "#c8e6c9", "#97b498", "black"};
+    case kOrange:
+      return NodeColors{"filled", "#ffe0b2", "#cbae82", "black"};
+    case kPurple:
+      return NodeColors{"filled", "#e1bee7", "#af8eb5", "black"};
+    case kRed:
+      return NodeColors{"filled", "#ffcdd2", "#cb9ca1", "black"};
+    case kWhite:
+      return NodeColors{"filled", "white", "black", "black"};
+    case kYellow:
+      return NodeColors{"filled", "#fff9c4", "#cbc693", "black"};
+    case kDashedBorder:
+      // "filled,dashed" looks the same as "dashed", since we have a white
+      // background.  But we use "filled,dashed" so that when you hover over
+      // any part of the node (not just the text inside the node), our css
+      // :hover rule is triggered.
+      return NodeColors{"filled,dashed", "white", "#757575", "#757575"};
+  }
+}
+
 // Given a ColorScheme, returns an attribute string for a node of that color.
 // Sets the node's style and fill/stroke/text colors.
 //
 // Colors are from https://material.io/color.
 string NodeColorAttributes(ColorScheme color) {
-  using std::make_tuple;
-
-  const char *style, *fill_color, *stroke_color, *font_color;
-  std::tie(style, fill_color, stroke_color, font_color) = [color] {
-    switch (color) {
-      case kBlue:
-        return make_tuple("filled", "#bbdefb", "#8aacc8", "black");
-      case kBrown:
-        return make_tuple("filled", "#bcaaa4", "#8c7b75", "black");
-      case kDarkBlue:
-        return make_tuple("filled", "#1565c0", "#003c8f", "white");
-      case kDarkGreen:
-        return make_tuple("filled", "#2e7d32", "#005005", "white");
-      case kDarkRed:
-        return make_tuple("filled", "#b71c1c", "#7f0000", "white");
-      case kGray:
-        return make_tuple("filled", "#cfd8dc", "#9ea7aa", "black");
-      case kGreen:
-        return make_tuple("filled", "#c8e6c9", "#97b498", "black");
-      case kOrange:
-        return make_tuple("filled", "#ffe0b2", "#cbae82", "black");
-      case kPurple:
-        return make_tuple("filled", "#e1bee7", "#af8eb5", "black");
-      case kRed:
-        return make_tuple("filled", "#ffcdd2", "#cb9ca1", "black");
-      case kWhite:
-        return make_tuple("filled", "white", "black", "black");
-      case kYellow:
-        return make_tuple("filled", "#fff9c4", "#cbc693", "black");
-      case kDashedBorder:
-        // "filled,dashed" looks the same as "dashed", since we have a white
-        // background.  But we use "filled,dashed" so that when you hover over
-        // any part of the node (not just the text inside the node), our css
-        // :hover rule is triggered.
-        return make_tuple("filled,dashed", "white", "#757575", "#757575");
-    }
-  }();
+  NodeColors node_colors = NodeColorsForScheme(color);
 
   return Printf(
-      R"(style="%s", fontcolor="%s", color="%s", fillcolor="%s")", style,
-      font_color, stroke_color, fill_color);
+      R"(style="%s", fontcolor="%s", color="%s", fillcolor="%s")",
+      node_colors.style, node_colors.font_color, node_colors.stroke_color,
+      node_colors.fill_color);
 }
 
 // Replaces <> with &lt;&gt;, so that this string is safe(er) for use in a
@@ -604,11 +612,21 @@ tooltip = " ";
       StrAppend(&subcomp_label, "<br/>", extra_info);
     }
 
-    // Subcomputation's fill/stroke color is light/dark red/gray, depending on
-    // whether or not the subcomputation's fusion node is highlighted.
     bool highlight = filter_.Highlight(parent_instr);
-    const char* fillcolor = highlight ? "#ffcdd2" : "#f5f5f5";
-    const char* strokecolor = highlight ? "#b71c1c" : "#c2c2c2";
+    const char* fillcolor;
+    const char* strokecolor;
+    if (debug_options_.xla_hlo_graph_sharding_color() && !highlight) {
+      // Use the sharding color, if the node isn't highlighted.
+      NodeColors node_colors =
+          NodeColorsForScheme(GetInstructionColor(parent_instr));
+      fillcolor = node_colors.fill_color;
+      strokecolor = node_colors.stroke_color;
+    } else {
+      // Subcomputation's fill/stroke color is light/dark red/gray, depending on
+      // whether or not the subcomputation's fusion node is highlighted.
+      fillcolor = highlight ? "#ffcdd2" : "#f5f5f5";
+      strokecolor = highlight ? "#b71c1c" : "#c2c2c2";
+    }
     style =
         Printf(R"(style="rounded,filled,bold"; fillcolor="%s"; color="%s;")",
                fillcolor, strokecolor);
@@ -782,6 +800,14 @@ string HloDotDumper::GetInstructionNodeInlinedOperands(
   auto stringify_constant = [](const HloInstruction* constant) {
     const auto& shape = constant->shape();
 
+    // If the shape has a dimension of size zero, print it as e.g.
+    // "{} (f32[42, 0, 10])".  The alternative, calling Literal::ToString(),
+    // enumerates all of its empty dimensions (e.g.  "{ { {}, {} }, ..."), which
+    // is just noise.
+    if (ShapeUtil::HasZeroElements(shape)) {
+      return Printf("{} (%s)", ShapeUtil::HumanString(constant->shape()));
+    }
+
     // Print the literal value of constants with <= K elements.
     optional<int64> elem_count;
     if (!ShapeUtil::IsOpaque(shape) && !ShapeUtil::IsTuple(shape)) {
@@ -940,6 +966,7 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
     case HloOpcode::kConcatenate:
     case HloOpcode::kCopy:
     case HloOpcode::kDynamicSlice:
+    case HloOpcode::kGather:
     case HloOpcode::kPad:
     case HloOpcode::kReshape:
     case HloOpcode::kReverse:
@@ -988,6 +1015,7 @@ ColorScheme HloDotDumper::GetInstructionColor(const HloInstruction* instr) {
     case HloOpcode::kCall:
     case HloOpcode::kConditional:
     case HloOpcode::kCustomCall:
+    case HloOpcode::kHostCompute:
     case HloOpcode::kWhile:
       return kDarkGreen;
     case HloOpcode::kConstant:
diff --git a/tensorflow/compiler/xla/service/hlo_instruction.cc b/tensorflow/compiler/xla/service/hlo_instruction.cc
index 0981f1f4fe57751d5b7059b4b08099385369e4b9..af9d772b0070be0c10e728940721d72d1ab6c04f 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction.cc
+++ b/tensorflow/compiler/xla/service/hlo_instruction.cc
@@ -801,6 +801,22 @@ static string FusionNodeName(HloInstruction::FusionKind fusion_kind) {
   return instruction;
 }
 
+HloInstruction* HloInstruction::AddFusionOperand(HloInstruction* new_operand) {
+  CHECK_EQ(opcode(), HloOpcode::kFusion);
+  CHECK_EQ(operand_count(),
+           fused_instructions_computation()->parameter_instructions().size());
+  const int64 param_no = operand_count();
+  // Name the parameter after the instruction it represents in the outer
+  // (non-fusion) computation.
+  string param_name = StrCat(new_operand->name(), ".param_", param_no);
+  HloInstruction* fused_parameter =
+      fused_instructions_computation()->AddParameter(
+          HloInstruction::CreateParameter(param_no, new_operand->shape(),
+                                          param_name));
+  AppendOperand(new_operand);
+  return fused_parameter;
+}
+
 void HloInstruction::MergeFusionInstruction(
     HloInstruction* instruction_to_merge) {
   CHECK_EQ(opcode_, HloOpcode::kFusion);
@@ -993,13 +1009,7 @@ HloInstruction* HloInstruction::CloneAndFuseInternal(
       // Clone's operand was not already an operand of the fusion
       // instruction. Add it as an operand and add a corresponding fused
       // parameter instruction.
-      int64 param_no = fused_parameters.size();
-      // Name the parameter after the instruction it represents in the outer
-      // (non-fusion) computation.
-      string param_name = StrCat(operand->name(), ".param_", param_no);
-      fused_param = fused_instructions_computation()->AddParameter(
-          CreateParameter(param_no, operand->shape(), param_name));
-      AppendOperand(operand);
+      fused_param = AddFusionOperand(operand);
     }
     TF_CHECK_OK(clone->ReplaceOperandWith(operand_num, fused_param));
   }
@@ -1084,6 +1094,7 @@ bool HloInstruction::HasSideEffect() const {
     case HloOpcode::kInfeed:
     case HloOpcode::kOutfeed:
     case HloOpcode::kTrace:
+    case HloOpcode::kHostCompute:
       return true;
     default: {
       // Check if any of the called computations has a side effect.
@@ -1121,6 +1132,19 @@ bool HloInstruction::HasSideEffect() const {
   return instruction;
 }
 
+/* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateHostCompute(
+    const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+    tensorflow::StringPiece channel_name, const int64 cost_estimate_ns) {
+  std::unique_ptr<HloInstruction> instruction =
+      WrapUnique(new HloInstruction(HloOpcode::kHostCompute, shape));
+  for (auto operand : operands) {
+    instruction->AppendOperand(operand);
+  }
+  instruction->channel_name_ = channel_name.ToString();
+  instruction->cost_estimate_ns_ = cost_estimate_ns;
+  return instruction;
+}
+
 /* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateTuple(
     tensorflow::gtl::ArraySlice<HloInstruction*> elements) {
   std::vector<Shape> element_shapes;
@@ -1131,6 +1155,40 @@ bool HloInstruction::HasSideEffect() const {
   return CreateVariadic(tuple_shape, HloOpcode::kTuple, elements);
 }
 
+/* static */ std::unique_ptr<HloInstruction> HloInstruction::CreateGather(
+    const Shape& shape, HloInstruction* operand, HloInstruction* gather_indices,
+    const GatherDimensionNumbers& gather_dim_numbers,
+    tensorflow::gtl::ArraySlice<int64> window_bounds) {
+  std::unique_ptr<HloInstruction> instruction =
+      WrapUnique(new HloInstruction(HloOpcode::kGather, shape));
+  instruction->AppendOperand(operand);
+  instruction->AppendOperand(gather_indices);
+  instruction->gather_dimension_numbers_ =
+      MakeUnique<GatherDimensionNumbers>(gather_dim_numbers);
+  c_copy(window_bounds, std::back_inserter(instruction->gather_window_bounds_));
+  return instruction;
+}
+
+/* static */ GatherDimensionNumbers HloInstruction::MakeGatherDimNumbers(
+    tensorflow::gtl::ArraySlice<int64> output_window_dims,
+    tensorflow::gtl::ArraySlice<int64> elided_window_dims,
+    tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims,
+    int64 index_vector_dim) {
+  GatherDimensionNumbers gather_dim_numbers;
+  for (int64 output_window_dim : output_window_dims) {
+    gather_dim_numbers.add_output_window_dims(output_window_dim);
+  }
+  for (int64 elided_window_dim : elided_window_dims) {
+    gather_dim_numbers.add_elided_window_dims(elided_window_dim);
+  }
+  for (int64 gather_dim_to_input_dim : gather_dims_to_operand_dims) {
+    gather_dim_numbers.add_gather_dims_to_operand_dims(gather_dim_to_input_dim);
+  }
+
+  gather_dim_numbers.set_index_vector_dim(index_vector_dim);
+  return gather_dim_numbers;
+}
+
 std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
     const Shape& shape,
     tensorflow::gtl::ArraySlice<HloInstruction*> new_operands,
@@ -1212,6 +1270,10 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
     case HloOpcode::kCustomCall:
       clone = CreateCustomCall(shape, new_operands, custom_call_target_);
       break;
+    case HloOpcode::kHostCompute:
+      clone = CreateHostCompute(shape, new_operands, channel_name_,
+                                cost_estimate_ns_);
+      break;
     case HloOpcode::kConcatenate:
       clone = CreateConcatenate(shape, new_operands, dimensions(0));
       break;
@@ -1361,12 +1423,19 @@ std::unique_ptr<HloInstruction> HloInstruction::CloneWithNewOperands(
       break;
     case HloOpcode::kRecv:
       CHECK_EQ(new_operands.size(), 0);
-      clone = CreateRecv(shape, channel_id());
+      // The shape is a tuple, but CreateRecv() wants the raw data shape.
+      clone =
+          CreateRecv(ShapeUtil::GetTupleElementShape(shape, 0), channel_id());
       break;
     case HloOpcode::kRecvDone:
       CHECK_EQ(new_operands.size(), 1);
       clone = CreateRecvDone(new_operands[0]);
       break;
+    case HloOpcode::kGather:
+      CHECK_EQ(new_operands.size(), 2);
+      clone = CreateGather(shape, new_operands[0], new_operands[1],
+                           *gather_dimension_numbers_, gather_window_bounds_);
+      break;
     case HloOpcode::kTrace:
       LOG(FATAL) << "Not yet implemented, clone: " << HloOpcodeString(opcode_);
   }
@@ -1710,6 +1779,11 @@ bool HloInstruction::IdenticalSlowPath(
       return protobuf_util::ProtobufEquals(dot_dimension_numbers(),
                                            other.dot_dimension_numbers());
 
+    case HloOpcode::kGather:
+      return protobuf_util::ProtobufEquals(gather_dimension_numbers(),
+                                           other.gather_dimension_numbers()) &&
+             gather_window_bounds() == other.gather_window_bounds();
+
     // FFT has various types & lengths.
     case HloOpcode::kFft:
       return fft_type() == other.fft_type() &&
@@ -1780,6 +1854,7 @@ bool HloInstruction::IdenticalSlowPath(
     case HloOpcode::kRecvDone:
     case HloOpcode::kSend:
     case HloOpcode::kSendDone:
+    case HloOpcode::kHostCompute:
       return false;
   }
 }
@@ -2140,6 +2215,11 @@ std::vector<string> HloInstruction::ExtraAttributesToString(
   if (dot_dimension_numbers_ != nullptr) {
     extra.push_back(DotDimensionNumbersToString());
   }
+  if (gather_dimension_numbers_ != nullptr) {
+    extra.push_back(GatherDimensionNumbersToString());
+    extra.push_back(
+        StrCat("window_bounds={", Join(gather_window_bounds(), ","), "}"));
+  }
   if (opcode() == HloOpcode::kFft) {
     extra.push_back(StrCat("fft_type=", FftType_Name(fft_type())));
     extra.push_back(StrCat("fft_length={", Join(fft_length(), ","), "}"));
@@ -2271,6 +2351,14 @@ HloInstructionProto HloInstruction::ToProto() const {
   if (dot_dimension_numbers_ != nullptr) {
     *proto.mutable_dot_dimension_numbers() = *dot_dimension_numbers_;
   }
+  if (gather_dimension_numbers_ != nullptr) {
+    *proto.mutable_gather_dimension_numbers() = *gather_dimension_numbers_;
+  }
+  if (opcode() == HloOpcode::kGather) {
+    for (int64 bound : gather_window_bounds()) {
+      proto.add_gather_window_bounds(bound);
+    }
+  }
   for (int i = 0; i < slice_starts_.size(); ++i) {
     auto* slice_dimension = proto.add_slice_dimensions();
     slice_dimension->set_start(slice_starts_[i]);
@@ -2565,6 +2653,8 @@ Status HloInstruction::Visit(DfsHloVisitorBase<HloInstructionPtr>* visitor) {
       return visitor->HandleInfeed(this);
     case HloOpcode::kOutfeed:
       return visitor->HandleOutfeed(this);
+    case HloOpcode::kHostCompute:
+      return visitor->HandleHostCompute(this);
     case HloOpcode::kRng:
       return visitor->HandleRng(this);
     case HloOpcode::kWhile:
@@ -2585,13 +2675,17 @@ Status HloInstruction::Visit(DfsHloVisitorBase<HloInstructionPtr>* visitor) {
       return visitor->HandleSend(this);
     case HloOpcode::kSendDone:
       return visitor->HandleSendDone(this);
+    case HloOpcode::kGather:
+      return visitor->HandleGather(this);
 
     // These opcodes are not handled here.
     case HloOpcode::kTrace:
       break;
   }
-  return Unimplemented("unhandled HloOpcode for DfsHloVisitor: %s",
-                       HloOpcodeString(opcode_).c_str());
+  return InternalError(
+      "Unhandled HloOpcode for DfsHloVisitor: %s. This should not happen - "
+      "please file a bug for XLA.",
+      HloOpcodeString(opcode_).c_str());
 }
 
 // Explicit instantiations.
@@ -3268,6 +3362,26 @@ string HloInstruction::DotDimensionNumbersToString() const {
   return Join(result, ", ");
 }
 
+string HloInstruction::GatherDimensionNumbersToString() const {
+  CHECK_NE(gather_dimension_numbers_.get(), nullptr);
+  string output_window_dims =
+      StrCat("output_window_dims={",
+             Join(gather_dimension_numbers_->output_window_dims(), ","), "}");
+  string elided_window_dims =
+      StrCat("elided_window_dims={",
+             Join(gather_dimension_numbers_->elided_window_dims(), ","), "}");
+  string gather_dims_to_operand_dims = StrCat(
+      "gather_dims_to_operand_dims={",
+      Join(gather_dimension_numbers_->gather_dims_to_operand_dims(), ","), "}");
+  string index_vector_dim = StrCat(
+      "index_vector_dim=", gather_dimension_numbers_->index_vector_dim());
+
+  return Join<std::initializer_list<string>>(
+      {output_window_dims, elided_window_dims, gather_dims_to_operand_dims,
+       index_vector_dim},
+      ", ");
+}
+
 bool HloInstruction::CouldBeBitcast() const {
   switch (opcode_) {
     case HloOpcode::kTranspose:
diff --git a/tensorflow/compiler/xla/service/hlo_instruction.h b/tensorflow/compiler/xla/service/hlo_instruction.h
index 3170746157fbcfa7d0a7eaba6d226d46691105f9..e4c86214c2014095b2e171ff10691e1221574cb7 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction.h
+++ b/tensorflow/compiler/xla/service/hlo_instruction.h
@@ -451,6 +451,12 @@ class HloInstruction {
       HloInstruction* true_computation_arg, HloComputation* true_computation,
       HloInstruction* false_computation_arg, HloComputation* false_computation);
 
+  static std::unique_ptr<HloInstruction> CreateGather(
+      const Shape& shape, HloInstruction* operand,
+      HloInstruction* gather_indices,
+      const GatherDimensionNumbers& gather_dim_numbers,
+      tensorflow::gtl::ArraySlice<int64> window_bounds);
+
   // Creates a fusion instruction. A fusion instruction contains one or more
   // fused instructions forming an expression with a single root
   // "fused_root". Additional instructions can be added to the fusion
@@ -475,6 +481,12 @@ class HloInstruction {
       const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
       tensorflow::StringPiece custom_call_target);
 
+  // Creates a HostCompute instruction, which records host-side control and
+  // data dependencies for use in instruction scheduling.
+  static std::unique_ptr<HloInstruction> CreateHostCompute(
+      const Shape& shape, tensorflow::gtl::ArraySlice<HloInstruction*> operands,
+      tensorflow::StringPiece channel_name, const int64 cost_estimate_ns);
+
   // Creates a tuple instruction with the given elements. This is a convenience
   // wrapper around CreateVariadic.
   static std::unique_ptr<HloInstruction> CreateTuple(
@@ -486,6 +498,13 @@ class HloInstruction {
       const Shape& shape, HloInstruction* operand,
       tensorflow::gtl::ArraySlice<int64> dimensions);
 
+  // Creates an instance of GatherDimensionNumbers.
+  static GatherDimensionNumbers MakeGatherDimNumbers(
+      tensorflow::gtl::ArraySlice<int64> output_window_dims,
+      tensorflow::gtl::ArraySlice<int64> elided_window_dims,
+      tensorflow::gtl::ArraySlice<int64> gather_dims_to_operand_dims,
+      int64 index_vector_dim);
+
   // Returns the opcode for this instruction.
   HloOpcode opcode() const { return opcode_; }
 
@@ -767,6 +786,10 @@ class HloInstruction {
   //
   // (We express the default options using an overload rather than a default
   // param because gdb ignores default params, but does resolve overloads.)
+  //
+  // TODO(b/73348663): Make ToString() adaptive to the size of the string by
+  // default, backing off on providing full information for very large strings,
+  // or provide a different name for a ToString-like function that does that.
   string ToString() const { return ToString(HloPrintOptions()); }
   string ToString(const HloPrintOptions& options) const;
 
@@ -802,6 +825,12 @@ class HloInstruction {
   // Precondition: opcode() == HloOpcode::kSend or HloOpcode::kRecv
   int64 channel_id() const { return channel_id_; }
 
+  // Returns the channel name associated with the instruction. The name is
+  // used to identify host Send/Recv operations.
+  //
+  // Precondition: opcode() == HloOpcode::kHostCompute
+  string channel_name() const { return channel_name_; }
+
   // Returns feature_index field associated with the instruction. The index
   // represents the index of the feature dimension.
   //
@@ -914,6 +943,9 @@ class HloInstruction {
   // Return true if this operator has a sharding assigned.
   bool has_sharding() const { return sharding_ != nullptr; }
 
+  // Adds a new operand the fusion instruction.
+  HloInstruction* AddFusionOperand(HloInstruction* new_operand);
+
   // Merges the fused instructions from 'instruction_to_merge' into the
   // fused instruction set of 'this', updating operands as necessary.
   //
@@ -1086,6 +1118,19 @@ class HloInstruction {
   // Returns the dump string of the dot dimension numbers.
   string DotDimensionNumbersToString() const;
 
+  const GatherDimensionNumbers& gather_dimension_numbers() const {
+    CHECK(gather_dimension_numbers_ != nullptr);
+    return *gather_dimension_numbers_;
+  }
+
+  tensorflow::gtl::ArraySlice<int64> gather_window_bounds() const {
+    CHECK_EQ(opcode(), HloOpcode::kGather);
+    return gather_window_bounds_;
+  }
+
+  // Returns the dump string of the gather dimension numbers.
+  string GatherDimensionNumbersToString() const;
+
   // Returns the random distribution for this rng node.
   //
   // Precondition: opcode() == HloOpcode::kRng
@@ -1350,6 +1395,9 @@ class HloInstruction {
   // Describes the dimension numbers used for a dot.
   std::unique_ptr<DotDimensionNumbers> dot_dimension_numbers_;
 
+  std::unique_ptr<GatherDimensionNumbers> gather_dimension_numbers_;
+  std::vector<int64> gather_window_bounds_;
+
   // Describes FFT type for an FFT instruction.
   FftType fft_type_ = FftType::FFT;
 
@@ -1388,6 +1436,12 @@ class HloInstruction {
   // Name of a global symbol to call, only present for kCustomCall.
   string custom_call_target_;
 
+  // Name to use for host send/recv channels, only present for kHostCompute.
+  string channel_name_;
+
+  // Estimate of the duration of a host computation in nanoseconds.
+  int64 cost_estimate_ns_;
+
   // Computations called by this instruction.
   std::vector<HloComputation*> called_computations_;
 
diff --git a/tensorflow/compiler/xla/service/hlo_instruction_test.cc b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
index 94e9bfe56eb445ec0b459a55342cd3cc4c6f68ef..f2980d309d01fdf3b3e601bc260a0ad0895b3064 100644
--- a/tensorflow/compiler/xla/service/hlo_instruction_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_instruction_test.cc
@@ -1271,5 +1271,77 @@ TEST_F(HloInstructionTest, Stringification) {
             "true_computation=%TransposeDot, false_computation=%TransposeDot");
 }
 
+TEST_F(HloInstructionTest, StringifyGather_0) {
+  Shape input_tensor_shape = ShapeUtil::MakeShape(F32, {50, 49, 48, 47, 46});
+  Shape gather_indices_tensor_shape =
+      ShapeUtil::MakeShape(S64, {10, 9, 8, 7, 5});
+  Shape gather_result_shape =
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28, 27, 26});
+
+  HloComputation::Builder builder("Gather");
+  HloInstruction* input = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, input_tensor_shape, "input_tensor"));
+  HloInstruction* gather_indices =
+      builder.AddInstruction(HloInstruction::CreateParameter(
+          1, gather_indices_tensor_shape, "gather_indices"));
+
+  HloInstruction* gather_instruction =
+      builder.AddInstruction(HloInstruction::CreateGather(
+          gather_result_shape, input, gather_indices,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{4, 5, 6, 7, 8},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/4),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+
+  HloModule module(TestName());
+  module.AddEntryComputation(builder.Build());
+
+  EXPECT_EQ(gather_instruction->ToString(),
+            "%gather = f32[10,9,8,7,30,29,28,27,26]{8,7,6,5,4,3,2,1,0} "
+            "gather(f32[50,49,48,47,46]{4,3,2,1,0} %input_tensor, "
+            "s64[10,9,8,7,5]{4,3,2,1,0} %gather_indices), "
+            "output_window_dims={4,5,6,7,8}, elided_window_dims={}, "
+            "gather_dims_to_operand_dims={0,1,2,3,4}, "
+            "index_vector_dim=4, window_bounds={30,29,28,27,26}");
+}
+
+TEST_F(HloInstructionTest, StringifyGather_1) {
+  Shape input_tensor_shape = ShapeUtil::MakeShape(F32, {50, 49, 48, 47, 46});
+  Shape gather_indices_tensor_shape =
+      ShapeUtil::MakeShape(S64, {10, 9, 5, 7, 6});
+  Shape gather_result_shape =
+      ShapeUtil::MakeShape(F32, {10, 9, 7, 6, 30, 29, 28, 27, 26});
+
+  HloComputation::Builder builder("Gather");
+  HloInstruction* input = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, input_tensor_shape, "input_tensor"));
+  HloInstruction* gather_indices =
+      builder.AddInstruction(HloInstruction::CreateParameter(
+          1, gather_indices_tensor_shape, "gather_indices"));
+
+  HloInstruction* gather_instruction =
+      builder.AddInstruction(HloInstruction::CreateGather(
+          gather_result_shape, input, gather_indices,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{4, 5, 6, 7, 8},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/2),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+
+  HloModule module(TestName());
+  module.AddEntryComputation(builder.Build());
+
+  EXPECT_EQ(gather_instruction->ToString(),
+            "%gather = f32[10,9,7,6,30,29,28,27,26]{8,7,6,5,4,3,2,1,0} "
+            "gather(f32[50,49,48,47,46]{4,3,2,1,0} %input_tensor, "
+            "s64[10,9,5,7,6]{4,3,2,1,0} %gather_indices), "
+            "output_window_dims={4,5,6,7,8}, elided_window_dims={}, "
+            "gather_dims_to_operand_dims={0,1,2,3,4}, "
+            "index_vector_dim=2, window_bounds={30,29,28,27,26}");
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/hlo_module.cc b/tensorflow/compiler/xla/service/hlo_module.cc
index 60270b0595dcfca8f1fcea5ab0914428880f35b5..cb2fe9f874012a51e1e6cbd1dd086dbb26994bde 100644
--- a/tensorflow/compiler/xla/service/hlo_module.cc
+++ b/tensorflow/compiler/xla/service/hlo_module.cc
@@ -145,6 +145,21 @@ void HloModule::ReplaceComputations(
           }
           break;
         }
+        case HloOpcode::kConditional: {
+          HloComputation* new_true_computation =
+              tensorflow::gtl::FindWithDefault(
+                  replacements, instruction->true_computation(), nullptr);
+          if (new_true_computation != nullptr) {
+            instruction->set_true_computation(new_true_computation);
+          }
+          HloComputation* new_false_computation =
+              tensorflow::gtl::FindWithDefault(
+                  replacements, instruction->false_computation(), nullptr);
+          if (new_false_computation != nullptr) {
+            instruction->set_false_computation(new_false_computation);
+          }
+          break;
+        }
         case HloOpcode::kSelectAndScatter: {
           HloComputation* new_select = tensorflow::gtl::FindWithDefault(
               replacements, instruction->select(), nullptr);
@@ -563,6 +578,18 @@ std::unique_ptr<HloModule> HloModule::Clone(const string& suffix) const {
   return module;
 }
 
+HloComputation* HloModule::DeepCloneComputation(HloComputation* computation) {
+  HloComputation* clone = AddEmbeddedComputation(computation->Clone("", this));
+  TF_CHECK_OK(
+      clone->root_instruction()->Accept([this](HloInstruction* instruction) {
+        instruction->ReplaceCalledComputations([this](HloComputation* callee) {
+          return DeepCloneComputation(callee);
+        });
+        return Status::OK();
+      }));
+  return clone;
+}
+
 uint64 HloModule::RandomNew64() const {
   tensorflow::mutex_lock l(rng_mutex_);
   return rng_();
diff --git a/tensorflow/compiler/xla/service/hlo_module.h b/tensorflow/compiler/xla/service/hlo_module.h
index 4bfe8d89ce0a285de6d05d4867aaa6b266d78d12..ca94118763566c91295cb35d80bc459d008e824c 100644
--- a/tensorflow/compiler/xla/service/hlo_module.h
+++ b/tensorflow/compiler/xla/service/hlo_module.h
@@ -85,6 +85,10 @@ class HloModule {
   // Returns a deep copy of this module including all computations.
   std::unique_ptr<HloModule> Clone(const string& suffix = "clone") const;
 
+  // Performs a deep clone of the computation, by recursively cloning all
+  // the called computations as well.
+  HloComputation* DeepCloneComputation(HloComputation* computation);
+
   // Return a pointer to the entry computation of the module..
   const HloComputation* entry_computation() const {
     CHECK_NE(nullptr, entry_computation_);
@@ -183,11 +187,6 @@ class HloModule {
   // Returns a randomly generated uint64.
   uint64 RandomNew64() const;
 
-  // Returns the unique name for a computation in this module.
-  string GetUniqueCompuationName(const string& prefix) {
-    return computation_name_uniquer_.GetUniqueName(prefix);
-  }
-
   // Returns the NameUniquer for uniquing instruction names in this module.
   NameUniquer& instruction_name_uniquer() { return instruction_name_uniquer_; }
 
diff --git a/tensorflow/compiler/xla/service/hlo_module_config.h b/tensorflow/compiler/xla/service/hlo_module_config.h
index a5ee895e48448fbb8fa3879dc1b6764c1f9f6966..d3c1fae592bb465609ffbde2d0262e2600912e63 100644
--- a/tensorflow/compiler/xla/service/hlo_module_config.h
+++ b/tensorflow/compiler/xla/service/hlo_module_config.h
@@ -67,6 +67,15 @@ class HloModuleConfig {
   bool hlo_profiling_enabled() const { return hlo_profiling_enabled_; }
   void enable_hlo_profiling(bool enabled) { hlo_profiling_enabled_ = enabled; }
 
+  // Sets/returns whether this is a "host module".  Host modules are used to
+  // record the data- and control-flow dependencies of host side computation
+  // that communicates with compiled code.  They are used for analysis and
+  // scheduling purposes, but no code is generated.
+  bool is_host_module() const { return is_host_module_; }
+  void set_is_host_module(bool is_host_module) {
+    is_host_module_ = is_host_module;
+  }
+
   // Sets/returns the module seed set during execution.
   void set_seed(uint64 seed) { seed_ = seed; }
   uint64 seed() const { return seed_; }
@@ -104,6 +113,9 @@ class HloModuleConfig {
   // Whether to enable HLO-level profiling.
   bool hlo_profiling_enabled_ = false;
 
+  // Whether this is a 'host module'.
+  bool is_host_module_ = false;
+
   // Module/graph-level seed handle.
   uint64 seed_ = 0;
 
diff --git a/tensorflow/compiler/xla/service/hlo_opcode.h b/tensorflow/compiler/xla/service/hlo_opcode.h
index 3d64523a79fc50638fdf378b5d521a5cd4482b90..af24604c39b554f146793594958f373999844b4c 100644
--- a/tensorflow/compiler/xla/service/hlo_opcode.h
+++ b/tensorflow/compiler/xla/service/hlo_opcode.h
@@ -76,9 +76,11 @@ namespace xla {
   V(kFft, "fft")                                             \
   V(kFloor, "floor")                                         \
   V(kFusion, "fusion", kHloOpcodeIsVariadic)                 \
+  V(kGather, "gather")                                       \
   V(kGe, "greater-than-or-equal-to", kHloOpcodeIsComparison) \
   V(kGetTupleElement, "get-tuple-element")                   \
   V(kGt, "greater-than", kHloOpcodeIsComparison)             \
+  V(kHostCompute, "host-compute")                            \
   V(kImag, "imag")                                           \
   V(kInfeed, "infeed")                                       \
   V(kIsFinite, "is-finite")                                  \
diff --git a/tensorflow/compiler/xla/service/hlo_ordering.cc b/tensorflow/compiler/xla/service/hlo_ordering.cc
index 68e3c9618c1fe9daacb0aee3ee98862c8b9e4bc4..1b24d8da9e832e6847cb6f405e15af3c455f695a 100644
--- a/tensorflow/compiler/xla/service/hlo_ordering.cc
+++ b/tensorflow/compiler/xla/service/hlo_ordering.cc
@@ -186,6 +186,22 @@ bool HloOrdering::UseIsBeforeValueDefinition(
     }
   }
 
+  if (use.instruction->opcode() == HloOpcode::kConditional) {
+    const HloInstruction* conditional = use.instruction;
+    if (call_graph_->InstructionIsNestedIn(value.defining_instruction(),
+                                           conditional->true_computation())) {
+      VLOG(4) << "  use is conditional " << use.instruction->name()
+              << " and def is in TRUE computation";
+      return true;
+    }
+    if (call_graph_->InstructionIsNestedIn(value.defining_instruction(),
+                                           conditional->false_computation())) {
+      VLOG(4) << "  use is conditional " << use.instruction->name()
+              << " and def is in FALSE computation";
+      return true;
+    }
+  }
+
   VLOG(4) << "  use is not before value";
   return false;
 }
diff --git a/tensorflow/compiler/xla/service/hlo_ordering_test.cc b/tensorflow/compiler/xla/service/hlo_ordering_test.cc
index aba66114de649ce7667ae77174e9c4073b010b90..a989fce63234cb860d08c48b02462e96bec879bc 100644
--- a/tensorflow/compiler/xla/service/hlo_ordering_test.cc
+++ b/tensorflow/compiler/xla/service/hlo_ordering_test.cc
@@ -262,8 +262,8 @@ TEST_F(HloOrderingTest, ValuesInWhileComputations) {
       scalar_shape, HloOpcode::kAdd, constant, xla_while));
   module->AddEntryComputation(builder.Build());
 
-  TF_ASSERT_OK_AND_ASSIGN(
-      auto dataflow, HloDataflowAnalysis::Run(module.get(), /*ssa_form=*/true));
+  TF_ASSERT_OK_AND_ASSIGN(auto dataflow,
+                          HloDataflowAnalysis::Run(*module, /*ssa_form=*/true));
   DependencyHloOrdering ordering(module.get());
 
   // Init value is defined before the while, but live range is not before the
diff --git a/tensorflow/compiler/xla/service/hlo_rematerialization.cc b/tensorflow/compiler/xla/service/hlo_rematerialization.cc
index c6b4dc0368d92fd477decdfb38045f74f8696803..98b8d34be1f331aaeac94e952deeae1e76379861 100644
--- a/tensorflow/compiler/xla/service/hlo_rematerialization.cc
+++ b/tensorflow/compiler/xla/service/hlo_rematerialization.cc
@@ -60,6 +60,7 @@ bool IsRematerializable(const HloInstruction* instruction) {
   switch (instruction->opcode()) {
     case HloOpcode::kCall:
     case HloOpcode::kConstant:
+    case HloOpcode::kConditional:
     case HloOpcode::kCrossReplicaSum:
     case HloOpcode::kCustomCall:
     case HloOpcode::kParameter:
diff --git a/tensorflow/compiler/xla/service/hlo_scheduling.cc b/tensorflow/compiler/xla/service/hlo_scheduling.cc
index 8dc4d4f7bac1b2007f2b9f60d126fa07e314dac9..f6e33403f538bd8492b04c34d46a458f7f06cc06 100644
--- a/tensorflow/compiler/xla/service/hlo_scheduling.cc
+++ b/tensorflow/compiler/xla/service/hlo_scheduling.cc
@@ -15,7 +15,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/service/hlo_scheduling.h"
 
-#include <queue>
+#include <map>
 #include <utility>
 #include <vector>
 
@@ -151,8 +151,10 @@ class ListScheduler {
     int64 bytes_defined;
 
     // For each buffer B used by this instruction, we keep a pair (B, U), where
-    // U is the number of uses of B that have not yet been scheduled.
-    std::vector<std::pair<const LogicalBuffer* const, int64>>
+    // U is the number of uses of B that have not yet been scheduled. This pair
+    // is a pointer into the unscheduled_use_count_ map, so it gets updated for
+    // free when we update counts in the map.
+    std::vector<const std::pair<const LogicalBuffer* const, int64>*>
         used_buffer_unscheduled_use_counts;
   };
 
@@ -175,8 +177,8 @@ class ListScheduler {
       }
       auto unscheduled_use_count_it = unscheduled_use_count_.find(buffer);
       CHECK(unscheduled_use_count_it != unscheduled_use_count_.end());
-      entry.used_buffer_unscheduled_use_counts.emplace_back(
-          unscheduled_use_count_it->first, unscheduled_use_count_it->second);
+      entry.used_buffer_unscheduled_use_counts.push_back(
+          &*unscheduled_use_count_it);
     }
     return entry;
   }
@@ -185,8 +187,8 @@ class ListScheduler {
   int64 BytesFreedIfScheduled(const ReadyListEntry& entry) {
     int64 freed_bytes = 0;
     for (const auto& kv : entry.used_buffer_unscheduled_use_counts) {
-      auto buffer = kv.first;
-      auto use_count = kv.second;
+      auto buffer = kv->first;
+      auto use_count = kv->second;
       if (use_count == 1) {
         freed_bytes += size_function_(*buffer);
       }
@@ -217,23 +219,18 @@ class ListScheduler {
       }
     }
 
-    auto priority_comparator =
-        [this](const std::pair<Priority, ReadyListEntry>& lhs,
-               const std::pair<Priority, ReadyListEntry>& rhs) {
-          return lhs.first < rhs.first;
-        };
-    std::priority_queue<std::pair<Priority, ReadyListEntry>,
-                        std::vector<std::pair<Priority, ReadyListEntry>>,
-                        decltype(priority_comparator)>
-        ready_queue(priority_comparator);
+    // Use a multimap to sort ReadyListEntry according to their priority.
+    std::multimap<Priority, ReadyListEntry> ready_queue;
 
-    // Set of instructions in the ready list.
-    tensorflow::gtl::FlatSet<const HloInstruction*> ready_instructions;
+    // Map of ready instructions to their iterators in ready_queue.
+    tensorflow::gtl::FlatMap<const HloInstruction*,
+                             std::multimap<Priority, ReadyListEntry>::iterator>
+        ready_instructions;
 
     auto add_to_ready_queue = [&](HloInstruction* inst) {
       auto entry = MakeReadyListEntry(inst);
-      ready_queue.emplace(GetPriority(entry), std::move(entry));
-      ready_instructions.insert(inst);
+      auto it = ready_queue.emplace(GetPriority(entry), std::move(entry));
+      ready_instructions[inst] = it;
     };
 
     for (auto* instruction : computation_.instructions()) {
@@ -247,14 +244,10 @@ class ListScheduler {
     while (!ready_queue.empty()) {
       // Remove the selected instruction from the ready list and add it to the
       // schedule.
-      const HloInstruction* best = ready_queue.top().second.instruction;
-      ready_queue.pop();
-      // We may have duplicates in the priority queue, because when a ready
-      // instruction's priority goes up, we reinsert it to the priority queue.
-      // Skip the duplicate.
-      if (scheduled_instructions_.find(best) != scheduled_instructions_.end()) {
-        continue;
-      }
+      auto best_it = ready_queue.end();
+      --best_it;
+      const HloInstruction* best = best_it->second.instruction;
+      ready_queue.erase(best_it);
       ready_instructions.erase(best);
       schedule.push_back(best);
       scheduled_instructions_.insert(best);
@@ -287,16 +280,27 @@ class ListScheduler {
         update_pred_count(succ);
       }
       // The unscheduled use count for a buffer has changed to 1, so the
-      // priorities of some ready instructions may go up. We reinsert them to
-      // the priority queue, so that they can appear earlier. The old entries
-      // will become duplicates and will be skipped.
+      // priorities of some ready instructions may go up. We update them in the
+      // ready queue, so that they can appear earlier.
       if (adjust_ready_queue) {
         for (HloInstruction* operand : best->operands()) {
           for (HloInstruction* operand_user : operand->users()) {
-            if (ready_instructions.find(operand_user) !=
-                ready_instructions.end()) {
-              add_to_ready_queue(operand_user);
+            auto ready_instructions_it = ready_instructions.find(operand_user);
+            if (ready_instructions_it == ready_instructions.end()) {
+              continue;
+            }
+            auto ready_queue_it = ready_instructions_it->second;
+            auto& entry = ready_queue_it->second;
+            Priority new_priority = GetPriority(entry);
+            if (new_priority == ready_queue_it->first) {
+              continue;
             }
+            // Create a new entry in ready_queue, then update
+            // ready_instructions[operand_user] to refer to the new entry.
+            ready_instructions_it->second =
+                ready_queue.emplace(new_priority, std::move(entry));
+            // Remove the old entry in ready_queue.
+            ready_queue.erase(ready_queue_it);
           }
         }
       }
@@ -317,8 +321,9 @@ class ListScheduler {
       buffer_uses_;
 
   // A map containing the count of unscheduled HLOs which using a particular
-  // LogicalBuffer.  We rely on iterator stability in this map.
-  tensorflow::gtl::FlatMap<const LogicalBuffer*, int64> unscheduled_use_count_;
+  // LogicalBuffer.  We rely on iterator stability in this map, and that the map
+  // entries are std::pair's.
+  std::unordered_map<const LogicalBuffer*, int64> unscheduled_use_count_;
 
   // Set of instructions which have been scheduled.
   tensorflow::gtl::FlatSet<const HloInstruction*> scheduled_instructions_;
diff --git a/tensorflow/compiler/xla/service/hlo_sharding.cc b/tensorflow/compiler/xla/service/hlo_sharding.cc
index 447c2446668253c932b44b51b2db22bfd47f9957..afe79c9f17befdcb2812c0a08b205f21b0715b19 100644
--- a/tensorflow/compiler/xla/service/hlo_sharding.cc
+++ b/tensorflow/compiler/xla/service/hlo_sharding.cc
@@ -183,6 +183,10 @@ Status HloSharding::ValidateTuple(const Shape& shape, int64 num_devices) const {
   // shape tree.
   ShapeTree<HloSharding> shape_tree = GetAsShapeTree(shape);
   for (const auto& index_to_sharding : shape_tree.leaves()) {
+    if (index_to_sharding.first.empty()) {
+      // An empty tuple has a ShapeTree with a single leaf at the empty index.
+      continue;
+    }
     Status status = index_to_sharding.second.ValidateNonTuple(
         ShapeUtil::GetSubshape(shape, index_to_sharding.first), num_devices);
     if (!status.ok()) {
@@ -222,7 +226,7 @@ Status HloSharding::ValidateNonTuple(const Shape& shape,
   Status status = Status::OK();
   std::set<int64> seen_cores;
   tile_assignment_.Each(
-      [&](tensorflow::gtl::ArraySlice<int64> indices, uint32 core) {
+      [&](tensorflow::gtl::ArraySlice<int64> indices, int32 core) {
         // Don't overwrite a bad status, so we report the first error.
         if (status.ok()) {
           if (core >= num_devices) {
diff --git a/tensorflow/compiler/xla/service/hlo_verifier.cc b/tensorflow/compiler/xla/service/hlo_verifier.cc
index e2b3bb9d71497c352b0b92add2d2f6b4b777bee8..b1fd068115e1d104a11d880675ef84e07d6d5602 100644
--- a/tensorflow/compiler/xla/service/hlo_verifier.cc
+++ b/tensorflow/compiler/xla/service/hlo_verifier.cc
@@ -125,6 +125,10 @@ Status ShapeVerifier::HandleOutfeed(HloInstruction* outfeed) {
   return CheckShape(outfeed, ShapeUtil::MakeNil());
 }
 
+Status ShapeVerifier::HandleHostCompute(HloInstruction*) {
+  return tensorflow::Status::OK();
+}
+
 Status ShapeVerifier::HandleRng(HloInstruction*) {
   return tensorflow::Status::OK();
 }
@@ -420,6 +424,14 @@ Status CheckMixedPrecisionOperands(const HloInstruction* instruction) {
 
 }  // namespace
 
+Status ShapeVerifier::HandleGather(HloInstruction* gather) {
+  return CheckShape(
+      gather,
+      ShapeInference::InferGatherShape(
+          gather->operand(0)->shape(), gather->operand(1)->shape(),
+          gather->gather_dimension_numbers(), gather->gather_window_bounds()));
+}
+
 Status ShapeVerifier::CheckShape(const HloInstruction* instruction,
                                  const Shape& inferred_shape) {
   // If allow_mixed_precision_ is false, check if there are operands with
diff --git a/tensorflow/compiler/xla/service/hlo_verifier.h b/tensorflow/compiler/xla/service/hlo_verifier.h
index 7eccf834bbd3ac6af0d5762a7241758b416a3523..1dd7ec3c51e18dcfe89bd478de87798ba3858119 100644
--- a/tensorflow/compiler/xla/service/hlo_verifier.h
+++ b/tensorflow/compiler/xla/service/hlo_verifier.h
@@ -60,6 +60,7 @@ class ShapeVerifier : public DfsHloVisitor {
   Status HandleFusion(HloInstruction*) override;
   Status HandleCall(HloInstruction* call) override;
   Status HandleCustomCall(HloInstruction*) override;
+  Status HandleHostCompute(HloInstruction*) override;
   Status HandleSlice(HloInstruction* slice) override;
   Status HandleDynamicSlice(HloInstruction* dynamic_slice) override;
   Status HandleDynamicUpdateSlice(
@@ -79,6 +80,7 @@ class ShapeVerifier : public DfsHloVisitor {
   Status HandleBatchNormInference(
       HloInstruction* batch_norm_inference) override;
   Status HandleBatchNormGrad(HloInstruction* batch_norm_grad) override;
+  Status HandleGather(HloInstruction* gather) override;
 
   Status FinishVisit(HloInstruction*) override {
     return tensorflow::Status::OK();
diff --git a/tensorflow/compiler/xla/service/instruction_fusion.cc b/tensorflow/compiler/xla/service/instruction_fusion.cc
index 90e1f0acdc4cdeda280dabaab2df66b181d0f407..f494748e17fc2d0de74dec67f7414d4791f76a07 100644
--- a/tensorflow/compiler/xla/service/instruction_fusion.cc
+++ b/tensorflow/compiler/xla/service/instruction_fusion.cc
@@ -102,6 +102,8 @@ namespace xla {
     case HloOpcode::kExp:
     case HloOpcode::kFft:
     case HloOpcode::kFusion:
+    case HloOpcode::kGather:
+    case HloOpcode::kHostCompute:
     case HloOpcode::kLog:
     case HloOpcode::kMap:
     case HloOpcode::kParameter:
diff --git a/tensorflow/compiler/xla/service/interpreter/executable.cc b/tensorflow/compiler/xla/service/interpreter/executable.cc
index 0cb9b5d8107cd8bf468b07d5fe2a22930d9e8b8c..883063d0f075f5b0d79edc01bcd27a7c579272f4 100644
--- a/tensorflow/compiler/xla/service/interpreter/executable.cc
+++ b/tensorflow/compiler/xla/service/interpreter/executable.cc
@@ -93,7 +93,7 @@ StatusOr<std::unique_ptr<ShapedBuffer>> InterpreterExecutable::ExecuteOnStream(
   TF_ASSIGN_OR_RETURN(std::unique_ptr<ShapedBuffer> result,
                       transfer_manager->AllocateShapedBuffer(
                           result_literal->shape(), run_options->allocator(),
-                          run_options->device_ordinal()));
+                          executor->device_ordinal()));
   TF_RETURN_IF_ERROR(transfer_manager->TransferLiteralToDevice(
       executor, *result_literal, *result));
 
diff --git a/tensorflow/compiler/xla/service/layout_assignment.cc b/tensorflow/compiler/xla/service/layout_assignment.cc
index fce135ef61a7868386b869def1a79167c428d928..39f9120e552f014dd2759bff2892157402d9c47a 100644
--- a/tensorflow/compiler/xla/service/layout_assignment.cc
+++ b/tensorflow/compiler/xla/service/layout_assignment.cc
@@ -53,6 +53,83 @@ limitations under the License.
 
 namespace xla {
 
+// For now moving only one API here, but we should have a single top level
+// anonymous namespace, instead of three or four spread all over this file.
+namespace {
+
+// Creates and returns a copy of the given instruction with a different
+// layout. Tuple-shaped instructions will be deep-copied, and the last Tuple
+// instruction producing the copy is returned.
+StatusOr<HloInstruction*> CreateCopyWithNewLayout(
+    const Shape& shape_with_layout, HloInstruction* instruction) {
+  TF_RET_CHECK(LayoutUtil::HasLayout(shape_with_layout));
+  DCHECK(ShapeUtil::Compatible(shape_with_layout, instruction->shape()))
+      << ShapeUtil::HumanString(shape_with_layout) << " "
+      << ShapeUtil::HumanString(instruction->shape())
+      << " instruction: " << instruction->ToString();
+
+  if (ShapeUtil::IsTuple(instruction->shape())) {
+    // Deep-copy tuples.
+    std::vector<HloInstruction*> element_copies;
+    for (int64 i = 0; i < ShapeUtil::TupleElementCount(instruction->shape());
+         ++i) {
+      HloInstruction* gte = instruction->parent()->AddInstruction(
+          HloInstruction::CreateGetTupleElement(
+              ShapeUtil::GetSubshape(instruction->shape(), {i}), instruction,
+              i));
+
+      // Recurse to copy each elements.
+      TF_ASSIGN_OR_RETURN(
+          HloInstruction * element_copy,
+          CreateCopyWithNewLayout(
+              ShapeUtil::GetSubshape(shape_with_layout, {i}), gte));
+      element_copies.push_back(element_copy);
+    }
+    // Gather element copies into a tuple with a new Tuple instruction.
+    HloInstruction* tuple_copy = instruction->parent()->AddInstruction(
+        HloInstruction::CreateTuple(element_copies));
+    LayoutUtil::ClearLayout(tuple_copy->mutable_shape());
+    TF_RETURN_IF_ERROR(LayoutUtil::CopyLayoutBetweenShapes(
+        shape_with_layout, tuple_copy->mutable_shape()));
+    return tuple_copy;
+  } else if (ShapeUtil::IsArray(instruction->shape())) {
+    HloInstruction* copy =
+        instruction->parent()->AddInstruction(HloInstruction::CreateUnary(
+            instruction->shape(), HloOpcode::kCopy, instruction));
+    LayoutUtil::ClearLayout(copy->mutable_shape());
+    TF_RETURN_IF_ERROR(LayoutUtil::CopyLayoutBetweenShapes(
+        shape_with_layout, copy->mutable_shape()));
+
+    return copy;
+  } else {
+    return FailedPrecondition(
+        "Can only copy array and tuple shaped instructions");
+  }
+}
+
+// Creates a copy of the given operand if the operand's layout does not match
+// the given layout. This copy replaces the use in the given instruction. Tuple
+// operands will be deep-copied.
+Status CopyOperandIfLayoutsDiffer(const ShapeLayout& operand_layout,
+                                  HloInstruction* instruction,
+                                  int64 operand_no) {
+  HloInstruction* operand = instruction->mutable_operand(operand_no);
+  TF_RET_CHECK(operand_layout.LayoutIsSet());
+  TF_RET_CHECK(LayoutUtil::HasLayout(operand->shape()));
+
+  if (ShapeUtil::Equal(operand_layout.shape(), operand->shape())) {
+    // Operand layout already matches our constraint. Nothing to do.
+    return Status::OK();
+  }
+
+  TF_ASSIGN_OR_RETURN(HloInstruction * operand_copy,
+                      CreateCopyWithNewLayout(operand_layout.shape(), operand));
+
+  return instruction->ReplaceOperandWith(operand_no, operand_copy);
+}
+
+}  // namespace
+
 std::ostream& operator<<(std::ostream& out,
                          const LayoutConstraint& constraint) {
   out << constraint.ToString();
@@ -115,17 +192,34 @@ LayoutConstraints::LayoutConstraints(
   }
 }
 
+PointsToSet::BufferSet* LayoutConstraints::GetBufferSet(
+    const HloInstruction* instruction) const {
+  auto it = buffer_sets_cache_.find(instruction);
+  if (it != buffer_sets_cache_.end()) {
+    return it->second.get();
+  }
+  auto& buffer_set =
+      buffer_sets_cache_
+          .emplace(instruction, MakeUnique<PointsToSet::BufferSet>())
+          .first->second;
+  const auto& points_to_set = points_to_analysis_.GetPointsToSet(instruction);
+  points_to_set.ForEachElement(
+      [&buffer_set](const ShapeIndex& /*index*/,
+                    const PointsToSet::BufferList& buffers) {
+        buffer_set->insert(buffers.begin(), buffers.end());
+      });
+  return buffer_set.get();
+}
+
 bool LayoutConstraints::OperandBufferForwarded(
     const HloInstruction* instruction, int64 operand_no) const {
   // The operand is potentially forwarded if the intersection of points-to sets
   // of the operand and the instruction is non-empty.
-  auto output_buffers =
-      points_to_analysis_.GetPointsToSet(instruction).CreateFlattenedSet();
-  auto operand_buffers =
-      points_to_analysis_.GetPointsToSet(instruction->operand(operand_no))
-          .CreateFlattenedSet();
-  for (const LogicalBuffer* output_buffer : output_buffers) {
-    if (operand_buffers.count(output_buffer) > 0) {
+  PointsToSet::BufferSet* output_buffers = GetBufferSet(instruction);
+  PointsToSet::BufferSet* operand_buffers =
+      GetBufferSet(instruction->operand(operand_no));
+  for (const LogicalBuffer* output_buffer : *output_buffers) {
+    if (operand_buffers->count(output_buffer) > 0) {
       return true;
     }
   }
@@ -512,6 +606,36 @@ Status LayoutAssignment::AddMandatoryConstraints(
           body_layout.result_shape(), instruction));
       TF_RETURN_IF_ERROR(constraints->SetOperandLayout(
           body_layout.result_shape(), instruction, 0));
+    } else if (instruction->opcode() == HloOpcode::kConditional) {
+      // The layout of the true and false computations must match, and must
+      // be the layout of the kConditional instruction.
+      TF_RET_CHECK(instruction->operand_count() == 3);
+
+      HloComputation* true_computation = instruction->true_computation();
+      HloComputation* false_computation = instruction->false_computation();
+      const HloInstruction* true_operand = instruction->operand(1);
+      const HloInstruction* false_operand = instruction->operand(2);
+
+      TF_RET_CHECK(true_computation->num_parameters() == 1);
+      TF_RET_CHECK(false_computation->num_parameters() == 1);
+      ComputationLayout& true_computation_layout =
+          FindOrDie(computation_layouts_, true_computation);
+      ComputationLayout& false_computation_layout =
+          FindOrDie(computation_layouts_, false_computation);
+
+      DCHECK(ShapeUtil::Compatible(true_operand->shape(),
+                                   true_computation_layout.parameter_shape(0)));
+      DCHECK(ShapeUtil::Compatible(
+          false_operand->shape(), false_computation_layout.parameter_shape(0)));
+
+      TF_RETURN_IF_ERROR(constraints->SetInstructionLayout(
+          true_computation_layout.result_shape(), instruction));
+      TF_RETURN_IF_ERROR(constraints->SetOperandLayout(
+          true_computation_layout.parameter_shape(0), instruction, 1,
+          /*mandatory=*/true));
+      TF_RETURN_IF_ERROR(constraints->SetOperandLayout(
+          false_computation_layout.parameter_shape(0), instruction, 2,
+          /*mandatory=*/true));
     } else if (instruction->opcode() == HloOpcode::kCustomCall) {
       if (!CustomCallRequiresMajorFirstLayout(instruction)) {
         continue;
@@ -598,6 +722,33 @@ Status CheckWhileLayout(HloInstruction* while_inst,
   return Status::OK();
 }
 
+Status CheckConditionalLayout(
+    HloInstruction* instruction,
+    const ComputationLayout& true_computation_layout,
+    const ComputationLayout& false_computation_layout) {
+  HloComputation* true_computation = instruction->true_computation();
+  HloComputation* false_computation = instruction->false_computation();
+  const HloInstruction* true_operand = instruction->operand(1);
+  const HloInstruction* false_operand = instruction->operand(2);
+
+  TF_RET_CHECK(true_computation_layout.result_layout() ==
+               false_computation_layout.result_layout());
+  TF_RET_CHECK(true_computation_layout.result_layout().MatchesLayoutInShape(
+      instruction->shape()));
+  TF_RET_CHECK(true_computation_layout.result_layout().MatchesLayoutInShape(
+      true_computation->root_instruction()->shape()));
+  TF_RET_CHECK(false_computation_layout.result_layout().MatchesLayoutInShape(
+      instruction->shape()));
+  TF_RET_CHECK(false_computation_layout.result_layout().MatchesLayoutInShape(
+      false_computation->root_instruction()->shape()));
+  TF_RET_CHECK(true_computation_layout.parameter_layout(0).MatchesLayoutInShape(
+      true_operand->shape()));
+  TF_RET_CHECK(
+      false_computation_layout.parameter_layout(0).MatchesLayoutInShape(
+          false_operand->shape()));
+  return Status::OK();
+}
+
 // Fusion parameters must match the layout of the fusion instructions operands,
 // and the root of the fusion expression must match the layout of the fusion
 // instruction.
@@ -710,6 +861,13 @@ Status LayoutAssignment::CheckLayouts(HloModule* module) {
               FindOrDie(computation_layouts_, instruction->while_condition()),
               FindOrDie(computation_layouts_, instruction->while_body())));
           break;
+        case HloOpcode::kConditional:
+          TF_RETURN_IF_ERROR(CheckConditionalLayout(
+              instruction,
+              FindOrDie(computation_layouts_, instruction->true_computation()),
+              FindOrDie(computation_layouts_,
+                        instruction->false_computation())));
+          break;
         default:
           break;
       }
@@ -1165,77 +1323,6 @@ StatusOr<Layout> InferArrayLayout(
   return *first_buffer_layout;
 }
 
-// Creates and returns a copy of the given instruction with a different
-// layout. Tuple-shaped instructions will be deep-copied, and the last Tuple
-// instruction producing the copy is returned.
-StatusOr<HloInstruction*> CreateCopyWithNewLayout(
-    const Shape& shape_with_layout, HloInstruction* instruction) {
-  TF_RET_CHECK(LayoutUtil::HasLayout(shape_with_layout));
-  DCHECK(ShapeUtil::Compatible(shape_with_layout, instruction->shape()))
-      << ShapeUtil::HumanString(shape_with_layout) << " "
-      << ShapeUtil::HumanString(instruction->shape())
-      << " instruction: " << instruction->ToString();
-
-  if (ShapeUtil::IsTuple(instruction->shape())) {
-    // Deep-copy tuples.
-    std::vector<HloInstruction*> element_copies;
-    for (int64 i = 0; i < ShapeUtil::TupleElementCount(instruction->shape());
-         ++i) {
-      HloInstruction* gte = instruction->parent()->AddInstruction(
-          HloInstruction::CreateGetTupleElement(
-              ShapeUtil::GetSubshape(instruction->shape(), {i}), instruction,
-              i));
-
-      // Recurse to copy each elements.
-      TF_ASSIGN_OR_RETURN(
-          HloInstruction * element_copy,
-          CreateCopyWithNewLayout(
-              ShapeUtil::GetSubshape(shape_with_layout, {i}), gte));
-      element_copies.push_back(element_copy);
-    }
-    // Gather element copies into a tuple with a new Tuple instruction.
-    HloInstruction* tuple_copy = instruction->parent()->AddInstruction(
-        HloInstruction::CreateTuple(element_copies));
-    LayoutUtil::ClearLayout(tuple_copy->mutable_shape());
-    TF_RETURN_IF_ERROR(LayoutUtil::CopyLayoutBetweenShapes(
-        shape_with_layout, tuple_copy->mutable_shape()));
-    return tuple_copy;
-  } else if (ShapeUtil::IsArray(instruction->shape())) {
-    HloInstruction* copy =
-        instruction->parent()->AddInstruction(HloInstruction::CreateUnary(
-            instruction->shape(), HloOpcode::kCopy, instruction));
-    LayoutUtil::ClearLayout(copy->mutable_shape());
-    TF_RETURN_IF_ERROR(LayoutUtil::CopyLayoutBetweenShapes(
-        shape_with_layout, copy->mutable_shape()));
-
-    return copy;
-  } else {
-    return FailedPrecondition(
-        "Can only copy array and tuple shaped instructions");
-  }
-}
-
-// Creates a copy of the given operand if the operand's layout does not match
-// the given layout. This copy replaces the use in the given instruction. Tuple
-// operands will be deep-copied.
-Status CopyOperandIfLayoutsDiffer(const ShapeLayout& operand_layout,
-                                  HloInstruction* instruction,
-                                  int64 operand_no) {
-  HloInstruction* operand = instruction->mutable_operand(operand_no);
-  TF_RET_CHECK(operand_layout.LayoutIsSet());
-  TF_RET_CHECK(LayoutUtil::HasLayout(operand->shape()));
-
-  if (ShapeUtil::Equal(operand_layout.shape(), operand->shape())) {
-    // Operand layout already matches our constraint. Nothing to do.
-    return Status::OK();
-  }
-
-  TF_ASSIGN_OR_RETURN(HloInstruction * operand_copy,
-                      CreateCopyWithNewLayout(operand_layout.shape(), operand));
-
-  return instruction->ReplaceOperandWith(operand_no, operand_copy);
-}
-
 // For fusion instructions, set the layout of each fused parameter instruction
 // to match the layout of its corresponding fusion instruction operand. Also,
 // set the layout of the fused root to match the layout of the fusion
@@ -1474,6 +1561,13 @@ StatusOr<bool> LayoutAssignment::Run(HloModule* module) {
     // infeeds.  Clearing the layouts here avoids hiding potential bugs in the
     // layout assignment pass that may accidently use the existing layout.
     for (HloInstruction* instruction : computation->instructions()) {
+      if (instruction->opcode() == HloOpcode::kBitcast) {
+        // bitcasts are inherently layout sensitive and so a bitcast instruction
+        // present in the IR before layout assignment is a bug.
+        return InternalError(
+            "Unexpected bitcast operation seen during layout assignment: %s.",
+            instruction->ToString().c_str());
+      }
       if (instruction->opcode() != HloOpcode::kInfeed) {
         LayoutUtil::ClearLayout(instruction->mutable_shape());
       }
diff --git a/tensorflow/compiler/xla/service/layout_assignment.h b/tensorflow/compiler/xla/service/layout_assignment.h
index 29018584487cabfd740d7914625c2a50f552d6ff..7126cb50cf168241979178c9e1077051cc935e53 100644
--- a/tensorflow/compiler/xla/service/layout_assignment.h
+++ b/tensorflow/compiler/xla/service/layout_assignment.h
@@ -38,6 +38,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/gtl/flatmap.h"
 #include "tensorflow/core/platform/types.h"
 
 namespace xla {
@@ -199,6 +200,11 @@ class LayoutConstraints {
   string ToString() const;
 
  private:
+  // Find a bufferset in the bufferset cache. This is useful since we can
+  // currently create the flattened buffer set for the same instruction many
+  // times, which is often slow.
+  PointsToSet::BufferSet* GetBufferSet(const HloInstruction* instruction) const;
+
   // The set of BufferLayoutConstraints applied to the computation.
   std::unordered_map<const LogicalBuffer*, BufferLayoutConstraint>
       buffer_constraints_;
@@ -221,6 +227,10 @@ class LayoutConstraints {
   // Array-shaped buffers which have not yet been constrained.
   std::set<LogicalBuffer::Id> unconstrained_buffer_ids_;
 
+  mutable tensorflow::gtl::FlatMap<const HloInstruction*,
+                                   std::unique_ptr<PointsToSet::BufferSet>>
+      buffer_sets_cache_;
+
   HloComputation* computation_;
 };
 
diff --git a/tensorflow/compiler/xla/service/layout_assignment_test.cc b/tensorflow/compiler/xla/service/layout_assignment_test.cc
index e269a13459f1146f1d2952870399827d9e705e38..4b1c9bad41de8030cf14bc6d1c0db21b9c56c3bf 100644
--- a/tensorflow/compiler/xla/service/layout_assignment_test.cc
+++ b/tensorflow/compiler/xla/service/layout_assignment_test.cc
@@ -590,6 +590,85 @@ TEST_F(LayoutAssignmentTest, TransposeToBitcastToUser) {
                                             transpose->shape(), {2, 3, 0, 1}));
 }
 
+// TransposeIsBitcast shouldn't be called without layout information.
+TEST_F(LayoutAssignmentTest, TransposeIsBitcastFail) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape input_shape = ShapeUtil::MakeShape(F32, {2, 2, 2});
+  Shape input_shape_with_layout(input_shape);
+  *input_shape_with_layout.mutable_layout() = LayoutUtil::MakeLayout({2, 1, 0});
+  auto param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, input_shape_with_layout, "param"));
+  auto hlo = builder.AddInstruction(
+      HloInstruction::CreateTranspose(input_shape, param, {0, 2, 1}));
+  // Clear the default layout assigned to the instruction.
+  LayoutUtil::ClearLayout(hlo->mutable_shape());
+  EXPECT_DEATH(ShapeUtil::TransposeIsBitcast(hlo->operand(0)->shape(),
+                                             hlo->shape(), hlo->dimensions()),
+               "LayoutUtil::HasLayout");
+}
+
+// ReshapeIsBitcast shouldn't be called without layout information.
+TEST_F(LayoutAssignmentTest, ReshapeIsBitcastFail) {
+  auto builder = HloComputation::Builder(TestName());
+  Shape input_shape = ShapeUtil::MakeShape(F32, {2, 2, 2});
+  Shape input_shape_with_layout(input_shape);
+  *input_shape_with_layout.mutable_layout() = LayoutUtil::MakeLayout({2, 1, 0});
+  auto param = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, input_shape_with_layout, "param"));
+  auto hlo =
+      builder.AddInstruction(HloInstruction::CreateReshape(input_shape, param));
+  // Clear the default layout assigned to the instruction.
+  LayoutUtil::ClearLayout(hlo->mutable_shape());
+  EXPECT_DEATH(
+      ShapeUtil::ReshapeIsBitcast(hlo->operand(0)->shape(), hlo->shape()),
+      "LayoutUtil::HasLayout");
+}
+
+// Check that the computation below doesn't crash the compiler.
+//
+// Within a fusion computation, only the parameters and result get assigned a
+// layout.  When we run the algebraic simplifier on this computation post layout
+// assignment, it should not call TransposeIsBitcast on the `transpose` node
+// inside the fusion computation as TransposeIsBitcast checks both input_shape
+// and output_shape have layouts.
+TEST_F(LayoutAssignmentTest, TransposeWithinFusionDoesNotCrash) {
+  const char* module_str = R"(
+    HloModule test_module
+
+    fused_computation {
+      param_1 = f32[2,2,2]{2,1,0} parameter(1)
+      transpose = f32[2,2,2]{2,1,0} transpose(param_1), dimensions={0,2,1}
+      reduce_1 = f32[] parameter(0)
+      broadcast_1 = f32[2,2,2]{2,1,0} broadcast(reduce_1), dimensions={}
+      ROOT divide_1 = f32[2,2,2]{2,1,0} divide(transpose, broadcast_1)
+    }
+
+    ENTRY entry_computation {
+      fusion.1 = f32[2,2,2]{2,1,0} parameter(1)
+      reduce.1 = f32[] parameter(0)
+      fusion.2 = f32[2,2,2]{2,1,0} fusion(reduce.1, fusion.1), kind=kLoop, calls=fused_computation
+     ROOT tuple.1 = (f32[2,2,2]{2,1,0}) tuple(fusion.2)
+    }
+  )";
+
+  auto module = tools::Parse(module_str).ValueOrDie();
+
+  module =
+      backend()
+          .compiler()
+          ->RunHloPasses(std::move(module), backend().default_stream_executor(),
+                         /*device_allocator=*/nullptr)
+          .ConsumeValueOrDie();
+
+  EXPECT_EQ(
+      ::tensorflow::Status::OK(),
+      backend()
+          .compiler()
+          ->RunBackend(std::move(module), backend().default_stream_executor(),
+                       /*device_allocator=*/nullptr)
+          .status());
+}
+
 // A GTE inside of a fusion node inherits the layout of its operand (which
 // should, if we keep following operands, eventually be a parameter).
 TEST_F(LayoutAssignmentTest, GTEInheritsLayoutFromOperand) {
@@ -629,33 +708,113 @@ TEST_F(LayoutAssignmentTest, GTEInheritsLayoutFromOperand) {
       LayoutUtil::MakeLayout({2, 1, 0}));
   AssignLayouts(module.get(), &computation_layout);
 
-  HloComputation* fused_computation = *std::find_if(
-      module->computations().begin(), module->computations().end(),
-      [](const HloComputation* c) { return c->name() == "fused_computation"; });
-
-  auto fused_instr = [&](const string& name) {
-    auto it = std::find_if(
-        fused_computation->instructions().begin(),
-        fused_computation->instructions().end(),
-        [&](const HloInstruction* i) { return i->name() == name; });
-    CHECK(it != fused_computation->instructions().end());
-    return *it;
+  auto layout_of = [&](tensorflow::StringPiece name) {
+    return FindInstruction(module.get(), name)
+        ->shape()
+        .layout()
+        .minor_to_major();
   };
 
-  EXPECT_THAT(fused_instr("gte0")->shape().layout().minor_to_major(),
-              ElementsAre(0, 1, 2));
-  EXPECT_THAT(
-      fused_instr("gte1")->shape().tuple_shapes(0).layout().minor_to_major(),
-      ElementsAre(1, 2, 0));
-  EXPECT_THAT(
-      fused_instr("gte1")->shape().tuple_shapes(1).layout().minor_to_major(),
-      ElementsAre(2, 0, 1));
-  EXPECT_THAT(fused_instr("gte1a")->shape().layout().minor_to_major(),
+  EXPECT_THAT(layout_of("gte0"), ElementsAre(0, 1, 2));
+  EXPECT_THAT(layout_of("gte1a"), ElementsAre(1, 2, 0));
+  EXPECT_THAT(layout_of("gte1b"), ElementsAre(2, 0, 1));
+  EXPECT_THAT(layout_of("fresult"), ElementsAre(2, 1, 0));
+  EXPECT_THAT(FindInstruction(module.get(), "gte1")
+                  ->shape()
+                  .tuple_shapes(0)
+                  .layout()
+                  .minor_to_major(),
               ElementsAre(1, 2, 0));
-  EXPECT_THAT(fused_instr("gte1b")->shape().layout().minor_to_major(),
+  EXPECT_THAT(FindInstruction(module.get(), "gte1")
+                  ->shape()
+                  .tuple_shapes(1)
+                  .layout()
+                  .minor_to_major(),
               ElementsAre(2, 0, 1));
-  EXPECT_THAT(fused_instr("fresult")->shape().layout().minor_to_major(),
-              ElementsAre(2, 1, 0));
+}
+
+TEST_F(LayoutAssignmentTest, ConditionalAsymmetricLayout) {
+  auto builder = HloComputation::Builder(TestName());
+  auto module = CreateNewModule();
+  Shape shape = ShapeUtil::MakeShape(F32, {128, 8});
+  Shape tshape = ShapeUtil::MakeTupleShape({shape, shape});
+  Shape result_tshape = ShapeUtil::MakeTupleShape({shape});
+
+  auto param0 = builder.AddInstruction(
+      HloInstruction::CreateParameter(0, shape, "param0"));
+  auto param1 = builder.AddInstruction(
+      HloInstruction::CreateParameter(1, shape, "param1"));
+  auto pred = builder.AddInstruction(HloInstruction::CreateParameter(
+      2, ShapeUtil::MakeShape(PRED, {}), "param2"));
+  auto tuple =
+      builder.AddInstruction(HloInstruction::CreateTuple({param0, param1}));
+
+  auto true_builder = HloComputation::Builder(TestName() + "_TrueBranch");
+  {
+    auto param = true_builder.AddInstruction(
+        HloInstruction::CreateParameter(0, tshape, "param"));
+    auto gte0 = true_builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(shape, param, 0));
+    auto gte1 = true_builder.AddInstruction(
+        HloInstruction::CreateGetTupleElement(shape, param, 1));
+    auto add = true_builder.AddInstruction(
+        HloInstruction::CreateBinary(shape, HloOpcode::kAdd, gte0, gte1));
+    true_builder.AddInstruction(HloInstruction::CreateTuple({add}));
+  }
+  HloComputation* true_computation =
+      module->AddEmbeddedComputation(true_builder.Build());
+
+  auto false_builder = HloComputation::Builder(TestName() + "_FalseBranch");
+  {
+    Shape xshape = ShapeUtil::MakeShapeWithLayout(F32, {128, 8}, {0, 1});
+    false_builder.AddInstruction(
+        HloInstruction::CreateParameter(0, tshape, "param"));
+    // Using infeed as layout assignment does not mess up with it.
+    auto infeed =
+        false_builder.AddInstruction(HloInstruction::CreateInfeed(xshape, ""));
+    false_builder.AddInstruction(HloInstruction::CreateTuple({infeed}));
+  }
+  HloComputation* false_computation =
+      module->AddEmbeddedComputation(false_builder.Build());
+  builder.AddInstruction(HloInstruction::CreateConditional(
+      result_tshape, pred, tuple, true_computation, tuple, false_computation));
+
+  HloComputation* computation = module->AddEntryComputation(builder.Build());
+  ComputationLayout computation_layout(computation->ComputeProgramShape());
+
+  AssignLayouts(module.get(), &computation_layout);
+
+  const HloInstruction* true_root = true_computation->root_instruction();
+  const HloInstruction* false_root = false_computation->root_instruction();
+  EXPECT_THAT(true_root->opcode(), HloOpcode::kTuple);
+  EXPECT_THAT(false_root->opcode(), HloOpcode::kTuple);
+
+  const HloInstruction* true_result = true_root->operand(0);
+  const HloInstruction* false_result = false_root->operand(0);
+  EXPECT_TRUE(LayoutUtil::Equal(true_result->shape().layout(),
+                                false_result->shape().layout()));
+  EXPECT_THAT(false_result->opcode(), HloOpcode::kCopy);
+}
+
+TEST_F(LayoutAssignmentTest, InternalErrorOnBitcast) {
+  auto builder = HloComputation::Builder(TestName());
+  auto constant0 = builder.AddInstruction(
+      HloInstruction::CreateConstant(Literal::CreateR2WithLayout<float>(
+          {{1.0, 2.0}, {3.0, 4.0}}, LayoutUtil::MakeLayout({0, 1}))));
+  builder.AddInstruction(HloInstruction::CreateUnary(
+      constant0->shape(), HloOpcode::kBitcast, constant0));
+  auto module = CreateNewModule();
+  module->AddEntryComputation(builder.Build());
+
+  ComputationLayout computation_layout(
+      module->entry_computation()->ComputeProgramShape());
+  LayoutAssignment layout_assignment(&computation_layout);
+  Status error_status = layout_assignment.Run(module.get()).status();
+  EXPECT_FALSE(error_status.ok());
+  EXPECT_THAT(
+      error_status.error_message(),
+      ::testing::HasSubstr(
+          "Unexpected bitcast operation seen during layout assignment"));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/service/liveness_util_test.cc b/tensorflow/compiler/xla/service/liveness_util_test.cc
index 2c2a02f6375343d67dfb155bbb03729ff6e490d2..f8b309488eeb5391b1cad5db760934ec1f7e3521 100644
--- a/tensorflow/compiler/xla/service/liveness_util_test.cc
+++ b/tensorflow/compiler/xla/service/liveness_util_test.cc
@@ -35,8 +35,7 @@ class PointsToAnalysisTestBase : public HloTestBase {
     CHECK_NOTNULL(module_.get());
     points_to_analysis_ =
         TuplePointsToAnalysis::Run(module_.get()).ConsumeValueOrDie();
-    dataflow_analysis_ =
-        HloDataflowAnalysis::Run(module_.get()).ConsumeValueOrDie();
+    dataflow_analysis_ = HloDataflowAnalysis::Run(*module_).ConsumeValueOrDie();
   }
 
   void BuildModuleAndRunAnalysis(std::unique_ptr<HloComputation> computation) {
diff --git a/tensorflow/compiler/xla/service/llvm_compiler.cc b/tensorflow/compiler/xla/service/llvm_compiler.cc
index 68c35c0c1f8717957d5ab8cb067b8f8892aa426d..911b243fe28a5baf8a4b8ed752b892265f5388ac 100644
--- a/tensorflow/compiler/xla/service/llvm_compiler.cc
+++ b/tensorflow/compiler/xla/service/llvm_compiler.cc
@@ -14,6 +14,7 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/compiler/xla/service/llvm_compiler.h"
+#include "tensorflow/core/platform/denormal.h"
 
 #ifdef __FAST_MATH__
 #error "Don't build XLA with -ffast-math"
@@ -24,6 +25,18 @@ StatusOr<std::vector<std::unique_ptr<Executable>>> LLVMCompiler::Compile(
     std::vector<std::unique_ptr<HloModule>> modules,
     std::vector<std::vector<perftools::gputools::StreamExecutor*>> stream_execs,
     DeviceMemoryAllocator* device_allocator) {
+  // Tensorflow tries to enable the following behaviors in all its threads:
+  //
+  //  - Denormals are zero (DAZ): roughly, operations treat denormal floats as
+  //    zero.
+  //  - Flush denormals to zero (FTZ): roughly, operations produce zero instead
+  //    of denormal floats.
+  //
+  // In theory enabling these shouldn't matter since the compiler should ideally
+  // not leak its environment into generated code, but we turn off DAZ and FTZ
+  // to get some defense-in-depth.
+  tensorflow::port::ScopedDontFlushDenormal dont_flush_denormals;
+
   std::vector<std::unique_ptr<Executable>> result;
   for (size_t i = 0; i < modules.size(); i++) {
     if (stream_execs[i].size() != 1) {
diff --git a/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc b/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc
index 6384c7f46f5ebbedaeda232b40095611a5d738a4..f3642cf0a1c202e785d8e2d3fe469f95eff212c8 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/ir_array.cc
@@ -160,7 +160,8 @@ IrArray::Index IrArray::Index::SourceIndexOfReshape(
     }
   }
 
-  if (linear() != nullptr &&
+  if (linear() != nullptr && LayoutUtil::HasLayout(input_shape) &&
+      LayoutUtil::HasLayout(output_shape) &&
       ShapeUtil::ReshapeIsBitcast(input_shape, output_shape)) {
     return Index(source_multidim_index, linear(), input_shape);
   }
@@ -195,10 +196,13 @@ IrArray::Index IrArray::Index::SourceIndexOfTranspose(
     llvm::IRBuilder<>* builder) const {
   std::vector<llvm::Value*> operand_multidim_index =
       Permute(dimension_mapping, multidim());
-  if (linear() != nullptr &&
+
+  if (linear() != nullptr && LayoutUtil::HasLayout(operand_shape) &&
+      LayoutUtil::HasLayout(shape) &&
       ShapeUtil::TransposeIsBitcast(operand_shape, shape, dimension_mapping)) {
     return Index(operand_multidim_index, linear(), operand_shape);
   }
+
   return Index(operand_multidim_index);
 }
 
diff --git a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
index 5c1866311d1ae1e0c33ab061ee326d86d647a908..2a282f3be79f847a6569416794d1a2a3fcd69148 100644
--- a/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
+++ b/tensorflow/compiler/xla/service/llvm_ir/llvm_util.cc
@@ -106,8 +106,10 @@ llvm::Value* EmitFloatMax(llvm::Value* lhs_value, llvm::Value* rhs_value,
     auto cmp = ir_builder->CreateFCmpUGE(lhs_value, rhs_value);
     return ir_builder->CreateSelect(cmp, lhs_value, rhs_value);
   } else {
-    return EmitCallToIntrinsic(llvm::Intrinsic::maxnum, {lhs_value, rhs_value},
-                               {lhs_value->getType()}, ir_builder);
+    auto cmp_ge = ir_builder->CreateFCmpOGE(lhs_value, rhs_value);
+    auto lhs_is_nan = ir_builder->CreateFCmpUNE(lhs_value, lhs_value);
+    auto sel_lhs = ir_builder->CreateOr(cmp_ge, lhs_is_nan);
+    return ir_builder->CreateSelect(sel_lhs, lhs_value, rhs_value);
   }
 }
 
@@ -117,8 +119,10 @@ llvm::Value* EmitFloatMin(llvm::Value* lhs_value, llvm::Value* rhs_value,
     auto cmp = ir_builder->CreateFCmpULE(lhs_value, rhs_value);
     return ir_builder->CreateSelect(cmp, lhs_value, rhs_value);
   } else {
-    return EmitCallToIntrinsic(llvm::Intrinsic::minnum, {lhs_value, rhs_value},
-                               {lhs_value->getType()}, ir_builder);
+    auto cmp_le = ir_builder->CreateFCmpOLE(lhs_value, rhs_value);
+    auto lhs_is_nan = ir_builder->CreateFCmpUNE(lhs_value, lhs_value);
+    auto sel_lhs = ir_builder->CreateOr(cmp_le, lhs_is_nan);
+    return ir_builder->CreateSelect(sel_lhs, lhs_value, rhs_value);
   }
 }
 
diff --git a/tensorflow/compiler/xla/service/service.cc b/tensorflow/compiler/xla/service/service.cc
index 98dfc89867ab33788c4cc837a66d6751a1ef2507..43d0f605985819afdaf2db2309a0bfb86f230fe3 100644
--- a/tensorflow/compiler/xla/service/service.cc
+++ b/tensorflow/compiler/xla/service/service.cc
@@ -44,6 +44,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/util.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/gtl/cleanup.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
@@ -1445,6 +1446,9 @@ tensorflow::Status Service::Op(const OpRequest* arg, OpResponse* result) {
     case OpRequest::kFftRequest:
       handle_status = computation->AddFftInstruction(arg->fft_request());
       break;
+    case OpRequest::kGatherRequest:
+      handle_status = computation->AddGatherInstruction(arg->gather_request());
+      break;
     case OpRequest::kGetTupleElementRequest:
       handle_status = computation->AddGetTupleElementInstruction(
           arg->get_tuple_element_request());
@@ -1456,6 +1460,10 @@ tensorflow::Status Service::Op(const OpRequest* arg, OpResponse* result) {
       handle_status =
           computation->AddOutfeedInstruction(arg->outfeed_request());
       break;
+    case OpRequest::kHostComputeRequest:
+      handle_status =
+          computation->AddHostComputeInstruction(arg->host_compute_request());
+      break;
     case OpRequest::kMapRequest: {
       TF_ASSIGN_OR_RETURN(
           UserComputation * to_apply,
@@ -1548,8 +1556,10 @@ tensorflow::Status Service::Op(const OpRequest* arg, OpResponse* result) {
     case OpRequest::kSendRequest: {
       TF_RETURN_IF_ERROR(
           channel_tracker_.RegisterSend(arg->send_request().channel_handle()));
-      TF_RETURN_IF_ERROR(computation->AddSendInstruction(arg->send_request()));
-      return tensorflow::Status::OK();
+      // Send does not return a value, but we need a handle to be able to
+      // set OpMetadata and OpSharding (device assignment).
+      handle_status = computation->AddSendInstruction(arg->send_request());
+      break;
     }
     case OpRequest::kRecvRequest: {
       TF_RETURN_IF_ERROR(
diff --git a/tensorflow/compiler/xla/service/shape_inference.cc b/tensorflow/compiler/xla/service/shape_inference.cc
index 004889b5f216015ee1e1308702b2bf4cb0deb344..c54cb3b48d962bffa36e3688f8b3c79a72d62ee7 100644
--- a/tensorflow/compiler/xla/service/shape_inference.cc
+++ b/tensorflow/compiler/xla/service/shape_inference.cc
@@ -169,11 +169,11 @@ bool AllUnique(tensorflow::gtl::ArraySlice<int64> slice) {
 tensorflow::Status ExpectNotTupleOrOpaque(const Shape& shape,
                                           tensorflow::StringPiece op_type) {
   if (ShapeUtil::IsTuple(shape)) {
-    return InvalidArgument("Expected non-tuple argument for %s. Got: %s",
+    return InvalidArgument("Expected non-tuple argument for %s, but got %s.",
                            op_type.ToString().c_str(),
                            ShapeUtil::HumanString(shape).c_str());
   } else if (ShapeUtil::IsOpaque(shape)) {
-    return InvalidArgument("Expected non-opaque argument for %s. Got: %s",
+    return InvalidArgument("Expected non-opaque argument for %s, but got %s.",
                            op_type.ToString().c_str(),
                            ShapeUtil::HumanString(shape).c_str());
   } else {
@@ -193,8 +193,7 @@ tensorflow::Status VerifyReducerShape(const ProgramShape& reducer_shape,
 
   const Shape& accumulator_shape = reducer_shape.result();
   if (ShapeUtil::Rank(accumulator_shape) != 0) {
-    return Unimplemented(
-        "Reduction function currently must have rank-0 result.");
+    return InvalidArgument("Reduction function must have rank 0.");
   }
 
   // Check that the accumulator can be passed in as the first argument.
@@ -235,8 +234,8 @@ tensorflow::Status VerifyReducerShape(const ProgramShape& reducer_shape,
   if (!ShapeUtil::CompatibleIgnoringFpPrecision(accumulator_shape,
                                                 reducer_shape.parameters(1))) {
     return InvalidArgument(
-        "Reduction function's second parameter shape currently must "
-        "match the result shape. Got %s vs %s",
+        "Reduction function's second parameter shape must "
+        "match the result shape, but got %s vs %s.",
         ShapeUtil::HumanString(reducer_shape.parameters(1)).c_str(),
         ShapeUtil::HumanString(accumulator_shape).c_str());
   }
@@ -258,29 +257,29 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
   for (int64 i = 0; i < window.dimensions_size(); ++i) {
     const auto& dim = window.dimensions(i);
     if (dim.size() <= 0) {
-      return InvalidArgument("Window has a non-positive dimension. Window: %s",
+      return InvalidArgument("Window %s has a non-positive dimension.",
                              window.DebugString().c_str());
     }
     if (dim.stride() <= 0) {
-      return InvalidArgument("Window has a non-positive stride. Window: %s",
+      return InvalidArgument("Window %s has a non-positive stride.",
                              window.DebugString().c_str());
     }
     if (!allow_negative_padding && dim.padding_low() < 0) {
-      return InvalidArgument("Window has a negative low padding. Window: %s",
+      return InvalidArgument("Window %s has a negative low padding.",
                              window.DebugString().c_str());
     }
     if (!allow_negative_padding && dim.padding_high() < 0) {
-      return InvalidArgument("Window has a negative high padding. Window: %s",
+      return InvalidArgument("Window %s has a negative high padding.",
                              window.DebugString().c_str());
     }
     if (dim.base_dilation() < 1) {
       return InvalidArgument(
-          "Window has a non-positive base area dilation factor. Window: %s",
+          "Window %s has a non-positive base area dilation factor.",
           window.DebugString().c_str());
     }
     if (dim.window_dilation() < 1) {
       return InvalidArgument(
-          "Window has a non-positive window dilation factor. Window: %s",
+          "Window %s has a non-positive window dilation factor.",
           window.DebugString().c_str());
     }
 
@@ -320,8 +319,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     case UNOP_CEIL:
       if (!ShapeUtil::ElementIsFloating(arg)) {
         return InvalidArgument(
-            "expected element type in shape to be floating for floor/ceil "
-            "operation; got %s",
+            "Expected element type in shape to be floating for floor/ceil "
+            "operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return arg;
@@ -333,8 +332,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
       if (!ShapeUtil::ElementIsFloating(arg) &&
           !ShapeUtil::ElementIsComplex(arg)) {
         return InvalidArgument(
-            "expected element type in shape to be floating or complex for "
-            "sin/cos/exp/log/tanh operation; got %s",
+            "Expected element type in shape to be floating or complex for "
+            "sin/cos/exp/log/tanh operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return arg;
@@ -342,8 +341,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     case UNOP_IMAG:
       if (!ShapeUtil::ElementIsComplex(arg)) {
         return InvalidArgument(
-            "expected element type in shape to be complex for real/imag "
-            "operation; got %s",
+            "Expected element type in shape to be complex for real/imag "
+            "operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return ShapeUtil::ChangeElementType(arg, F32);
@@ -363,8 +362,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
       if (arg.element_type() != PRED &&
           !primitive_util::IsIntegralType(arg.element_type())) {
         return InvalidArgument(
-            "expected pred or an integral element type in argument to not "
-            "operation; got %s",
+            "Expected pred or an integral element type in argument to Not "
+            "operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return arg;
@@ -372,8 +371,8 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     case UNOP_IS_FINITE:
       if (!ShapeUtil::ElementIsFloating(arg)) {
         return InvalidArgument(
-            "expected element type in shape to be floating point for IsFinite "
-            "operation; got %s",
+            "Expected element type in shape to be floating point for IsFinite "
+            "operation; got %s.",
             PrimitiveType_Name(arg.element_type()).c_str());
       }
       return ShapeUtil::ChangeElementType(arg, PRED);
@@ -389,10 +388,10 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     tensorflow::gtl::ArraySlice<const Shape*> arg_shapes,
     const int64 dimension) {
   if (arg_shapes.empty()) {
-    return InvalidArgument("Concatenate expects at least one argument");
+    return InvalidArgument("Concatenate expects at least one argument.");
   }
   if (dimension < 0 || dimension >= ShapeUtil::Rank(*arg_shapes[0])) {
-    return InvalidArgument("dimension to concatenate along out of bounds: %lld",
+    return InvalidArgument("Concatenate dimension out of bounds: %lld.",
                            dimension);
   }
   const Shape* arg_shape = nullptr;
@@ -408,14 +407,14 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     if (ShapeUtil::Rank(*arg_shape) != ShapeUtil::Rank(*shape)) {
       return InvalidArgument(
           "Cannot concatenate arrays with different ranks: %lld (%s) vs %lld "
-          "(%s)",
+          "(%s).",
           ShapeUtil::Rank(*arg_shape),
           ShapeUtil::HumanString(*arg_shape).c_str(), ShapeUtil::Rank(*shape),
           ShapeUtil::HumanString(*shape).c_str());
     }
     if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(*arg_shape, *shape)) {
       return InvalidArgument(
-          "cannot concatenate arrays with different element types: %s vs %s",
+          "Cannot concatenate arrays with different element types: %s vs %s.",
           PrimitiveType_Name(arg_shape->element_type()).c_str(),
           PrimitiveType_Name(shape->element_type()).c_str());
     }
@@ -428,9 +427,9 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
                      // concatenating.
         }
         return InvalidArgument(
-            "cannot concatenate arrays that differ in dimensions other than "
+            "Cannot concatenate arrays that differ in dimensions other than "
             "the one being concatenated (the other array dimensions must be "
-            "the same): %s vs %s in dimension %lld",
+            "the same): %s vs %s in dimension %lld.",
             ShapeUtil::HumanString(*arg_shape).c_str(),
             ShapeUtil::HumanString(*shape).c_str(), dimension);
       }
@@ -452,7 +451,7 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
   if (primitive_util::IsComplexType(old_element_type) &&
       !primitive_util::IsComplexType(new_element_type)) {
     return Unimplemented(
-        "Unsupported conversion from complex to real type: %s => %s",
+        "Conversion from complex to real type %s => %s is not implemented.",
         ShapeUtil::HumanString(operand_shape).c_str(),
         PrimitiveType_Name(new_element_type).c_str());
   }
@@ -461,7 +460,7 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     // future, by recursing into the tuple elements to check all sub-conversions
     // are valid. For now we just reject them, though.
     return InvalidArgument(
-        "cannot convert from or to tuple type; requested conversion: %s => %s",
+        "Convert does not allow tuples, so cannot convert from %s to %s.",
         ShapeUtil::HumanString(operand_shape).c_str(),
         PrimitiveType_Name(new_element_type).c_str());
   }
@@ -474,24 +473,23 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
   auto old_element_type = operand_shape.element_type();
   if (primitive_util::IsComplexType(old_element_type) !=
       primitive_util::IsComplexType(new_element_type)) {
-    return Unimplemented(
-        "Unsupported conversion between real and complex types: %s => %s",
-        ShapeUtil::HumanString(operand_shape).c_str(),
-        PrimitiveType_Name(new_element_type).c_str());
+    return InvalidArgument("Conversion from complex to real type %s => %s.",
+                           ShapeUtil::HumanString(operand_shape).c_str(),
+                           PrimitiveType_Name(new_element_type).c_str());
   }
   if (ShapeUtil::IsTuple(operand_shape) || new_element_type == TUPLE) {
     // Note: we may want to support tuple conversions via this operation in the
     // future, by recursing into the tuple elements to check all sub-conversions
     // are valid. For now we just reject them, though.
     return InvalidArgument(
-        "cannot convert from or to tuple type; requested conversion: %s => %s",
+        "Cannot convert from or to tuple type; requested conversion: %s => %s.",
         ShapeUtil::HumanString(operand_shape).c_str(),
         PrimitiveType_Name(new_element_type).c_str());
   }
   if (primitive_util::BitWidth(old_element_type) !=
       primitive_util::BitWidth(new_element_type)) {
     return InvalidArgument(
-        "cannot bitcast types with different bit-widths: %s => %s",
+        "Cannot bitcast types with different bit-widths: %s => %s.",
         PrimitiveType_Name(old_element_type).c_str(),
         PrimitiveType_Name(new_element_type).c_str());
   }
@@ -504,20 +502,20 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     const int mantissa_bits) {
   if (!ShapeUtil::ElementIsFloating(operand_shape)) {
     return InvalidArgument(
-        "expected element type in shape to be floating point for "
-        "ReducePrecision operation; got %s",
+        "Expected element type in shape to be floating point for "
+        "ReducePrecision operation; got %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
   if (exponent_bits < 1) {
     // One exponent bit is necessary to distinguish 0 from infinity.  Having
     // no exponent bits doesn't produce a sensible number, so we require at
     // least one.
-    return InvalidArgument("expected exponent_bits >= 1; got %d",
+    return InvalidArgument("Expected exponent_bits >= 1; got %d.",
                            exponent_bits);
   }
   if (mantissa_bits < 0) {
     // A number with no mantissa bits is still meaningful, however.
-    return InvalidArgument("expected non-negative mantissa_bits; got %d",
+    return InvalidArgument("Expected non-negative mantissa_bits; got %d.",
                            mantissa_bits);
   }
   return operand_shape;
@@ -528,23 +526,23 @@ StatusOr<Shape> InferWindowOutputShape(const Shape& base_shape,
     const PaddingConfig& padding_config) {
   if (ShapeUtil::IsTuple(operand_shape)) {
     return InvalidArgument(
-        "pad operation does not support tuple-shape operands");
+        "Pad operation does not support tuple-shape operands.");
   }
   if (!ShapeUtil::IsScalar(padding_value_shape)) {
     return InvalidArgument(
-        "pad operation does not support non-scalar padding values");
+        "Pad operation does not support non-scalar padding values.");
   }
   if (ShapeUtil::Rank(operand_shape) != padding_config.dimensions_size()) {
     return InvalidArgument(
         "The rank of the operand and the padding configuration do not match: "
-        "%s vs %s",
+        "%s vs %s.",
         ShapeUtil::HumanString(operand_shape).c_str(),
         padding_config.ShortDebugString().c_str());
   }
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(operand_shape,
                                                      padding_value_shape)) {
     return InvalidArgument(
-        "the element types of the operands to pad do not match");
+        "The element types of the operands to Pad do not match.");
   }
   std::vector<int64> dimensions(ShapeUtil::Rank(operand_shape));
   for (int64 i = 0; i < operand_shape.dimensions_size(); ++i) {
@@ -605,7 +603,7 @@ Status ValidateDotDimensionNumbers(
                      lhs_batch_dimensions) ||
       !dims_in_range(ShapeUtil::Rank(rhs), rhs_contracting_dimensions,
                      rhs_batch_dimensions)) {
-    return InvalidArgument("A dimension number is out of range in dot: %s",
+    return InvalidArgument("A dimension number is out of range in Dot: %s.",
                            dimension_numbers.DebugString().c_str());
   }
 
@@ -623,7 +621,7 @@ Status ValidateDotDimensionNumbers(
 
   if (!dims_unique(lhs_contracting_dimensions, lhs_batch_dimensions) ||
       !dims_unique(rhs_contracting_dimensions, rhs_batch_dimensions)) {
-    return InvalidArgument("A dimension number is not unique in dot: %s",
+    return InvalidArgument("A dimension number is not unique in Dot: %s.",
                            dimension_numbers.DebugString().c_str());
   }
 
@@ -641,8 +639,7 @@ Status ValidateDotDimensionNumbers(
       rhs_non_contracting_non_batch_dims < 0 ||
       rhs_non_contracting_non_batch_dims > 1) {
     return InvalidArgument(
-        "batch and contracting dimension number mismatch "
-        "with rank ");
+        "Batch and contracting dimension number mismatch with rank.");
   }
 
   // Check that batch dimension numbers are ordered before all others, and
@@ -654,7 +651,7 @@ Status ValidateDotDimensionNumbers(
       !std::equal(batch_dim_numbers.begin(), batch_dim_numbers.end(),
                   rhs_batch_dimensions.begin())) {
     return InvalidArgument(
-        "batch dimension numbers must precede non-batch dimensions and be"
+        "Batch dimension numbers must precede non-batch dimensions and be"
         "monotonically increasing.");
   }
 
@@ -671,22 +668,22 @@ Status ValidateDotDimensionNumbers(
 
   auto fail = [lhs, rhs](const string& addendum) -> Status {
     string message = tensorflow::strings::Printf(
-        "cannot infer shape for dot operation: %s <dot> %s",
+        "Cannot infer shape for dot operation: %s <dot> %s.",
         ShapeUtil::HumanString(lhs).c_str(),
         ShapeUtil::HumanString(rhs).c_str());
     if (!addendum.empty()) {
-      message += ": " + addendum;
+      message += " " + addendum;
     }
     return InvalidArgument("%s", message.c_str());
   };
 
   // Check if both element types are the same.
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(lhs, rhs)) {
-    return fail("element types do not match");
+    return fail("Element types do not match.");
   }
 
   if ((ShapeUtil::Rank(lhs) < 1) || (ShapeUtil::Rank(rhs) < 1)) {
-    return fail("dot only supports rank 1 or above.");
+    return fail("Dot only supports rank 1 or above.");
   }
 
   // Validate basic properties of dot dimension numbers.
@@ -696,7 +693,7 @@ Status ValidateDotDimensionNumbers(
   if (dimension_numbers.lhs_contracting_dimensions_size() !=
           dimension_numbers.rhs_contracting_dimensions_size() ||
       dimension_numbers.lhs_contracting_dimensions_size() != 1) {
-    return fail("must specify one contracting dimension for both lhs and rhs.");
+    return fail("Must specify one contracting dimension for both lhs and rhs.");
   }
 
   // Check that contracting dimension sizes match.
@@ -706,13 +703,13 @@ Status ValidateDotDimensionNumbers(
       dimension_numbers.rhs_contracting_dimensions(0);
   if (lhs.dimensions(lhs_contracting_dimension) !=
       rhs.dimensions(rhs_contracting_dimension)) {
-    return fail("contracting dimension sizes do not match.");
+    return fail("Contracting dimension sizes do not match.");
   }
 
   // Check that number of batch dimensions match.
   if (dimension_numbers.lhs_batch_dimensions_size() !=
       dimension_numbers.rhs_batch_dimensions_size()) {
-    return fail("must the same number of batch dimensions for lhs and rhs.");
+    return fail("Must the same number of batch dimensions for lhs and rhs.");
   }
 
   // Check that batch dimension numbers and sizes match.
@@ -721,7 +718,7 @@ Status ValidateDotDimensionNumbers(
             dimension_numbers.rhs_batch_dimensions(i) ||
         lhs.dimensions(dimension_numbers.lhs_batch_dimensions(i)) !=
             rhs.dimensions(dimension_numbers.rhs_batch_dimensions(i))) {
-      return fail("batch dimension numbers and sizes must match for lhs/rhs.");
+      return fail("Batch dimension numbers and sizes must match for lhs/rhs.");
     }
   }
 
@@ -770,10 +767,11 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     } else if (rhs.dimensions(i) == 1) {
       output_dimensions[i] = lhs.dimensions(i);
     } else {
-      return InvalidArgument("binary op %s with incompatible shapes: %s and %s",
-                             BinaryOperation_Name(operation).c_str(),
-                             ShapeUtil::HumanString(lhs).c_str(),
-                             ShapeUtil::HumanString(rhs).c_str());
+      return InvalidArgument(
+          "Binary op %s with incompatible shapes: %s and %s.",
+          BinaryOperation_Name(operation).c_str(),
+          ShapeUtil::HumanString(lhs).c_str(),
+          ShapeUtil::HumanString(rhs).c_str());
     }
   }
   return ShapeUtil::MakeShape(ShapeUtil::HigherPrecisionElementType(lhs, rhs),
@@ -788,15 +786,15 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     // Reject "magic" inference for binops on different shapes, requiring
     // the user to provide an explicit broadcast dimension in this case.
     // See b/25177275 for more details.
-    return InvalidArgument("automatic shape inference not supported: %s and %s",
+    return InvalidArgument("Automatic shape inference not supported: %s and %s",
                            ShapeUtil::HumanString(smaller_shape).c_str(),
                            ShapeUtil::HumanString(larger_shape).c_str());
   } else if (broadcast_dimensions.size() != ShapeUtil::Rank(smaller_shape)) {
     return InvalidArgument(
-        "size of broadcast_dimensions has to match lower-rank operand's "
+        "Size of broadcast_dimensions has to match lower-rank operand's "
         "rank; "
         " lower-rank operand's rank is %lld, size of broadcast_dimensions is "
-        "%zu",
+        "%zu.",
         ShapeUtil::Rank(smaller_shape), broadcast_dimensions.size());
   }
 
@@ -846,13 +844,13 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     int64 dimension_to_match = broadcast_dimensions.at(i);
     if (dimension_to_match < 0) {
       return InvalidArgument(
-          "broadcast dimension number (%lld) cannot be negative",
+          "Broadcast dimension number (%lld) cannot be negative.",
           dimension_to_match);
     }
     if (dimension_to_match >= larger_shape.dimensions_size()) {
       return InvalidArgument(
-          "broadcast dimension number (%lld) too large; higher-rank "
-          "operand has rank %d",
+          "Broadcast dimension number (%lld) too large; higher-rank "
+          "operand has rank %d.",
           dimension_to_match, larger_shape.dimensions_size());
     }
     int64 small_dimension_size = smaller_shape.dimensions(i);
@@ -863,7 +861,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     if (small_dimension_size != large_dimension_size &&
         small_dimension_size != 1 && large_dimension_size != 1) {
       return InvalidArgument(
-          "broadcast dimension %d mismatch: %lld != %lld; %s and %s", i,
+          "Broadcast dimension %d mismatch: %lld != %lld; %s and %s.", i,
           small_dimension_size, large_dimension_size,
           ShapeUtil::HumanString(smaller_shape).c_str(),
           ShapeUtil::HumanString(larger_shape).c_str());
@@ -872,7 +870,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     // order.
     if (i > 0 && broadcast_dimensions.at(i - 1) >= dimension_to_match) {
       return InvalidArgument(
-          "broadcast dimensions order is wrong: %lld comes after %lld",
+          "Broadcast dimensions order is wrong: %lld comes after %lld.",
           dimension_to_match, broadcast_dimensions.at(i - 1));
     }
 
@@ -892,7 +890,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(lhs, rhs)) {
     return InvalidArgument(
-        "binary op %s with different element types: %s and %s",
+        "Binary op %s with different element types: %s and %s.",
         BinaryOperation_Name(operation).c_str(),
         ShapeUtil::HumanString(lhs).c_str(),
         ShapeUtil::HumanString(rhs).c_str());
@@ -904,8 +902,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     if (!broadcast_dimensions.empty() &&
         broadcast_dimensions != identity_dims) {
       return InvalidArgument(
-          "broadcast dimensions field must either be not set or be the "
-          "identity on binary operations with operands of the same rank");
+          "Broadcast dimensions field must either be not set or be the "
+          "identity on binary operations with operands of the same rank.");
     }
   }
 
@@ -979,8 +977,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     case BINOP_COMPLEX: {
       if (!ShapeUtil::ElementIsFloating(lhs)) {
         return InvalidArgument(
-            "expected element type in shape to be floating for complex compose "
-            "operation; got %s",
+            "Expected element type in shape to be floating for complex compose "
+            "operation; got %s.",
             PrimitiveType_Name(lhs.element_type()).c_str());
       }
       TF_ASSIGN_OR_RETURN(const Shape& shape,
@@ -989,7 +987,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       if (lhs.element_type() == F32 && rhs.element_type() == F32) {
         return ShapeUtil::ChangeElementType(shape, C64);
       } else {
-        return Unimplemented("complex component type not supported");
+        return Unimplemented("Complex component type is not implemented.");
       }
     }
     case BINOP_AND:
@@ -997,8 +995,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       if (lhs.element_type() != PRED &&
           !primitive_util::IsIntegralType(lhs.element_type())) {
         return InvalidArgument(
-            "expected pred or integral type in argument to and/or operation; "
-            "got %s",
+            "Expected pred or integral type in argument to and/or operation; "
+            "got %s.",
             PrimitiveType_Name(lhs.element_type()).c_str());
       }
       return InferElementwiseBinaryOpShape(operation, lhs, rhs,
@@ -1016,7 +1014,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     }
     default:
       return Unimplemented(
-          "not yet implemented; infer binary op shape: %s; lhs: %s; rhs: %s",
+          "Binary op shape inference: %s; lhs: %s; rhs: %s is not implemented.",
           BinaryOperation_Name(operation).c_str(),
           lhs.ShortDebugString().c_str(), rhs.ShortDebugString().c_str());
   }
@@ -1041,7 +1039,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     case TRIOP_SELECT:
       return InferSelectShape(lhs, rhs, ehs);
     default:
-      return InvalidArgument("unknown operation %s",
+      return InvalidArgument("Unknown operation %s.",
                              TernaryOperation_Name(operation).c_str());
   }
 }
@@ -1072,7 +1070,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       return result;
     }
     default:
-      return InvalidArgument("unknown operation %s",
+      return InvalidArgument("Unknown operation %s.",
                              VariadicOperation_Name(operation).c_str());
   }
 }
@@ -1082,7 +1080,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const ProgramShape& to_apply,
     tensorflow::gtl::ArraySlice<int64> dimensions) {
   if (arg_shapes.empty()) {
-    return InvalidArgument("Map expects at least one argument");
+    return InvalidArgument("Map expects at least one argument.");
   }
 
   // All arguments must have the same shape.
@@ -1113,7 +1111,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     }
     return InvalidArgument(
         "Map operation requires all operands to have the same shape; got: "
-        "%s",
+        "%s.",
         Join(pieces, ", ").c_str());
   }
 
@@ -1122,7 +1120,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (dimensions.size() != arg_shape->dimensions_size()) {
     return InvalidArgument(
         "Map applied to a subset of dimensions currently not supported: "
-        "arg_dimension_size: %d, requested_map_dimensions_size: %zu",
+        "arg_dimension_size: %d, requested_map_dimensions_size: %zu.",
         arg_shape->dimensions_size(), dimensions.size());
   }
 
@@ -1130,7 +1128,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   for (int i = 0; i < dimensions.size(); ++i) {
     if (dimensions[i] != i) {
       return InvalidArgument(
-          "Map requires monotonically increasing dimension numbers, found: %s ",
+          "Map requires monotonically increasing dimension numbers; got: %s.",
           Join(dimensions, ", ").c_str());
     }
   }
@@ -1139,7 +1137,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (arg_shapes.size() != to_apply.parameters_size()) {
     return InvalidArgument(
         "Map applied function arity must match number of arguments; got: "
-        "arity: %d, arguments: %zu",
+        "arity: %d, arguments: %zu.",
         to_apply.parameters_size(), arg_shapes.size());
   }
 
@@ -1147,8 +1145,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   const Shape& output_shape = to_apply.result();
   if (!ShapeUtil::IsScalar(output_shape)) {
     return InvalidArgument(
-        "mapped computation's result has to be a scalar; "
-        "got: %s",
+        "Mapped computation's result has to be a scalar; got: %s.",
         ShapeUtil::HumanString(output_shape).c_str());
   }
 
@@ -1157,16 +1154,16 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
     if (!ShapeUtil::IsScalar(parameter_shape)) {
       return InvalidArgument(
-          "mapped computation's parameter has to be a scalar; "
-          "got parameter %d shape: %s",
+          "Mapped computation's parameter has to be a scalar; "
+          "got parameter %d shape: %s.",
           i, ShapeUtil::HumanString(parameter_shape).c_str());
     }
 
     if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(parameter_shape,
                                                        *arg_shape)) {
       return InvalidArgument(
-          "mapped computation's parameter type has to match argument element "
-          "type; got parameter %d shape: %s, argument shape: %s",
+          "Mapped computation's parameter type has to match argument element "
+          "type; got parameter %d shape: %s, argument shape: %s.",
           i, ShapeUtil::HumanString(parameter_shape).c_str(),
           ShapeUtil::HumanString(*arg_shape).c_str());
     }
@@ -1197,21 +1194,21 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected feature_index of batch-norm-training to be "
         "smaller than the rank of operand_shape; "
-        "got feature_index %lld, and rank %lld",
+        "got feature_index %lld, and rank %lld.",
         feature_index, ShapeUtil::Rank(operand_shape));
   }
 
   if (feature_index < 0) {
     return InvalidArgument(
         "Expected feature_index of batch-norm-training to "
-        "be a non-negative number, got %lld",
+        "be a non-negative number, got %lld.",
         feature_index);
   }
 
   if (ShapeUtil::Rank(operand_shape) < 1) {
     return InvalidArgument(
         "Expected the rank of operand to "
-        "batch-norm-training to be at least 1; got %lld",
+        "batch-norm-training to be at least 1; got %lld.",
         ShapeUtil::Rank(operand_shape));
   }
 
@@ -1232,7 +1229,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (!ShapeUtil::ElementIsFloating(operand_shape)) {
     return InvalidArgument(
         "The operand to batch-norm-training must have a floating point "
-        "element type, but the shape is %s",
+        "element type, but the shape is %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
 
@@ -1241,7 +1238,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-training, "
         "but the shape of offset factor is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(offset_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1251,7 +1248,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-training, "
         "but the shape of scale factor is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(scale_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1264,7 +1261,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of offset factor should be the same as feature count,"
         "but the size of offset factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(offset_shape, 0), feature_count);
   }
 
@@ -1272,7 +1269,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of scale factor should be the same as feature count,"
         "but the size of scale factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(scale_shape, 0), feature_count);
   }
 
@@ -1307,21 +1304,21 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected feature_index of batch-norm-inference to be "
         "smaller than the rank of operand_shape; "
-        "got feature_index %lld, and rank %lld",
+        "got feature_index %lld, and rank %lld.",
         feature_index, ShapeUtil::Rank(operand_shape));
   }
 
   if (feature_index < 0) {
     return InvalidArgument(
         "Expected feature_index of batch-norm-inference to "
-        "be a non-negative number, got %lld",
+        "be a non-negative number, got %lld.",
         feature_index);
   }
 
   if (ShapeUtil::Rank(operand_shape) < 1) {
     return InvalidArgument(
         "Expected the rank of operand to "
-        "batch-norm-inference to be at least 1; got %lld",
+        "batch-norm-inference to be at least 1; got %lld.",
         ShapeUtil::Rank(operand_shape));
   }
 
@@ -1342,7 +1339,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (!ShapeUtil::ElementIsFloating(operand_shape)) {
     return InvalidArgument(
         "The operand to batch-norm-inference must have a floating point "
-        "element type, but the shape is %s",
+        "element type, but the shape is %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
 
@@ -1352,7 +1349,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "The inputs should have the same element type for "
         "batch-norm-inference, "
         "but the shape of offset factor is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(offset_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1363,7 +1360,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "The inputs should have the same element type for "
         "batch-norm-inference, "
         "but the shape of scale factor is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(scale_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1374,7 +1371,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "The inputs should have the same element type for "
         "batch-norm-inference, "
         "but the shape of mean is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(mean_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1385,7 +1382,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "The inputs should have the same element type for "
         "batch-norm-inference, "
         "but the shape of variance is %s "
-        "and the shape of operand is %s",
+        "and the shape of operand is %s.",
         PrimitiveType_Name(mean_shape.element_type()).c_str(),
         PrimitiveType_Name(variance_shape.element_type()).c_str());
   }
@@ -1398,7 +1395,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of offset factor should be the same as feature count,"
         "but the size of offset factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(offset_shape, 0), feature_count);
   }
 
@@ -1406,7 +1403,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of scale factor should be the same as feature count,"
         "but the size of scale factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(scale_shape, 0), feature_count);
   }
 
@@ -1414,7 +1411,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of mean should be the same as feature count,"
         "but the size of mean is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(mean_shape, 0), feature_count);
   }
 
@@ -1422,7 +1419,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of variance should be the same as feature count,"
         "but the size of variance is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(variance_shape, 0), feature_count);
   }
 
@@ -1455,7 +1452,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected feature_index of batch-norm-grad to be "
         "smaller than the rank of operand_shape; "
-        "got feature_index %lld, and rank %lld",
+        "got feature_index %lld, and rank %lld.",
         feature_index, ShapeUtil::Rank(operand_shape));
   }
 
@@ -1463,7 +1460,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected operand_shape of batch-norm-grad to have the same rank as"
         " output_grad_shape; got rank(oprand_shape) %lld, and"
-        " rank(output_grad_shape) %lld",
+        " rank(output_grad_shape) %lld.",
         ShapeUtil::Rank(operand_shape), ShapeUtil::Rank(output_grad_shape));
   }
 
@@ -1491,14 +1488,14 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (!ShapeUtil::ElementIsFloating(operand_shape)) {
     return InvalidArgument(
         "The operand to batch-norm-grad must have a floating point "
-        "element type, but the shape is %s",
+        "element type, but the shape is %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
 
   if (!ShapeUtil::ElementIsFloating(output_grad_shape)) {
     return InvalidArgument(
         "The output_grad to batch-norm-grad must have a floating point "
-        "element type, but the shape is %s",
+        "element type, but the shape is %s.",
         PrimitiveType_Name(output_grad_shape.element_type()).c_str());
   }
 
@@ -1507,7 +1504,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-grad, "
         "but the element type of output_grad is %s "
-        "and the element type of operand is %s",
+        "and the element type of operand is %s.",
         PrimitiveType_Name(output_grad_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1517,7 +1514,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-grad, "
         "but the element type of scale factor is %s "
-        "and the element type of operand is %s",
+        "and the element type of operand is %s.",
         PrimitiveType_Name(scale_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1527,7 +1524,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-grad, "
         "but the element type of mean is %s "
-        "and the element type of operand is %s",
+        "and the element type of operand is %s.",
         PrimitiveType_Name(mean_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1537,7 +1534,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The inputs should have the same element type for batch-norm-grad, "
         "but the element type of mean is %s "
-        "and the element type of operand is %s",
+        "and the element type of operand is %s.",
         PrimitiveType_Name(mean_shape.element_type()).c_str(),
         PrimitiveType_Name(operand_shape.element_type()).c_str());
   }
@@ -1551,7 +1548,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of mean should be the same as feature count,"
         "but the size of offset factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(mean_shape, 0), feature_count);
   }
 
@@ -1559,7 +1556,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of scale factor should be the same as feature count,"
         "but the size of scale factor is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(scale_shape, 0), feature_count);
   }
 
@@ -1567,7 +1564,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "The size of variance should be the same as feature count,"
         "but the size of variance is %lld "
-        "and the feature count is %lld",
+        "and the feature count is %lld.",
         ShapeUtil::GetDimension(var_shape, 0), feature_count);
   }
 
@@ -1578,7 +1575,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       return InvalidArgument(
           "The bounds of operand shape should be the same as output_grad's,"
           "but the bound of operand_shape at dimension %lld is %lld "
-          "and the bound of output_grad_shape is %lld",
+          "and the bound of output_grad_shape is %lld.",
           i, ShapeUtil::GetDimension(operand_shape, i),
           ShapeUtil::GetDimension(output_grad_shape, i));
     }
@@ -1596,7 +1593,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(lhs, rhs)) {
     return InvalidArgument(
-        "Convolution with different element types: %s and %s",
+        "Convolution with different element types: %s and %s.",
         ShapeUtil::HumanString(lhs).c_str(),
         ShapeUtil::HumanString(rhs).c_str());
   }
@@ -1612,21 +1609,19 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (window.dimensions_size() != num_spatial_dims) {
     return InvalidArgument(
         "Window must have same number of dimensions as dimension numbers.\n"
-        "Window: %s\nDimension numbers: %s",
+        "Window: %s\nDimension numbers: %s.",
         window.DebugString().c_str(), dnums.DebugString().c_str());
   }
 
   const int num_dims = num_spatial_dims + 2;
   if (ShapeUtil::Rank(lhs) != num_dims) {
     return InvalidArgument(
-        "The LHS argument to a convolution should have rank %d.\n"
-        "lhs: %s",
+        "The LHS argument to a convolution should have rank %d; lhs: %s.",
         num_dims, ShapeUtil::HumanString(lhs).c_str());
   }
   if (ShapeUtil::Rank(rhs) != num_dims) {
     return InvalidArgument(
-        "The RHS argument to a convolution should have rank %d.\n"
-        "lhs: %s",
+        "The RHS argument to a convolution should have rank %d; lhs: %s.",
         num_dims, ShapeUtil::HumanString(lhs).c_str());
   }
   TF_DCHECK_OK(ShapeUtil::ValidateShapeWithOptionalLayout(lhs));
@@ -1663,26 +1658,26 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       !std::all_of(window_dnums.begin(), window_dnums.end(), in_range) ||
       !std::all_of(output_dnums.begin(), output_dnums.end(), in_range)) {
     return InvalidArgument(
-        "A dimension number is out of range in convolution: %s",
+        "A dimension number is out of range in convolution: %s.",
         dnums.DebugString().c_str());
   }
 
   if (input_dnums != expected_dnums) {
     return InvalidArgument(
         "Input dimensions of convolution must contain each dimension exactly "
-        "once: %s",
+        "once: %s.",
         dnums.DebugString().c_str());
   }
   if (window_dnums != expected_dnums) {
     return InvalidArgument(
         "Window dimensions of convolution must contain each dimension exactly "
-        "once: %s",
+        "once: %s.",
         dnums.DebugString().c_str());
   }
   if (output_dnums != expected_dnums) {
     return InvalidArgument(
         "Output dimensions of convolution must contain each dimension exactly "
-        "once: %s",
+        "once: %s.",
         dnums.DebugString().c_str());
   }
 
@@ -1706,7 +1701,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Expected LHS feature dimension (value %lld) to match RHS "
         "input feature dimension (value %lld); got <conv>(%s, %s)\n"
-        "Dimension numbers: {%s}",
+        "Dimension numbers: {%s}.",
         input_features, kernel_input_features,
         ShapeUtil::HumanString(lhs).c_str(),
         ShapeUtil::HumanString(rhs).c_str(), dnums.DebugString().c_str());
@@ -1720,7 +1715,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
         "Window dimensions do not match RHS shape:\n\t"
         "RHS shape: %s\n\t"
         "Window: {%s}\n\t"
-        "Dimension numbers: {%s}",
+        "Dimension numbers: {%s}.",
         ShapeUtil::HumanString(rhs).c_str(), window.ShortDebugString().c_str(),
         dnums.ShortDebugString().c_str());
   }
@@ -1748,8 +1743,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const tensorflow::gtl::ArraySlice<int64> fft_length) {
   const int64 fft_rank = fft_length.size();
   if (fft_rank < 1 || fft_rank > 3) {
-    return InvalidArgument("FFT only supports ranks 1-3, but got %lld",
-                           fft_rank);
+    return InvalidArgument("FFT only supports ranks 1-3; got %lld.", fft_rank);
   }
 #define RET_CHECK_RANK(x)                              \
   if (x.dimensions_size() < fft_rank) {                \
@@ -1762,7 +1756,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     case FFT:
     case IFFT:
       if (in.element_type() != C64) {
-        return InvalidArgument("%s requires C64 input type, found %s",
+        return InvalidArgument("%s requires C64 input type, found %s.",
                                FftType_Name(fft_type).c_str(),
                                PrimitiveType_Name(in.element_type()).c_str());
       }
@@ -1770,7 +1764,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
       return in;
     case RFFT: {
       if (in.element_type() != F32) {
-        return InvalidArgument("RFFT requires F32 input type, found %s",
+        return InvalidArgument("RFFT requires F32 input type, found %s.",
                                PrimitiveType_Name(in.element_type()).c_str());
       }
       RET_CHECK_RANK(in);
@@ -1779,7 +1773,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
             fft_length[i]) {
           return InvalidArgument(
               "RFFT requires innermost dimensions match fft_length but "
-              "dimension %lld is %lld and should be %lld",
+              "dimension %lld is %lld and should be %lld.",
               in.dimensions_size() - fft_rank + i,
               in.dimensions(in.dimensions_size() - fft_rank + i),
               fft_length[i]);
@@ -1792,7 +1786,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     }
     case IRFFT: {
       if (in.element_type() != C64) {
-        return InvalidArgument("IRFFT requires C64 input type, found %s",
+        return InvalidArgument("IRFFT requires C64 input type, found %s.",
                                PrimitiveType_Name(in.element_type()).c_str());
       }
       RET_CHECK_RANK(in);
@@ -1802,7 +1796,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
             fft_length[i]) {
           return InvalidArgument(
               "IRFFT requires all but one innermost dimensions match "
-              "fft_length, but dimension %lld is %lld and should be %lld",
+              "fft_length, but dimension %lld is %lld and should be %lld.",
               in.dimensions_size() - fft_rank + i,
               in.dimensions(in.dimensions_size() - fft_rank + i),
               fft_length[i]);
@@ -1812,7 +1806,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
           fft_length[fft_rank - 1] / 2 + 1) {
         return InvalidArgument(
             "IRFFT requires innermost dimension matches fft_length/2+1, but "
-            "dimension %d is %lld and should be %lld",
+            "dimension %d is %lld and should be %lld.",
             in.dimensions_size() - 1, in.dimensions(in.dimensions_size() - 1),
             fft_length[fft_rank - 1] / 2 + 1);
       }
@@ -1850,8 +1844,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   for (int64 dimension : dimensions_to_reduce) {
     if (dimension >= ShapeUtil::Rank(arg) || dimension < 0) {
       return InvalidArgument(
-          "attempting to reduce out-of-bounds dimension %lld in shape %s",
-          dimension, ShapeUtil::HumanString(arg).c_str());
+          "Reducing out-of-bounds dimension %lld in shape %s.", dimension,
+          ShapeUtil::HumanString(arg).c_str());
     }
   }
   TF_RETURN_IF_ERROR(
@@ -1891,30 +1885,30 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   // Check if the select function has a proper shape of (T,T) -> PRED.
   if (select_shape.parameters_size() != 2) {
     return InvalidArgument(
-        "select function must take 2 parameters, but "
+        "Select function must take 2 parameters, but "
         "takes %d parameter(s).",
         select_shape.parameters_size());
   }
   const Shape& select_result_shape = select_shape.result();
   if (!ShapeUtil::Compatible(select_result_shape,
                              ShapeUtil::MakeShape(PRED, {}))) {
-    return Unimplemented("select function must have rank-0 PRED result.");
+    return InvalidArgument("Select function must have rank-0 PRED result.");
   }
   const Shape& operand_element_shape =
       ShapeUtil::MakeShape(operand_shape.element_type(), {});
   if (!ShapeUtil::CompatibleIgnoringFpPrecision(operand_element_shape,
                                                 select_shape.parameters(0))) {
     return InvalidArgument(
-        "select function's first parameter shape currently must "
-        "match the operand element shape. Got %s vs %s",
+        "Select function's first parameter shape currently must "
+        "match the operand element shape, but got %s vs %s.",
         ShapeUtil::HumanString(select_shape.parameters(0)).c_str(),
         ShapeUtil::HumanString(operand_element_shape).c_str());
   }
   if (!ShapeUtil::CompatibleIgnoringFpPrecision(operand_element_shape,
                                                 select_shape.parameters(1))) {
     return InvalidArgument(
-        "select function's second parameter shape currently must "
-        "match the operand element shape. Got %s vs %s",
+        "Select function's second parameter shape currently must "
+        "match the operand element shape, but got %s vs %s.",
         ShapeUtil::HumanString(select_shape.parameters(1)).c_str(),
         ShapeUtil::HumanString(operand_element_shape).c_str());
   }
@@ -1931,8 +1925,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   if (!ShapeUtil::CompatibleIgnoringFpPrecision(source_shape,
                                                 window_result_shape)) {
     return InvalidArgument(
-        "source shape does not match the shape of window-reduced operand: "
-        "source(%s), window-reduced operand(%s)",
+        "Source shape does not match the shape of window-reduced operand: "
+        "source(%s), window-reduced operand(%s).",
         ShapeUtil::HumanString(source_shape).c_str(),
         ShapeUtil::HumanString(window_result_shape).c_str());
   }
@@ -1946,7 +1940,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   auto error = [&](const string& message) {
     return InvalidArgument(
         "%s in slice operation; argument shape: %s; starts: {%s}; limits: "
-        "{%s}; strides: {%s}",
+        "{%s}; strides: {%s}.",
         message.c_str(), ShapeUtil::HumanString(arg).c_str(),
         Join(starts, ",").c_str(), Join(limits, ",").c_str(),
         Join(strides, ",").c_str());
@@ -1969,7 +1963,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (starts.size() != ShapeUtil::Rank(arg)) {
     return InvalidArgument(
-        "slice index count does not match argument rank: %zu vs %lld",
+        "Slice index count does not match argument rank: %zu vs %lld.",
         starts.size(), ShapeUtil::Rank(arg));
   }
 
@@ -1979,7 +1973,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     int64 limit_index = limits[dimension];
     int64 stride = strides[dimension];
     if (start_index < 0) {
-      return InvalidArgument("negative start index to slice: %lld",
+      return InvalidArgument("Negative start index to slice: %lld.",
                              start_index);
     }
     if (limit_index > arg.dimensions(dimension)) {
@@ -1999,7 +1993,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
                  limit_index, start_index));
     }
     if (stride <= 0) {
-      return InvalidArgument("stride (%lld) must be positive", stride);
+      return InvalidArgument("Stride (%lld) must be positive.", stride);
     }
     sizes.push_back((limit_index - start_index + stride - 1) / stride);
   }
@@ -2023,20 +2017,20 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (ShapeUtil::Rank(start_indices_shape) != 1) {
     return InvalidArgument(
-        "dynamic slice start indices of rank %lld must be rank1.",
+        "Dynamic slice start indices of rank %lld must be rank1.",
         ShapeUtil::Rank(start_indices_shape));
   }
 
   if (!ShapeUtil::ElementIsIntegral(start_indices_shape)) {
     return InvalidArgument(
-        "dynamic slice start indices must be of integral type.");
+        "Dynamic slice start indices must be of integral type.");
   }
 
   const int64 start_num_dims = start_indices_shape.dimensions(0);
   if (ShapeUtil::Rank(operand_shape) != start_num_dims) {
     return InvalidArgument(
-        "dynamic slice start number of dimensions %lld (%s) must match rank "
-        "%lld of slice input (%s)",
+        "Dynamic slice start number of dimensions %lld (%s) must match rank "
+        "%lld of slice input (%s).",
         start_num_dims, ShapeUtil::HumanString(start_indices_shape).c_str(),
         ShapeUtil::Rank(operand_shape),
         ShapeUtil::HumanString(operand_shape).c_str());
@@ -2044,7 +2038,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (slice_sizes.size() != ShapeUtil::Rank(operand_shape)) {
     return InvalidArgument(
-        "dynamic slice index count does not match argument rank: %zu vs %lld",
+        "Dynamic slice index count does not match argument rank: %zu vs %lld.",
         slice_sizes.size(), ShapeUtil::Rank(operand_shape));
   }
 
@@ -2052,12 +2046,12 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const int64 input_dim_size = operand_shape.dimensions(dim);
     const int64 slice_dim_size = slice_sizes[dim];
     if (slice_dim_size < 0) {
-      return InvalidArgument("negative size index to dynamic slice: %lld",
+      return InvalidArgument("Negative size index to dynamic slice: %lld.",
                              slice_dim_size);
     }
     if (slice_dim_size > input_dim_size) {
       return InvalidArgument(
-          "slice dim size %lld greater than dynamic slice dimension: %lld",
+          "Slice dim size %lld greater than dynamic slice dimension: %lld.",
           slice_dim_size, input_dim_size);
     }
     VLOG(2) << tensorflow::strings::Printf("slice_sizes[%lld] = %lld", dim,
@@ -2086,20 +2080,20 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (ShapeUtil::Rank(start_indices_shape) != 1) {
     return InvalidArgument(
-        "dynamic update slice start indices of rank %lld must be rank1.",
+        "Dynamic update slice start indices of rank %lld must be rank1.",
         ShapeUtil::Rank(start_indices_shape));
   }
 
   if (!ShapeUtil::ElementIsIntegral(start_indices_shape)) {
     return InvalidArgument(
-        "dynamic update slice start indices must be of integral type.");
+        "Dynamic update slice start indices must be of integral type.");
   }
 
   const int64 start_num_dims = start_indices_shape.dimensions(0);
   if (ShapeUtil::Rank(operand_shape) != start_num_dims) {
     return InvalidArgument(
-        "dynamic slice start number of dimensions %lld (%s) must match rank "
-        "%lld of slice input (%s)",
+        "Dynamic slice start number of dimensions %lld (%s) must match rank "
+        "%lld of slice input (%s).",
         start_num_dims, ShapeUtil::HumanString(start_indices_shape).c_str(),
         ShapeUtil::Rank(operand_shape),
         ShapeUtil::HumanString(operand_shape).c_str());
@@ -2107,16 +2101,16 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (ShapeUtil::Rank(update_shape) != ShapeUtil::Rank(operand_shape)) {
     return InvalidArgument(
-        "dynamic update slice update rank does not match argument rank: "
-        "%lld vs %lld",
+        "Dynamic update slice update rank does not match argument rank: "
+        "%lld vs %lld.",
         ShapeUtil::Rank(update_shape), ShapeUtil::Rank(operand_shape));
   }
 
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(operand_shape,
                                                      update_shape)) {
     return InvalidArgument(
-        "dynamic update slice update element type does not match argument. "
-        "operand.element_type: %s vs update.element_type: %s",
+        "Dynamic update slice update element type does not match argument. "
+        "operand.element_type: %s vs update.element_type: %s.",
         PrimitiveType_Name(operand_shape.element_type()).c_str(),
         PrimitiveType_Name(update_shape.element_type()).c_str());
   }
@@ -2126,12 +2120,12 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const int64 update_dim_size = update_shape.dimensions(dim);
     if (update_dim_size < 0) {
       return InvalidArgument(
-          "size index %lld to dynamic update slice must be >= 0",
+          "Size index %lld to dynamic update slice must be >= 0.",
           update_dim_size);
     }
     if (update_dim_size > input_dim_size) {
       return InvalidArgument(
-          "update dim size %lld greater than dynamic slice dimension: %lld",
+          "Update dim size %lld greater than dynamic slice dimension: %lld.",
           update_dim_size, input_dim_size);
     }
     VLOG(2) << tensorflow::strings::Printf("update_sizes[%lld] = %lld", dim,
@@ -2151,7 +2145,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   for (int64 dimension : dimensions) {
     if (dimension >= ShapeUtil::Rank(operand_shape) || dimension < 0) {
       return InvalidArgument(
-          "one of the reverse dimensions (%lld) is out-of-bounds in shape %s",
+          "One of the reverse dimensions (%lld) is out-of-bounds in shape %s.",
           dimension, ShapeUtil::HumanString(operand_shape).c_str());
     }
   }
@@ -2162,14 +2156,14 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const Shape& arg, int64 index) {
   if (!ShapeUtil::IsTuple(arg)) {
     return InvalidArgument(
-        "cannot infer shape: attempting to index into non-tuple: %s",
+        "Cannot infer shape: attempting to index into non-tuple: %s.",
         ShapeUtil::HumanString(arg).c_str());
   }
 
   if (index >= arg.tuple_shapes_size()) {
     return InvalidArgument(
-        "cannot infer shape: attempt to index out of tuple bounds: %lld "
-        ">= %d in shape %s",
+        "Cannot infer shape: attempt to index out of tuple bounds: %lld "
+        ">= %d in shape %s.",
         index, arg.tuple_shapes_size(), ShapeUtil::HumanString(arg).c_str());
   }
 
@@ -2181,17 +2175,17 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const Shape& init) {
   // Check the number of parameters for given computations.
   if (condition.parameters_size() != 1) {
-    return InvalidArgument("condition must take 1 arguments; got %d",
+    return InvalidArgument("Condition must take 1 arguments; got %d.",
                            condition.parameters_size());
   }
   if (body.parameters_size() != 1) {
-    return InvalidArgument("body must take 1 arguments; got %d",
+    return InvalidArgument("Body must take 1 arguments; got %d.",
                            body.parameters_size());
   }
 
   auto shape_string = [&]() {
     return tensorflow::strings::Printf(
-        "condition: %s; body: %s; init: %s",
+        "Condition: %s; body: %s; init: %s.",
         ShapeUtil::HumanString(condition).c_str(),
         ShapeUtil::HumanString(body).c_str(),
         ShapeUtil::HumanString(init).c_str());
@@ -2199,15 +2193,15 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   // Check the shapes of computation parameters and return types.
   if (!ShapeUtil::ShapeIs(condition.result(), PRED, {})) {
-    return InvalidArgument("condition must return a boolean; got %s",
+    return InvalidArgument("Condition must return a boolean; got %s.",
                            shape_string().c_str());
   }
   if (!ShapeUtil::Compatible(body.result(), condition.parameters(0)) ||
       !ShapeUtil::Compatible(body.result(), body.parameters(0)) ||
       !ShapeUtil::Compatible(body.result(), init)) {
     return InvalidArgument(
-        "the parameter of condition and body, the result of the body, and init "
-        "must all have the same shape; got %s",
+        "The parameter of condition and body, the result of the body, and init "
+        "must all have the same shape; got %s.",
         shape_string().c_str());
   }
 
@@ -2219,7 +2213,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     const Shape& false_operand, const ProgramShape& true_computation,
     const ProgramShape& false_computation) {
   if (!ShapeUtil::ShapeIs(predicate, PRED, {})) {
-    return InvalidArgument("predicate must be a boolean; got %s.",
+    return InvalidArgument("Predicate must be a boolean; got %s.",
                            ShapeUtil::HumanString(predicate).c_str());
   }
 
@@ -2302,8 +2296,8 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
 
   if (ShapeUtil::ElementsIn(operand) != ShapeUtil::ElementsIn(inferred_shape)) {
     return InvalidArgument(
-        "reshape operation has mismatched element counts: from=%lld (%s) "
-        "to=%lld (%s)",
+        "Reshape operation has mismatched element counts: from=%lld (%s) "
+        "to=%lld (%s).",
         ShapeUtil::ElementsIn(operand), ShapeUtil::HumanString(operand).c_str(),
         ShapeUtil::ElementsIn(inferred_shape),
         ShapeUtil::HumanString(inferred_shape).c_str());
@@ -2351,7 +2345,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   TF_RETURN_IF_ERROR(ExpectNotTupleOrOpaque(max, "clamp max"));
   if (!ShapeUtil::SameElementTypeIgnoringFpPrecision(min, operand) ||
       !ShapeUtil::SameElementTypeIgnoringFpPrecision(max, operand)) {
-    return InvalidArgument("clamp op with different operand types: %s, %s, %s",
+    return InvalidArgument("Clamp with different operand types: %s, %s, %s.",
                            ShapeUtil::HumanString(min).c_str(),
                            ShapeUtil::HumanString(operand).c_str(),
                            ShapeUtil::HumanString(max).c_str());
@@ -2372,7 +2366,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     }
   }
   return Unimplemented(
-      "not yet implemented: %s, %s <clamp> %s", min.ShortDebugString().c_str(),
+      "%s, %s <clamp> %s is not implemented.", min.ShortDebugString().c_str(),
       max.ShortDebugString().c_str(), operand.ShortDebugString().c_str());
 }
 
@@ -2391,13 +2385,13 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   }
   if (!compatible) {
     return InvalidArgument(
-        "operands to select must be the same shape; got %s and %s",
+        "Operands to select must be the same shape; got %s and %s.",
         ShapeUtil::HumanString(on_true).c_str(),
         ShapeUtil::HumanString(on_false).c_str());
   }
   if (pred.element_type() != PRED) {
     return InvalidArgument(
-        "select's pred operand must have PRED element type; got %s",
+        "Select's pred operand must have PRED element type; got %s.",
         ShapeUtil::HumanString(pred).c_str());
   }
   if (ShapeUtil::SameDimensions(pred, on_true) || ShapeUtil::Rank(pred) == 0) {
@@ -2407,9 +2401,9 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return ShapeUtil::ChangeElementType(
         on_true, ShapeUtil::HigherPrecisionElementType(on_true, on_false));
   } else {
-    return Unimplemented(
-        "select operation with non-scalar predicate with dimensionality "
-        " different from the other operands: %s",
+    return InvalidArgument(
+        "Select operation with non-scalar predicate with dimensionality "
+        " different from the other operands: %s.",
         ShapeUtil::HumanString(pred).c_str());
   }
 }
@@ -2427,7 +2421,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     return InvalidArgument(
         "Call applied function arity must match number of arguments; got: "
         "arity: %d, arguments: %zu; computation signature: %s; argument "
-        "shapes: [%s]",
+        "shapes: [%s].",
         to_apply.parameters_size(), arg_shapes.size(),
         computation_signature.c_str(), argument_shapes.c_str());
   }
@@ -2439,7 +2433,7 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
     if (!ShapeUtil::Compatible(arg_shape, param_shape)) {
       return InvalidArgument(
           "Call parameter must match argument; got parameter %d shape: %s, "
-          "argument shape: %s",
+          "argument shape: %s.",
           i, ShapeUtil::HumanString(param_shape).c_str(),
           ShapeUtil::HumanString(arg_shape).c_str());
     }
@@ -2448,4 +2442,209 @@ ShapeInference::InferDegenerateDimensionBroadcastShape(
   return to_apply.result();
 }
 
+static Status ValidateGatherDimensionNumbers(
+    const Shape& input_shape,
+    tensorflow::gtl::ArraySlice<int64> gather_indices_shape,
+    const GatherDimensionNumbers& dim_numbers) {
+  if (!c_is_sorted(dim_numbers.output_window_dims())) {
+    return InvalidArgument(
+        "Output window dimensions in gather op must be ascending; got: %s.",
+        Join(dim_numbers.output_window_dims(), ", ").c_str());
+  }
+
+  if (c_adjacent_find(dim_numbers.output_window_dims()) !=
+      dim_numbers.output_window_dims().end()) {
+    return InvalidArgument(
+        "Output window dimensions in gather op must not repeat; got: %s.",
+        Join(dim_numbers.output_window_dims(), ", ").c_str());
+  }
+
+  const int64 output_window_dim_count = dim_numbers.output_window_dims_size();
+  const int64 output_shape_rank =
+      output_window_dim_count + gather_indices_shape.size() - 1;
+
+  for (int i = 0; i < dim_numbers.output_window_dims_size(); ++i) {
+    int64 window_index = dim_numbers.output_window_dims(i);
+    if (window_index < 0 || window_index >= output_shape_rank) {
+      return InvalidArgument(
+          "Window index %d in gather op is out of bounds; got %lld, but should "
+          "have been in [0,%lld).",
+          i, window_index, output_shape_rank);
+    }
+  }
+
+  if (dim_numbers.gather_dims_to_operand_dims_size() !=
+      gather_indices_shape[dim_numbers.index_vector_dim()]) {
+    return InvalidArgument(
+        "Gather op has %d elements in gather_dims_to_operand_dims and the "
+        "bound of dimension index_vector_dim=%lld of gather_indices is "
+        "%lld. These two numbers must be equal.",
+        dim_numbers.gather_dims_to_operand_dims_size(),
+        dim_numbers.index_vector_dim(),
+        gather_indices_shape[dim_numbers.index_vector_dim()]);
+  }
+
+  for (int i = 0; i < dim_numbers.gather_dims_to_operand_dims_size(); i++) {
+    int64 gather_dim_to_input_dim = dim_numbers.gather_dims_to_operand_dims(i);
+    if (gather_dim_to_input_dim < 0 ||
+        gather_dim_to_input_dim >= input_shape.dimensions_size()) {
+      return InvalidArgument(
+          "Invalid gather_dims_to_operand_dims mapping; domain is [0, %d), "
+          "got: %d->%lld.",
+          input_shape.dimensions_size(), i, gather_dim_to_input_dim);
+    }
+  }
+
+  std::vector<int64> sorted_gather_dims_to_operand_dims(
+      dim_numbers.gather_dims_to_operand_dims().begin(),
+      dim_numbers.gather_dims_to_operand_dims().end());
+
+  c_sort(sorted_gather_dims_to_operand_dims);
+
+  if (c_adjacent_find(sorted_gather_dims_to_operand_dims) !=
+      sorted_gather_dims_to_operand_dims.end()) {
+    return InvalidArgument(
+        "Repeated dimensions are not allowed in gather_dims_to_operand_dims; "
+        "got: %s.",
+        Join(dim_numbers.gather_dims_to_operand_dims(), ", ").c_str());
+  }
+
+  for (int64 elided_dim : dim_numbers.elided_window_dims()) {
+    if (elided_dim < 0 || elided_dim >= input_shape.dimensions_size()) {
+      return InvalidArgument(
+          "Invalid elided_window_dims set in gather op; valid range is [0, "
+          "%d), got: %lld.",
+          input_shape.dimensions_size(), elided_dim);
+    }
+  }
+
+  if (!c_is_sorted(dim_numbers.elided_window_dims())) {
+    return InvalidArgument(
+        "elided_window_dims in gather op must be sorted; got: %s",
+        Join(dim_numbers.elided_window_dims(), ", ").c_str());
+  }
+
+  if (c_adjacent_find(dim_numbers.elided_window_dims()) !=
+      dim_numbers.elided_window_dims().end()) {
+    return InvalidArgument(
+        "Repeated dimensions not allowed in elided_window_dims in gather op; "
+        "got: %s.",
+        Join(dim_numbers.elided_window_dims(), ", ").c_str());
+  }
+
+  return Status::OK();
+}
+
+/*static*/ StatusOr<Shape> ShapeInference::InferGatherShape(
+    const Shape& input_shape, const Shape& gather_indices_shape,
+    const GatherDimensionNumbers& gather_dim_numbers,
+    tensorflow::gtl::ArraySlice<int64> window_bounds) {
+  TF_RETURN_IF_ERROR(
+      ExpectNotTupleOrOpaque(input_shape, "input tensor operand gather op"));
+  TF_RETURN_IF_ERROR(ExpectNotTupleOrOpaque(
+      gather_indices_shape, "gather indices operand of gather op"));
+
+  if (!ShapeUtil::ElementIsIntegral(gather_indices_shape)) {
+    return InvalidArgument(
+        "Gather indices parameter must be an integral tensor; got %s.",
+        ShapeUtil::HumanString(gather_indices_shape).c_str());
+  }
+
+  // We implicitly reshape gather indices of shape P[A,B,C] to P[A,B,C,1] if
+  // index_vector_dim is rank(P).  The bounds of this expanded shape is
+  // stored in expanded_gather_indices_shape.
+
+  if (gather_indices_shape.dimensions_size() <
+          gather_dim_numbers.index_vector_dim() ||
+      gather_dim_numbers.index_vector_dim() < 0) {
+    return InvalidArgument(
+        "Gather index leaf dimension must be within [0, rank(gather_indices) + "
+        "1). rank(gather_indices) is %d and gather index leaf dimension is "
+        "%lld.",
+        gather_indices_shape.dimensions_size(),
+        gather_dim_numbers.index_vector_dim());
+  }
+
+  std::vector<int64> expanded_gather_indices_shape;
+  expanded_gather_indices_shape.reserve(gather_indices_shape.dimensions_size());
+  c_copy(gather_indices_shape.dimensions(),
+         std::back_inserter(expanded_gather_indices_shape));
+  if (expanded_gather_indices_shape.size() ==
+      gather_dim_numbers.index_vector_dim()) {
+    expanded_gather_indices_shape.push_back(1);
+  }
+
+  TF_RETURN_IF_ERROR(ValidateGatherDimensionNumbers(
+      input_shape, expanded_gather_indices_shape, gather_dim_numbers));
+
+  if (window_bounds.size() != input_shape.dimensions_size()) {
+    return InvalidArgument(
+        "Gather op must have one window bound for every input dimension; got: "
+        "len(window_bounds)=%lu, input_shape.rank=%d.",
+        window_bounds.size(), input_shape.dimensions_size());
+  }
+
+  if (window_bounds.size() !=
+      gather_dim_numbers.output_window_dims_size() +
+          gather_dim_numbers.elided_window_dims_size()) {
+    return InvalidArgument(
+        "All components of the window index in a gather op must either be a "
+        "output window index or explicitly elided; got len(window_bounds)=%lu, "
+        "output_window_bounds=%s, elided_window_bounds=%s.",
+        window_bounds.size(),
+        Join(gather_dim_numbers.output_window_dims(), ",").c_str(),
+        Join(gather_dim_numbers.elided_window_dims(), ",").c_str());
+  }
+
+  for (int i = 0; i < window_bounds.size(); i++) {
+    int64 window_bound = window_bounds[i];
+    int64 corresponding_input_bound = input_shape.dimensions(i);
+    if (window_bound < 0 || window_bound > corresponding_input_bound) {
+      return InvalidArgument(
+          "Window bound at index %d in gather op is out of range, must be "
+          "within "
+          "[0, %lld), got %lld.",
+          i, corresponding_input_bound + 1, window_bound);
+    }
+  }
+
+  for (int i = 0; i < gather_dim_numbers.elided_window_dims_size(); i++) {
+    if (window_bounds[gather_dim_numbers.elided_window_dims(i)] != 1) {
+      return InvalidArgument(
+          "Gather op can only elide window indices with bound 1, but bound is "
+          "%lld for index %lld at position %d.",
+          window_bounds[gather_dim_numbers.elided_window_dims(i)],
+          gather_dim_numbers.elided_window_dims(i), i);
+    }
+  }
+
+  int64 result_rank = gather_dim_numbers.output_window_dims_size() +
+                      (expanded_gather_indices_shape.size() - 1);
+  int64 window_dims_seen = 0;
+  int64 gather_dims_seen = 0;
+  std::vector<int64> output_dim_bounds;
+  output_dim_bounds.reserve(result_rank);
+  for (int64 i = 0; i < result_rank; i++) {
+    int64 current_bound;
+    bool is_window_index =
+        c_binary_search(gather_dim_numbers.output_window_dims(), i);
+    if (is_window_index) {
+      while (c_binary_search(gather_dim_numbers.elided_window_dims(),
+                             window_dims_seen)) {
+        window_dims_seen++;
+      }
+      current_bound = window_bounds[window_dims_seen++];
+    } else {
+      if (gather_dims_seen == gather_dim_numbers.index_vector_dim()) {
+        gather_dims_seen++;
+      }
+      current_bound = expanded_gather_indices_shape[gather_dims_seen++];
+    }
+
+    output_dim_bounds.push_back(current_bound);
+  }
+
+  return ShapeUtil::MakeShape(input_shape.element_type(), output_dim_bounds);
+}
+
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/shape_inference.h b/tensorflow/compiler/xla/service/shape_inference.h
index b39151ebbc19f5d0b702a80da5069f58c8dfb07d..0d3045213db2230da3e18ffcb1a9923250560b64 100644
--- a/tensorflow/compiler/xla/service/shape_inference.h
+++ b/tensorflow/compiler/xla/service/shape_inference.h
@@ -37,6 +37,11 @@ namespace xla {
 // the expected result type for computations that are built up via the API --
 // the shape that results from an operation is inferred. Some methods have
 // overloads for inferring shape at the HLO level.
+//
+// TODO(b/73352135): Shape inference does not issue very good error messages, in
+// part because HloInstruction::ToString() is not available since shape
+// inference runs before the HloInstruction object is created. We need a
+// solution for this.
 class ShapeInference {
  public:
   // Infers the shape produced by applying the given unary operation to the
@@ -248,6 +253,14 @@ class ShapeInference {
       const Shape& lhs, const Shape& rhs,
       const DotDimensionNumbers& dimension_numbers);
 
+  // Helper that infers the shape of the tensor produced by a gather operation
+  // with the given input shape, gather indices shape and gather dimension
+  // numbers.
+  static StatusOr<Shape> InferGatherShape(
+      const Shape& input_shape, const Shape& gather_indices_shape,
+      const GatherDimensionNumbers& gather_dim_numbers,
+      tensorflow::gtl::ArraySlice<int64> window_bounds);
+
  private:
   // Helper that infers the shape produced by performing an element-wise binary
   // operation with the given LHS and RHS shapes.
diff --git a/tensorflow/compiler/xla/service/shape_inference_test.cc b/tensorflow/compiler/xla/service/shape_inference_test.cc
index 026c021165785bd3945d6a846dae446ad45da9b7..0e61994a786b53a295ef9c9c2287b28fbf754d9b 100644
--- a/tensorflow/compiler/xla/service/shape_inference_test.cc
+++ b/tensorflow/compiler/xla/service/shape_inference_test.cc
@@ -18,15 +18,16 @@ limitations under the License.
 #include <string>
 
 #include "tensorflow/compiler/xla/shape_util.h"
-#include "tensorflow/compiler/xla/xla_data.pb.h"
-
 #include "tensorflow/compiler/xla/test.h"
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/types.h"
+#include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/gtl/array_slice.h"
 
 namespace xla {
 namespace {
 
+using ::tensorflow::gtl::ArraySlice;
 using ::testing::ContainsRegex;
 using ::testing::HasSubstr;
 
@@ -134,7 +135,7 @@ TEST_F(ShapeInferenceTest, SelectBadShapes) {
       TernaryOperation::TRIOP_SELECT, pred_, matrix_64_48_, matrix_32_64_);
   ASSERT_FALSE(inferred_status_error1.ok());
   ASSERT_THAT(inferred_status_error1.status().error_message(),
-              HasSubstr("operands to select must be the same shape"));
+              HasSubstr("Operands to select must be the same shape"));
 
   auto inferred_status_error2 = ShapeInference::InferTernaryOpShape(
       TernaryOperation::TRIOP_SELECT, s32_, matrix_64_48_, matrix_64_48_);
@@ -339,7 +340,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSourceShape) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("source shape does not match"));
+              HasSubstr("Source shape does not match"));
 }
 
 TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape1) {
@@ -350,7 +351,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape1) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("select function must take 2 parameters"));
+              HasSubstr("Select function must take 2 parameters"));
 }
 
 TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape2) {
@@ -361,7 +362,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape2) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("select function must have rank-0 PRED"));
+              HasSubstr("Select function must have rank-0 PRED"));
 }
 
 TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape3) {
@@ -372,7 +373,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape3) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("select function's first parameter"));
+              HasSubstr("Select function's first parameter"));
 }
 
 TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape4) {
@@ -383,7 +384,7 @@ TEST_F(SelectAndScatterShapeInferenceTest, SelectAndScatterWrongSelectShape4) {
       init_value_shape_, scatter_program_shape_);
   ASSERT_FALSE(inferred_status_fail.ok());
   ASSERT_THAT(inferred_status_fail.status().error_message(),
-              HasSubstr("select function's second parameter"));
+              HasSubstr("Select function's second parameter"));
 }
 
 TEST_F(ShapeInferenceTest, Convolve) {
@@ -905,7 +906,7 @@ TEST_F(ShapeInferenceTest, ScalarDotVector) {
       ShapeInference::InferDotOpShape(f32_, vector_32_, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("dot only supports rank"));
+              HasSubstr("Dot only supports rank"));
 }
 
 // 3D <dot> 2D: error
@@ -917,7 +918,7 @@ TEST_F(ShapeInferenceTest, DotWithRankHigherThanTwo) {
       ShapeUtil::MakeShape(F32, {32, 32, 32}), matrix_32_64_, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("batch and contracting dimension number mismatch"));
+              HasSubstr("Batch and contracting dimension number mismatch"));
 }
 
 // vector <dot> vector -> scalar
@@ -1023,7 +1024,7 @@ TEST_F(ShapeInferenceTest, DotWithTwoContractingDimsFails) {
       ShapeInference::InferDotOpShape(lhs_shape, rhs_shape, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("must specify one contracting dimension for both "
+              HasSubstr("Must specify one contracting dimension for both "
                         "lhs and rhs"));
 }
 
@@ -1043,7 +1044,7 @@ TEST_F(ShapeInferenceTest, DotWithMisatchedBatchDimSizesFails) {
       ShapeInference::InferDotOpShape(lhs_shape, rhs_shape, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("batch dimension numbers and sizes must match"));
+              HasSubstr("Batch dimension numbers and sizes must match"));
 }
 
 // BatchMatMul with different batch dimension numbers fails.
@@ -1062,7 +1063,7 @@ TEST_F(ShapeInferenceTest, DotWithMisatchedBatchDimNumbersFails) {
       ShapeInference::InferDotOpShape(lhs_shape, rhs_shape, dot_dnums);
   ASSERT_FALSE(inferred_status.ok());
   ASSERT_THAT(inferred_status.status().error_message(),
-              HasSubstr("batch dimension numbers must precede non-batch"));
+              HasSubstr("Batch dimension numbers must precede non-batch"));
 }
 
 // BatchMatMul with out-of-range dimension numbers fails.
@@ -1165,42 +1166,42 @@ TEST_F(ShapeInferenceTest, BinOpBroadcastBadDimension) {
       BinaryOperation::BINOP_ADD, tensor, vec8, {});
   ASSERT_FALSE(inferred_status_error1.ok());
   ASSERT_THAT(inferred_status_error1.status().error_message(),
-              HasSubstr("automatic"));
+              HasSubstr("Automatic"));
 
   // broadcast_dimension out of bounds for tensor's rank
   auto inferred_status_error2 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, vec8, {3});
   ASSERT_FALSE(inferred_status_error2.ok());
   ASSERT_THAT(inferred_status_error2.status().error_message(),
-              ContainsRegex("broadcast dimension number .* too large"));
+              ContainsRegex("Broadcast dimension number .* too large"));
 
   // broadcast_dimension doesn't match corresponding dimension
   auto inferred_status_error3 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, vec8, {0});
   ASSERT_FALSE(inferred_status_error3.ok());
   ASSERT_THAT(inferred_status_error3.status().error_message(),
-              HasSubstr("broadcast dimension 0 mismatch"));
+              HasSubstr("Broadcast dimension 0 mismatch"));
 
   // broadcast_dimensions list too long
   auto inferred_status_error4 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, matrix8_4, {0, 1, 2});
   ASSERT_FALSE(inferred_status_error4.ok());
   ASSERT_THAT(inferred_status_error4.status().error_message(),
-              HasSubstr("size of broadcast_dimensions has to match"));
+              HasSubstr("broadcast_dimensions has to match"));
 
   // there's a dimension above the rank of the tensor
   auto inferred_status_error5 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, matrix8_4, {3, 0});
   ASSERT_FALSE(inferred_status_error5.ok());
   ASSERT_THAT(inferred_status_error5.status().error_message(),
-              ContainsRegex("broadcast dimension number .* too large"));
+              ContainsRegex("dimension number .* too large"));
 
   // broadcasting dimensions don't match in this order
   auto inferred_status_error6 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor, matrix8_4, {2, 1});
   ASSERT_FALSE(inferred_status_error6.ok());
   ASSERT_THAT(inferred_status_error6.status().error_message(),
-              HasSubstr("broadcast dimension 0 mismatch"));
+              HasSubstr("dimension 0 mismatch"));
 
   // The following two tests make sure that broadcasting dimensions are listed
   // in a proper (strictly increasing) order, even if the lower-rank array
@@ -1209,13 +1210,13 @@ TEST_F(ShapeInferenceTest, BinOpBroadcastBadDimension) {
       BinaryOperation::BINOP_ADD, tensor8_8_8, matrix8_8, {0, 0});
   ASSERT_FALSE(inferred_status_error7.ok());
   ASSERT_THAT(inferred_status_error7.status().error_message(),
-              HasSubstr("broadcast dimensions order is wrong"));
+              HasSubstr("dimensions order is wrong"));
 
   auto inferred_status_error8 = ShapeInference::InferBinaryOpShape(
       BinaryOperation::BINOP_ADD, tensor8_8_8, matrix8_8, {1, 0});
   ASSERT_FALSE(inferred_status_error8.ok());
   ASSERT_THAT(inferred_status_error8.status().error_message(),
-              HasSubstr("broadcast dimensions order is wrong"));
+              HasSubstr("dimensions order is wrong"));
 }
 
 // Tests for the while instruction with proper shapes.
@@ -1241,7 +1242,7 @@ TEST_F(ShapeInferenceTest, WhileWithBadShapes) {
       ShapeInference::InferWhileShape(bad_shape_1, body, result_shape);
   ASSERT_FALSE(inferred_status_error1.ok());
   ASSERT_THAT(inferred_status_error1.status().error_message(),
-              HasSubstr("condition must take 1 arguments"));
+              HasSubstr("Condition must take 1 arguments"));
 
   auto bad_shape_2 =
       ShapeUtil::MakeProgramShape({s32_, result_shape}, result_shape);
@@ -1249,14 +1250,14 @@ TEST_F(ShapeInferenceTest, WhileWithBadShapes) {
       ShapeInference::InferWhileShape(cond, bad_shape_2, result_shape);
   ASSERT_FALSE(inferred_status_error2.ok());
   ASSERT_THAT(inferred_status_error2.status().error_message(),
-              HasSubstr("body must take 1 arguments"));
+              HasSubstr("Body must take 1 arguments"));
 
   auto bad_shape_3 = ShapeUtil::MakeProgramShape({result_shape}, s32_);
   auto inferred_status_error3 =
       ShapeInference::InferWhileShape(bad_shape_3, body, result_shape);
   ASSERT_FALSE(inferred_status_error3.ok());
   ASSERT_THAT(inferred_status_error3.status().error_message(),
-              HasSubstr("condition must return a boolean"));
+              HasSubstr("Condition must return a boolean"));
 
   auto bad_shape_4 = ShapeUtil::MakeProgramShape({result_shape}, vector_32_);
   auto inferred_status_error4 =
@@ -1300,13 +1301,13 @@ TEST_F(ShapeInferenceTest, ConcatenateWithBadShapes) {
       ShapeInference::InferConcatOpShape({&vector_32_}, /*dimension=*/-1);
   ASSERT_FALSE(inferred_status_error2.ok());
   ASSERT_THAT(inferred_status_error2.status().error_message(),
-              HasSubstr("dimension to concatenate along out of bounds: -1"));
+              HasSubstr("dimension out of bounds: -1"));
 
   auto inferred_status_error3 =
       ShapeInference::InferConcatOpShape({&vector_32_}, /*dimension=*/1);
   ASSERT_FALSE(inferred_status_error3.ok());
   ASSERT_THAT(inferred_status_error3.status().error_message(),
-              HasSubstr("dimension to concatenate along out of bounds: 1"));
+              HasSubstr("dimension out of bounds: 1"));
 
   Shape tuple = ShapeUtil::MakeTupleShape({vector_32_});
   auto inferred_status_error4 = ShapeInference::InferConcatOpShape(
@@ -1314,21 +1315,20 @@ TEST_F(ShapeInferenceTest, ConcatenateWithBadShapes) {
   ASSERT_FALSE(inferred_status_error4.ok());
   ASSERT_THAT(
       inferred_status_error4.status().error_message(),
-      HasSubstr("Expected non-tuple argument for operand of concatenation."));
+      HasSubstr("Expected non-tuple argument for operand of concatenation"));
 
   const Shape vector_s32 = ShapeUtil::MakeShape(S32, {32});
   auto inferred_status_error5 = ShapeInference::InferConcatOpShape(
       {&vector_32_, &vector_s32}, /*dimension=*/0);
   ASSERT_FALSE(inferred_status_error5.ok());
-  ASSERT_THAT(
-      inferred_status_error5.status().error_message(),
-      HasSubstr("cannot concatenate arrays with different element types"));
+  ASSERT_THAT(inferred_status_error5.status().error_message(),
+              HasSubstr("concatenate arrays with different element types"));
 
   auto inferred_status_error6 = ShapeInference::InferConcatOpShape(
       {&matrix_32_48_, &matrix_32_64_}, /*dimension=*/0);
   ASSERT_FALSE(inferred_status_error6.ok());
   ASSERT_THAT(inferred_status_error6.status().error_message(),
-              HasSubstr("cannot concatenate arrays that differ in "
+              HasSubstr("concatenate arrays that differ in "
                         "dimensions other than the one being "
                         "concatenated"));
 }
@@ -1466,7 +1466,7 @@ TEST_F(ShapeInferenceTest, Conditional) {
       ShapeUtil::MakeProgramShape({vector_64_}, f32_));
   EXPECT_FALSE(inferred_status_error0.ok());
   EXPECT_THAT(inferred_status_error0.status().error_message(),
-              HasSubstr("predicate must be a boolean"));
+              HasSubstr("Predicate must be a boolean"));
 
   auto inferred_status_error1 = ShapeInference::InferConditionalShape(
       pred_, ShapeUtil::MakeTupleShape({f32_, vector_32_}), matrix_32_48_,
@@ -1527,5 +1527,458 @@ TEST_F(ShapeInferenceTest, BadSlice) {
       << statusor.status();
 }
 
+class GatherShapeInferenceTest : public ShapeInferenceTest {
+ protected:
+  const Shape s64_scalar_ = ShapeUtil::MakeShape(S64, {});
+  const Shape s64_vector_5_ = ShapeUtil::MakeShape(S64, {5});
+  const Shape s64_vector_32_ = ShapeUtil::MakeShape(S64, {32});
+  const Shape s64_4d_tensor_10_9_8_7_1_ =
+      ShapeUtil::MakeShape(S64, {10, 9, 8, 7, 1});
+  const Shape s64_4d_tensor_10_9_8_7_5_ =
+      ShapeUtil::MakeShape(S64, {10, 9, 8, 7, 5});
+  const Shape s64_4d_tensor_5_10_9_7_6_ =
+      ShapeUtil::MakeShape(S64, {5, 10, 9, 7, 6});
+  const Shape s64_4d_tensor_10_9_5_7_6_ =
+      ShapeUtil::MakeShape(S64, {10, 9, 5, 7, 6});
+  const Shape f32_5d_tensor_50_49_48_47_46_ =
+      ShapeUtil::MakeShape(F32, {50, 49, 48, 47, 46});
+  const Shape tuple_shape_ = ShapeUtil::MakeTupleShape(
+      {s64_4d_tensor_10_9_8_7_1_, s64_4d_tensor_10_9_8_7_1_});
+};
+
+TEST_F(GatherShapeInferenceTest, TensorFlowGather) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(matrix_64_48_, s64_vector_32_,
+                                       HloInstruction::MakeGatherDimNumbers(
+                                           /*output_window_dims=*/{0},
+                                           /*elided_window_dims=*/{1},
+                                           /*gather_dims_to_operand_dims=*/{1},
+                                           /*index_vector_dim=*/1),
+                                       /*window_bounds=*/{64, 1}));
+  EXPECT_TRUE(
+      ShapeUtil::Equal(gather_shape, ShapeUtil::MakeShape(F32, {64, 32})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, TensorFlowGatherV2) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(matrix_64_48_, s64_vector_32_,
+                                       HloInstruction::MakeGatherDimNumbers(
+                                           /*output_window_dims=*/{1},
+                                           /*elided_window_dims=*/{0},
+                                           /*gather_dims_to_operand_dims=*/{0},
+                                           /*index_vector_dim=*/1),
+                                       /*window_bounds=*/{1, 48}));
+  EXPECT_TRUE(
+      ShapeUtil::Equal(gather_shape, ShapeUtil::MakeShape(F32, {32, 48})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, TensorFlowGatherNd) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(matrix_64_48_, s64_4d_tensor_10_9_8_7_1_,
+                                       HloInstruction::MakeGatherDimNumbers(
+                                           /*output_window_dims=*/{4},
+                                           /*elided_window_dims=*/{0},
+                                           /*gather_dims_to_operand_dims=*/{0},
+                                           /*index_vector_dim=*/4),
+                                       /*window_bounds=*/{1, 48}));
+  EXPECT_TRUE(ShapeUtil::Equal(gather_shape,
+                               ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 48})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, TensorFlowBatchDynamicSlice) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{4, 5, 6, 7, 8},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/4),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+  EXPECT_TRUE(ShapeUtil::Equal(
+      gather_shape,
+      ShapeUtil::MakeShape(F32, {10, 9, 8, 7, 30, 29, 28, 27, 26})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_A) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_5_7_6_,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{4, 5, 6, 7, 8},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/2),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+
+  EXPECT_TRUE(ShapeUtil::Equal(
+      gather_shape,
+      ShapeUtil::MakeShape(F32, {10, 9, 7, 6, 30, 29, 28, 27, 26})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, NonDefaultGatherIndicesLeafDim_B) {
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_5_10_9_7_6_,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{4, 5, 6, 7, 8},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/0),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+
+  EXPECT_TRUE(ShapeUtil::Equal(
+      gather_shape,
+      ShapeUtil::MakeShape(F32, {10, 9, 7, 6, 30, 29, 28, 27, 26})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, NoOutputGatherDims) {
+  // This is equivalent to a dynamic slice.
+  TF_ASSERT_OK_AND_ASSIGN(
+      Shape gather_shape,
+      ShapeInference::InferGatherShape(
+          f32_5d_tensor_50_49_48_47_46_, s64_vector_5_,
+          HloInstruction::MakeGatherDimNumbers(
+              /*output_window_dims=*/{0, 1, 2, 3, 4},
+              /*elided_window_dims=*/{},
+              /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+              /*index_vector_dim=*/0),
+          /*window_bounds=*/{30, 29, 28, 27, 26}));
+
+  EXPECT_TRUE(ShapeUtil::Equal(gather_shape,
+                               ShapeUtil::MakeShape(F32, {30, 29, 28, 27, 26})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, ScalarGatherIndices) {
+  // The gather indices "tensor" is a scalar S here that's used to slice out
+  // [S,0,0,0,0]..[S,30,29,28,27] into a [30,29,28,27] shaped result.
+  TF_ASSERT_OK_AND_ASSIGN(Shape gather_shape,
+                          ShapeInference::InferGatherShape(
+                              f32_5d_tensor_50_49_48_47_46_, s64_scalar_,
+                              HloInstruction::MakeGatherDimNumbers(
+                                  /*output_window_dims=*/{0, 1, 2, 3},
+                                  /*elided_window_dims=*/{0},
+                                  /*gather_dims_to_operand_dims=*/{0},
+                                  /*index_vector_dim=*/0),
+                              /*window_bounds=*/{1, 30, 29, 28, 27}));
+
+  EXPECT_TRUE(ShapeUtil::Equal(gather_shape,
+                               ShapeUtil::MakeShape(F32, {30, 29, 28, 27})))
+      << ShapeUtil::HumanString(gather_shape);
+}
+
+TEST_F(GatherShapeInferenceTest, TupleShapedTensorInput) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      tuple_shape_, s64_vector_32_,
+      HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
+                                           /*elided_window_dims=*/{1},
+                                           /*gather_dims_to_operand_dims=*/{1},
+                                           /*index_vector_dim=*/1),
+      /*window_bounds=*/{64, 1});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Expected non-tuple argument for input"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest, TupleShapedGatherIndicesInput) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      s64_vector_32_, tuple_shape_,
+      HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
+                                           /*elided_window_dims=*/{1},
+                                           /*gather_dims_to_operand_dims=*/{1},
+                                           /*index_vector_dim=*/0),
+      /*window_bounds=*/{64, 1});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Expected non-tuple argument for gather indices"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest, FloatingPointGatherIndicesInput) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      s64_vector_32_, vector_32_,
+      HloInstruction::MakeGatherDimNumbers(/*output_window_dims=*/{0},
+                                           /*elided_window_dims=*/{1},
+                                           /*gather_dims_to_operand_dims=*/{1},
+                                           /*index_vector_dim=*/0),
+      /*window_bounds=*/{64, 1});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Gather indices parameter must be an integral tensor"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_NonAscendingWindowIndices) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 8, 7},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("Output window dimensions in gather op must be ascending"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_RepeatedWindowIndices) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 7},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("Output window dimensions in gather op must not repeat"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_WindowIndexOutOfBounds) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 99, 100, 101},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Window index 2 in gather op is out of bounds"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_WindowIndexBarelyOutOfBounds) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 9},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Window index 4 in gather op is out of bounds"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_MismatchingElidedWindowDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{4},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("All components of the window index in a gather op must either "
+                "be a output window index or explicitly elided"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_OutOfBoundsWindowToInputMapping) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{0, 1, 2, 3, 19},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Invalid elided_window_dims set in gather op; valid "
+                        "range is [0, 5), got: 19"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_RepeatedWindowToInputMapping) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{0, 1, 2, 3, 3},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr(
+          "Repeated dimensions not allowed in elided_window_dims in gather op"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_MismatchingGatherToInputMapping) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("Gather op has 4 elements in gather_dims_to_operand_dims and "
+                "the bound of dimension index_vector_dim=4 of "
+                "gather_indices is 5. These two numbers must be equal."))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_OutOfBoundsGatherToInputMapping) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 7},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr("Invalid gather_dims_to_operand_dims mapping; domain is "
+                "[0, 5), got: 4->7"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_RepeatedGatherToInputMapping) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 3},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr(
+          "Repeated dimensions are not allowed in gather_dims_to_operand_dims"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_NonAscendingElidedWindowDims) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{2, 1},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{1, 1, 28, 27, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("elided_window_dims in gather op must be sorted"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest, InvalidGatherDimNumbers_WindowBoundsTooLarge) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7},
+          /*elided_window_dims=*/{2},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 1, 300, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Window bound at index 3 in gather op is out of range, "
+                        "must be within [0, 48), got 300"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_MismatchingNumberOfWindowBounds) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 26});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(
+      statusor.status().error_message(),
+      HasSubstr(
+          "Gather op must have one window bound for every input dimension"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest,
+       InvalidGatherDimNumbers_WindowBoundsNot1ForElidedDim) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_8_7_5_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7},
+          /*elided_window_dims=*/{1},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/4),
+      /*window_bounds=*/{30, 29, 28, 26, 20});
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Gather op can only elide window indices with bound 1, "
+                        "but bound is 29 for index 1 at position 0"))
+      << statusor.status();
+}
+
+TEST_F(GatherShapeInferenceTest, OutOfBoundsGatherIndicesLeafDim) {
+  StatusOr<Shape> statusor = ShapeInference::InferGatherShape(
+      f32_5d_tensor_50_49_48_47_46_, s64_4d_tensor_10_9_5_7_6_,
+      HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/{4, 5, 6, 7, 8},
+          /*elided_window_dims=*/{},
+          /*gather_dims_to_operand_dims=*/{0, 1, 2, 3, 4},
+          /*index_vector_dim=*/32),
+      /*window_bounds=*/{30, 29, 28, 27, 26});
+
+  ASSERT_FALSE(statusor.ok());
+  EXPECT_THAT(statusor.status().error_message(),
+              HasSubstr("Gather index leaf dimension must be within [0, "
+                        "rank(gather_indices) + 1)"))
+      << statusor.status();
+}
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/service/user_computation.cc b/tensorflow/compiler/xla/service/user_computation.cc
index fead9b92362bcd1974f2dff6e030bc47dfc5aa85..06735e9442942f3c69d1cd679857fe22f2fa6756 100644
--- a/tensorflow/compiler/xla/service/user_computation.cc
+++ b/tensorflow/compiler/xla/service/user_computation.cc
@@ -226,7 +226,8 @@ StatusOr<ComputationDataHandle> UserComputation::AddParameterInstruction(
   return handle;
 }
 
-Status UserComputation::AddSendInstruction(const SendRequest& send_request) {
+StatusOr<ComputationDataHandle> UserComputation::AddSendInstruction(
+    const SendRequest& send_request) {
   tensorflow::mutex_lock lock(mutex_);
 
   // Check if the operand of the instruction is valid.
@@ -244,7 +245,7 @@ Status UserComputation::AddSendInstruction(const SendRequest& send_request) {
   VLOG(1) << "AddSendInstruction (" << GetVersionedHandleInternal()
           << "), data handle " << handle.handle() << ": "
           << send_request.ShortDebugString();
-  return Status::OK();
+  return handle;
 }
 
 StatusOr<ComputationDataHandle> UserComputation::AddRecvInstruction(
@@ -315,6 +316,36 @@ StatusOr<ComputationDataHandle> UserComputation::AddConstantInstruction(
   return handle;
 }
 
+StatusOr<ComputationDataHandle> UserComputation::AddGatherInstruction(
+    const GatherRequest& gather_request) {
+  tensorflow::mutex_lock lock(mutex_);
+
+  TF_ASSIGN_OR_RETURN(const OperationRequest* input_request,
+                      LookUpRequest(gather_request.input()));
+  TF_ASSIGN_OR_RETURN(const OperationRequest* gather_indices_request,
+                      LookUpRequest(gather_request.gather_indices()));
+
+  TF_ASSIGN_OR_RETURN(
+      Shape shape,
+      ShapeInference::InferGatherShape(
+          input_request->output_shape(), gather_indices_request->output_shape(),
+          gather_request.dimension_numbers(),
+          AsInt64Slice(gather_request.window_bounds())));
+
+  const ComputationDataHandle handle = CreateComputationDataHandle();
+
+  OperationRequest& request =
+      (*session_computation_.mutable_requests())[handle.handle()];
+  *request.mutable_output_handle() = handle;
+  *request.mutable_output_shape() = shape;
+  *request.mutable_request()->mutable_gather_request() = gather_request;
+
+  VLOG(1) << "AddGatherInstruction (" << GetVersionedHandleInternal()
+          << "), data handle " << handle.handle() << ": "
+          << gather_request.ShortDebugString();
+  return handle;
+}
+
 StatusOr<ComputationDataHandle> UserComputation::AddGetTupleElementInstruction(
     const GetTupleElementRequest& get_tuple_element_request) {
   tensorflow::mutex_lock lock(mutex_);
@@ -1276,6 +1307,28 @@ StatusOr<ComputationDataHandle> UserComputation::AddCustomCallInstruction(
   return handle;
 }
 
+StatusOr<ComputationDataHandle> UserComputation::AddHostComputeInstruction(
+    const HostComputeRequest& host_compute_request) {
+  tensorflow::mutex_lock lock(mutex_);
+
+  for (const ComputationDataHandle& handle : host_compute_request.operands()) {
+    TF_RETURN_IF_ERROR(LookUpRequest(handle).status());
+  }
+
+  ComputationDataHandle handle = CreateComputationDataHandle();
+  OperationRequest& request =
+      (*session_computation_.mutable_requests())[handle.handle()];
+  *request.mutable_output_handle() = handle;
+  *request.mutable_output_shape() = host_compute_request.shape();
+  *request.mutable_request()->mutable_host_compute_request() =
+      host_compute_request;
+
+  VLOG(1) << "AddHostComputeInstruction (" << GetVersionedHandleInternal()
+          << "), data handle " << handle.handle() << ": "
+          << host_compute_request.ShortDebugString();
+  return handle;
+}
+
 StatusOr<ComputationDataHandle> UserComputation::AddDotInstruction(
     const DotRequest& dot_request) {
   tensorflow::mutex_lock lock(mutex_);
@@ -1713,6 +1766,11 @@ void PureFunctionalVisitor(const SessionComputation& session_computation,
       break;
     }
 
+    case OpRequest::kHostComputeRequest: {
+      *is_functional = false;
+      break;
+    }
+
     case OpRequest::kCallRequest: {
       const CallRequest& call_request = request.request().call_request();
       for (const ComputationDataHandle& handle : call_request.operands()) {
@@ -1991,6 +2049,16 @@ void PureFunctionalVisitor(const SessionComputation& session_computation,
       break;
     }
 
+    case OpRequest::kGatherRequest: {
+      PureFunctionalVisitor(session_computation,
+                            request.request().gather_request().input(),
+                            num_parameters, visited, is_functional);
+      PureFunctionalVisitor(session_computation,
+                            request.request().gather_request().gather_indices(),
+                            num_parameters, visited, is_functional);
+      break;
+    }
+
     case OpRequest::OP_NOT_SET:
       LOG(FATAL) << "OperationRequest doesn't contain a request";
 
@@ -2643,6 +2711,15 @@ static void ForEachOperand(
       break;
     }
 
+    case OpRequest::kHostComputeRequest: {
+      const HostComputeRequest& hc_request =
+          request.request().host_compute_request();
+      for (const ComputationDataHandle& operand : hc_request.operands()) {
+        apply(operand);
+      }
+      break;
+    }
+
     case OpRequest::kDotRequest: {
       const DotRequest& dot_request = request.request().dot_request();
       apply(dot_request.rhs());
@@ -2684,6 +2761,13 @@ static void ForEachOperand(
       break;
     }
 
+    case OpRequest::kGatherRequest: {
+      const GatherRequest& gather_request = request.request().gather_request();
+      apply(gather_request.input());
+      apply(gather_request.gather_indices());
+      break;
+    }
+
     case OpRequest::OP_NOT_SET:
       LOG(FATAL) << "OperationRequest doesn't contain a request";
 
@@ -3299,6 +3383,22 @@ void ComputationLowerer::Visit(
       break;
     }
 
+    case OpRequest::kHostComputeRequest: {
+      const HostComputeRequest& host_compute_request =
+          request.request().host_compute_request();
+      std::vector<HloInstruction*> operands;
+      for (const ComputationDataHandle& operand :
+           host_compute_request.operands()) {
+        operands.push_back(lookup_instruction(operand));
+      }
+      auto output_shape = host_compute_request.shape();
+      auto channel_name = host_compute_request.channel_name();
+      auto cost_estimate_ns = host_compute_request.cost_estimate_ns();
+      hlo_instruction = add_instruction(HloInstruction::CreateHostCompute(
+          output_shape, operands, channel_name, cost_estimate_ns));
+      break;
+    }
+
     case OpRequest::kUnaryOpRequest: {
       const UnaryOpRequest& unary_op_request =
           request.request().unary_op_request();
@@ -3401,6 +3501,20 @@ void ComputationLowerer::Visit(
       break;
     }
 
+    case OpRequest::kGatherRequest: {
+      const GatherRequest& gather_request = request.request().gather_request();
+      HloInstruction* input_operand =
+          lookup_instruction(gather_request.input());
+      HloInstruction* gather_indices_operand =
+          lookup_instruction(gather_request.gather_indices());
+      std::vector<int64> window_bounds;
+      c_copy(gather_request.window_bounds(), std::back_inserter(window_bounds));
+      hlo_instruction = add_instruction(HloInstruction::CreateGather(
+          request.output_shape(), input_operand, gather_indices_operand,
+          gather_request.dimension_numbers(), window_bounds));
+      break;
+    }
+
     case OpRequest::OP_NOT_SET:
       LOG(FATAL) << "OperationRequest doesn't contain a request";
 
diff --git a/tensorflow/compiler/xla/service/user_computation.h b/tensorflow/compiler/xla/service/user_computation.h
index 54bb24d6d7fe7aa8cc7c684795e40464e4eb6614..5544c868fe905c1ca7e6cab32738440add2e3b4f 100644
--- a/tensorflow/compiler/xla/service/user_computation.h
+++ b/tensorflow/compiler/xla/service/user_computation.h
@@ -149,6 +149,10 @@ class UserComputation {
   StatusOr<ComputationDataHandle> AddOutfeedInstruction(
       const OutfeedRequest& outfeed_request);
 
+  // Enqueues a host compute instruction onto this user computation.
+  StatusOr<ComputationDataHandle> AddHostComputeInstruction(
+      const HostComputeRequest& host_compute_request);
+
   // Enqueues a call instruction onto this user computation.
   StatusOr<ComputationDataHandle> AddCallInstruction(
       const CallRequest& call_request,
@@ -232,12 +236,17 @@ class UserComputation {
       const UserComputation& false_computation);
 
   // Enqueues a Send instruction onto this user computation.
-  Status AddSendInstruction(const SendRequest& send_request);
+  StatusOr<ComputationDataHandle> AddSendInstruction(
+      const SendRequest& send_request);
 
   // Enqueues a Recv instruction onto this user computation.
   StatusOr<ComputationDataHandle> AddRecvInstruction(
       const RecvRequest& recv_request);
 
+  // Enqueues a Gather instruction onto this user computation.
+  StatusOr<ComputationDataHandle> AddGatherInstruction(
+      const GatherRequest& gather_request);
+
   // Returns the user-provided name of this user computation, which is provided
   // via the XLA computation-building API.
   const string& name() const { return name_; }
diff --git a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc
index a5f9b01f011ce04f1114c74391a967c62f015221..3ef0cdff6751258e4489ce350deb0931fdf69ef9 100644
--- a/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc
+++ b/tensorflow/compiler/xla/service/while_loop_invariant_code_motion.cc
@@ -106,20 +106,12 @@ static bool NotWorthHoistingIndividually(const HloInstruction& instruction) {
     case HloOpcode::kBitcast:
     case HloOpcode::kBroadcast:
     case HloOpcode::kConstant:
+    case HloOpcode::kReshape:
     case HloOpcode::kReverse:
     case HloOpcode::kSlice:
+    case HloOpcode::kTranspose:
     case HloOpcode::kTuple:
       return true;
-
-    case HloOpcode::kTranspose:
-      return ShapeUtil::TransposeIsBitcast(
-          /*input_shape=*/instruction.operand(0)->shape(),
-          /*output_shape=*/instruction.shape(), instruction.dimensions());
-
-    case HloOpcode::kReshape:
-      return ShapeUtil::ReshapeIsBitcast(
-          /*input_shape=*/instruction.operand(0)->shape(),
-          /*output_shape=*/instruction.shape());
   }
 }
 
diff --git a/tensorflow/compiler/xla/service/while_loop_simplifier.cc b/tensorflow/compiler/xla/service/while_loop_simplifier.cc
index 981de9b2200a9ae8938db21299580f510834d2f0..c9d77c9376ffa5e992c97e77fbd632e5e62e18cd 100644
--- a/tensorflow/compiler/xla/service/while_loop_simplifier.cc
+++ b/tensorflow/compiler/xla/service/while_loop_simplifier.cc
@@ -212,7 +212,7 @@ static optional<int64> GetLoopTripCount(HloInstruction* while_op) {
   // Now that we know the index of the induction variable, we can we can try to
   // compute how many times the loop executes.  Start by computing the induction
   // variable's initial value.
-  HloEvaluator evaluator;
+  HloEvaluator evaluator(/*max_loop_iterations=*/0);
   auto* while_init = while_op->mutable_operand(0);
   auto* indvar_init = while_init->mutable_operand(*indvar_tuple_idx);
   StatusOr<std::unique_ptr<Literal>> indvar_init_result =
diff --git a/tensorflow/compiler/xla/shape_tree.h b/tensorflow/compiler/xla/shape_tree.h
index 280f02e88675381bd75108bfae0dd22c462ba718..ffaa40c2d673a2365342371ed8dab59565d1d08f 100644
--- a/tensorflow/compiler/xla/shape_tree.h
+++ b/tensorflow/compiler/xla/shape_tree.h
@@ -53,7 +53,7 @@ struct ShapeTreeNode {
   ShapeTreeNode(const ShapeTreeNode& other)
       : data(other.data), children(other.children.size()) {
     for (size_t i = 0; i < children.size(); ++i) {
-      children[i] = MakeUnique<ShapeTreeNode>(*other.children[i]);
+      children[i] = ::xla::MakeUnique<ShapeTreeNode>(*other.children[i]);
     }
   }
 
@@ -62,7 +62,7 @@ struct ShapeTreeNode {
       data = other.data;
       children.resize(other.children.size());
       for (size_t i = 0; i < children.size(); ++i) {
-        children[i] = MakeUnique<ShapeTreeNode>(*other.children[i]);
+        children[i] = ::xla::MakeUnique<ShapeTreeNode>(*other.children[i]);
       }
     }
     return *this;
@@ -445,7 +445,7 @@ class ShapeTreeIterator : public std::iterator<std::forward_iterator_tag,
     for (auto& node_and_index : stack_) {
       index.push_back(node_and_index.second);
     }
-    current_ = MakeUnique<value_type>(index, node_->data);
+    current_ = ::xla::MakeUnique<value_type>(index, node_->data);
     return *current_;
   }
 
@@ -492,7 +492,7 @@ void ShapeTree<T>::InitChildren(const Shape& shape, Node* node) {
 template <typename T>
 ShapeTree<T>::ShapeTree(Shape shape)
     : root_(),
-      shape_storage_(MakeUnique<Shape>(std::move(shape))),
+      shape_storage_(::xla::MakeUnique<Shape>(std::move(shape))),
       shape_(shape_storage_.get()) {
   // The shape_ field is just used to hold the structure of the shape.
   // It should not be relied upon to store layout information.
@@ -508,7 +508,7 @@ ShapeTree<T>::ShapeTree(const Shape* shape) : root_(), shape_(shape) {
 template <typename T>
 ShapeTree<T>::ShapeTree(Shape shape, const T& init_value)
     : root_(init_value),
-      shape_storage_(MakeUnique<Shape>(std::move(shape))),
+      shape_storage_(::xla::MakeUnique<Shape>(std::move(shape))),
       shape_(shape_storage_.get()) {
   // The shape_ field is just used to hold the structure of the shape.
   // It should not be relied upon to store layout information.
diff --git a/tensorflow/compiler/xla/shape_util.cc b/tensorflow/compiler/xla/shape_util.cc
index 604e0173e789348923316174873f58058eaf2815..9810e818f6cd5a8c4f602a63984b51e6c0f7bdf3 100644
--- a/tensorflow/compiler/xla/shape_util.cc
+++ b/tensorflow/compiler/xla/shape_util.cc
@@ -1073,9 +1073,10 @@ ShapeUtil::DimensionsUnmodifiedByReshape(const Shape& input_shape,
 /* static */ bool ShapeUtil::TransposeIsBitcast(
     const Shape& input_shape, const Shape& output_shape,
     tensorflow::gtl::ArraySlice<int64> dimension_mapping) {
-  // Can't insert bitcasts without layout information.
-  if (!LayoutUtil::HasLayout(input_shape) &&
-      !LayoutUtil::HasLayout(output_shape)) {
+  CHECK(LayoutUtil::HasLayout(input_shape) &&
+        LayoutUtil::HasLayout(output_shape));
+
+  if (!SameElementType(input_shape, output_shape)) {
     return false;
   }
 
@@ -1106,9 +1107,10 @@ ShapeUtil::DimensionsUnmodifiedByReshape(const Shape& input_shape,
 
 /* static */ bool ShapeUtil::ReshapeIsBitcast(const Shape& input_shape,
                                               const Shape& output_shape) {
-  // Can't convert reshapes into bitcasts without layout information.
-  if (!LayoutUtil::HasLayout(input_shape) ||
-      !LayoutUtil::HasLayout(output_shape)) {
+  CHECK(LayoutUtil::HasLayout(input_shape) &&
+        LayoutUtil::HasLayout(output_shape));
+
+  if (!SameElementType(input_shape, output_shape)) {
     return false;
   }
 
diff --git a/tensorflow/compiler/xla/shape_util.h b/tensorflow/compiler/xla/shape_util.h
index 19b1aa93bd373ebd5f502d0dca56c9b31ab4fd7f..fb66f6970915b258acd91bd8be28c882de9ede99 100644
--- a/tensorflow/compiler/xla/shape_util.h
+++ b/tensorflow/compiler/xla/shape_util.h
@@ -24,6 +24,7 @@ limitations under the License.
 
 #include "tensorflow/compiler/xla/layout_util.h"
 #include "tensorflow/compiler/xla/primitive_util.h"
+#include "tensorflow/compiler/xla/status_macros.h"
 #include "tensorflow/compiler/xla/statusor.h"
 #include "tensorflow/compiler/xla/types.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
@@ -320,6 +321,15 @@ class ShapeUtil {
   static Shape MakeShape(PrimitiveType element_type,
                          tensorflow::gtl::ArraySlice<int64> dimensions);
 
+  // Creates a Shape with element type corresponding to T and the given
+  // dimensions
+  template <typename T>
+  static Shape MakeShapeWithType(
+      tensorflow::gtl::ArraySlice<int64> dimensions) {
+    return ShapeUtil::MakeShape(primitive_util::NativeToPrimitiveType<T>(),
+                                dimensions);
+  }
+
   // Constructs a new shape with the given minor_to_major order in its Layout.
   // Returns a value shape such that shape.has_layout().
   static Shape MakeShapeWithLayout(
@@ -522,12 +532,16 @@ class ShapeUtil {
   // Returns whether a transpose from input_shape to output_shape with dimension
   // mapping "dimension_mapping" produces a result which is bit-wise identical
   // to its input and thus may be replaced with a bitcast.
+  //
+  // Precondition: Both input_shape and output_shape have explicit layouts.
   static bool TransposeIsBitcast(
       const Shape& input_shape, const Shape& output_shape,
       tensorflow::gtl::ArraySlice<int64> dimension_mapping);
 
   // Returns whether a reshape from "input_shape" to "output_shape" is a
   // bitcast.
+  //
+  // Precondition: Both input_shape and output_shape have explicit layouts.
   static bool ReshapeIsBitcast(const Shape& input_shape,
                                const Shape& output_shape);
 
@@ -560,16 +574,16 @@ class ShapeUtil {
   // The visitor_function visitor function should return true if it wants to
   // continue, or false otherwise.
   //
-  // visitor_function must be a callable of type bool(const std::vector<int64>&)
-  // or compatible.
+  // visitor_function must be a callable of type
+  // StatusOr<bool>(ArraySlice<int64>) or compatible.
   template <typename FnType>
-  static void ForEachIndex(const Shape& shape,
-                           tensorflow::gtl::ArraySlice<int64> base,
-                           tensorflow::gtl::ArraySlice<int64> count,
-                           tensorflow::gtl::ArraySlice<int64> incr,
-                           const FnType& visitor_function) {
+  static Status ForEachIndexWithStatus(const Shape& shape,
+                                       tensorflow::gtl::ArraySlice<int64> base,
+                                       tensorflow::gtl::ArraySlice<int64> count,
+                                       tensorflow::gtl::ArraySlice<int64> incr,
+                                       const FnType& visitor_function) {
     if (ShapeUtil::HasZeroElements(shape)) {
-      return;
+      return Status::OK();
     }
     CHECK_EQ(Rank(shape), base.size());
     CHECK_EQ(incr.size(), base.size());
@@ -579,7 +593,11 @@ class ShapeUtil {
     // once with the proper empty indexes.
     int64 n = -1;
     std::vector<int64> indexes(base.begin(), base.end());
-    while (n < rank && visitor_function(indexes)) {
+    while (n < rank) {
+      TF_ASSIGN_OR_RETURN(bool should_continue, visitor_function(indexes));
+      if (!should_continue) {
+        break;
+      }
       // Increments dimensions in minor to major order.
       for (n = 0; n < rank; ++n) {
         int64 dim = LayoutUtil::Minor(shape.layout(), n);
@@ -590,6 +608,21 @@ class ShapeUtil {
         indexes[dim] = base[dim];
       }
     }
+
+    return Status::OK();
+  }
+
+  template <typename FnType>
+  static void ForEachIndex(const Shape& shape,
+                           tensorflow::gtl::ArraySlice<int64> base,
+                           tensorflow::gtl::ArraySlice<int64> count,
+                           tensorflow::gtl::ArraySlice<int64> incr,
+                           const FnType& visitor_function) {
+    ForEachIndexWithStatus(shape, base, count, incr,
+                           [&](tensorflow::gtl::ArraySlice<int64> indices) {
+                             return StatusOr<bool>(visitor_function(indices));
+                           })
+        .IgnoreError();
   }
 
  private:
diff --git a/tensorflow/compiler/xla/shape_util_test.cc b/tensorflow/compiler/xla/shape_util_test.cc
index 4db97d45b20b86dc60531845c6e28a223203ff7f..a3574156983cfcb53cd240bdf83feef107f11d7e 100644
--- a/tensorflow/compiler/xla/shape_util_test.cc
+++ b/tensorflow/compiler/xla/shape_util_test.cc
@@ -573,10 +573,11 @@ TEST(ShapeUtilTest, ForEachIndex) {
     Shape shape = ShapeUtil::MakeShape(F32, data.dimensions);
     // Increments at every invocation.
     int invocations = 0;
-    auto increment_func = [&invocations](const std::vector<int64>& indexes) {
-      invocations++;
-      return true;
-    };
+    auto increment_func =
+        [&invocations](tensorflow::gtl::ArraySlice<int64> indexes) {
+          invocations++;
+          return true;
+        };
 
     std::vector<int64> zero_base(data.dimensions.size(), 0);
     std::vector<int64> step(data.dimensions.size(), 1);
@@ -588,6 +589,29 @@ TEST(ShapeUtilTest, ForEachIndex) {
   }
 }
 
+TEST(ShapeUtilTest, ForEachIndexWithStatus) {
+  Shape shape = ShapeUtil::MakeShape(F32, {10, 10});
+  // Increments at every invocation.
+  int invocations = 0;
+  auto increment_func =
+      [&invocations](
+          tensorflow::gtl::ArraySlice<int64> indexes) -> StatusOr<bool> {
+    if (++invocations == 5) {
+      return Unimplemented("Cannot increment beyond 5.");
+    }
+    return true;
+  };
+
+  Status error_status = ShapeUtil::ForEachIndexWithStatus(
+      shape, /*base=*/{0, 0}, /*count=*/{10, 10}, /*incr=*/{0, 1},
+      increment_func);
+
+  EXPECT_FALSE(error_status.ok());
+  EXPECT_THAT(error_status.error_message(),
+              ::testing::HasSubstr("Cannot increment beyond 5."));
+  EXPECT_EQ(invocations, 5);
+}
+
 TEST(ShapeUtilTest, DimensionsUnmodifiedByReshape_1x1x1x1_to_1x1x1) {
   // All output dimensions should be unmodified. One of the input dimensions is
   // modified because the input rank is larger by one.
diff --git a/tensorflow/compiler/xla/tests/BUILD b/tensorflow/compiler/xla/tests/BUILD
index 8339d08ef4d7455f9739b80074ab0405a404e8e8..63f4a4430fd3d3103ffabf9a857443496df42768 100644
--- a/tensorflow/compiler/xla/tests/BUILD
+++ b/tensorflow/compiler/xla/tests/BUILD
@@ -271,6 +271,9 @@ cc_library(
 xla_test(
     name = "bad_rng_shape_validation_test",
     srcs = ["bad_rng_shape_validation_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:statusor",
         "//tensorflow/compiler/xla:test",
@@ -290,6 +293,9 @@ xla_test(
 xla_test(
     name = "check_execution_arity_test",
     srcs = ["check_execution_arity_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
@@ -309,6 +315,9 @@ xla_test(
 xla_test(
     name = "query_inferred_shape_test",
     srcs = ["query_inferred_shape_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:shape_util",
         "//tensorflow/compiler/xla:statusor",
@@ -325,6 +334,9 @@ xla_test(
 xla_test(
     name = "while_test",
     srcs = ["while_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
@@ -366,6 +378,9 @@ xla_test(
 xla_test(
     name = "axpy_simple_test",
     srcs = ["axpy_simple_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla/client:computation_builder",
         "//tensorflow/compiler/xla/client:local_client",
@@ -430,6 +445,9 @@ xla_test(
 xla_test(
     name = "pred_test",
     srcs = ["pred_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla/client:computation_builder",
@@ -444,6 +462,9 @@ xla_test(
 xla_test(
     name = "select_test",
     srcs = ["select_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:computation_builder",
@@ -460,6 +481,7 @@ xla_test(
 xla_test(
     name = "conditional_test",
     srcs = ["conditional_test.cc"],
+    tags = ["enable_for_xla_interpreter"],
     deps = [
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:computation_builder",
@@ -476,6 +498,7 @@ xla_test(
 xla_test(
     name = "unary_op_test",
     srcs = ["unary_op_test.cc"],
+    tags = ["enable_for_xla_interpreter"],
     deps = [
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:computation_builder",
@@ -622,8 +645,10 @@ xla_test(
 xla_test(
     name = "dot_operation_test",
     srcs = ["dot_operation_test.cc"],
+    shard_count = 20,
     tags = [
         "enable_for_xla_interpreter",
+        "optonly",
     ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
@@ -642,32 +667,7 @@ xla_test(
     ],
 )
 
-# Tests the dot operation in some cases that can be performed via a
-# runtime call on some backends - e.g. a runtime call to Eigen.
-xla_test(
-    name = "dot_operation_runtime_test",
-    srcs = ["dot_operation_test.cc"],
-    tags = [
-        "enable_for_xla_interpreter",
-    ],
-    deps = [
-        "//tensorflow/compiler/xla:array2d",
-        "//tensorflow/compiler/xla:array3d",
-        "//tensorflow/compiler/xla:reference_util",
-        "//tensorflow/compiler/xla:shape_util",
-        "//tensorflow/compiler/xla/client:computation_builder",
-        "//tensorflow/compiler/xla/client:local_client",
-        "//tensorflow/compiler/xla/tests:client_library_test_base",
-        "//tensorflow/compiler/xla/tests:literal_test_util",
-        "//tensorflow/compiler/xla/tests:test_utils",
-        "//tensorflow/compiler/xla/tests:xla_internal_test_main",
-        "//tensorflow/core:framework_internal",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:test",
-    ],
-)
-
-# Repeat dot_operation_runtime_test with single-threded eigen.
+# Repeat dot_operation_runtime_test with single-threaded eigen.
 xla_test(
     name = "dot_operation_single_threaded_runtime_test",
     srcs = ["dot_operation_test.cc"],
@@ -679,6 +679,8 @@ xla_test(
             "--xla_cpu_multi_thread_eigen=false",
         ],
     },
+    shard_count = 20,
+    tags = ["optonly"],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array3d",
@@ -699,6 +701,9 @@ xla_test(
 xla_test(
     name = "transpose_test",
     srcs = ["transpose_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:reference_util",
@@ -717,6 +722,9 @@ xla_test(
 xla_test(
     name = "constants_test",
     srcs = ["constants_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array3d",
@@ -877,8 +885,7 @@ xla_test(
     name = "half_test",
     srcs = ["half_test.cc"],
     backends = [
-        # TODO(b/72509305): Flaky (fails with SEGV) as of 2018-01-25
-        # "cpu",
+        "cpu",
         "gpu",
     ],
     deps = [
@@ -902,6 +909,9 @@ xla_test(
     name = "slice_test",
     srcs = ["slice_test.cc"],
     shard_count = 40,
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:reference_util",
@@ -918,6 +928,9 @@ xla_test(
 xla_test(
     name = "multidimensional_slice_test",
     srcs = ["multidimensional_slice_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array3d",
@@ -980,6 +993,9 @@ xla_test(
 xla_test(
     name = "vector_ops_reduce_test",
     srcs = ["vector_ops_reduce_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array3d",
@@ -998,6 +1014,10 @@ xla_test(
     name = "reduce_test",
     srcs = ["reduce_test.cc"],
     shard_count = 40,
+    tags = [
+        "enable_for_xla_interpreter",
+        "optonly",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
@@ -1091,6 +1111,9 @@ xla_test(
 xla_test(
     name = "copy_test",
     srcs = ["copy_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         ":client_library_test_base",
         "//tensorflow/compiler/xla:array2d",
@@ -1109,6 +1132,9 @@ xla_test(
 xla_test(
     name = "reduce_hlo_test",
     srcs = ["reduce_hlo_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         ":client_library_test_base",
         "//tensorflow/compiler/xla/tests:hlo_test_base",
@@ -1122,6 +1148,9 @@ xla_test(
 xla_test(
     name = "call_test",
     srcs = ["call_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
@@ -1159,6 +1188,9 @@ xla_test(
 xla_test(
     name = "binop_scaling_test",
     srcs = ["binop_scaling_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
@@ -1175,6 +1207,9 @@ xla_test(
 xla_test(
     name = "broadcast_simple_test",
     srcs = ["broadcast_simple_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
@@ -1192,6 +1227,9 @@ xla_test(
 xla_test(
     name = "pad_test",
     srcs = ["pad_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
@@ -1212,6 +1250,9 @@ xla_test(
 xla_test(
     name = "fmax_test",
     srcs = ["fmax_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla/client:computation_builder",
         "//tensorflow/compiler/xla/client:local_client",
@@ -1225,6 +1266,9 @@ xla_test(
 xla_test(
     name = "log_test",
     srcs = ["log_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla/client:computation_builder",
         "//tensorflow/compiler/xla/client:local_client",
@@ -1238,6 +1282,9 @@ xla_test(
 xla_test(
     name = "matrix_ops_simple_test",
     srcs = ["matrix_ops_simple_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:literal_util",
@@ -1252,6 +1299,7 @@ xla_test(
         "//tensorflow/compiler/xla/client:local_client",
         "//tensorflow/compiler/xla/tests:client_library_test_base",
         "//tensorflow/compiler/xla/tests:literal_test_util",
+        "//tensorflow/compiler/xla/tests:test_utils",
         "//tensorflow/compiler/xla/tests:xla_internal_test_main",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
@@ -1280,6 +1328,9 @@ xla_test(
     name = "reshape_test",
     srcs = ["reshape_test.cc"],
     shard_count = 30,
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
@@ -1306,6 +1357,9 @@ xla_test(
 xla_test(
     name = "reverse_test",
     srcs = ["reverse_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array4d",
@@ -1322,6 +1376,9 @@ xla_test(
 xla_test(
     name = "vector_ops_simple_test",
     srcs = ["vector_ops_simple_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array4d",
         "//tensorflow/compiler/xla:shape_util",
@@ -1345,6 +1402,9 @@ xla_test(
 xla_test(
     name = "concat_test",
     srcs = ["concat_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:array2d",
         "//tensorflow/compiler/xla:array3d",
@@ -1365,8 +1425,12 @@ xla_test(
 xla_test(
     name = "convert_test",
     srcs = ["convert_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:shape_util",
+        "//tensorflow/compiler/xla:types",
         "//tensorflow/compiler/xla:xla_data_proto",
         "//tensorflow/compiler/xla/client:computation_builder",
         "//tensorflow/compiler/xla/client:local_client",
@@ -1421,6 +1485,9 @@ xla_test(
 xla_test(
     name = "floor_ceil_test",
     srcs = ["floor_ceil_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla/client:computation_builder",
         "//tensorflow/compiler/xla/client:local_client",
@@ -1504,6 +1571,9 @@ xla_test(
 xla_test(
     name = "replay_test",
     srcs = ["replay_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:protobuf_util",
@@ -1526,6 +1596,9 @@ xla_test(
 xla_test(
     name = "broadcast_test",
     srcs = ["broadcast_test.cc"],
+    tags = [
+        "enable_for_xla_interpreter",
+    ],
     deps = [
         "//tensorflow/compiler/xla:literal_util",
         "//tensorflow/compiler/xla:shape_util",
diff --git a/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc b/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc
index 7e9005001db34d403ea923eb9c152d114bf32803..6e21dda25d8e5151b31b8c2328253260595a94c4 100644
--- a/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc
+++ b/tensorflow/compiler/xla/tests/array_elementwise_ops_test.cc
@@ -101,6 +101,33 @@ XLA_TEST_F(ArrayElementwiseOpTest, NegConstantC64) {
       {}, error_spec_);
 }
 
+XLA_TEST_F(ArrayElementwiseOpTest, NegConstantS64) {
+  ComputationBuilder builder(client_, TestName());
+  auto a = builder.ConstantR1<int64>({
+      -1,
+      1,
+      0,
+      0x12345678,
+      static_cast<int64>(0xffffffff12345678l),
+      static_cast<int64>(0x8000000000000000LL),
+      static_cast<int64>(0x8000000000000001LL),
+  });
+  auto result = builder.Neg(a);
+  LOG(INFO) << -static_cast<int64>(0x7FFFFFFFFFFFFFFFLL);
+
+  ComputeAndCompareR1<int64>(&builder,
+                             {
+                                 1,
+                                 -1,
+                                 0,
+                                 -0x12345678,
+                                 0xedcba988,
+                                 static_cast<int64>(0x8000000000000000LL),
+                                 -static_cast<int64>(0x8000000000000001LL),
+                             },
+                             {});
+}
+
 XLA_TEST_F(ArrayElementwiseOpTest, IsFiniteZeroElementF32s) {
   ComputationBuilder builder(client_, TestName());
   auto a = builder.ConstantR1<float>({});
@@ -186,6 +213,86 @@ XLA_TEST_F(ArrayElementwiseOpTest, AddTwoConstantZeroElementC64s) {
   ComputeAndCompareR1<complex64>(&builder, {}, {}, error_spec_);
 }
 
+XLA_TEST_F(ArrayElementwiseOpTest, AddTwoConstantU64s) {
+  ComputationBuilder b(client_, TestName());
+
+  std::vector<uint64> lhs{0xFFFFFFFF,
+                          static_cast<uint64>(-1),
+                          0,
+                          0,
+                          0x7FFFFFFFFFFFFFFFLL,
+                          0x7FFFFFFFFFFFFFFLL,
+                          0x8000000000000000LL,
+                          0x8000000000000000LL,
+                          1};
+  std::unique_ptr<Literal> lhs_literal = Literal::CreateR1<uint64>({lhs});
+  auto lhs_param = b.Parameter(0, lhs_literal->shape(), "lhs_param");
+  std::unique_ptr<GlobalData> lhs_data =
+      client_->TransferToServer(*lhs_literal).ConsumeValueOrDie();
+
+  std::vector<uint64> rhs{1,
+                          0x7FFFFFFFFFFFFFFLL,
+                          0x7FFFFFFFFFFFFFFFLL,
+                          0x8000000000000000LL,
+                          0,
+                          static_cast<uint64>(-1),
+                          0,
+                          1,
+                          0x8000000000000000LL};
+  std::unique_ptr<Literal> rhs_literal = Literal::CreateR1<uint64>({rhs});
+  auto rhs_param = b.Parameter(1, rhs_literal->shape(), "rhs_param");
+  std::unique_ptr<GlobalData> rhs_data =
+      client_->TransferToServer(*rhs_literal).ConsumeValueOrDie();
+
+  auto add = b.Add(lhs_param, rhs_param);
+
+  std::vector<uint64> expected(lhs.size());
+  for (int64 i = 0; i < lhs.size(); ++i) {
+    expected[i] = lhs[i] + rhs[i];
+  }
+
+  ComputeAndCompareR1<uint64>(&b, expected, {lhs_data.get(), rhs_data.get()});
+}
+
+XLA_TEST_F(ArrayElementwiseOpTest, SubTwoConstantS64s) {
+  ComputationBuilder b(client_, TestName());
+
+  std::vector<int64> lhs{static_cast<int64>(0x8000000000000000LL),
+                         static_cast<int64>(0x8000000000000000LL),
+                         -1,
+                         0x7FFFFFFFFFFFFFFLL,
+                         0x7FFFFFFFFFFFFFFFLL,
+                         1,
+                         0,
+                         -1};
+  std::unique_ptr<Literal> lhs_literal = Literal::CreateR1<int64>({lhs});
+  auto lhs_param = b.Parameter(0, lhs_literal->shape(), "lhs_param");
+  std::unique_ptr<GlobalData> lhs_data =
+      client_->TransferToServer(*lhs_literal).ConsumeValueOrDie();
+
+  std::vector<int64> rhs{-1,
+                         0,
+                         static_cast<int64>(0x8000000000000000LL),
+                         1,
+                         0,
+                         0x7FFFFFFFFFFFFFFLL,
+                         0x7FFFFFFFFFFFFFFFLL,
+                         0x7FFFFFFFFFFFFFFFLL};
+  std::unique_ptr<Literal> rhs_literal = Literal::CreateR1<int64>({rhs});
+  auto rhs_param = b.Parameter(1, rhs_literal->shape(), "rhs_param");
+  std::unique_ptr<GlobalData> rhs_data =
+      client_->TransferToServer(*rhs_literal).ConsumeValueOrDie();
+
+  auto sub = b.Sub(lhs_param, rhs_param);
+
+  std::vector<int64> expected(lhs.size());
+  for (int64 i = 0; i < lhs.size(); ++i) {
+    expected[i] = lhs[i] - rhs[i];
+  }
+
+  ComputeAndCompareR1<int64>(&b, expected, {lhs_data.get(), rhs_data.get()});
+}
+
 TEST_P(ArrayElementwiseOpTestParamCount, AddManyValues) {
   const int count = GetParam();
   ComputationBuilder builder(client_, TestName());
@@ -847,68 +954,76 @@ XLA_TEST_F(ArrayElementwiseOpTest, NotZeroElementU32R1) {
 
 XLA_TEST_F(ArrayElementwiseOpTest, ShiftLeftS32) {
   ComputationBuilder builder(client_, TestName());
-  auto a =
-      builder.ConstantR1<int32>({static_cast<int32>(0x12345678),
-                                 static_cast<int32>(0xF0001000), 1, 3, 77});
-  auto b = builder.ConstantR1<int32>({4, 8, 2, 7, 15});
+  auto a = builder.ConstantR1<int32>({static_cast<int32>(0x12345678),
+                                      static_cast<int32>(0xF0001000), 1, 3, 77,
+                                      1, -3, 77});
+  auto b = builder.ConstantR1<int32>({4, 8, 2, 7, 15, 32, 100, -1});
   auto out = builder.ShiftLeft(a, b);
 
-  ComputeAndCompareR1<int32>(
-      &builder,
-      {static_cast<int32>(0x23456780), 0x00100000, 0x4, 0x180, 2523136}, {});
+  ComputeAndCompareR1<int32>(&builder,
+                             {static_cast<int32>(0x23456780), 0x00100000, 0x4,
+                              0x180, 2523136, 0, 0, 0},
+                             {});
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, ShiftRightArithmeticS32) {
   ComputationBuilder builder(client_, TestName());
-  auto a =
-      builder.ConstantR1<int32>({static_cast<int32>(0x92345678),
-                                 static_cast<int32>(0x10001000), 1, 3, 77});
-  auto b = builder.ConstantR1<int32>({4, 8, 2, 7, 2});
+  auto a = builder.ConstantR1<int32>({static_cast<int32>(0x92345678),
+                                      static_cast<int32>(0x10001000), 1, 3, 77,
+                                      1, -3, 77});
+  auto b = builder.ConstantR1<int32>({4, 8, 2, 7, 2, 32, 100, -1});
   auto out = builder.ShiftRightArithmetic(a, b);
 
-  ComputeAndCompareR1<int32>(&builder,
-                             {static_cast<int32>(0xF9234567),
-                              static_cast<int32>(0x00100010), 0, 0, 19},
-                             {});
+  ComputeAndCompareR1<int32>(
+      &builder,
+      {static_cast<int32>(0xF9234567), static_cast<int32>(0x00100010), 0, 0, 19,
+       0, -1, 0},
+      {});
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, ShiftRightLogicalS32) {
   ComputationBuilder builder(client_, TestName());
-  auto a =
-      builder.ConstantR1<int32>({static_cast<int32>(0x92345678),
-                                 static_cast<int32>(0x10001000), 1, 3, 77});
-  auto b = builder.ConstantR1<int32>({4, 8, 2, 7, 5});
+  auto a = builder.ConstantR1<int32>({static_cast<int32>(0x92345678),
+                                      static_cast<int32>(0x10001000), 1, 3, 77,
+                                      1, -3, 77});
+  auto b = builder.ConstantR1<int32>({4, 8, 2, 7, 5, 32, 100, -1});
   auto out = builder.ShiftRightLogical(a, b);
 
-  ComputeAndCompareR1<int32>(&builder, {0x09234567, 0x00100010, 0, 0, 2}, {});
+  ComputeAndCompareR1<int32>(&builder,
+                             {0x09234567, 0x00100010, 0, 0, 2, 0, 0, 0}, {});
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, ShiftLeftU32) {
   ComputationBuilder builder(client_, TestName());
-  auto a = builder.ConstantR1<uint32>({0x12345678, 0xF0001000, 1, 3, 77});
-  auto b = builder.ConstantR1<uint32>({4, 8, 2, 7, 15});
+  auto a = builder.ConstantR1<uint32>(
+      {0x12345678, 0xF0001000, 1, 3, 77, 1, ~3u, 77});
+  auto b = builder.ConstantR1<uint32>({4, 8, 2, 7, 15, 32, 100, ~0u});
   auto out = builder.ShiftLeft(a, b);
 
   ComputeAndCompareR1<uint32>(
-      &builder, {0x23456780, 0x00100000, 0x4, 0x180, 2523136}, {});
+      &builder, {0x23456780, 0x00100000, 0x4, 0x180, 2523136, 0, 0, 0}, {});
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, ShiftRightArithmeticU32) {
   ComputationBuilder builder(client_, TestName());
-  auto a = builder.ConstantR1<uint32>({0x92345678, 0x10001000, 1, 3, 77});
-  auto b = builder.ConstantR1<uint32>({4, 8, 2, 7, 2});
+  auto a = builder.ConstantR1<uint32>(
+      {0x92345678, 0x10001000, 1, 3, 77, 1, ~3u, 77});
+  auto b = builder.ConstantR1<uint32>({4, 8, 2, 7, 2, 32, 100, ~0u});
   auto out = builder.ShiftRightArithmetic(a, b);
 
-  ComputeAndCompareR1<uint32>(&builder, {0xF9234567, 0x00100010, 0, 0, 19}, {});
+  ComputeAndCompareR1<uint32>(
+      &builder, {0xF9234567, 0x00100010, 0, 0, 19, 0, ~0u, 0}, {});
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, ShiftRightLogicalU32) {
   ComputationBuilder builder(client_, TestName());
-  auto a = builder.ConstantR1<uint32>({0x92345678, 0x10001000, 1, 3, 77});
-  auto b = builder.ConstantR1<uint32>({4, 8, 2, 7, 5});
+  auto a = builder.ConstantR1<uint32>(
+      {0x92345678, 0x10001000, 1, 3, 77, 1, ~3u, 77});
+  auto b = builder.ConstantR1<uint32>({4, 8, 2, 7, 5, 32, 100, ~0u});
   auto out = builder.ShiftRightLogical(a, b);
 
-  ComputeAndCompareR1<uint32>(&builder, {0x09234567, 0x00100010, 0, 0, 2}, {});
+  ComputeAndCompareR1<uint32>(&builder,
+                              {0x09234567, 0x00100010, 0, 0, 2, 0, 0, 0}, {});
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, CompareEqF32s) {
@@ -1533,33 +1648,15 @@ XLA_TEST_F(ArrayElementwiseOpTest, SquareIn4DZeroElements) {
   ComputeAndCompareR4<float>(&builder, expected, {}, error_spec_);
 }
 
-// GPU backend emits nvvm intrinsic for fmin and fmax, whose semantics is NOT
-// such
-// * fmin(NaN, x) = x
-// * fmax(NaN, x) = x
-// so we only test NAN on CPU.
-//
-// TODO(b/28180546): Make this compile in a way that is consistent
-// among backends.
 XLA_TEST_F(ArrayElementwiseOpTest, MinF32s) {
   ComputationBuilder builder(client_, TestName());
-#if !defined(XLA_TEST_BACKEND_CPU)
-  auto lhs = builder.ConstantR1<float>({1.0f, 1.0f, 2.25f});
-  auto rhs = builder.ConstantR1<float>({2.0f, -5.0f, 1.0f});
-#else
   SetFastMathDisabled(true);
   auto lhs = builder.ConstantR1<float>({1.0f, 1.0f, 2.25f, NAN, 6.0f});
   auto rhs = builder.ConstantR1<float>({2.0f, -5.0f, 1.0f, 10.0f, NAN});
-#endif
   auto minimum = builder.Min(lhs, rhs);
 
-  ComputeAndCompareR1<float>(&builder,
-#if !defined(XLA_TEST_BACKEND_CPU)
-                             {1.0f, -5.0f, 1.0f},
-#else
-                             {1.0f, -5.0f, 1.0f, 10.0f, 6.0f},
-#endif
-                             {}, error_spec_);
+  ComputeAndCompareR1<float>(&builder, {1.0f, -5.0f, 1.0f, NAN, NAN}, {},
+                             error_spec_);
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, MinZeroElementF32s) {
@@ -1570,50 +1667,26 @@ XLA_TEST_F(ArrayElementwiseOpTest, MinZeroElementF32s) {
   ComputeAndCompareR1<float>(&builder, {}, {}, error_spec_);
 }
 
-// TODO(b/28180546): Make this compile in a way that is consistent
-// among backends. See comment on MinF32s test above.
 XLA_TEST_F(ArrayElementwiseOpTest, MinF64s) {
   ComputationBuilder builder(client_, TestName());
-#if !defined(XLA_TEST_BACKEND_CPU)
-  auto lhs = builder.ConstantR1<double>({1.0, 1.0, 2.25});
-  auto rhs = builder.ConstantR1<double>({2.0, -5.0, 1.0});
-#else
   SetFastMathDisabled(true);
   auto lhs = builder.ConstantR1<double>({1.0, 1.0, 2.25, NAN, 6.0});
   auto rhs = builder.ConstantR1<double>({2.0, -5.0, 1.0, 10.0, NAN});
-#endif
   auto minimum = builder.Min(lhs, rhs);
 
-  ComputeAndCompareR1<double>(&builder,
-#if !defined(XLA_TEST_BACKEND_CPU)
-                              {1.0, -5.0, 1.0},
-#else
-                              {1.0, -5.0, 1.0, 10.0, 6.0},
-#endif
-                              {}, error_spec_);
+  ComputeAndCompareR1<double>(&builder, {1.0, -5.0, 1.0, NAN, NAN}, {},
+                              error_spec_);
 }
 
-// TODO(b/28180546): Make this compile in a way that is consistent
-// among backends. See comment on MinF32s test above.
 XLA_TEST_F(ArrayElementwiseOpTest, MaxF32s) {
   ComputationBuilder builder(client_, TestName());
-#if !defined(XLA_TEST_BACKEND_CPU)
-  auto lhs = builder.ConstantR1<float>({1.0f, 1.0f, 2.25f});
-  auto rhs = builder.ConstantR1<float>({2.0f, -5.0f, 1.0f});
-#else
   SetFastMathDisabled(true);
   auto lhs = builder.ConstantR1<float>({1.0f, 1.0f, 2.25f, NAN, 6.0f});
   auto rhs = builder.ConstantR1<float>({2.0f, -5.0f, 1.0f, 10.0f, NAN});
-#endif
   auto maximum = builder.Max(lhs, rhs);
 
-  ComputeAndCompareR1<float>(&builder,
-#if !defined(XLA_TEST_BACKEND_CPU)
-                             {2.0f, 1.0f, 2.25f},
-#else
-                             {2.0f, 1.0f, 2.25f, 10.0f, 6.0f},
-#endif
-                             {}, error_spec_);
+  ComputeAndCompareR1<float>(&builder, {2.0f, 1.0f, 2.25f, NAN, NAN}, {},
+                             error_spec_);
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, MaxZeroElementF32s) {
@@ -1624,27 +1697,15 @@ XLA_TEST_F(ArrayElementwiseOpTest, MaxZeroElementF32s) {
   ComputeAndCompareR1<float>(&builder, {}, {}, error_spec_);
 }
 
-// TODO(b/28180546): Make this compile in a way that is consistent
-// among backends. See comment on MinF32s test above.
 XLA_TEST_F(ArrayElementwiseOpTest, MaxF64s) {
   ComputationBuilder builder(client_, TestName());
-#if !defined(XLA_TEST_BACKEND_CPU)
-  auto lhs = builder.ConstantR1<double>({1.0, 1.0, 2.25});
-  auto rhs = builder.ConstantR1<double>({2.0, -5.0, 1.0});
-#else
   SetFastMathDisabled(true);
   auto lhs = builder.ConstantR1<double>({1.0, 1.0, 2.25, NAN, 6.0});
   auto rhs = builder.ConstantR1<double>({2.0, -5.0, 1.0, 10.0, NAN});
-#endif
   auto maximum = builder.Max(lhs, rhs);
 
-  ComputeAndCompareR1<double>(&builder,
-#if !defined(XLA_TEST_BACKEND_CPU)
-                              {2.0, 1.0, 2.25},
-#else
-                              {2.0, 1.0, 2.25, 10.0, 6.0},
-#endif
-                              {}, error_spec_);
+  ComputeAndCompareR1<double>(&builder, {2.0, 1.0, 2.25, NAN, NAN}, {},
+                              error_spec_);
 }
 
 XLA_TEST_F(ArrayElementwiseOpTest, MaxS32s) {
diff --git a/tensorflow/compiler/xla/tests/broadcast_simple_test.cc b/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
index 03f5e08315bfed2bcb43ebb7098aaa0b97228605..97095f1cc427789845051a8fea24c95475286fe2 100644
--- a/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
+++ b/tensorflow/compiler/xla/tests/broadcast_simple_test.cc
@@ -662,7 +662,7 @@ XLA_TEST_F(BroadcastSimpleTest, InvalidBinaryAndDegenerateBroadcasting) {
   auto result_status = Execute(&b, {});
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("broadcast dimension 0 mismatch"));
+              HasSubstr("dimension 0 mismatch"));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, InvalidInDimensionBroadcasting) {
@@ -675,7 +675,7 @@ XLA_TEST_F(BroadcastSimpleTest, InvalidInDimensionBroadcasting) {
   auto result_status = Execute(&b, {});
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("binary op BINOP_ADD with incompatible shapes"));
+              HasSubstr("op BINOP_ADD with incompatible shapes"));
 }
 
 XLA_TEST_F(BroadcastSimpleTest, InvalidDegenerateBroadcasting) {
@@ -688,7 +688,7 @@ XLA_TEST_F(BroadcastSimpleTest, InvalidDegenerateBroadcasting) {
   auto result_status = Execute(&b, {});
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("binary op BINOP_ADD with incompatible shapes"));
+              HasSubstr("op BINOP_ADD with incompatible shapes"));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/tests/concat_test.cc b/tensorflow/compiler/xla/tests/concat_test.cc
index 1bcad5a3f37a37c9d482f3a5a899ac527666cca3..fb0e9c724a69b61801e6e0c2d07ef75b63a00465 100644
--- a/tensorflow/compiler/xla/tests/concat_test.cc
+++ b/tensorflow/compiler/xla/tests/concat_test.cc
@@ -75,7 +75,7 @@ XLA_TEST_F(ConcatTest, CannotConcatR0WithR0) {
   StatusOr<Computation> computation_status = builder.Build();
   ASSERT_FALSE(computation_status.ok());
   EXPECT_THAT(computation_status.status().ToString(),
-              HasSubstr("dimension to concatenate along out of bounds: 0"));
+              HasSubstr("out of bounds: 0"));
 }
 
 XLA_TEST_F(ConcatTest, Concat_R1_L0_With_R1_L0) {
diff --git a/tensorflow/compiler/xla/tests/convert_test.cc b/tensorflow/compiler/xla/tests/convert_test.cc
index f66e3b57bf45fbc9f8ea786146d6fffe5d55a262..59d6d7a4153be1b76ed8195a12a90cb103baa422 100644
--- a/tensorflow/compiler/xla/tests/convert_test.cc
+++ b/tensorflow/compiler/xla/tests/convert_test.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
 #include "tensorflow/compiler/xla/tests/test_macros.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
+#include "tensorflow/core/lib/core/casts.h"
 #include "tensorflow/core/platform/stream_executor_no_cuda.h"
 #include "tensorflow/core/platform/test.h"
 #include "tensorflow/core/platform/types.h"
@@ -106,11 +107,108 @@ TEST_F(ConvertTest, ConvertR1F32ToR1S32) {
 
 XLA_TEST_F(ConvertTest, ConvertR1S64ToR1F32) {
   ComputationBuilder builder(client_, TestName());
-  auto a = builder.ConstantR1<int64>({32, 64});
-  builder.ConvertElementType(a, F32);
+  std::vector<int64> arg{
+      -9223371216516022272,
+      -2,
+      -1,
+      -0x7FFFFFFF,
+      -0x80000000,
+      0,
+      1,
+      2,
+      1073742145,
+      1073742656,
+      0x7FFFFFFF,
+      0x80000000,
+      826720496944058148,
+      4296062029846194332,
+      0x0007FB72E4000000LL,
+      0x0007FB72E4000001LL,
+      0x0007FB72E6000000LL,
+      0x0007FB72E7000000LL,
+      0x0007FB72E7FFFFFFLL,
+      0x0007FB72E8000000LL,
+      0x0007FB72E8000001LL,
+      0x0007FB72EA000000LL,
+      0x0007FB72EB000000LL,
+      0x0007FB72EBFFFFFFLL,
+      0x0007FB72EC000000LL,
+      0x7FFFFF0000000000LL,
+      0x7FFFFF8000000000LL,
+      0x7FFFFFFFFFFFFF00,
+      static_cast<int64>(0xFFFFFFFFFFFFFFFF),
+      static_cast<int64>(0x0000f234e67e0001LL),
+      static_cast<int64>(0x8000000000000000),
+      static_cast<int64>(0x8000000000000000LL),
+      static_cast<int64>(0x8000000000000001LL),
+      static_cast<int64>(0x8000008000000000LL),
+      static_cast<int64>(0x8000010000000000LL),
+  };
+  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<int64>({arg});
+  auto arg_param = builder.Parameter(0, arg_literal->shape(), "arg_param");
+  std::unique_ptr<GlobalData> arg_data =
+      client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
+
+  builder.ConvertElementType(arg_param, F32);
+
+  std::vector<float> expected(arg.size());
+  for (int64 i = 0; i < arg.size(); ++i) {
+    expected[i] = static_cast<float>(arg[i]);
+  }
+  ComputeAndCompareR1<float>(&builder, expected, {arg_data.get()});
+}
 
-  std::vector<float> expected = {32.0, 64.0};
-  ComputeAndCompareR1<float>(&builder, expected, {});
+XLA_TEST_F(ConvertTest, ConvertR1U32ToR1F32) {
+  ComputationBuilder builder(client_, TestName());
+  std::vector<uint32> arg{0,          1,          0x1000,     0x7fffffff,
+                          0x80000000, 0x80000001, 0x80000002, 0x80000003,
+                          0x80000080, 0x80000081, 0x80000082, 0xFFFFFFFF};
+  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<uint32>({arg});
+  auto arg_param = builder.Parameter(0, arg_literal->shape(), "arg_param");
+  std::unique_ptr<GlobalData> arg_data =
+      client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
+
+  builder.ConvertElementType(arg_param, F32);
+
+  std::vector<float> expected(arg.size());
+  for (int64 i = 0; i < arg.size(); ++i) {
+    expected[i] = static_cast<float>(arg[i]);
+  }
+  ComputeAndCompareR1<float>(&builder, expected, {arg_data.get()});
+}
+
+XLA_TEST_F(ConvertTest, ConvertR1U32ToR1S64) {
+  ComputationBuilder builder(client_, TestName());
+  std::vector<uint32> arg{0, 1, 0x1000, 0x7fffffff, 0x80000082, 0xFFFFFFFF};
+  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<uint32>({arg});
+  auto arg_param = builder.Parameter(0, arg_literal->shape(), "arg_param");
+  std::unique_ptr<GlobalData> arg_data =
+      client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
+
+  builder.ConvertElementType(arg_param, S64);
+
+  std::vector<int64> expected(arg.size());
+  for (int64 i = 0; i < arg.size(); ++i) {
+    expected[i] = static_cast<int64>(arg[i]);
+  }
+  ComputeAndCompareR1<int64>(&builder, expected, {arg_data.get()});
+}
+
+XLA_TEST_F(ConvertTest, ConvertR1S32ToR1S64) {
+  ComputationBuilder builder(client_, TestName());
+  std::vector<int32> arg{0, 1, 0x1000, -1, -0x1000};
+  std::unique_ptr<Literal> arg_literal = Literal::CreateR1<int32>({arg});
+  auto arg_param = builder.Parameter(0, arg_literal->shape(), "arg_param");
+  std::unique_ptr<GlobalData> arg_data =
+      client_->TransferToServer(*arg_literal).ConsumeValueOrDie();
+
+  builder.ConvertElementType(arg_param, S64);
+
+  std::vector<int64> expected(arg.size());
+  for (int64 i = 0; i < arg.size(); ++i) {
+    expected[i] = static_cast<int64>(arg[i]);
+  }
+  ComputeAndCompareR1<int64>(&builder, expected, {arg_data.get()});
 }
 
 XLA_TEST_F(ConvertTest, ConvertR1U8ToR1F32) {
@@ -208,5 +306,65 @@ TEST_F(ConvertTest, ConvertReshape) {
   ComputeAndCompareR0<float>(&builder, 42.0f, {}, ErrorSpec(0.0001));
 }
 
+std::vector<float> GetInterestingF16ConversionTestCases() {
+  float infinity = std::numeric_limits<float>::infinity();
+  float half_min_positive_normal =
+      tensorflow::bit_cast<float, uint32>(0x38800000);
+  float half_max_subnormal = tensorflow::bit_cast<float, uint32>(0x387fc000);
+  float half_min_positive_subnormal =
+      tensorflow::bit_cast<float, uint32>(0x33800000);
+  float half_max = 65504.0f;
+
+  std::vector<float> test_cases(
+      {-infinity, -(half_max * 2 + 1), -half_max, -42.0f, -1.0f,
+       -half_min_positive_subnormal, -half_max_subnormal,
+       -half_min_positive_normal, -0.0f, 0.0f, half_min_positive_subnormal,
+       half_max_subnormal, half_min_positive_normal, 1.0f, 42.0f, half_max,
+       (half_max * 2 + 1), infinity});
+  return test_cases;
+}
+
+XLA_TEST_F(ConvertTest, ConvertR1F16ToR1F32) {
+  std::vector<float> test_cases = GetInterestingF16ConversionTestCases();
+  std::vector<half> input;
+  c_transform(test_cases, std::back_inserter(input),
+              [](float f) { return Eigen::half(f); });
+  std::vector<float> expected_output;
+  c_transform(input, std::back_inserter(expected_output),
+              [](Eigen::half h) { return static_cast<float>(h); });
+
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<GlobalData> dot_lhs_handle,
+      client_->TransferToServer(*Literal::CreateR1<half>(input)));
+
+  ComputationBuilder builder(client_, TestName());
+  builder.ConvertElementType(
+      builder.Parameter(
+          0, ShapeUtil::MakeShape(F16, {static_cast<int64>(input.size())}),
+          "param"),
+      F32);
+
+  ComputeAndCompareR1<float>(&builder, expected_output, {dot_lhs_handle.get()});
+}
+
+XLA_TEST_F(ConvertTest, ConvertR1F32ToR1F16) {
+  std::vector<float> input = GetInterestingF16ConversionTestCases();
+  std::vector<half> expected_output;
+  c_transform(input, std::back_inserter(expected_output),
+              [](float f) { return Eigen::half(f); });
+
+  TF_ASSERT_OK_AND_ASSIGN(
+      std::unique_ptr<GlobalData> dot_lhs_handle,
+      client_->TransferToServer(*Literal::CreateR1<float>(input)));
+
+  ComputationBuilder builder(client_, TestName());
+  builder.ConvertElementType(
+      builder.Parameter(
+          0, ShapeUtil::MakeShape(F32, {static_cast<int64>(input.size())}),
+          "param"),
+      F16);
+
+  ComputeAndCompareR1<half>(&builder, expected_output, {dot_lhs_handle.get()});
+}
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/convolution_test.cc b/tensorflow/compiler/xla/tests/convolution_test.cc
index 1385b437fc47fe5289c401581fab8b5278872382..99640f5bb561a463a9f66af95a6495513a7e63b3 100644
--- a/tensorflow/compiler/xla/tests/convolution_test.cc
+++ b/tensorflow/compiler/xla/tests/convolution_test.cc
@@ -53,27 +53,12 @@ class ConvolutionTest : public ClientLibraryTestBase {
 #endif
 };
 
-// TODO(b/72509305): Enable half data type tests for CPU
-#if (XLA_TEST_BACKEND_GPU)
-using TestTypes = ::testing::Types<float, Eigen::half>;
-#else
+#ifdef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
 using TestTypes = ::testing::Types<float>;
+#else
+using TestTypes = ::testing::Types<float, Eigen::half>;
 #endif
 
-template <typename T>
-Shape MakeShapeWrapper(tensorflow::gtl::ArraySlice<int64> dimensions);
-
-template <>
-Shape MakeShapeWrapper<float>(tensorflow::gtl::ArraySlice<int64> dimensions) {
-  return ShapeUtil::MakeShape(F32, dimensions);
-}
-
-template <>
-Shape MakeShapeWrapper<Eigen::half>(
-    tensorflow::gtl::ArraySlice<int64> dimensions) {
-  return ShapeUtil::MakeShape(F16, dimensions);
-}
-
 template <typename T>
 class ForwardPassConvolution_3x3x256_256_OutputZ_Iota : public ConvolutionTest {
  public:
@@ -122,8 +107,8 @@ class Convolve_1x1x1x2_1x1x1x2_Valid : public ConvolutionTest {
  public:
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
-    Shape input_shape = MakeShapeWrapper<T>({1, 1, 1, 2});
-    Shape filter_shape = MakeShapeWrapper<T>({1, 1, 1, 2});
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 1, 2});
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 1, 2});
     auto input = builder.Parameter(0, input_shape, "input");
     auto filter = builder.Parameter(1, filter_shape, "filter");
     auto conv = builder.Conv(input, filter, {1, 1}, Padding::kValid);
@@ -153,8 +138,8 @@ class Convolve_1x1x4x4_1x1x2x2_Valid : public ConvolutionTest {
  public:
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
-    Shape input_shape = MakeShapeWrapper<T>({1, 1, 4, 4});
-    Shape filter_shape = MakeShapeWrapper<T>({1, 1, 2, 2});
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 4, 4});
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 2, 2});
     auto input = builder.Parameter(0, input_shape, "input");
     auto filter = builder.Parameter(1, filter_shape, "filter");
     auto conv = builder.Conv(input, filter, {1, 1}, Padding::kValid);
@@ -187,8 +172,8 @@ class Convolve_1x1x4x4_1x1x2x2_Same : public ConvolutionTest {
  public:
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
-    Shape input_shape = MakeShapeWrapper<T>({1, 1, 4, 4});
-    Shape filter_shape = MakeShapeWrapper<T>({1, 1, 2, 2});
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 4, 4});
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 2, 2});
     auto input = builder.Parameter(0, input_shape, "input");
     auto filter = builder.Parameter(1, filter_shape, "filter");
     auto conv = builder.Conv(input, filter, {1, 1}, Padding::kSame);
@@ -223,8 +208,8 @@ class Convolve_1x1x4x4_1x1x3x3_Same : public ConvolutionTest {
  public:
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
-    Shape input_shape = MakeShapeWrapper<T>({1, 1, 4, 4});
-    Shape filter_shape = MakeShapeWrapper<T>({1, 1, 3, 3});
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 4, 4});
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 1, 3, 3});
     auto input = builder.Parameter(0, input_shape, "input");
     auto filter = builder.Parameter(1, filter_shape, "filter");
     auto conv = builder.Conv(input, filter, {1, 1}, Padding::kSame);
@@ -281,8 +266,8 @@ class Convolve1D_1x2x5_1x2x2_WithRHSDilation : public ConvolutionTest {
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
     {
-      Shape input_shape = MakeShapeWrapper<T>({1, 2, 5});
-      Shape filter_shape = MakeShapeWrapper<T>({1, 2, 2});
+      Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 2, 5});
+      Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 2, 2});
       auto input = builder.Parameter(0, input_shape, "input");
       auto filter = builder.Parameter(1, filter_shape, "filter");
       // Convolution dimensions are bf0_oi0->bo0.
@@ -382,8 +367,8 @@ class Convolve1D_1x2x5_1x2x2_WithPadding : public ConvolutionTest {
   void RunTest() {
     ComputationBuilder builder(client_, TestName());
     {
-      Shape input_shape = MakeShapeWrapper<T>({1, 2, 5});
-      Shape filter_shape = MakeShapeWrapper<T>({1, 2, 2});
+      Shape input_shape = ShapeUtil::MakeShapeWithType<T>({1, 2, 5});
+      Shape filter_shape = ShapeUtil::MakeShapeWithType<T>({1, 2, 2});
       auto input = builder.Parameter(0, input_shape, "input");
       auto filter = builder.Parameter(1, filter_shape, "filter");
       // Convolution dimensions are bf0_oi0->bo0.
@@ -487,8 +472,8 @@ class Convolve2D_1x3x3x5_3x3x5x5_Valid : public ConvolutionTest {
     ComputationBuilder builder(client_, TestName());
     std::vector<int64> input_dims = {1, 3, 3, 5};
     std::vector<int64> filter_dims = {3, 3, 5, 3};
-    Shape input_shape = MakeShapeWrapper<T>(input_dims);
-    Shape filter_shape = MakeShapeWrapper<T>(filter_dims);
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>(input_dims);
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>(filter_dims);
     {
       auto input = builder.Parameter(0, input_shape, "input");
       auto filter = builder.Parameter(1, filter_shape, "filter");
@@ -612,8 +597,8 @@ class Convolve1D1WindowTestBase
                                      input_feature};
     std::vector<int64> filter_dims = {window_size, input_feature,
                                       output_feature};
-    Shape input_shape = MakeShapeWrapper<T>(input_dims);
-    Shape filter_shape = MakeShapeWrapper<T>(filter_dims);
+    Shape input_shape = ShapeUtil::MakeShapeWithType<T>(input_dims);
+    Shape filter_shape = ShapeUtil::MakeShapeWithType<T>(filter_dims);
     {
       auto input = builder.Parameter(0, input_shape, "input");
       auto filter = builder.Parameter(1, filter_shape, "filter");
@@ -699,9 +684,7 @@ INSTANTIATE_TEST_CASE_P(
 #if (XLA_TEST_BACKEND_GPU || XLA_TEST_BACKEND_CPU)
 class Convolve1D1WindowTestHalf : public Convolve1D1WindowTestBase {};
 
-// TODO(b/72509305): Enable half data type tests for CPU.
-XLA_TEST_P(Convolve1D1WindowTestHalf,
-           DISABLED_ON_CPU_PARALLEL(DISABLED_ON_CPU(Convolve1D1Window))) {
+XLA_TEST_P(Convolve1D1WindowTestHalf, Convolve1D1Window) {
   TestImpl<Eigen::half>();
 }
 
@@ -719,14 +702,16 @@ INSTANTIATE_TEST_CASE_P(
                       Convolve1DTestParam{130, 1, 1, 1, 3},
                       Convolve1DTestParam{64, 1, 1, 1, 1},
                       Convolve1DTestParam{128, 1, 1, 1, 1},
-                      // TODO(b/72566306): the following three tests fail on CPU
-                      // backend due to result miscompare.
+// TODO(b/72566306): The following five tests failed on CPU with unreasonable
+// relative errors.  Last ran on 2018-02-22.
+#if XLA_TEST_BACKEND_GPU
                       Convolve1DTestParam{139, 1, 1, 128, 1},
                       Convolve1DTestParam{640, 3, 3, 128, 1},
                       Convolve1DTestParam{900, 1, 1, 10, 1},
                       Convolve1DTestParam{1, 10, 10, 1, 10},
-                      Convolve1DTestParam{1, 10, 130, 1, 2},
                       Convolve1DTestParam{1, 10, 130, 1, 1},
+#endif
+                      Convolve1DTestParam{1, 10, 130, 1, 2},
                       Convolve1DTestParam{1, 64, 64, 1, 10},
                       Convolve1DTestParam{1, 65, 65, 1, 1},
                       Convolve1DTestParam{1, 128, 128, 1, 1},
diff --git a/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc b/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc
index 032c06cd3c9f872f57674d3d7b5adc201c91ea77..3ab0ea4ad48c00724d48e7d285ec024e10d5db31 100644
--- a/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc
+++ b/tensorflow/compiler/xla/tests/deconstruct_tuple_test.cc
@@ -195,7 +195,7 @@ XLA_TEST_F(DeconstructTupleTest, DeconstructNestedTuple) {
   auto result_status = client_->DeconstructTuple(*global_data);
   EXPECT_FALSE(result_status.ok());
   EXPECT_THAT(result_status.status().error_message(),
-              HasSubstr("deconstructing nested tuples not yet supported"));
+              HasSubstr("Deconstructing nested tuples is not implemented"));
 }
 
 }  // namespace
diff --git a/tensorflow/compiler/xla/tests/dot_operation_test.cc b/tensorflow/compiler/xla/tests/dot_operation_test.cc
index 6b0c04c2c083bbfce267dd92d24ef15c06186d26..09b1dd283e4d026a2f0007240d88cd9ac38acb19 100644
--- a/tensorflow/compiler/xla/tests/dot_operation_test.cc
+++ b/tensorflow/compiler/xla/tests/dot_operation_test.cc
@@ -34,169 +34,194 @@ limitations under the License.
 namespace xla {
 namespace {
 
-// TODO(b/34468543): use GUnit typed tests when we can do all tests on all
-// backends.
 class DotOperationTest : public ClientLibraryTestBase {
  public:
   ErrorSpec error_spec_{0.0001, 1e-5};
-
- protected:
-  template <typename Element>
-  void TestOneElementVectorDot();
-  template <typename Element>
-  void TestVectorDot();
-  template <typename Element>
-  void TestSquareMatrixDot(bool lhs_row_major = false,
-                           bool rhs_row_major = false);
-  template <typename Element>
-  void TestNonsquareMatrixDot(bool lhs_row_major = false,
-                              bool rhs_row_major = false);
 };
 
-XLA_TEST_F(DotOperationTest, ZeroElementVectorDotF32) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR1<float>({});
-  auto rhs = builder.ConstantR1<float>({});
+#if defined(XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16) && \
+    defined(XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT64)
+using TypesF16F32 = ::testing::Types<float>;
+using TypesF16F32F64 = ::testing::Types<float>;
+using TypesF16F32F64CF64 = ::testing::Types<float>;
+#elif !defined(XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16) && \
+    !defined(XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT64)
+using TypesF16F32 = ::testing::Types<Eigen::half, float>;
+using TypesF16F32F64 = ::testing::Types<Eigen::half, float, double>;
+using TypesF16F32F64CF64 =
+    ::testing::Types<Eigen::half, float, double, complex64>;
+#else
+#error "Situation not handled yet"
+#endif
+
+template <typename T>
+class DotOperationTest_F16F32F64CF64 : public DotOperationTest {};
+TYPED_TEST_CASE(DotOperationTest_F16F32F64CF64, TypesF16F32F64CF64);
+
+XLA_TYPED_TEST(DotOperationTest_F16F32F64CF64, ZeroElementVectorDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+
+  auto lhs = builder.ConstantR1<T>({});
+  auto rhs = builder.ConstantR1<T>({});
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR0<float>(&builder, 0.0, {}, error_spec_);
+  this->template ComputeAndCompareR0<T>(&builder, static_cast<T>(0.0), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, TrivialMatrixVectorDotF32) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR2<float>({{3.0, 4.0}});
-  auto rhs = builder.ConstantR1<float>({3.0, 4.0});
-  auto result = builder.Dot(lhs, rhs);
+template <typename T>
+class DotOperationTest_F16F32F64 : public DotOperationTest {};
+TYPED_TEST_CASE(DotOperationTest_F16F32F64, TypesF16F32F64);
 
-  ComputeAndCompareR1<float>(&builder, {25.0}, {}, error_spec_);
-}
-
-template <typename Element>
-void DotOperationTest::TestOneElementVectorDot() {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR1<Element>({2.0});
-  auto rhs = builder.ConstantR1<Element>({3.0});
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, TrivialMatrixVectorDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>({{3.0f, 4.0f}});
+  auto rhs = builder.ConstantFromArray<T>({3.0f, 4.0f});
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR0<Element>(&builder, 6.0, {}, error_spec_);
+  this->template ComputeAndCompareR1<T>(&builder, {static_cast<T>(25.0f)}, {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, OneElementVectorDotF32) {
-  TestOneElementVectorDot<float>();
-}
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, OneElementVectorDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR1<T>({static_cast<T>(2.0f)});
+  auto rhs = builder.ConstantR1<T>({static_cast<T>(3.0f)});
+  auto result = builder.Dot(lhs, rhs);
 
-XLA_TEST_F(DotOperationTest, OneElementVectorDotF64) {
-  TestOneElementVectorDot<double>();
+  this->template ComputeAndCompareR0<T>(&builder, static_cast<T>(6.0f), {},
+                                        this->error_spec_);
 }
 
-template <typename Element>
-void DotOperationTest::TestVectorDot() {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR1<Element>({1.0, 2.5, 42.0});
-  auto rhs = builder.ConstantR1<Element>({11.0, -1.0, 0.5});
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, VectorDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantFromArray<T>({1.0f, 2.5f, 42.0f});
+  auto rhs = builder.ConstantFromArray<T>({11.0f, -1.0f, 0.5f});
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR0<Element>(&builder, 29.5, {}, error_spec_);
+  this->template ComputeAndCompareR0<T>(&builder, static_cast<T>(29.5f), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, VectorDotF32) { TestVectorDot<float>(); }
-
-XLA_TEST_F(DotOperationTest, VectorDotF64) { TestVectorDot<double>(); }
-
-namespace {
-
 std::vector<int64> MinorToMajorForIsRowMajor(bool row_major) {
   return {row_major ? 1 : 0, row_major ? 0 : 1};
 }
 
-}  // namespace
-
-XLA_TEST_F(DotOperationTest, Dot_0x2_2x0) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(0, 2));
-  auto rhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(2, 0));
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_0x2_2x0) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(0, 2));
+  auto rhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(2, 0));
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR2<float>(&builder, Array2D<float>(0, 0), {}, error_spec_);
+  this->template ComputeAndCompareR2<T>(&builder, Array2D<T>(0, 0), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, Dot_0x2_2x3) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(0, 2));
-  auto rhs = builder.ConstantR2<float>({{7.0, 8.0, 9.0}, {42.0, 77.0, 101.0}});
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_0x2_2x3) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(0, 2));
+  auto rhs = builder.ConstantR2FromArray2D<T>(
+      {{7.0f, 8.0f, 9.0f}, {42.0f, 77.0f, 101.0f}});
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR2<float>(&builder, Array2D<float>(0, 3), {}, error_spec_);
+  this->template ComputeAndCompareR2<T>(&builder, Array2D<T>(0, 3), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, Dot_3x2_2x0) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs =
-      builder.ConstantR2<float>({{7.0, 8.0}, {9.0, 42.0}, {77.0, 101.0}});
-  auto rhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(2, 0));
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_3x2_2x0) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>(
+      {{7.0f, 8.0f}, {9.0f, 42.0f}, {77.0f, 101.0f}});
+  auto rhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(2, 0));
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR2<float>(&builder, Array2D<float>(3, 0), {}, error_spec_);
+  this->template ComputeAndCompareR2<T>(&builder, Array2D<T>(3, 0), {},
+                                        this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, Dot_2x0_0x2) {
-  ComputationBuilder builder(client_, TestName());
-  auto lhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(2, 0));
-  auto rhs = builder.ConstantR2FromArray2D<float>(Array2D<float>(0, 2));
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, Dot_2x0_0x2) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto lhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(2, 0));
+  auto rhs = builder.ConstantR2FromArray2D<T>(Array2D<T>(0, 2));
   auto result = builder.Dot(lhs, rhs);
 
-  ComputeAndCompareR2<float>(&builder, Array2D<float>(2, 2, 0.0f), {},
-                             error_spec_);
+  this->template ComputeAndCompareR2<T>(
+      &builder, Array2D<T>(2, 2, static_cast<T>(0.0f)), {}, this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, FusedDot) {
-  ComputationBuilder builder(client_, TestName());
-  auto param0 = builder.Parameter(0, ShapeUtil::MakeShape(F32, {2, 4}), "arg0");
-  auto param1 = builder.Parameter(1, ShapeUtil::MakeShape(F32, {4, 1}), "arg1");
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, FusedDot) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto param0 =
+      builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({2, 4}), "arg0");
+  auto param1 =
+      builder.Parameter(1, ShapeUtil::MakeShapeWithType<T>({4, 1}), "arg1");
   auto exp0 = builder.Exp(param0);
   auto result = builder.Dot(exp0, param1);
 
-  auto lhs_handle = client_
-                        ->TransferToServer(*Literal::CreateR2<float>(
-                            {{1.0, 2.0, 3.0, 4.0}, {-1.0, -2.0, -3.0, -4.0}}))
-                        .ConsumeValueOrDie();
-  auto rhs_handle = client_
-                        ->TransferToServer(*Literal::CreateR2<float>(
-                            {{1.0}, {2.0}, {3.0}, {4.0}}))
-                        .ConsumeValueOrDie();
-
-  ComputeAndCompareR2<float>(
-      &builder, Array2D<float>({{296.14560492846033}, {0.8611737683031964}}),
-      {lhs_handle.get(), rhs_handle.get()}, error_spec_);
-}
-
-template <typename Element>
-void DotOperationTest::TestSquareMatrixDot(bool lhs_row_major,
-                                           bool rhs_row_major) {
   auto lhs_handle =
-      client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<Element>(
-              {{1.0, 2.0}, {3.0, -4.0}},
-              LayoutUtil::MakeLayout(MinorToMajorForIsRowMajor(lhs_row_major))))
-          .ConsumeValueOrDie();
-  auto rhs_handle =
-      client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<Element>(
-              {{1.0, 6.0}, {7.0, -4.0}},
-              LayoutUtil::MakeLayout(MinorToMajorForIsRowMajor(rhs_row_major))))
+      this->client_
+          ->TransferToServer(*Literal::CreateR2FromArray2D<T>(
+              {{1.0f, 2.0f, 3.0f, 4.0f}, {-1.0f, -2.0f, -3.0f, -4.0f}}))
           .ConsumeValueOrDie();
+  auto rhs_handle = this->client_
+                        ->TransferToServer(*Literal::CreateR2FromArray2D<T>(
+                            {{1.0f}, {2.0f}, {3.0f}, {4.0f}}))
+                        .ConsumeValueOrDie();
 
-  ComputationBuilder builder(client_, TestName());
-  auto prim_type = primitive_util::NativeToPrimitiveType<Element>();
-  auto result = builder.Dot(
-      builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 2}), "lhs"),
-      builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {2, 2}), "rhs"));
+  if (std::is_same<Eigen::half, T>::value) {
+    this->error_spec_ = ErrorSpec{0.0001, 1e-3};
+  }
 
-  Array2D<Element> expected({{15.0, -2.0}, {-25.0, 34.0}});
-  ComputeAndCompareR2<Element>(
-      &builder, expected, {lhs_handle.get(), rhs_handle.get()}, error_spec_);
+  this->template ComputeAndCompareR2<T>(
+      &builder, Array2D<T>({{296.14560492846033f}, {0.8611737683031964f}}),
+      {lhs_handle.get(), rhs_handle.get()}, this->error_spec_);
 }
 
+template <typename T>
+class SquareMatrixDot : public DotOperationTest {
+ public:
+  void TestImpl(bool lhs_row_major, bool rhs_row_major) {
+    auto lhs_handle =
+        client_
+            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+                {{1.0f, 2.0f}, {3.0f, -4.0f}},
+                LayoutUtil::MakeLayout(
+                    MinorToMajorForIsRowMajor(lhs_row_major))))
+            .ConsumeValueOrDie();
+    auto rhs_handle =
+        client_
+            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+                {{1.0f, 6.0f}, {7.0f, -4.0f}},
+                LayoutUtil::MakeLayout(
+                    MinorToMajorForIsRowMajor(rhs_row_major))))
+            .ConsumeValueOrDie();
+    ComputationBuilder builder(client_, TestName());
+    auto prim_type = primitive_util::NativeToPrimitiveType<T>();
+    auto result = builder.Dot(
+        builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 2}), "lhs"),
+        builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {2, 2}), "rhs"));
+
+    Array2D<T> expected({{15.0f, -2.0f}, {-25.0f, 34.0f}});
+    ComputeAndCompareR2<T>(&builder, expected,
+                           {lhs_handle.get(), rhs_handle.get()}, error_spec_);
+  }
+};
+
+TYPED_TEST_CASE(SquareMatrixDot, TypesF16F32F64CF64);
+XLA_TYPED_TEST(SquareMatrixDot, TypesFF) { this->TestImpl(false, false); }
+XLA_TYPED_TEST(SquareMatrixDot, TypesFT) { this->TestImpl(false, true); }
+XLA_TYPED_TEST(SquareMatrixDot, TypesTF) { this->TestImpl(true, false); }
+XLA_TYPED_TEST(SquareMatrixDot, TypesTT) { this->TestImpl(true, true); }
+
 struct DotTestParam {
   int m;
   int k;
@@ -225,33 +250,39 @@ string PrintDotTestParam(
 }
 
 class ParametricDotTest : public DotOperationTest,
-                          public ::testing::WithParamInterface<DotTestParam> {};
+                          public ::testing::WithParamInterface<DotTestParam> {
+ protected:
+  template <typename NativeT>
+  void TestImpl();
+};
 
-XLA_TEST_P(ParametricDotTest, TestF32) {
+template <typename NativeT>
+void ParametricDotTest::TestImpl() {
   DotTestParam param = GetParam();
 
-  std::unique_ptr<Array2D<float>> dot_lhs_data =
-      MakeLinspaceArray2D(0.0, 1.0, param.m, param.k);
+  std::unique_ptr<Array2D<NativeT>> dot_lhs_data =
+      MakeLinspaceArray2D<NativeT>(0.0, 1.0, param.m, param.k);
   std::unique_ptr<Literal> dot_lhs_lit = Literal::CreateR2FromArray2DWithLayout(
       *dot_lhs_data, LayoutUtil::MakeLayout(
                          MinorToMajorForIsRowMajor(param.dot_lhs_row_major)));
   std::unique_ptr<GlobalData> dot_lhs_handle =
       client_->TransferToServer(*dot_lhs_lit).ConsumeValueOrDie();
 
-  std::unique_ptr<Array2D<float>> dot_rhs_data =
-      MakeLinspaceArray2D(0.0, 1.0, param.k, param.n);
-  std::unique_ptr<Literal> dot_rhs_lit = Literal::CreateR2FromArray2DWithLayout(
-      *dot_rhs_data, LayoutUtil::MakeLayout(
-                         MinorToMajorForIsRowMajor(param.dot_rhs_row_major)));
+  std::unique_ptr<Array2D<NativeT>> dot_rhs_data =
+      MakeLinspaceArray2D<NativeT>(0.0, 1.0, param.k, param.n);
+  Layout rhs_layout = LayoutUtil::MakeLayout(
+      MinorToMajorForIsRowMajor(param.dot_rhs_row_major));
+  std::unique_ptr<Literal> dot_rhs_lit =
+      Literal::CreateR2FromArray2DWithLayout(*dot_rhs_data, rhs_layout);
   std::unique_ptr<GlobalData> dot_rhs_handle =
       client_->TransferToServer(*dot_rhs_lit).ConsumeValueOrDie();
 
-  std::unique_ptr<Array2D<float>> addend_data;
+  std::unique_ptr<Array2D<NativeT>> addend_data;
   std::unique_ptr<Literal> addend_lit;
   std::unique_ptr<GlobalData> addend_handle;
 
   if (param.has_addend) {
-    addend_data = MakeLinspaceArray2D(0.0, 1.0, param.m, param.n);
+    addend_data = MakeLinspaceArray2D<NativeT>(0.0, 1.0, param.m, param.n);
     addend_lit = Literal::CreateR2FromArray2DWithLayout(
         *addend_data, LayoutUtil::MakeLayout(
                           MinorToMajorForIsRowMajor(param.addend_row_major)));
@@ -259,24 +290,33 @@ XLA_TEST_P(ParametricDotTest, TestF32) {
   }
 
   ComputationBuilder builder(client_, TestName());
-  auto prim_type = primitive_util::NativeToPrimitiveType<float>();
+  auto prim_type = primitive_util::NativeToPrimitiveType<NativeT>();
   auto result = builder.Dot(
-      builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {param.m, param.k}),
+      builder.Parameter(0,
+                        ShapeUtil::MakeShapeWithLayout(
+                            prim_type, {param.m, param.k},
+                            MinorToMajorForIsRowMajor(param.dot_lhs_row_major)),
                         "dot_lhs"),
-      builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {param.k, param.n}),
+      builder.Parameter(1,
+                        ShapeUtil::MakeShapeWithLayout(
+                            prim_type, {param.k, param.n},
+                            MinorToMajorForIsRowMajor(param.dot_rhs_row_major)),
                         "dot_rhs"));
 
   if (param.has_addend) {
     result = builder.Add(
-        result,
-        builder.Parameter(
-            2, ShapeUtil::MakeShape(prim_type, {param.m, param.n}), "addend"));
+        result, builder.Parameter(
+                    2,
+                    ShapeUtil::MakeShapeWithLayout(
+                        prim_type, {param.m, param.n},
+                        MinorToMajorForIsRowMajor(param.addend_row_major)),
+                    "addend"));
   }
 
-  std::unique_ptr<Array2D<float>> expected;
+  std::unique_ptr<Array2D<NativeT>> expected;
   if (param.has_addend) {
     expected = ReferenceUtil::ApplyElementwise2D(
-        std::plus<float>(),
+        std::plus<NativeT>(),
         *ReferenceUtil::MatmulArray2D(*dot_lhs_data, *dot_rhs_data),
         *addend_data);
   } else {
@@ -287,8 +327,11 @@ XLA_TEST_P(ParametricDotTest, TestF32) {
   if (param.has_addend) {
     args.push_back(addend_handle.get());
   }
-
-  ComputeAndCompareR2<float>(&builder, *expected, args, ErrorSpec(0.3, 3e-3));
+  ErrorSpec error_spec(0.3, 3e-3);
+  if (std::is_same<Eigen::half, NativeT>::value) {
+    error_spec = ErrorSpec(0.3, 5e-3);
+  }
+  ComputeAndCompareR2<NativeT>(&builder, *expected, args, error_spec);
 }
 
 std::vector<DotTestParam> CreateDotTestParameters() {
@@ -305,30 +348,77 @@ std::vector<DotTestParam> CreateDotTestParameters() {
     }
   };
 
+  add_matrix_matrix_dot_test(/*m=*/12, /*k=*/117, /*n=*/7);
+  add_matrix_matrix_dot_test(/*m=*/270, /*k=*/270, /*n=*/520);
+  add_matrix_matrix_dot_test(/*m=*/260, /*k=*/3, /*n=*/520);
+
+  return params;
+}
+
+#ifndef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+XLA_TEST_P(ParametricDotTest, TestF16) { TestImpl<Eigen::half>(); }
+#endif
+XLA_TEST_P(ParametricDotTest, TestF32) { TestImpl<float>(); }
+XLA_TEST_P(ParametricDotTest, TestF64) { TestImpl<double>(); }
+
+INSTANTIATE_TEST_CASE_P(DotTests, ParametricDotTest,
+                        ::testing::ValuesIn(CreateDotTestParameters()),
+                        PrintDotTestParam);
+
+class ParametricDotTestWithoutLayoutAssignment : public ParametricDotTest {
+ public:
+  ParametricDotTestWithoutLayoutAssignment() {
+    execution_options_.mutable_debug_options()->add_xla_disable_hlo_passes(
+        "layout-assignment");
+  }
+};
+
+std::vector<DotTestParam> CreateNoLayoutAssignmentDotTestParameters() {
+  std::vector<DotTestParam> params;
+
   auto add_matrix_vector_dot_test = [&](int k, int n) {
-    for (bool has_addend : {false, true}) {
-      params.push_back({/*m=*/1, /*k=*/k, /*n=*/n,
-                        /*dot_lhs_row_major=*/true, /*dot_rhs_row_major=*/true,
-                        /*has_addend=*/has_addend, /*addend_row_major=*/true});
-      if (n != 1) {
-        params.push_back(
-            {/*m=*/n, /*k=*/k, /*n=*/1,
-             /*dot_lhs_row_major=*/true, /*dot_rhs_row_major=*/true,
-             /*has_addend=*/has_addend, /*addend_row_major=*/true});
+    for (bool lhs_row_major : {true, false}) {
+      for (bool rhs_row_major : {true, false}) {
+        for (bool has_addend : {true, false}) {
+          params.push_back({/*m=*/1, /*k=*/k, /*n=*/n,
+                            /*dot_lhs_row_major=*/lhs_row_major,
+                            /*dot_rhs_row_major=*/rhs_row_major,
+                            /*has_addend=*/has_addend,
+                            /*addend_row_major=*/true});
+          if (has_addend) {
+            params.push_back({/*m=*/1, /*k=*/k, /*n=*/n,
+                              /*dot_lhs_row_major=*/lhs_row_major,
+                              /*dot_rhs_row_major=*/rhs_row_major,
+                              /*has_addend=*/has_addend,
+                              /*addend_row_major=*/false});
+          }
+          if (n != 1) {
+            params.push_back({/*m=*/n, /*k=*/k, /*n=*/1,
+                              /*dot_lhs_row_major=*/lhs_row_major,
+                              /*dot_rhs_row_major=*/rhs_row_major,
+                              /*has_addend=*/has_addend,
+                              /*addend_row_major=*/true});
+            if (has_addend) {
+              params.push_back({/*m=*/n, /*k=*/k, /*n=*/1,
+                                /*dot_lhs_row_major=*/lhs_row_major,
+                                /*dot_rhs_row_major=*/rhs_row_major,
+                                /*has_addend=*/has_addend,
+                                /*addend_row_major=*/false});
+            }
+          }
+        }
       }
     }
   };
 
-  add_matrix_matrix_dot_test(/*m=*/12, /*k=*/117, /*n=*/7);
-  add_matrix_matrix_dot_test(/*m=*/270, /*k=*/270, /*n=*/520);
-  add_matrix_matrix_dot_test(/*m=*/260, /*k=*/3, /*n=*/520);
-
   add_matrix_vector_dot_test(/*k=*/8, /*n=*/8);
   add_matrix_vector_dot_test(/*k=*/130, /*n=*/8);
   add_matrix_vector_dot_test(/*k=*/8, /*n=*/130);
   add_matrix_vector_dot_test(/*k=*/290, /*n=*/130);
   add_matrix_vector_dot_test(/*k=*/1, /*n=*/1);
   add_matrix_vector_dot_test(/*k=*/1, /*n=*/16);
+  add_matrix_vector_dot_test(/*k=*/1, /*n=*/4);
+  add_matrix_vector_dot_test(/*k=*/1, /*n=*/3);
   add_matrix_vector_dot_test(/*k=*/3, /*n=*/16);
   add_matrix_vector_dot_test(/*k=*/3, /*n=*/3);
   add_matrix_vector_dot_test(/*k=*/29, /*n=*/29);
@@ -339,109 +429,60 @@ std::vector<DotTestParam> CreateDotTestParameters() {
   return params;
 }
 
-INSTANTIATE_TEST_CASE_P(DotTests, ParametricDotTest,
-                        ::testing::ValuesIn(CreateDotTestParameters()),
-                        PrintDotTestParam);
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF32MinorToMajorFF) {
-  TestSquareMatrixDot<float>(false, false);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF32MinorToMajorFT) {
-  TestSquareMatrixDot<float>(false, true);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF32MinorToMajorTF) {
-  TestSquareMatrixDot<float>(true, false);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF32MinorToMajorTT) {
-  TestSquareMatrixDot<float>(true, true);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotC64MinorToMajorFF) {
-  TestSquareMatrixDot<complex64>(false, false);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotC64MinorToMajorFT) {
-  TestSquareMatrixDot<complex64>(false, true);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotC64MinorToMajorTF) {
-  TestSquareMatrixDot<complex64>(true, false);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotC64MinorToMajorTT) {
-  TestSquareMatrixDot<complex64>(true, true);
-}
-
-XLA_TEST_F(DotOperationTest, SquareMatrixDotF64) {
-  TestSquareMatrixDot<double>();
-}
-
-template <typename Element>
-void DotOperationTest::TestNonsquareMatrixDot(bool lhs_row_major,
-                                              bool rhs_row_major) {
-  auto lhs_handle =
-      client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<Element>(
-              {{1.0, 2.0, 3.0}, {3.0, -4.0, -1.0}},
-              LayoutUtil::MakeLayout(MinorToMajorForIsRowMajor(lhs_row_major))))
-          .ConsumeValueOrDie();
-  auto rhs_handle =
-      client_
-          ->TransferToServer(*Literal::CreateR2WithLayout<Element>(
-              {{1.0, 6.0}, {2.0, 3.0}, {7.0, -4.0}},
-              LayoutUtil::MakeLayout(MinorToMajorForIsRowMajor(rhs_row_major))))
-          .ConsumeValueOrDie();
-
-  ComputationBuilder builder(client_, TestName());
-  auto prim_type = primitive_util::NativeToPrimitiveType<Element>();
-  auto result = builder.Dot(
-      builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 3}), "lhs"),
-      builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {3, 2}), "rhs"));
-
-  Array2D<Element> expected({{26.0, 0.0}, {-12.0, 10.0}});
-
-  ComputeAndCompareR2<Element>(
-      &builder, expected, {lhs_handle.get(), rhs_handle.get()}, error_spec_);
-}
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF32MajorToMinorFF) {
-  TestNonsquareMatrixDot<float>(false, false);
+#ifndef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+XLA_TEST_P(ParametricDotTestWithoutLayoutAssignment, TestF16) {
+  TestImpl<Eigen::half>();
 }
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF32MajorToMinorFT) {
-  TestNonsquareMatrixDot<float>(false, true);
+#endif
+XLA_TEST_P(ParametricDotTestWithoutLayoutAssignment, TestF32) {
+  TestImpl<float>();
 }
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF32MajorToMinorTF) {
-  TestNonsquareMatrixDot<float>(true, false);
-}
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF32MajorToMinorTT) {
-  TestNonsquareMatrixDot<float>(true, true);
-}
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotF64) {
-  TestNonsquareMatrixDot<double>();
-}
-
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotC64MajorToMinorFF) {
-  TestNonsquareMatrixDot<complex64>(false, false);
+XLA_TEST_P(ParametricDotTestWithoutLayoutAssignment, TestF64) {
+  TestImpl<double>();
 }
 
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotC64MajorToMinorFT) {
-  TestNonsquareMatrixDot<complex64>(false, true);
-}
+INSTANTIATE_TEST_CASE_P(
+    DotTests, ParametricDotTestWithoutLayoutAssignment,
+    ::testing::ValuesIn(CreateNoLayoutAssignmentDotTestParameters()),
+    PrintDotTestParam);
 
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotC64MajorToMinorTF) {
-  TestNonsquareMatrixDot<complex64>(true, false);
-}
+template <typename T>
+class NonsquareMatrixDot : public DotOperationTest {
+ public:
+  void TestImpl(bool lhs_row_major, bool rhs_row_major) {
+    auto lhs_handle =
+        client_
+            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+                {{1.0f, 2.0f, 3.0f}, {3.0f, -4.0f, -1.0f}},
+                LayoutUtil::MakeLayout(
+                    MinorToMajorForIsRowMajor(lhs_row_major))))
+            .ConsumeValueOrDie();
+    auto rhs_handle =
+        client_
+            ->TransferToServer(*Literal::CreateFromArrayWithLayout<T>(
+                {{1.0f, 6.0f}, {2.0f, 3.0f}, {7.0f, -4.0f}},
+                LayoutUtil::MakeLayout(
+                    MinorToMajorForIsRowMajor(rhs_row_major))))
+            .ConsumeValueOrDie();
+
+    ComputationBuilder builder(client_, TestName());
+    auto prim_type = primitive_util::NativeToPrimitiveType<T>();
+    auto result = builder.Dot(
+        builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 3}), "lhs"),
+        builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {3, 2}), "rhs"));
+
+    Array2D<T> expected({{26.0f, 0.0f}, {-12.0f, 10.0f}});
+
+    ComputeAndCompareR2<T>(&builder, expected,
+                           {lhs_handle.get(), rhs_handle.get()}, error_spec_);
+  }
+};
 
-XLA_TEST_F(DotOperationTest, NonsquareMatrixDotC64MajorToMinorTT) {
-  TestNonsquareMatrixDot<complex64>(true, true);
-}
+TYPED_TEST_CASE(NonsquareMatrixDot, TypesF16F32F64CF64);
+XLA_TYPED_TEST(NonsquareMatrixDot, TestFF) { this->TestImpl(false, false); }
+XLA_TYPED_TEST(NonsquareMatrixDot, TestFT) { this->TestImpl(false, true); }
+XLA_TYPED_TEST(NonsquareMatrixDot, TestTF) { this->TestImpl(true, false); }
+XLA_TYPED_TEST(NonsquareMatrixDot, TestTT) { this->TestImpl(true, true); }
 
 XLA_TEST_F(DotOperationTest, MatrixVectorC64) {
   auto lhs_handle =
@@ -468,25 +509,35 @@ XLA_TEST_F(DotOperationTest, MatrixVectorC64) {
       &builder, expected, {lhs_handle.get(), rhs_handle.get()}, error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, ConcurrentMatMul) {
-  ComputationBuilder builder(client_, TestName());
-  auto matrix1 = builder.ConstantR2<float>({{1.0, 2.0}, {3.0, 4.0}});
-  auto matrix2 = builder.ConstantR2<float>({{5.0, 6.0}, {7.0, 8.0}});
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, ConcurrentMatMult) {
+  using T = TypeParam;
+
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto matrix1 = builder.ConstantR2FromArray2D<T>({{1.0f, 2.0f}, {3.0f, 4.0f}});
+  auto matrix2 = builder.ConstantR2FromArray2D<T>({{5.0f, 6.0f}, {7.0f, 8.0f}});
   auto matrix12 = builder.Dot(matrix1, matrix2);
   auto matrix21 = builder.Dot(matrix2, matrix1);
   builder.Add(matrix12, matrix21);
 
-  Array2D<float> expected({{42.0, 56.0}, {74.0, 96.0}});
-  ComputeAndCompareR2<float>(&builder, expected, {}, error_spec_);
+  Array2D<T> expected({{42.0f, 56.0f}, {74.0f, 96.0f}});
+  this->template ComputeAndCompareR2<T>(&builder, expected, {},
+                                        this->error_spec_);
 }
 
+template <typename T>
+class DotOperationTestForBatchMatMul : public DotOperationTest {};
+TYPED_TEST_CASE(DotOperationTestForBatchMatMul, TypesF16F32F64);
+
 // Regression test for b/32055648. The root of the graph is a kFusion of 4
 // bitcasts. Although bitcasts don't map to thunks, the root should still be
 // sync-dependent on bitcasts' operands.
-XLA_TEST_F(DotOperationTest, BatchMatMul) {
-  ComputationBuilder builder(client_, TestName());
-  auto x = builder.Parameter(0, ShapeUtil::MakeShape(F32, {2, 2, 2, 2}), "x");
-  auto y = builder.Parameter(1, ShapeUtil::MakeShape(F32, {2, 2, 2, 2}), "y");
+XLA_TYPED_TEST(DotOperationTestForBatchMatMul, Types) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto x =
+      builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({2, 2, 2, 2}), "x");
+  auto y =
+      builder.Parameter(1, ShapeUtil::MakeShapeWithType<T>({2, 2, 2, 2}), "y");
 
   auto x_flat = builder.Reshape(x, {0, 1, 2, 3}, {4, 2, 2});
   auto y_flat = builder.Reshape(y, {0, 1, 2, 3}, {4, 2, 2});
@@ -507,29 +558,42 @@ XLA_TEST_F(DotOperationTest, BatchMatMul) {
   auto out_flat = builder.ConcatInDim(out_slices, 0);
   builder.Reshape(out_flat, {0, 1, 2}, {2, 2, 2, 2});
 
-  auto x_data = client_
-                    ->TransferToServer(*Literal::CreateR4<float>(
-                        {{{{1000, 100}, {10, 1}}, {{2000, 200}, {20, 2}}},
-                         {{{3000, 300}, {30, 3}}, {{4000, 400}, {40, 4}}}}))
-                    .ConsumeValueOrDie();
-  auto y_data = client_
-                    ->TransferToServer(*Literal::CreateR4<float>(
-                        {{{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}},
-                         {{{11, 22}, {33, 44}}, {{55, 66}, {77, 88}}}}))
+  auto x_data = this->client_
+                    ->TransferToServer(*Literal::CreateR4FromArray4D<T>(
+                        {{{{1000.0f, 100.0f}, {10.0f, 1.0f}},
+                          {{2000.0f, 200.0f}, {20.0f, 2.0f}}},
+                         {{{3000.0f, 300.0f}, {30.0f, 3.0f}},
+                          {{4000.0f, 400.0f}, {40.0f, 4.0f}}}}))
                     .ConsumeValueOrDie();
+  auto y_data =
+      this->client_
+          ->TransferToServer(*Literal::CreateR4FromArray4D<T>(
+              {{{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}},
+               {{{11.0f, 22.0f}, {33.0f, 44.0f}},
+                {{55.0f, 66.0f}, {77.0f, 88.0f}}}}))
+          .ConsumeValueOrDie();
 
-  ComputeAndCompareR4<float>(
+  if (std::is_same<Eigen::half, T>::value) {
+    this->error_spec_ = ErrorSpec{0.0001, 1e-3};
+  }
+  this->template ComputeAndCompareR4<T>(
       &builder,
       /*expected=*/
-      {{{{1300, 2400}, {13, 24}}, {{11400, 13600}, {114, 136}}},
-       {{{42900, 79200}, {429, 792}}, {{250800, 299200}, {2508, 2992}}}},
-      {x_data.get(), y_data.get()}, error_spec_);
+      {{{{1300.0f, 2400.0f}, {13.0f, 24.0f}},
+        {{11400.0f, 13600.0f}, {114.0f, 136.0f}}},
+       {{{42900.0f, 79200.0f}, {429.0f, 792.0f}},
+        {{250800.0f, 299200.0f}, {2508.0f, 2992.0f}}}},
+      {x_data.get(), y_data.get()}, this->error_spec_);
 }
 
-XLA_TEST_F(DotOperationTest, GeneralMatMul) {
-  ComputationBuilder builder(client_, TestName());
-  auto x = builder.Parameter(0, ShapeUtil::MakeShape(F32, {2, 2, 2}), "x");
-  auto y = builder.Parameter(1, ShapeUtil::MakeShape(F32, {2, 2, 2}), "y");
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, GeneralMatMul) {
+  using T = TypeParam;
+
+  ComputationBuilder builder(this->client_, this->TestName());
+  auto x =
+      builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({2, 2, 2}), "x");
+  auto y =
+      builder.Parameter(1, ShapeUtil::MakeShapeWithType<T>({2, 2, 2}), "y");
 
   DotDimensionNumbers dnums;
   dnums.add_lhs_contracting_dimensions(2);
@@ -539,31 +603,34 @@ XLA_TEST_F(DotOperationTest, GeneralMatMul) {
 
   auto out = builder.DotGeneral(x, y, dnums);
 
-  auto x_data = client_
-                    ->TransferToServer(*Literal::CreateR3<float>(
-                        {{{1.0, 2.0}, {3.0, 4.0}}, {{5.0, 6.0}, {7.0, 8.0}}}))
-                    .ConsumeValueOrDie();
+  auto x_data =
+      this->client_
+          ->TransferToServer(*Literal::CreateR3FromArray3D<T>(
+              {{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}}))
+          .ConsumeValueOrDie();
 
-  auto y_data = client_
-                    ->TransferToServer(*Literal::CreateR3<float>(
-                        {{{1.0, 0.0}, {0.0, 1.0}}, {{1.0, 0.0}, {0.0, 1.0}}}))
-                    .ConsumeValueOrDie();
+  auto y_data =
+      this->client_
+          ->TransferToServer(*Literal::CreateR3FromArray3D<T>(
+              {{{1.0f, 0.0f}, {0.0f, 1.0f}}, {{1.0f, 0.0f}, {0.0f, 1.0f}}}))
+          .ConsumeValueOrDie();
 
-  ComputeAndCompareR3<float>(
+  this->template ComputeAndCompareR3<T>(
       &builder,
       /*expected=*/
-      {{{1.0, 2.0}, {3.0, 4.0}}, {{5.0, 6.0}, {7.0, 8.0}}},
-      {x_data.get(), y_data.get()}, error_spec_);
+      {{{1.0f, 2.0f}, {3.0f, 4.0f}}, {{5.0f, 6.0f}, {7.0f, 8.0f}}},
+      {x_data.get(), y_data.get()}, this->error_spec_);
 }
 
-TEST_F(DotOperationTest, TransposeFolding) {
+XLA_TYPED_TEST(DotOperationTest_F16F32F64, TransposeFolding) {
+  using T = TypeParam;
   for (bool transpose_lhs : {false, true}) {
     for (bool transpose_rhs : {false, true}) {
       for (bool row_major : {false, true}) {
-        std::unique_ptr<Array2D<float>> lhs(
-            new Array2D<float>({{1.0, 2.0, 3.0}, {3.0, -4.0, -1.0}}));
-        std::unique_ptr<Array2D<float>> rhs(
-            new Array2D<float>({{1.0, 6.0}, {2.0, 3.0}, {7.0, -4.0}}));
+        std::unique_ptr<Array2D<T>> lhs(
+            new Array2D<T>({{1.0f, 2.0f, 3.0f}, {3.0f, -4.0f, -1.0f}}));
+        std::unique_ptr<Array2D<T>> rhs(
+            new Array2D<T>({{1.0f, 6.0f}, {2.0f, 3.0f}, {7.0f, -4.0f}}));
 
         if (transpose_lhs) {
           lhs = ReferenceUtil::TransposeArray2D(*lhs);
@@ -572,22 +639,20 @@ TEST_F(DotOperationTest, TransposeFolding) {
           rhs = ReferenceUtil::TransposeArray2D(*rhs);
         }
         auto lhs_handle =
-            client_
-                ->TransferToServer(
-                    *Literal::CreateR2FromArray2DWithLayout<float>(
-                        *lhs, LayoutUtil::MakeLayout(
-                                  MinorToMajorForIsRowMajor(row_major))))
+            this->client_
+                ->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
+                    *lhs, LayoutUtil::MakeLayout(
+                              MinorToMajorForIsRowMajor(row_major))))
                 .ConsumeValueOrDie();
         auto rhs_handle =
-            client_
-                ->TransferToServer(
-                    *Literal::CreateR2FromArray2DWithLayout<float>(
-                        *rhs, LayoutUtil::MakeLayout(
-                                  MinorToMajorForIsRowMajor(row_major))))
+            this->client_
+                ->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
+                    *rhs, LayoutUtil::MakeLayout(
+                              MinorToMajorForIsRowMajor(row_major))))
                 .ConsumeValueOrDie();
 
-        ComputationBuilder builder(client_, TestName());
-        auto prim_type = primitive_util::NativeToPrimitiveType<float>();
+        ComputationBuilder builder(this->client_, this->TestName());
+        auto prim_type = primitive_util::NativeToPrimitiveType<T>();
         auto lhs_arg = builder.Parameter(
             0, ShapeUtil::MakeShape(prim_type, {lhs->height(), lhs->width()}),
             "lhs");
@@ -602,24 +667,27 @@ TEST_F(DotOperationTest, TransposeFolding) {
         }
         auto result = builder.Dot(lhs_arg, rhs_arg);
 
-        Array2D<float> expected({{26.0, 0.0}, {-12.0, 10.0}});
+        Array2D<T> expected({{26.0f, 0.0f}, {-12.0f, 10.0f}});
         VLOG(1) << "TestTransposeFolding " << transpose_lhs << " "
                 << transpose_rhs << " " << row_major;
-        ComputeAndCompareR2<float>(&builder, expected,
-                                   {lhs_handle.get(), rhs_handle.get()},
-                                   error_spec_);
+        this->template ComputeAndCompareR2<T>(
+            &builder, expected, {lhs_handle.get(), rhs_handle.get()},
+            this->error_spec_);
       }
     }
   }
 }
 
-TEST_F(DotOperationTest, DotOfConcatOptimizationWithConstLHS) {
-  auto prim_type = primitive_util::NativeToPrimitiveType<float>();
+XLA_TYPED_TEST(DotOperationTest_F16F32F64,
+               DotOfConcatOptimizationWithConstLHS) {
+  using T = TypeParam;
+  auto prim_type = primitive_util::NativeToPrimitiveType<T>();
 
-  std::unique_ptr<Array2D<float>> constant_lhs_array(new Array2D<float>(
-      {{1.0, 2.0, 3.0, 4.0, 5.0, 6.0}, {6.0, 5.0, 4.0, 3.0, 2.0, 1.0}}));
+  std::unique_ptr<Array2D<T>> constant_lhs_array(
+      new Array2D<T>({{1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f},
+                      {6.0f, 5.0f, 4.0f, 3.0f, 2.0f, 1.0f}}));
 
-  ComputationBuilder builder(client_, TestName());
+  ComputationBuilder builder(this->client_, this->TestName());
   auto lhs_constant = builder.ConstantR2FromArray2D(*constant_lhs_array);
   auto rhs_arg_0 = builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 2}),
                                      "rhs_arg_0");
@@ -630,78 +698,80 @@ TEST_F(DotOperationTest, DotOfConcatOptimizationWithConstLHS) {
   auto result = builder.Dot(
       lhs_constant, builder.ConcatInDim({rhs_arg_0, rhs_arg_1, rhs_arg_2}, 0));
 
-  std::unique_ptr<Array2D<float>> arg_0_value_array(
-      new Array2D<float>({{1.0, 2.0}, {3.0, 4.0}}));
-  std::unique_ptr<Array2D<float>> arg_1_value_array(
-      new Array2D<float>({{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}}));
-  std::unique_ptr<Array2D<float>> arg_2_value_array(
-      new Array2D<float>({{1.0, 2.0}}));
+  std::unique_ptr<Array2D<T>> arg_0_value_array(
+      new Array2D<T>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
+  std::unique_ptr<Array2D<T>> arg_1_value_array(
+      new Array2D<T>({{1.0f, 2.0f}, {3.0f, 4.0f}, {5.0f, 6.0f}}));
+  std::unique_ptr<Array2D<T>> arg_2_value_array(new Array2D<T>({{1.0f, 2.0f}}));
 
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_0_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_0_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_0_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_1_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_1_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_1_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_2_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_2_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_2_value_array)));
 
-  Array2D<float> expected({{53.0, 74.0}, {45.0, 66.0}});
-  ComputeAndCompareR2<float>(
+  Array2D<T> expected({{53.0f, 74.0f}, {45.0f, 66.0f}});
+  this->template ComputeAndCompareR2<T>(
       &builder, expected,
-      {arg_0_value.get(), arg_1_value.get(), arg_2_value.get()}, error_spec_);
-}
-
-TEST_F(DotOperationTest, DotOfConcatOptimizationWithConstRHS) {
-  auto prim_type = primitive_util::NativeToPrimitiveType<float>();
-
-  std::unique_ptr<Array2D<float>> constant_rhs_array(
-      new Array2D<float>({{1.0, 2.0},
-                          {3.0, 4.0},
-                          {5.0, 6.0},
-                          {6.0, 5.0},
-                          {4.0, 3.0},
-                          {2.0, 1.0}}));
-
-  ComputationBuilder builder(client_, TestName());
+      {arg_0_value.get(), arg_1_value.get(), arg_2_value.get()},
+      this->error_spec_);
+}
+
+XLA_TYPED_TEST(DotOperationTest_F16F32F64,
+               DotOfConcatOptimizationWithConstRHS) {
+  using T = TypeParam;
+  std::unique_ptr<Array2D<T>> constant_rhs_array(
+      new Array2D<T>({{1.0f, 2.0f},
+                      {3.0f, 4.0f},
+                      {5.0f, 6.0f},
+                      {6.0f, 5.0f},
+                      {4.0f, 3.0f},
+                      {2.0f, 1.0f}}));
+
+  ComputationBuilder builder(this->client_, this->TestName());
   auto rhs_constant = builder.ConstantR2FromArray2D(*constant_rhs_array);
-  auto lhs_arg_0 = builder.Parameter(0, ShapeUtil::MakeShape(prim_type, {2, 2}),
+  auto lhs_arg_0 = builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({2, 2}),
                                      "lhs_arg_0");
-  auto lhs_arg_1 = builder.Parameter(1, ShapeUtil::MakeShape(prim_type, {2, 3}),
+  auto lhs_arg_1 = builder.Parameter(1, ShapeUtil::MakeShapeWithType<T>({2, 3}),
                                      "lhs_arg_1");
-  auto lhs_arg_2 = builder.Parameter(2, ShapeUtil::MakeShape(prim_type, {2, 1}),
+  auto lhs_arg_2 = builder.Parameter(2, ShapeUtil::MakeShapeWithType<T>({2, 1}),
                                      "lhs_arg_2");
   auto result = builder.Dot(
       builder.ConcatInDim({lhs_arg_0, lhs_arg_1, lhs_arg_2}, 1), rhs_constant);
 
-  std::unique_ptr<Array2D<float>> arg_0_value_array(
-      new Array2D<float>({{1.0, 2.0}, {3.0, 4.0}}));
-  std::unique_ptr<Array2D<float>> arg_1_value_array(
-      new Array2D<float>({{1.0, 2.0, 3.0}, {4.0, 5.0, 6.0}}));
-  std::unique_ptr<Array2D<float>> arg_2_value_array(
-      new Array2D<float>({{1.0}, {2.0}}));
+  std::unique_ptr<Array2D<T>> arg_0_value_array(
+      new Array2D<T>({{1.0f, 2.0f}, {3.0f, 4.0f}}));
+  std::unique_ptr<Array2D<T>> arg_1_value_array(
+      new Array2D<T>({{1.0f, 2.0f, 3.0f}, {4.0f, 5.0f, 6.0f}}));
+  std::unique_ptr<Array2D<T>> arg_2_value_array(
+      new Array2D<T>({{1.0f}, {2.0f}}));
 
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_0_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_0_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_0_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_1_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_1_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_1_value_array)));
   TF_ASSERT_OK_AND_ASSIGN(
       auto arg_2_value,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2D<float>(*arg_2_value_array)));
+      this->client_->TransferToServer(
+          *Literal::CreateR2FromArray2D<T>(*arg_2_value_array)));
 
-  Array2D<float> expected({{38.0, 36.0}, {93.0, 91.0}});
-  ComputeAndCompareR2<float>(
+  Array2D<T> expected({{38.0f, 36.0f}, {93.0f, 91.0f}});
+  this->template ComputeAndCompareR2<T>(
       &builder, expected,
-      {arg_0_value.get(), arg_1_value.get(), arg_2_value.get()}, error_spec_);
+      {arg_0_value.get(), arg_1_value.get(), arg_2_value.get()},
+      this->error_spec_);
 }
+
 }  // namespace
 }  // namespace xla
diff --git a/tensorflow/compiler/xla/tests/hlo_test_base.cc b/tensorflow/compiler/xla/tests/hlo_test_base.cc
index 9f5806c5e16c30cf198027cffab5f78c315cb957..6723c99edb945492abfbac159bed1959d551ec57 100644
--- a/tensorflow/compiler/xla/tests/hlo_test_base.cc
+++ b/tensorflow/compiler/xla/tests/hlo_test_base.cc
@@ -267,6 +267,28 @@ template <typename LiteralPtr>
                                   reference_preprocessor);
 }
 
+HloComputation* HloTestBase::FindComputation(HloModule* module,
+                                             tensorflow::StringPiece name) {
+  auto it = c_find_if(module->computations(),
+                      [&](HloComputation* c) { return c->name() == name; });
+  if (it == module->computations().end()) {
+    return nullptr;
+  }
+  return *it;
+}
+
+HloInstruction* HloTestBase::FindInstruction(HloModule* module,
+                                             tensorflow::StringPiece name) {
+  for (const HloComputation* c : module->computations()) {
+    auto it = c_find_if(c->instructions(),
+                        [&](HloInstruction* i) { return i->name() == name; });
+    if (it != c->instructions().end()) {
+      return *it;
+    }
+  }
+  return nullptr;
+}
+
 Backend& HloTestBase::backend() { return test_runner_.backend(); }
 
 /* static */
diff --git a/tensorflow/compiler/xla/tests/hlo_test_base.h b/tensorflow/compiler/xla/tests/hlo_test_base.h
index 4aea9fc9fd027231106e529eb16bcd43f23fbe1c..413bb213fdcb1303f396308d13d9d0b96b47b71f 100644
--- a/tensorflow/compiler/xla/tests/hlo_test_base.h
+++ b/tensorflow/compiler/xla/tests/hlo_test_base.h
@@ -197,6 +197,15 @@ class HloTestBase : public ::testing::Test {
         ->Clear();
   }
 
+  // Gets the computation/instruction from the given module with the given name.
+  //
+  // This is useful for tests which create HLOs from a string and then want to
+  // inspect a particular computation or instruction.
+  HloComputation* FindComputation(HloModule* module,
+                                  tensorflow::StringPiece name);
+  HloInstruction* FindInstruction(HloModule* module,
+                                  tensorflow::StringPiece name);
+
   // Return an HLO verifier constructed for the test backend.
   HloVerifier& verifier() const { return *hlo_verifier_; }
 
diff --git a/tensorflow/compiler/xla/tests/literal_test_util.cc b/tensorflow/compiler/xla/tests/literal_test_util.cc
index 5aa71a9261dbd414d1499f15c9b83cd63b634b49..81630df34c58526b6d41492b2b4b3892a02a21c2 100644
--- a/tensorflow/compiler/xla/tests/literal_test_util.cc
+++ b/tensorflow/compiler/xla/tests/literal_test_util.cc
@@ -209,6 +209,11 @@ template <>
   return CompareFloatsBitwiseEqual<bfloat16, uint16>(lhs, rhs);
 }
 template <>
+::testing::AssertionResult CompareEqual<Eigen::half>(Eigen::half lhs,
+                                                     Eigen::half rhs) {
+  return CompareFloatsBitwiseEqual<Eigen::half, uint16>(lhs, rhs);
+}
+template <>
 ::testing::AssertionResult CompareEqual<float>(float lhs, float rhs) {
   return CompareFloatsBitwiseEqual<float, uint32>(lhs, rhs);
 }
diff --git a/tensorflow/compiler/xla/tests/map_test.cc b/tensorflow/compiler/xla/tests/map_test.cc
index 2b0f7e6e80c48435ca55432a2afa3b6d69162625..0cd812fd1b4bc69c34b70d3ca0fd0aa6cf57fa4c 100644
--- a/tensorflow/compiler/xla/tests/map_test.cc
+++ b/tensorflow/compiler/xla/tests/map_test.cc
@@ -531,7 +531,7 @@ TEST_F(MapTest, MapOperantionWithBuildError) {
   ASSERT_TRUE(!computation_status.ok());
   EXPECT_THAT(
       computation_status.status().ToString(),
-      ::testing::HasSubstr("error from: ErrorAdd: binary op BINOP_ADD with "
+      ::testing::HasSubstr("error from: ErrorAdd: Binary op BINOP_ADD with "
                            "different element types: f32[] and u16[]"));
 }
 
diff --git a/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc b/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc
index 6c86dd5b9ef673c9facffafa37e00a859ce82010..c42f71388baba73e08a361d817e41b03e03bf133 100644
--- a/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc
+++ b/tensorflow/compiler/xla/tests/matrix_ops_simple_test.cc
@@ -29,6 +29,8 @@ limitations under the License.
 #include "tensorflow/compiler/xla/test_helpers.h"
 #include "tensorflow/compiler/xla/tests/client_library_test_base.h"
 #include "tensorflow/compiler/xla/tests/literal_test_util.h"
+#include "tensorflow/compiler/xla/tests/test_macros.h"
+#include "tensorflow/compiler/xla/tests/test_utils.h"
 #include "tensorflow/compiler/xla/xla_data.pb.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/platform/logging.h"
@@ -38,258 +40,223 @@ limitations under the License.
 namespace xla {
 namespace {
 
-class MatOpsSimpleTest : public ClientLibraryTestBase {
- protected:
-  Computation BuildSum() {
-    // sum(x, y) = x + y
-    ComputationBuilder builder(client_, "sum");
-    auto x_value =
-        builder.Parameter(0, ShapeUtil::MakeShape(F32, {}), "x_value");
-    auto y_value =
-        builder.Parameter(1, ShapeUtil::MakeShape(F32, {}), "y_value");
-    builder.Add(x_value, y_value);
-    auto computation_status = builder.Build();
-    TF_CHECK_OK(computation_status.status());
-    return computation_status.ConsumeValueOrDie();
-  }
-
-  void TestLinspaceMax(int64 rows, int64 cols) {
-    float from = -128.0, to = 256.0;
-    std::unique_ptr<Array2D<float>> alhs =
-        MakeLinspaceArray2D(from, to, rows, cols);
-    auto arhs = MakeUnique<Array2D<float>>(rows, cols, 1.0);
-
-    ComputationBuilder builder(
-        client_,
-        tensorflow::strings::Printf("max_%lldx%lld_linspace", rows, cols));
-    auto lhs = builder.ConstantR2FromArray2D<float>(*alhs);
-    auto rhs = builder.ConstantR2FromArray2D<float>(*arhs);
-    auto max = builder.Max(lhs, rhs);
-
-    Array2D<float> aexpected(rows, cols);
-    for (int row = 0; row < rows; ++row) {
-      for (int col = 0; col < cols; ++col) {
-        aexpected(row, col) = std::max((*alhs)(row, col), (*arhs)(row, col));
-      }
-    }
-
-    ComputeAndCompareR2<float>(&builder, aexpected, {}, ErrorSpec(1e-6));
-  }
-};
-
-TEST_F(MatOpsSimpleTest, ExpTwoByTwoValues) {
-  ComputationBuilder builder(client_, "exp_2x2");
-  auto data = builder.ConstantR2<float>({
-      {1.0, 0.0},   // row 0
-      {-1.0, 0.5},  // row 1
+#ifdef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+using TypesF16F32 = ::testing::Types<float>;
+#else
+using TypesF16F32 = ::testing::Types<Eigen::half, float>;
+#endif
+
+class MatOpsSimpleTest : public ClientLibraryTestBase {};
+
+template <typename T>
+class MatOpsSimpleTest_F16F32 : public MatOpsSimpleTest {};
+
+// TODO(bixia): This test for F16 failed on GPU 02-25-2018.
+#ifdef XLA_TEST_BACKEND_GPU
+TYPED_TEST_CASE(MatOpsSimpleTest_F16F32, ::testing::Types<float>);
+#else
+TYPED_TEST_CASE(MatOpsSimpleTest_F16F32, TypesF16F32);
+#endif
+
+XLA_TYPED_TEST(MatOpsSimpleTest_F16F32, ExpTwoByTwoValues) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, "exp_2x2");
+  auto data = builder.ConstantR2FromArray2D<T>({
+      {1.0f, 0.0f},   // row 0
+      {-1.0f, 0.5f},  // row 1
   });
   builder.Exp(data);
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2<float>({{2.71828, 1.00000},    // row 0
-                                {0.36788, 1.64872}});  // row 1
+      Literal::CreateR2FromArray2D<T>({{2.71828f, 1.00000f},    // row 0
+                                       {0.36788f, 1.64872f}});  // row 1
 
-  ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-5));
+  this->template ComputeAndCompareLiteral(&builder, *expected, {},
+                                          ErrorSpec(1e-5));
 }
 
-TEST_F(MatOpsSimpleTest, MapTwoByTwo) {
+XLA_TYPED_TEST(MatOpsSimpleTest_F16F32, MapTwoByTwo) {
+  using T = TypeParam;
   Computation add_half;
   {
     // add_half(x) = x + 0.5
-    ComputationBuilder builder(client_, "add_half");
+    ComputationBuilder builder(this->client_, "add_half");
     auto x_value =
-        builder.Parameter(0, ShapeUtil::MakeShape(F32, {}), "x_value");
-    auto half = builder.ConstantR0<float>(0.5);
+        builder.Parameter(0, ShapeUtil::MakeShapeWithType<T>({}), "x_value");
+    auto half = builder.ConstantR0<T>(static_cast<T>(0.5));
     builder.Add(x_value, half);
     auto computation_status = builder.Build();
     ASSERT_IS_OK(computation_status.status());
     add_half = computation_status.ConsumeValueOrDie();
   }
 
-  ComputationBuilder builder(client_, "map_2x2");
-  auto data = builder.ConstantR2<float>({
-      {1.0, 0.0},   // row 0
-      {-1.0, 0.5},  // row 1
+  ComputationBuilder builder(this->client_, "map_2x2");
+  auto data = builder.ConstantR2FromArray2D<T>({
+      {1.0f, 0.0f},   // row 0
+      {-1.0f, 0.5f},  // row 1
   });
   auto map = builder.Map({data}, add_half, {0, 1});
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2<float>({{1.5, 0.5},     // row 0
-                                {-0.5, 1.0}});  // row 1
-  ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-5));
+      Literal::CreateR2FromArray2D<T>({{1.5f, 0.5f},     // row 0
+                                       {-0.5f, 1.0f}});  // row 1
+  this->template ComputeAndCompareLiteral(&builder, *expected, {},
+                                          ErrorSpec(1e-5));
 }
 
-TEST_F(MatOpsSimpleTest, MaxTwoByTwoValues) {
-  ComputationBuilder builder(client_, "max_2x2");
-  auto lhs = builder.ConstantR2<float>({
-      {7.0, 2.0},   // row 0
-      {3.0, -4.0},  // row 1
+XLA_TYPED_TEST(MatOpsSimpleTest_F16F32, MaxTwoByTwoValues) {
+  using T = TypeParam;
+  ComputationBuilder builder(this->client_, "max_2x2");
+  auto lhs = builder.ConstantR2FromArray2D<T>({
+      {7.0f, 2.0f},   // row 0
+      {3.0f, -4.0f},  // row 1
   });
-  auto rhs = builder.ConstantR2<float>({
-      {5.0, 6.0},   // row 0
-      {1.0, -8.0},  // row 1
+  auto rhs = builder.ConstantR2FromArray2D<T>({
+      {5.0f, 6.0f},   // row 0
+      {1.0f, -8.0f},  // row 1
   });
   auto max = builder.Max(lhs, rhs);
 
   std::unique_ptr<Literal> expected =
-      Literal::CreateR2<float>({{7.0, 6.0},     // row 0
-                                {3.0, -4.0}});  // row 1
-  ComputeAndCompareLiteral(&builder, *expected, {}, ErrorSpec(1e-6));
+      Literal::CreateR2FromArray2D<T>({{7.0f, 6.0f},     // row 0
+                                       {3.0f, -4.0f}});  // row 1
+  this->template ComputeAndCompareLiteral(&builder, *expected, {},
+                                          ErrorSpec(1e-6));
 }
 
-TEST_F(MatOpsSimpleTest, Max1x1Linspace) { TestLinspaceMax(1, 1); }
-
-TEST_F(MatOpsSimpleTest, Max2x2Linspace) { TestLinspaceMax(2, 2); }
-
-TEST_F(MatOpsSimpleTest, Max3x3Linspace) { TestLinspaceMax(3, 3); }
-
-TEST_F(MatOpsSimpleTest, Max4x4Linspace) { TestLinspaceMax(4, 4); }
-
-TEST_F(MatOpsSimpleTest, Max6x6Linspace) { TestLinspaceMax(6, 6); }
-
-TEST_F(MatOpsSimpleTest, Max8x8Linspace) { TestLinspaceMax(8, 8); }
-
-TEST_F(MatOpsSimpleTest, Max12x12Linspace) { TestLinspaceMax(12, 12); }
-
-TEST_F(MatOpsSimpleTest, Max16x16Linspace) { TestLinspaceMax(16, 16); }
+struct TestLinspaceMaxParam {
+  int64 rows;
+  int64 cols;
+};
 
-TEST_F(MatOpsSimpleTest, Max32x8Linspace) { TestLinspaceMax(32, 8); }
+class TestLinspaceMaxParametric
+    : public MatOpsSimpleTest,
+      public ::testing::WithParamInterface<TestLinspaceMaxParam> {
+ public:
+  template <typename T>
+  void TestImpl() {
+    TestLinspaceMaxParam param = GetParam();
+    int64 rows = param.rows;
+    int64 cols = param.cols;
+    float from = -128.0, to = 256.0;
+    std::unique_ptr<Array2D<T>> alhs =
+        MakeLinspaceArray2D<T>(from, to, rows, cols);
+    auto arhs = MakeUnique<Array2D<T>>(rows, cols, static_cast<T>(1.0f));
 
-TEST_F(MatOpsSimpleTest, Max64x8Linspace) { TestLinspaceMax(64, 8); }
+    ComputationBuilder builder(
+        client_,
+        tensorflow::strings::Printf("max_%lldx%lld_linspace", rows, cols));
+    auto lhs = builder.ConstantR2FromArray2D<T>(*alhs);
+    auto rhs = builder.ConstantR2FromArray2D<T>(*arhs);
+    auto max = builder.Max(lhs, rhs);
 
-class MatOpsDotAddTest
-    : public ClientLibraryTestBase,
-      public ::testing::WithParamInterface<std::tuple<bool, bool, bool>> {};
-
-TEST_P(MatOpsDotAddTest, Dot_Add_2x2_2x2) {
-  bool row_major = std::get<0>(GetParam());
-  bool add_lhs = std::get<1>(GetParam());
-  bool transpose = std::get<2>(GetParam());
-  Array2D<float> lhs({{1.0, 2.0}, {3.0, 4.0}});
-  Array2D<float> rhs({{10.0, 11.0}, {12.0, 13.0}});
-
-  auto minor_to_major = [](bool row_major) -> std::vector<int64> {
-    return {row_major ? 1 : 0, row_major ? 0 : 1};
-  };
-
-  auto prim_type = primitive_util::NativeToPrimitiveType<float>();
-  Shape lhs_shape =
-      ShapeUtil::MakeShape(prim_type, {lhs.height(), lhs.width()});
-  Shape rhs_shape =
-      ShapeUtil::MakeShape(prim_type, {rhs.height(), rhs.width()});
-
-  TF_ASSERT_OK_AND_ASSIGN(
-      auto lhs_handle,
-      client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<float>(
-          lhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
-  TF_ASSERT_OK_AND_ASSIGN(
-      auto rhs_handle,
-      client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<float>(
-          rhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
-
-  ComputationBuilder builder(client_, TestName());
-  auto lhs_arg = builder.Parameter(0, lhs_shape, "lhs");
-  auto lhs_mat_arg = lhs_arg;
-  if (transpose) {
-    lhs_mat_arg = builder.Transpose(lhs_mat_arg, {1, 0});
-  }
-  auto rhs_arg = builder.Parameter(1, rhs_shape, "rhs");
-  auto result = builder.Dot(lhs_mat_arg, rhs_arg);
-  Array2D<float> expected;
-  if (add_lhs) {
-    result = builder.Add(result, lhs_arg);
-    if (transpose) {
-      expected = Array2D<float>({{47, 52}, {71, 78}});
-    } else {
-      expected = Array2D<float>({{35, 39}, {81, 89}});
+    Array2D<T> expected(rows, cols);
+    for (int row = 0; row < rows; ++row) {
+      for (int col = 0; col < cols; ++col) {
+        expected(row, col) = std::max<T>((*alhs)(row, col), (*arhs)(row, col));
+      }
     }
-  } else {
-    result = builder.Add(result, rhs_arg);
-    if (transpose) {
-      expected = Array2D<float>({{56, 61}, {80, 87}});
-    } else {
-      expected = Array2D<float>({{44, 48}, {90, 98}});
+    ErrorSpec error_spec(1e-6);
+    if (std::is_same<Eigen::half, T>::value) {
+      error_spec = ErrorSpec(1e-6, 2e-4);
     }
+    ComputeAndCompareR2<T>(&builder, expected, {}, error_spec);
   }
+};
 
-  ComputeAndCompareR2<float>(&builder, expected,
-                             {lhs_handle.get(), rhs_handle.get()},
-                             ErrorSpec(1e-6));
+string PrintTestLinspaceMaxParam(
+    const ::testing::TestParamInfo<TestLinspaceMaxParam>& test_param) {
+  const TestLinspaceMaxParam& param = test_param.param;
+  return tensorflow::strings::StrCat(param.rows, "r", param.cols, "c");
 }
 
-INSTANTIATE_TEST_CASE_P(MatOpsDotAddTestInstances, MatOpsDotAddTest,
-                        ::testing::Combine(::testing::Bool(), ::testing::Bool(),
-                                           ::testing::Bool()));
+#ifndef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+// TODO(bixia): This test failed on GPU 02-25-2018
+#ifdef XLA_TEST_BACKEND_CPU
+XLA_TEST_P(TestLinspaceMaxParametric, TestF16) { TestImpl<Eigen::half>(); }
+#endif
+#endif
+XLA_TEST_P(TestLinspaceMaxParametric, TestF32) { TestImpl<float>(); }
+
+INSTANTIATE_TEST_CASE_P(
+    TestLinspaceMax, TestLinspaceMaxParametric,
+    ::testing::Values(TestLinspaceMaxParam{1, 1}, TestLinspaceMaxParam{2, 2},
+                      TestLinspaceMaxParam{3, 3}, TestLinspaceMaxParam{4, 4},
+                      TestLinspaceMaxParam{6, 6}, TestLinspaceMaxParam{8, 8},
+                      TestLinspaceMaxParam{12, 12},
+                      TestLinspaceMaxParam{16, 16}, TestLinspaceMaxParam{32, 8},
+                      TestLinspaceMaxParam{64, 8}),
+    PrintTestLinspaceMaxParam);
 
-class MatOpsDotAddTest_bf16
+class MatOpsDotAddTest
     : public ClientLibraryTestBase,
-      public ::testing::WithParamInterface<std::tuple<bool, bool, bool>> {};
-
-TEST_P(MatOpsDotAddTest_bf16, Dot_Add_2x2_2x2) {
-  bool row_major = std::get<0>(GetParam());
-  bool add_lhs = std::get<1>(GetParam());
-  bool transpose = std::get<2>(GetParam());
-  Array2D<bfloat16> lhs(
-      {{bfloat16(1.0f), bfloat16(2.0f)}, {bfloat16(3.0), bfloat16(4.0)}});
-  Array2D<bfloat16> rhs(
-      {{bfloat16(10.0f), bfloat16(11.0f)}, {bfloat16(12.0f), bfloat16(13.0f)}});
-
-  auto minor_to_major = [](bool row_major) -> std::vector<int64> {
-    return {row_major ? 1 : 0, row_major ? 0 : 1};
-  };
-
-  auto prim_type = primitive_util::NativeToPrimitiveType<bfloat16>();
-  Shape lhs_shape =
-      ShapeUtil::MakeShape(prim_type, {lhs.height(), lhs.width()});
-  Shape rhs_shape =
-      ShapeUtil::MakeShape(prim_type, {rhs.height(), rhs.width()});
-
-  TF_ASSERT_OK_AND_ASSIGN(
-      auto lhs_handle,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2DWithLayout<bfloat16>(
-              lhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
-  TF_ASSERT_OK_AND_ASSIGN(
-      auto rhs_handle,
-      client_->TransferToServer(
-          *Literal::CreateR2FromArray2DWithLayout<bfloat16>(
-              rhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
-
-  ComputationBuilder builder(client_, TestName());
-  auto lhs_arg = builder.Parameter(0, lhs_shape, "lhs");
-  auto lhs_mat_arg = lhs_arg;
-  if (transpose) {
-    lhs_mat_arg = builder.Transpose(lhs_mat_arg, {1, 0});
-  }
-  auto rhs_arg = builder.Parameter(1, rhs_shape, "rhs");
-  auto result = builder.Dot(lhs_mat_arg, rhs_arg);
-  Array2D<bfloat16> expected;
-  if (add_lhs) {
-    result = builder.Add(result, lhs_arg);
+      public ::testing::WithParamInterface<std::tuple<bool, bool, bool>> {
+ public:
+  template <typename T>
+  void TestImpl() {
+    bool row_major = std::get<0>(GetParam());
+    bool add_lhs = std::get<1>(GetParam());
+    bool transpose = std::get<2>(GetParam());
+    Array2D<T> lhs({{1.0f, 2.0f}, {3.0f, 4.0f}});
+    Array2D<T> rhs({{10.0f, 11.0f}, {12.0f, 13.0f}});
+
+    auto minor_to_major = [](bool row_major) -> std::vector<int64> {
+      return {row_major ? 1 : 0, row_major ? 0 : 1};
+    };
+
+    auto prim_type = primitive_util::NativeToPrimitiveType<T>();
+    Shape lhs_shape =
+        ShapeUtil::MakeShape(prim_type, {lhs.height(), lhs.width()});
+    Shape rhs_shape =
+        ShapeUtil::MakeShape(prim_type, {rhs.height(), rhs.width()});
+
+    TF_ASSERT_OK_AND_ASSIGN(
+        auto lhs_handle,
+        client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
+            lhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
+    TF_ASSERT_OK_AND_ASSIGN(
+        auto rhs_handle,
+        client_->TransferToServer(*Literal::CreateR2FromArray2DWithLayout<T>(
+            rhs, LayoutUtil::MakeLayout(minor_to_major(row_major)))));
+
+    ComputationBuilder builder(client_, TestName());
+    auto lhs_arg = builder.Parameter(0, lhs_shape, "lhs");
+    auto lhs_mat_arg = lhs_arg;
     if (transpose) {
-      expected = Array2D<bfloat16>(
-          {{bfloat16(47), bfloat16(52)}, {bfloat16(71), bfloat16(78)}});
-    } else {
-      expected = Array2D<bfloat16>(
-          {{bfloat16(35), bfloat16(39)}, {bfloat16(81), bfloat16(89)}});
+      lhs_mat_arg = builder.Transpose(lhs_mat_arg, {1, 0});
     }
-  } else {
-    result = builder.Add(result, rhs_arg);
-    if (transpose) {
-      expected = Array2D<bfloat16>(
-          {{bfloat16(56), bfloat16(61)}, {bfloat16(80), bfloat16(87)}});
+    auto rhs_arg = builder.Parameter(1, rhs_shape, "rhs");
+    auto result = builder.Dot(lhs_mat_arg, rhs_arg);
+    Array2D<T> expected;
+    if (add_lhs) {
+      result = builder.Add(result, lhs_arg);
+      if (transpose) {
+        expected = Array2D<T>({{47.0f, 52.0f}, {71.0f, 78.0f}});
+      } else {
+        expected = Array2D<T>({{35.0f, 39.0f}, {81.0f, 89.0f}});
+      }
     } else {
-      expected = Array2D<bfloat16>(
-          {{bfloat16(44), bfloat16(48)}, {bfloat16(90), bfloat16(98)}});
+      result = builder.Add(result, rhs_arg);
+      if (transpose) {
+        expected = Array2D<T>({{56.0f, 61.0f}, {80.0f, 87.0f}});
+      } else {
+        expected = Array2D<T>({{44.0f, 48.0f}, {90.0f, 98.0f}});
+      }
     }
+
+    ComputeAndCompareR2<T>(&builder, expected,
+                           {lhs_handle.get(), rhs_handle.get()},
+                           ErrorSpec(1e-6));
   }
+};
 
-  ComputeAndCompareR2<bfloat16>(&builder, expected,
-                                {lhs_handle.get(), rhs_handle.get()},
-                                ErrorSpec(1e-6));
-}
+XLA_TEST_P(MatOpsDotAddTest, Dot_Add_2x2_2x2BF16) { TestImpl<bfloat16>(); }
+#ifndef XLA_BACKEND_DOES_NOT_SUPPORT_FLOAT16
+XLA_TEST_P(MatOpsDotAddTest, Dot_Add_2x2_2x2F16) { TestImpl<Eigen::half>(); }
+#endif
+XLA_TEST_P(MatOpsDotAddTest, Dot_Add_2x2_2x2F32) { TestImpl<float>(); }
 
-INSTANTIATE_TEST_CASE_P(MatOpsDotAddTestInstances, MatOpsDotAddTest_bf16,
+INSTANTIATE_TEST_CASE_P(MatOpsDotAddTestInstances, MatOpsDotAddTest,
                         ::testing::Combine(::testing::Bool(), ::testing::Bool(),
                                            ::testing::Bool()));
 
diff --git a/tensorflow/compiler/xla/tests/scalar_computations_test.cc b/tensorflow/compiler/xla/tests/scalar_computations_test.cc
index 4da6ee91607941b395b00befc98a10e7c17746ed..0c88bef69dfc522fef52422b0bd3a825fa173d44 100644
--- a/tensorflow/compiler/xla/tests/scalar_computations_test.cc
+++ b/tensorflow/compiler/xla/tests/scalar_computations_test.cc
@@ -163,7 +163,7 @@ XLA_TEST_F(ScalarComputationsTest, CastS64ToF32) {
   auto a = builder.Parameter(0, ShapeUtil::MakeShape(S64, {}), "a");
   builder.ConvertElementType(a, F32);
 
-  int64 value = 3LL << 32;
+  int64 value = 3LL << 35;
   std::unique_ptr<Literal> a_literal = Literal::CreateR0<int64>(value);
   std::unique_ptr<GlobalData> a_data =
       client_->TransferToServer(*a_literal).ConsumeValueOrDie();
@@ -860,6 +860,12 @@ XLA_TEST_F(ScalarComputationsTest, MinF32Below) {
   TestMinMax<float>(-100.1f, 3.1f, -100.1f, &ComputationBuilder::Min);
 }
 
+XLA_TEST_F(ScalarComputationsTest, MinPropagatesNan) {
+  SetFastMathDisabled(true);
+  TestMinMax<float>(NAN, 3.1f, NAN, &ComputationBuilder::Min);
+  TestMinMax<float>(-3.1f, NAN, NAN, &ComputationBuilder::Min);
+}
+
 XLA_TEST_F(ScalarComputationsTest, MaxF32Above) {
   TestMinMax<float>(10.1f, 3.1f, 10.1f, &ComputationBuilder::Max);
 }
@@ -868,6 +874,12 @@ XLA_TEST_F(ScalarComputationsTest, MaxF32Below) {
   TestMinMax<float>(-100.1f, 3.1f, 3.1f, &ComputationBuilder::Max);
 }
 
+XLA_TEST_F(ScalarComputationsTest, MaxPropagatesNan) {
+  SetFastMathDisabled(true);
+  TestMinMax<float>(NAN, 3.1f, NAN, &ComputationBuilder::Max);
+  TestMinMax<float>(-3.1f, NAN, NAN, &ComputationBuilder::Max);
+}
+
 XLA_TEST_F(ScalarComputationsTest, ComplicatedArithmeticExpressionF32) {
   // Compute the expression (1 * (3 - 1) * (7 + 0) - 4) / 20.
   ComputationBuilder b(client_, TestName());
diff --git a/tensorflow/compiler/xla/tests/slice_test.cc b/tensorflow/compiler/xla/tests/slice_test.cc
index ac163df127e0087c02777fa3d5ce7970c51b97b9..fe36df160daacc4fdfbdb0b75f8304f91e1a4245 100644
--- a/tensorflow/compiler/xla/tests/slice_test.cc
+++ b/tensorflow/compiler/xla/tests/slice_test.cc
@@ -237,6 +237,12 @@ INSTANTIATE_TEST_CASE_P(
     SliceR1TestInstantiation,
     SliceR1Test,
     ::testing::Values(
+// TODO(b/69425338): This uses too much memory on GPU.
+#ifndef XLA_TEST_BACKEND_GPU
+        R1Spec{16 * 1024 * 1024, 4 * 1024 * 1024, 12 * 1024 * 1024, 1},
+        R1Spec{16 * 1024 * 1024, 4 * 1024 * 1024 + 1, 12 * 1024 * 1024 - 1, 1},
+        R1Spec{16 * 1024 * 1024, 4 * 1024 * 1024 - 1, 12 * 1024 * 1024 + 1, 1},
+#endif
         R1Spec{10, 0, 0, 1},
         R1Spec{10, 7, 7, 1},
         R1Spec{10, 0, 5, 1},
@@ -267,13 +273,15 @@ INSTANTIATE_TEST_CASE_P(
         R1Spec{64 * 1024, 1024 + 1, 63 * 1024 - 1, 1},
         R1Spec{64 * 1024, 32 * 1024, 33 * 1024, 1},
         R1Spec{64 * 1024, 32 * 1024 + 1, 33 * 1024 - 1, 1},
-        R1Spec{64 * 1024, 32 * 1024 - 17, 36 * 1024 - 18, 1},
-// TODO(b/69425338): This uses too much memory on GPU.
-#ifndef XLA_TEST_BACKEND_GPU
-        R1Spec{16 * 1024 * 1024, 4 * 1024 * 1024, 12 * 1024 * 1024, 1},
-        R1Spec{16 * 1024 * 1024, 4 * 1024 * 1024 + 1, 12 * 1024 * 1024 - 1, 1},
-        R1Spec{16 * 1024 * 1024, 4 * 1024 * 1024 - 1, 12 * 1024 * 1024 + 1, 1},
-#endif
+        R1Spec{64 * 1024, 32 * 1024 - 17, 36 * 1024 - 18, 1}
+    ),
+    SliceR1TestDataToString
+);
+
+INSTANTIATE_TEST_CASE_P(
+    SliceStridedR1TestInstantiation,
+    SliceR1Test,
+    ::testing::Values(
         R1Spec{10, 2, 4, 2},
         R1Spec{10, 0, 10, 2},
         R1Spec{10, 0, 10, 3},
@@ -285,8 +293,24 @@ INSTANTIATE_TEST_CASE_P(
         R1Spec{2047, 1024 - 24, 1024 + 160, 31},
         R1Spec{2047, 1, 2046, 3 * 128},
         R1Spec{4096, 1024 + 3, 4095, 500},
-        R1Spec{8192, 0, 8192, 1024 * 3 + 400}
-        ),
+        R1Spec{8192, 0, 8192, 1024 * 3 + 400},
+        R1Spec{1024 * 1024, 0, 1024 * 1024, 2},
+        R1Spec{1024 * 1024, 0, 1024 * 1024, 8},
+        R1Spec{1024 * 1024, 0, 1024 * 1024, 7},
+        R1Spec{1024 * 1024, 0, 1024 * 1024, 125},
+        R1Spec{1024 * 1024, 3, 1024 - 9, 2},
+        R1Spec{1024 * 1024, 3, 1024 - 9, 8},
+        R1Spec{1024 * 1024, 3, 1024 - 9, 7},
+        R1Spec{1024 * 1024, 3, 1024 - 9, 125},
+        R1Spec{1024 * 1024, 3, 1024 * 512 - 9, 2},
+        R1Spec{1024 * 1024, 3, 1024 * 512 - 9, 8},
+        R1Spec{1024 * 1024, 3, 1024 * 512 - 9, 7},
+        R1Spec{1024 * 1024, 3, 1024 * 512 - 9, 125},
+        R1Spec{1024 * 1024 + 71, 3, 1024 * 512 - 9, 2},
+        R1Spec{1024 * 1024 + 71, 3, 1024 * 512 - 9, 8},
+        R1Spec{1024 * 1024 + 71, 3, 1024 * 512 - 9, 7},
+        R1Spec{1024 * 1024 + 71, 3, 1024 * 512 - 9, 125}
+    ),
     SliceR1TestDataToString
 );
 // clang-format on
diff --git a/tensorflow/compiler/xla/tests/test_utils.cc b/tensorflow/compiler/xla/tests/test_utils.cc
index b060fb13b1451aab30cfca73bea0a4a598a9fa3a..0bc7df2a65b44a76f877b6513e6bf93b99fbc1a3 100644
--- a/tensorflow/compiler/xla/tests/test_utils.cc
+++ b/tensorflow/compiler/xla/tests/test_utils.cc
@@ -287,7 +287,7 @@ StatusOr<std::unique_ptr<Literal>> MakeFakeLiteral(const Shape& shape) {
 
 StatusOr<std::vector<std::unique_ptr<Literal>>> MakeFakeArguments(
     HloModule* const module) {
-  TF_ASSIGN_OR_RETURN(auto dataflow, HloDataflowAnalysis::Run(module));
+  TF_ASSIGN_OR_RETURN(auto dataflow, HloDataflowAnalysis::Run(*module));
   const auto params = module->entry_computation()->parameter_instructions();
   std::minstd_rand0 engine;
   std::vector<std::unique_ptr<Literal>> arguments(params.size());
diff --git a/tensorflow/compiler/xla/tests/while_test.cc b/tensorflow/compiler/xla/tests/while_test.cc
index 52157b837c383205f77a030ef98b2fd03a41aff5..33d457c70bac84c2da10e3cf9302c2c952cf1bc2 100644
--- a/tensorflow/compiler/xla/tests/while_test.cc
+++ b/tensorflow/compiler/xla/tests/while_test.cc
@@ -910,7 +910,7 @@ XLA_TEST_F(WhileTest, WhileWithDynamicUpdateSlice) {
 // Per backend the values generated can be different as the different backends
 // use different random number generators.
 // TODO(b/32240857): Extend test to verify outputs.
-TEST_F(WhileTest, WhileWithPrngScalarResult) {
+TEST_F(WhileTest, DISABLED_ON_INTERPRETER(WhileWithPrngScalarResult)) {
   auto v6s32 = ShapeUtil::MakeShape(S32, {6});
 
   // Create a computation for the condition: repeat for count iterations.
@@ -1166,7 +1166,7 @@ XLA_TEST_F(WhileTest, NestedWhileWithScalarResult) {
 // while (f(result).get<0>()) {
 //   result = result + 1;
 // }
-TEST_F(WhileTest, WhileWithCallInsideCondition) {
+TEST_F(WhileTest, DISABLED_ON_INTERPRETER(WhileWithCallInsideCondition)) {
   auto result_shape = ShapeUtil::MakeShape(S32, {});
 
   // Create a computation for the condition: repeat for 5 iterations.
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_parser.cc b/tensorflow/compiler/xla/tools/parser/hlo_parser.cc
index 89def5d5610cb9522a69297668b443b8c4e03fb5..e60a5a4919f2207939821e787c3c59a08ff3ba4e 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_parser.cc
+++ b/tensorflow/compiler/xla/tools/parser/hlo_parser.cc
@@ -994,6 +994,20 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
           shape, operands, *custom_call_target));
       break;
     }
+    case HloOpcode::kHostCompute: {
+      optional<string> channel_name;
+      optional<int64> cost_estimate_ns;
+      attrs["channel_name"] = {/*required=*/true, AttrTy::kString,
+                               &channel_name};
+      attrs["cost_estimate_ns"] = {/*required=*/true, AttrTy::kInt64,
+                                   &cost_estimate_ns};
+      if (!ParseOperands(&operands) || !ParseAttributes(attrs)) {
+        return false;
+      }
+      instruction = builder->AddInstruction(HloInstruction::CreateHostCompute(
+          shape, operands, *channel_name, *cost_estimate_ns));
+      break;
+    }
     case HloOpcode::kDot: {
       optional<std::vector<int64>> lhs_contracting_dims;
       attrs["lhs_contracting_dims"] = {
@@ -1035,6 +1049,40 @@ bool HloParser::ParseInstruction(HloComputation::Builder* builder,
           HloInstruction::CreateDot(shape, operands[0], operands[1], dnum));
       break;
     }
+    case HloOpcode::kGather: {
+      optional<std::vector<int64>> output_window_dims;
+      attrs["output_window_dims"] = {
+          /*required=*/true, AttrTy::kBracedInt64List, &output_window_dims};
+      optional<std::vector<int64>> elided_window_dims;
+      attrs["elided_window_dims"] = {
+          /*required=*/true, AttrTy::kBracedInt64List, &elided_window_dims};
+      optional<std::vector<int64>> gather_dims_to_operand_dims;
+      attrs["gather_dims_to_operand_dims"] = {/*required=*/true,
+                                              AttrTy::kBracedInt64List,
+                                              &gather_dims_to_operand_dims};
+      optional<int64> index_vector_dim;
+      attrs["index_vector_dim"] = {/*required=*/true, AttrTy::kInt64,
+                                   &index_vector_dim};
+      optional<std::vector<int64>> window_bounds;
+      attrs["window_bounds"] = {/*required=*/true, AttrTy::kBracedInt64List,
+                                &window_bounds};
+
+      if (!ParseOperands(&operands, /*expected_size=*/2) ||
+          !ParseAttributes(attrs)) {
+        return false;
+      }
+
+      GatherDimensionNumbers dim_numbers = HloInstruction::MakeGatherDimNumbers(
+          /*output_window_dims=*/*output_window_dims,
+          /*elided_window_dims=*/*elided_window_dims,
+          /*gather_dims_to_operand_dims=*/*gather_dims_to_operand_dims,
+          /*index_vector_dim=*/*index_vector_dim);
+
+      instruction = builder->AddInstruction(HloInstruction::CreateGather(
+          shape, /*operand=*/operands[0], /*gather_indices=*/operands[1],
+          dim_numbers, *window_bounds));
+      break;
+    }
     case HloOpcode::kTrace:
       return TokenError(StrCat("parsing not yet implemented for op: ",
                                HloOpcodeString(opcode)));
diff --git a/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc b/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc
index b8c6b59204f897c7dc07b846370b5b776a19a808..863081d654390440aa6506bab4576b3cc5c1cbd1 100644
--- a/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc
+++ b/tensorflow/compiler/xla/tools/parser/hlo_parser_test.cc
@@ -716,6 +716,18 @@ ENTRY %sparse_f32_r1 () -> f32[9] {
   ROOT %foo = f32[9]sparse{10} constant(f32[9]{1: 2, 3: 4, 5: 6})
 }
 
+)"
+},
+{
+"gather",
+R"(HloModule StringifyGather
+
+ENTRY %Gather (input_tensor: f32[50,49,48,47,46], gather_indices: s64[10,9,8,7,5]) -> f32[10,9,8,7,30,29,28,27,26] {
+  %input_tensor = f32[50,49,48,47,46]{4,3,2,1,0} parameter(0)
+  %gather_indices = s64[10,9,8,7,5]{4,3,2,1,0} parameter(1)
+  ROOT %gather = f32[10,9,8,7,30,29,28,27,26]{8,7,6,5,4,3,2,1,0} gather(f32[50,49,48,47,46]{4,3,2,1,0} %input_tensor, s64[10,9,8,7,5]{4,3,2,1,0} %gather_indices), output_window_dims={4,5,6,7,8}, elided_window_dims={}, gather_dims_to_operand_dims={0,1,2,3,4}, index_vector_dim=4, window_bounds={30,29,28,27,26}
+}
+
 )"
 },
   });
@@ -860,6 +872,18 @@ ENTRY dot {
   ROOT dot = f32[2,3]{1,0} dot(a, b), lhs_batch_dims={0}, lhs_contracting_dims={1}, rhs_contracting_dims={0}
 }
 
+)"
+},
+{
+"gather",
+R"(HloModule gather
+
+ENTRY Gather {
+  input_tensor = f32[50,49,48,47,46]{4,3,2,1,0} parameter(0)
+  gather_indices = s64[10,9,8,7,5]{4,3,2,1,0} parameter(1)
+  ROOT gather = f32[10,9,8,7,30,29,28,27,26]{8,7,6,5,4,3,2,1,0} gather(input_tensor, gather_indices), output_window_dims={4,5,6,7,8}, elided_window_dims={}, gather_dims_to_operand_dims={0,1,2,3,4}, index_vector_dim=4, window_bounds={30,29,28,27,26}
+}
+
 )"
 },
   });
diff --git a/tensorflow/compiler/xla/util.h b/tensorflow/compiler/xla/util.h
index 08df5b12b3a53a138f56705531baa3333b23c5d8..82e5a59da0dcbb7f6302522ea4a66e12801ec809 100644
--- a/tensorflow/compiler/xla/util.h
+++ b/tensorflow/compiler/xla/util.h
@@ -427,32 +427,73 @@ std::vector<std::pair<int64, int64>> CommonFactors(
 string SanitizeFileName(string file_name);
 
 template <typename Container, typename Predicate>
-bool c_all_of(Container container, Predicate predicate) {
-  return std::all_of(std::begin(container), std::end(container), predicate);
+bool c_all_of(const Container& container, Predicate&& predicate) {
+  return std::all_of(std::begin(container), std::end(container),
+                     std::forward<Predicate>(predicate));
+}
+
+template <typename Container, typename Predicate>
+bool c_any_of(const Container& container, Predicate&& predicate) {
+  return std::any_of(std::begin(container), std::end(container),
+                     std::forward<Predicate>(predicate));
 }
 
 template <typename InputContainer, typename OutputIterator,
           typename UnaryOperation>
-OutputIterator c_transform(InputContainer input_container,
+OutputIterator c_transform(const InputContainer& input_container,
                            OutputIterator output_iterator,
-                           UnaryOperation unary_op) {
+                           UnaryOperation&& unary_op) {
   return std::transform(std::begin(input_container), std::end(input_container),
-                        output_iterator, unary_op);
+                        output_iterator,
+                        std::forward<UnaryOperation>(unary_op));
 }
 
 template <class InputContainer, class OutputIterator, class UnaryPredicate>
-OutputIterator c_copy_if(InputContainer input_container,
+OutputIterator c_copy_if(const InputContainer& input_container,
                          OutputIterator output_iterator,
-                         UnaryPredicate predicate) {
+                         UnaryPredicate&& predicate) {
   return std::copy_if(std::begin(input_container), std::end(input_container),
-                      output_iterator, predicate);
+                      output_iterator, std::forward<UnaryPredicate>(predicate));
+}
+
+template <class InputContainer, class OutputIterator>
+OutputIterator c_copy(const InputContainer& input_container,
+                      OutputIterator output_iterator) {
+  return std::copy(std::begin(input_container), std::end(input_container),
+                   output_iterator);
+}
+
+template <class InputContainer>
+void c_sort(InputContainer& input_container) {
+  std::sort(std::begin(input_container), std::end(input_container));
 }
 
 template <class InputContainer, class Comparator>
-void c_sort(InputContainer& input_container, Comparator comparator) {
-  std::sort(input_container.begin(), input_container.end(), comparator);
+void c_sort(InputContainer& input_container, Comparator&& comparator) {
+  std::sort(std::begin(input_container), std::end(input_container),
+            std::forward<Comparator>(comparator));
 }
 
+template <typename Sequence, typename T>
+bool c_binary_search(const Sequence& sequence, T&& value) {
+  return std::binary_search(std::begin(sequence), std::end(sequence),
+                            std::forward<T>(value));
+}
+
+template <typename C>
+bool c_is_sorted(const C& c) {
+  return std::is_sorted(std::begin(c), std::end(c));
+}
+
+template <typename C>
+auto c_adjacent_find(const C& c) -> decltype(std::begin(c)) {
+  return std::adjacent_find(std::begin(c), std::end(c));
+}
+
+template <typename C, typename Pred>
+auto c_find_if(const C& c, Pred&& pred) -> decltype(std::begin(c)) {
+  return std::find_if(std::begin(c), std::end(c), std::forward<Pred>(pred));
+}
 }  // namespace xla
 
 #define XLA_LOG_LINES(SEV, STRING) \
diff --git a/tensorflow/compiler/xla/xla_data.proto b/tensorflow/compiler/xla/xla_data.proto
index 3aea0217539b89b5d60ecfaf2605eee4b69af728..1f16e6d25178fd9c10a30b0c500e090ee2e08117 100644
--- a/tensorflow/compiler/xla/xla_data.proto
+++ b/tensorflow/compiler/xla/xla_data.proto
@@ -393,6 +393,37 @@ message Window {
   repeated WindowDimension dimensions = 1;
 }
 
+// Describes the dimension numbers for a gather operation.
+//
+// See https://www.tensorflow.org/performance/xla/operation_semantics#gather for
+// more details.
+message GatherDimensionNumbers {
+  // "Window indices" is a term for a set of indices that index into the
+  // interior of a dynamic-slice from the input tensor, the starting indices for
+  // which were computed from output_gather_dims (see the operation semantic for
+  // how this is defined) and the gather_indices tensor.
+  //
+  // The window indices for a specific output index Out is computed as:
+  //
+  //  i = 0
+  //  for (k : [0, input_tensor_shape.rank))
+  //    window_indices[k] =
+  //      if k in elided_window_dims
+  //      then 0
+  //      else Out[output_window_dims[i++]]
+  repeated int64 output_window_dims = 1;
+  repeated int64 elided_window_dims = 2;
+
+  // This is interpreted as a map from i to gather_dims_to_operand_dims[i]. It
+  // transforms the gather index looked up from the gather_indices tensor into
+  // the starting index in the input space.
+  repeated int64 gather_dims_to_operand_dims = 3;
+
+  // The dimension in the gather_indices input that contains the starting
+  // indices.
+  int64 index_vector_dim = 4;
+}
+
 // Operation requests that are all collected as a tagged union with a oneof
 // field in OpRequest.
 
@@ -519,6 +550,20 @@ message CustomCallRequest {
   Shape shape = 4;
 }
 
+message HostComputeRequest {
+  // Operand to the HostCompute. Supports tuple.
+  repeated ComputationDataHandle operands = 1;
+
+  // Name used to identify HostSend/Recv channels.
+  string channel_name = 2;
+
+  // Cost estimate in nanoseconds.
+  int64 cost_estimate_ns = 3;
+
+  // The shape of any data returned by host.
+  Shape shape = 4;
+}
+
 message DotDimensionNumbers {
   // The dimension numbers that represent the 'lhs' contracting dimensions.
   repeated int64 lhs_contracting_dimensions = 1;
@@ -880,6 +925,13 @@ message RecvRequest {
   ChannelHandle channel_handle = 2;
 }
 
+message GatherRequest {
+  ComputationDataHandle input = 1;
+  ComputationDataHandle gather_indices = 2;
+  GatherDimensionNumbers dimension_numbers = 3;
+  repeated int64 window_bounds = 4;
+}
+
 message OpSharding {
   enum Type {
     // This sharding is replicated across all devices (implies maximal,
@@ -957,7 +1009,9 @@ message OpRequest {
     FftRequest fft_request = 41;
     ConvertRequest bitcast_convert_request = 42;
     ConditionalRequest conditional_request = 44;
-    // Next: 45
+    HostComputeRequest host_compute_request = 45;
+    GatherRequest gather_request = 46;
+    // Next: 47
   }
 }
 
diff --git a/tensorflow/contrib/bayesflow/BUILD b/tensorflow/contrib/bayesflow/BUILD
index 74712aeb67c3f0a31def78f25a0298f9c02c9590..3592cff90bdf1817f2cecc8be1aaca28bb772486 100644
--- a/tensorflow/contrib/bayesflow/BUILD
+++ b/tensorflow/contrib/bayesflow/BUILD
@@ -39,7 +39,7 @@ py_library(
 
 cuda_py_test(
     name = "metropolis_hastings_test",
-    size = "medium",
+    size = "large",
     srcs = ["python/kernel_tests/metropolis_hastings_test.py"],
     additional_deps = [
         ":bayesflow_py",
@@ -99,6 +99,16 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "docstring_util_test",
+    size = "small",
+    srcs = ["python/kernel_tests/docstring_util_test.py"],
+    additional_deps = [
+        ":bayesflow_py",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
 cuda_py_test(
     name = "layers_conv_variational_test",
     size = "small",
@@ -200,7 +210,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "hmc_test",
-    size = "medium",
+    size = "large",
     srcs = ["python/kernel_tests/hmc_test.py"],
     additional_deps = [
         ":bayesflow_py",
@@ -241,23 +251,6 @@ cuda_py_test(
     tags = ["notsan"],
 )
 
-cuda_py_test(
-    name = "variable_utils_test",
-    size = "small",
-    srcs = ["python/kernel_tests/variable_utils_test.py"],
-    additional_deps = [
-        ":bayesflow_py",
-        "//third_party/py/numpy",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:platform_test",
-    ],
-)
-
 cuda_py_test(
     name = "variational_sgd_optimizer_test",
     size = "small",
diff --git a/tensorflow/contrib/bayesflow/__init__.py b/tensorflow/contrib/bayesflow/__init__.py
index 528c4fbacd06c7b0defa0e32bd24a98b2bc07b64..c41102634656e39f0e28242b681e890007b6e89f 100644
--- a/tensorflow/contrib/bayesflow/__init__.py
+++ b/tensorflow/contrib/bayesflow/__init__.py
@@ -30,7 +30,6 @@ from tensorflow.contrib.bayesflow.python.ops import mcmc_diagnostics
 from tensorflow.contrib.bayesflow.python.ops import metropolis_hastings
 from tensorflow.contrib.bayesflow.python.ops import monte_carlo
 from tensorflow.contrib.bayesflow.python.ops import optimizers
-from tensorflow.contrib.bayesflow.python.ops import variable_utils
 # pylint: enable=unused-import,line-too-long
 
 from tensorflow.python.util.all_util import remove_undocumented
@@ -49,7 +48,6 @@ _allowed_symbols = [
     'optimizers',
     'special_math',
     'stochastic_variables',
-    'variable_utils',
     'variational_inference',
 ]
 
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/docstring_util_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/docstring_util_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ed500b19d8dd72795758a2920119e3680576697
--- /dev/null
+++ b/tensorflow/contrib/bayesflow/python/kernel_tests/docstring_util_test.py
@@ -0,0 +1,87 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for docstring utilities."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.bayesflow.python.ops import docstring_util
+from tensorflow.python.platform import test
+
+
+class DocstringUtil(test.TestCase):
+
+  def _testFunction(self):
+    doc_args = """x: Input to return as output.
+  y: Baz."""
+    @docstring_util.expand_docstring(args=doc_args)
+    def foo(x):
+      # pylint: disable=g-doc-args
+      """Hello world.
+
+      Args:
+        @{args}
+
+      Returns:
+        x.
+      """
+      # pylint: enable=g-doc-args
+      return x
+
+    true_docstring = """Hello world.
+
+    Args:
+      x: Input to return as output.
+      y: Baz.
+
+    Returns:
+      x.
+    """
+    self.assertEqual(foo.__doc__, true_docstring)
+
+  def _testClassInit(self):
+    doc_args = """x: Input to return as output.
+  y: Baz."""
+
+    class Foo(object):
+
+      @docstring_util.expand_docstring(args=doc_args)
+      def __init__(self, x, y):
+        # pylint: disable=g-doc-args
+        """Hello world.
+
+        Args:
+          @{args}
+
+        Bar.
+        """
+        # pylint: enable=g-doc-args
+        pass
+
+    true_docstring = """Hello world.
+
+    Args:
+      x: Input to return as output.
+      y: Baz.
+
+    Bar.
+    """
+    self.assertEqual(Foo.__doc__, true_docstring)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/hmc_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/hmc_test.py
index 5bd834e56245ab4d874544cfd014fe59ae521ea8..819095a060b5f4cf18df6e7e4e4556e50ae44dd3 100644
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/hmc_test.py
+++ b/tensorflow/contrib/bayesflow/python/kernel_tests/hmc_test.py
@@ -224,12 +224,13 @@ class HMCTest(test.TestCase):
 
     expected_exp_x = self._shape_param / self._rate_param
 
-    acceptance_probs_, samples_, expected_x_ = sess.run(
-        [kernel_results.acceptance_probs, samples, expected_x],
+    log_accept_ratio_, samples_, expected_x_ = sess.run(
+        [kernel_results.log_accept_ratio, samples, expected_x],
         feed_dict)
 
     actual_x = samples_.mean()
     actual_exp_x = np.exp(samples_).mean()
+    acceptance_probs = np.exp(np.minimum(log_accept_ratio_, 0.))
 
     logging_ops.vlog(1, "True      E[x, exp(x)]: {}\t{}".format(
         expected_x_, expected_exp_x))
@@ -237,10 +238,10 @@ class HMCTest(test.TestCase):
         actual_x, actual_exp_x))
     self.assertNear(actual_x, expected_x_, 2e-2)
     self.assertNear(actual_exp_x, expected_exp_x, 2e-2)
-    self.assertAllEqual(np.ones_like(acceptance_probs_, np.bool),
-                        acceptance_probs_ > 0.5)
-    self.assertAllEqual(np.ones_like(acceptance_probs_, np.bool),
-                        acceptance_probs_ <= 1.)
+    self.assertAllEqual(np.ones_like(acceptance_probs, np.bool),
+                        acceptance_probs > 0.5)
+    self.assertAllEqual(np.ones_like(acceptance_probs, np.bool),
+                        acceptance_probs <= 1.)
 
   def _chain_gets_correct_expectations_wrapper(self, independent_chain_ndims):
     with self.test_session(graph=ops.Graph()) as sess:
@@ -265,7 +266,7 @@ class HMCTest(test.TestCase):
           -x - x**2,  # Non-constant gradient.
           array_ops.fill(x.shape, math_ops.cast(-np.inf, x.dtype)))
     # This log_prob has the property that it is likely to attract
-    # the HMC flow toward, and below, zero...but for x <=0,
+    # the flow toward, and below, zero...but for x <=0,
     # log_prob(x) = -inf, which should result in rejection, as well
     # as a non-finite log_prob.  Thus, this distribution gives us an opportunity
     # to test out the kernel results ability to correctly capture rejections due
@@ -305,11 +306,10 @@ class HMCTest(test.TestCase):
       self.assertLess(0, neg_inf_mask.sum())
       # We better have some rejections due to something other than -inf.
       self.assertLess(neg_inf_mask.sum(), (~kernel_results_.is_accepted).sum())
-      # We better have been accepted a decent amount, even near the end of the
-      # chain, or else this HMC run just got stuck at some point.
+      # We better have accepted a decent amount, even near end of the chain.
       self.assertLess(
           0.1, kernel_results_.is_accepted[int(0.9 * num_results):].mean())
-      # We better not have any NaNs in proposed state or log_prob.
+      # We better not have any NaNs in states or log_prob.
       # We may have some NaN in grads, which involve multiplication/addition due
       # to gradient rules.  This is the known "NaN grad issue with tf.where."
       self.assertAllEqual(np.zeros_like(states_),
@@ -333,9 +333,11 @@ class HMCTest(test.TestCase):
       np.testing.assert_array_less(0., pstates_[~neg_inf_mask])
 
       # Acceptance probs are zero whenever proposed state is negative.
+      acceptance_probs = np.exp(np.minimum(
+          kernel_results_.log_accept_ratio, 0.))
       self.assertAllEqual(
           np.zeros_like(pstates_[neg_inf_mask]),
-          kernel_results_.acceptance_probs[neg_inf_mask])
+          acceptance_probs[neg_inf_mask])
 
       # The move is accepted ==> state = proposed state.
       self.assertAllEqual(
@@ -383,26 +385,28 @@ class HMCTest(test.TestCase):
         seed=44)
 
     [
-        acceptance_probs_,
-        bad_acceptance_probs_,
+        log_accept_ratio_,
+        bad_log_accept_ratio_,
         initial_draws_,
         updated_draws_,
         fake_draws_,
     ] = sess.run([
-        kernel_results.acceptance_probs,
-        bad_kernel_results.acceptance_probs,
+        kernel_results.log_accept_ratio,
+        bad_kernel_results.log_accept_ratio,
         initial_draws,
         sample,
         bad_sample,
     ], feed_dict)
 
     # Confirm step size is small enough that we usually accept.
-    self.assertGreater(acceptance_probs_.mean(), 0.5)
-    self.assertGreater(bad_acceptance_probs_.mean(), 0.5)
+    acceptance_probs = np.exp(np.minimum(log_accept_ratio_, 0.))
+    bad_acceptance_probs = np.exp(np.minimum(bad_log_accept_ratio_, 0.))
+    self.assertGreater(acceptance_probs.mean(), 0.5)
+    self.assertGreater(bad_acceptance_probs.mean(), 0.5)
 
     # Confirm step size is large enough that we sometimes reject.
-    self.assertLess(acceptance_probs_.mean(), 0.99)
-    self.assertLess(bad_acceptance_probs_.mean(), 0.99)
+    self.assertLess(acceptance_probs.mean(), 0.99)
+    self.assertLess(bad_acceptance_probs.mean(), 0.99)
 
     _, ks_p_value_true = stats.ks_2samp(initial_draws_.flatten(),
                                         updated_draws_.flatten())
@@ -410,9 +414,9 @@ class HMCTest(test.TestCase):
                                         fake_draws_.flatten())
 
     logging_ops.vlog(1, "acceptance rate for true target: {}".format(
-        acceptance_probs_.mean()))
+        acceptance_probs.mean()))
     logging_ops.vlog(1, "acceptance rate for fake target: {}".format(
-        bad_acceptance_probs_.mean()))
+        bad_acceptance_probs.mean()))
     logging_ops.vlog(1, "K-S p-value for true target: {}".format(
         ks_p_value_true))
     logging_ops.vlog(1, "K-S p-value for fake target: {}".format(
@@ -615,15 +619,16 @@ class HMCTest(test.TestCase):
           step_size=2.,
           num_leapfrog_steps=5,
           seed=46)
-      initial_x_, updated_x_, acceptance_probs_ = sess.run(
-          [initial_x, updated_x, kernel_results.acceptance_probs])
+      initial_x_, updated_x_, log_accept_ratio_ = sess.run(
+          [initial_x, updated_x, kernel_results.log_accept_ratio])
+      acceptance_probs = np.exp(np.minimum(log_accept_ratio_, 0.))
 
       logging_ops.vlog(1, "initial_x = {}".format(initial_x_))
       logging_ops.vlog(1, "updated_x = {}".format(updated_x_))
-      logging_ops.vlog(1, "acceptance_probs = {}".format(acceptance_probs_))
+      logging_ops.vlog(1, "log_accept_ratio = {}".format(log_accept_ratio_))
 
       self.assertAllEqual(initial_x_, updated_x_)
-      self.assertEqual(acceptance_probs_, 0.)
+      self.assertEqual(acceptance_probs, 0.)
 
   def testNanFromGradsDontPropagate(self):
     """Test that update with NaN gradients does not cause NaN in results."""
@@ -638,15 +643,16 @@ class HMCTest(test.TestCase):
           step_size=2.,
           num_leapfrog_steps=5,
           seed=47)
-      initial_x_, updated_x_, acceptance_probs_ = sess.run(
-          [initial_x, updated_x, kernel_results.acceptance_probs])
+      initial_x_, updated_x_, log_accept_ratio_ = sess.run(
+          [initial_x, updated_x, kernel_results.log_accept_ratio])
+      acceptance_probs = np.exp(np.minimum(log_accept_ratio_, 0.))
 
       logging_ops.vlog(1, "initial_x = {}".format(initial_x_))
       logging_ops.vlog(1, "updated_x = {}".format(updated_x_))
-      logging_ops.vlog(1, "acceptance_probs = {}".format(acceptance_probs_))
+      logging_ops.vlog(1, "log_accept_ratio = {}".format(log_accept_ratio_))
 
       self.assertAllEqual(initial_x_, updated_x_)
-      self.assertEqual(acceptance_probs_, 0.)
+      self.assertEqual(acceptance_probs, 0.)
 
       self.assertAllFinite(
           gradients_ops.gradients(updated_x, initial_x)[0].eval())
@@ -671,10 +677,10 @@ class HMCTest(test.TestCase):
           step_size=0.01,
           num_leapfrog_steps=10,
           seed=48)
-      states_, acceptance_probs_ = sess.run(
-          [states, kernel_results.acceptance_probs])
+      states_, log_accept_ratio_ = sess.run(
+          [states, kernel_results.log_accept_ratio])
       self.assertEqual(dtype, states_.dtype)
-      self.assertEqual(dtype, acceptance_probs_.dtype)
+      self.assertEqual(dtype, log_accept_ratio_.dtype)
 
   def testChainWorksIn64Bit(self):
     self._testChainWorksDtype(np.float64)
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/metropolis_hastings_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/metropolis_hastings_test.py
index 63d93fad64d077aa385b72428665e841b6784b90..f508e5b114a55fc1aeb07212595fda45fc308c7b 100644
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/metropolis_hastings_test.py
+++ b/tensorflow/contrib/bayesflow/python/kernel_tests/metropolis_hastings_test.py
@@ -12,34 +12,195 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for metropolis_hastings.py."""
+"""Tests for Metropolis-Hastings."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 import numpy as np
+
 from tensorflow.contrib.bayesflow.python.ops import metropolis_hastings_impl as mh
+from tensorflow.contrib.distributions.python.ops import mvn_tril as mvn_tril_lib
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
+from tensorflow.python.ops.distributions import normal as normal_lib
 from tensorflow.python.platform import test
 
 
-class McmcStepTest(test.TestCase):
+class MetropolisHastingsTest(test.TestCase):
+
+  def testKernelStateTensor(self):
+    """Test that transition kernel works with tensor input to `state`."""
+    loc = variable_scope.get_variable("loc", initializer=0.)
+
+    def target_log_prob_fn(loc):
+      return normal_lib.Normal(loc=0.0, scale=0.1).log_prob(loc)
+
+    new_state, _ = mh.kernel(
+        target_log_prob_fn=target_log_prob_fn,
+        proposal_fn=mh.proposal_normal(scale=0.05),
+        current_state=loc,
+        seed=231251)
+    loc_update = loc.assign(new_state)
+
+    init = variables.initialize_all_variables()
+    with self.test_session() as sess:
+      sess.run(init)
+      loc_samples = []
+      for _ in range(2500):
+        loc_sample = sess.run(loc_update)
+        loc_samples.append(loc_sample)
+    loc_samples = loc_samples[500:]  # drop samples for burn-in
+
+    self.assertAllClose(np.mean(loc_samples), 0.0, rtol=1e-5, atol=1e-1)
+    self.assertAllClose(np.std(loc_samples), 0.1, rtol=1e-5, atol=1e-1)
+
+  def testKernelStateList(self):
+    """Test that transition kernel works with list input to `state`."""
+    num_chains = 2
+    loc_one = variable_scope.get_variable(
+        "loc_one", [num_chains],
+        initializer=init_ops.zeros_initializer())
+    loc_two = variable_scope.get_variable(
+        "loc_two", [num_chains], initializer=init_ops.zeros_initializer())
+
+    def target_log_prob_fn(loc_one, loc_two):
+      loc = array_ops.stack([loc_one, loc_two])
+      log_prob = mvn_tril_lib.MultivariateNormalTriL(
+          loc=constant_op.constant([0., 0.]),
+          scale_tril=constant_op.constant([[0.1, 0.1], [0.0, 0.1]])).log_prob(
+              loc)
+      return math_ops.reduce_sum(log_prob, 0)
+
+    def proposal_fn(loc_one, loc_two):
+      loc_one_proposal = mh.proposal_normal(scale=0.05)
+      loc_two_proposal = mh.proposal_normal(scale=0.05)
+      loc_one_sample, _ = loc_one_proposal(loc_one)
+      loc_two_sample, _ = loc_two_proposal(loc_two)
+      return [loc_one_sample, loc_two_sample], None
+
+    new_state, _ = mh.kernel(
+        target_log_prob_fn=target_log_prob_fn,
+        proposal_fn=proposal_fn,
+        current_state=[loc_one, loc_two],
+        seed=12415)
+    loc_one_update = loc_one.assign(new_state[0])
+    loc_two_update = loc_two.assign(new_state[1])
+
+    init = variables.initialize_all_variables()
+    with self.test_session() as sess:
+      sess.run(init)
+      loc_one_samples = []
+      loc_two_samples = []
+      for _ in range(10000):
+        loc_one_sample, loc_two_sample = sess.run(
+            [loc_one_update, loc_two_update])
+        loc_one_samples.append(loc_one_sample)
+        loc_two_samples.append(loc_two_sample)
+
+    loc_one_samples = np.array(loc_one_samples)
+    loc_two_samples = np.array(loc_two_samples)
+    loc_one_samples = loc_one_samples[1000:]  # drop samples for burn-in
+    loc_two_samples = loc_two_samples[1000:]  # drop samples for burn-in
+
+    self.assertAllClose(np.mean(loc_one_samples, 0),
+                        np.array([0.] * num_chains),
+                        rtol=1e-5, atol=1e-1)
+    self.assertAllClose(np.mean(loc_two_samples, 0),
+                        np.array([0.] * num_chains),
+                        rtol=1e-5, atol=1e-1)
+    self.assertAllClose(np.std(loc_one_samples, 0),
+                        np.array([0.1] * num_chains),
+                        rtol=1e-5, atol=1e-1)
+    self.assertAllClose(np.std(loc_two_samples, 0),
+                        np.array([0.1] * num_chains),
+                        rtol=1e-5, atol=1e-1)
+
+  def testKernelResultsUsingTruncatedDistribution(self):
+    def log_prob(x):
+      return array_ops.where(
+          x >= 0.,
+          -x - x**2,
+          array_ops.fill(x.shape, math_ops.cast(-np.inf, x.dtype)))
+    # The truncated distribution has the property that it is likely to attract
+    # the flow toward, and below, zero...but for x <=0,
+    # log_prob(x) = -inf, which should result in rejection, as well
+    # as a non-finite log_prob.  Thus, this distribution gives us an opportunity
+    # to test out the kernel results ability to correctly capture rejections due
+    # to finite AND non-finite reasons.
+
+    num_results = 1000
+    # Large step size, will give rejections due to going into a region of
+    # log_prob = -inf.
+    step_size = 0.3
+    num_chains = 2
+
+    with self.test_session(graph=ops.Graph()) as sess:
+
+      # Start multiple independent chains.
+      initial_state = ops.convert_to_tensor([0.1] * num_chains)
 
-  def test_density_increasing_step_accepted(self):
+      states = []
+      is_accepted = []
+      proposed_states = []
+      current_state = initial_state
+      for _ in range(num_results):
+        current_state, kernel_results = mh.kernel(
+            target_log_prob_fn=log_prob,
+            proposal_fn=mh.proposal_uniform(step_size=step_size),
+            current_state=current_state,
+            seed=42)
+        states.append(current_state)
+        proposed_states.append(kernel_results.proposed_state)
+        is_accepted.append(kernel_results.is_accepted)
+
+      states = array_ops.stack(states)
+      proposed_states = array_ops.stack(proposed_states)
+      is_accepted = array_ops.stack(is_accepted)
+      states_, pstates_, is_accepted_ = sess.run(
+          [states, proposed_states, is_accepted])
+
+      # We better have accepted a decent amount, even near end of the chain.
+      self.assertLess(
+          0.1, is_accepted_[int(0.9 * num_results):].mean())
+      # We better not have any NaNs in states.
+      self.assertAllEqual(np.zeros_like(states_),
+                          np.isnan(states_))
+      # We better not have any +inf in states.
+      self.assertAllEqual(np.zeros_like(states_),
+                          np.isposinf(states_))
+
+      # The move is accepted ==> state = proposed state.
+      self.assertAllEqual(
+          states_[is_accepted_],
+          pstates_[is_accepted_],
+      )
+
+      # The move was rejected <==> state[t] == state[t - 1].
+      for t in range(1, num_results):
+        for i in range(num_chains):
+          if is_accepted_[t, i]:
+            self.assertNotEqual(states_[t, i], states_[t - 1, i])
+          else:
+            self.assertEqual(states_[t, i], states_[t - 1, i])
+
+  def testDensityIncreasingStepAccepted(self):
     """Tests that if a transition increases density, it is always accepted."""
     target_log_density = lambda x: - x * x
-    state = variable_scope.get_variable('state', initializer=10.)
+    state = variable_scope.get_variable("state", initializer=10.)
     state_log_density = variable_scope.get_variable(
-        'state_log_density',
+        "state_log_density",
         initializer=target_log_density(state.initialized_value()))
     log_accept_ratio = variable_scope.get_variable(
-        'log_accept_ratio', initializer=0.)
+        "log_accept_ratio", initializer=0.)
 
     get_next_proposal = lambda x: (x - 1., None)
     step = mh.evolve(state, state_log_density, log_accept_ratio,
@@ -54,7 +215,7 @@ class McmcStepTest(test.TestCase):
         self.assertAlmostEqual(sample, 9 - j)
         self.assertAlmostEqual(sample_log_density, - (9 - j) * (9 - j))
 
-  def test_sample_properties(self):
+  def testSampleProperties(self):
     """Tests that the samples converge to the target distribution."""
 
     def target_log_density(x):
@@ -62,16 +223,16 @@ class McmcStepTest(test.TestCase):
       return - (x - 2.0) * (x - 2.0) * 0.5
 
     # Use the uniform random walker to generate proposals.
-    proposal_fn = mh.uniform_random_proposal(
+    proposal_fn = mh.proposal_uniform(
         step_size=1.0, seed=1234)
 
-    state = variable_scope.get_variable('state', initializer=0.0)
+    state = variable_scope.get_variable("state", initializer=0.0)
     state_log_density = variable_scope.get_variable(
-        'state_log_density',
+        "state_log_density",
         initializer=target_log_density(state.initialized_value()))
-
     log_accept_ratio = variable_scope.get_variable(
-        'log_accept_ratio', initializer=0.)
+        "log_accept_ratio", initializer=0.)
+
     # Random walk MCMC converges slowly so need to put in enough iterations.
     num_iterations = 5000
     step = mh.evolve(state, state_log_density, log_accept_ratio,
@@ -98,11 +259,11 @@ class McmcStepTest(test.TestCase):
     self.assertAlmostEqual(sample_mean, 2.0, delta=0.1)
     self.assertAlmostEqual(sample_variance, 1.0, delta=0.1)
 
-  def test_normal_proposals(self):
+  def testProposalNormal(self):
     """Tests that the normal proposals are correctly distributed."""
 
     initial_points = array_ops.ones([10000], dtype=dtypes.float32)
-    proposal_fn = mh.normal_random_proposal(
+    proposal_fn = mh.proposal_normal(
         scale=2.0, seed=1234)
     proposal_points, _ = proposal_fn(initial_points)
 
@@ -115,7 +276,7 @@ class McmcStepTest(test.TestCase):
     self.assertAlmostEqual(np.mean(sample), 1.0, delta=0.1)
     self.assertAlmostEqual(np.std(sample), 2.0, delta=0.1)
 
-  def test_docstring_example(self):
+  def testDocstringExample(self):
     """Tests the simplified docstring example with multiple chains."""
 
     n = 2  # dimension of the problem
@@ -123,7 +284,7 @@ class McmcStepTest(test.TestCase):
     # Generate 300 initial values randomly. Each of these would be an
     # independent starting point for a Markov chain.
     state = variable_scope.get_variable(
-        'state', initializer=random_ops.random_normal(
+        "state", initializer=random_ops.random_normal(
             [300, n], mean=3.0, dtype=dtypes.float32, seed=42))
 
     # Computes the log(p(x)) for the unit normal density and ignores the
@@ -133,12 +294,12 @@ class McmcStepTest(test.TestCase):
 
     # Initial log-density value
     state_log_density = variable_scope.get_variable(
-        'state_log_density',
+        "state_log_density",
         initializer=log_density(state.initialized_value()))
 
     # A variable to store the log_acceptance_ratio:
     log_acceptance_ratio = variable_scope.get_variable(
-        'log_acceptance_ratio',
+        "log_acceptance_ratio",
         initializer=array_ops.zeros([300], dtype=dtypes.float32))
 
     # Generates random proposals by moving each coordinate uniformly and
@@ -175,5 +336,5 @@ class McmcStepTest(test.TestCase):
                         - np.reshape(covariance, [n**2]))), 0,
           delta=0.2)
 
-if __name__ == '__main__':
+if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/bayesflow/python/kernel_tests/variable_utils_test.py b/tensorflow/contrib/bayesflow/python/kernel_tests/variable_utils_test.py
deleted file mode 100644
index f978cf86417dc5ff5412a3eee584330a266e0964..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/kernel_tests/variable_utils_test.py
+++ /dev/null
@@ -1,135 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for utility functions related to managing `tf.Variable`s."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import warnings
-
-import numpy as np
-
-from tensorflow.contrib.bayesflow.python.ops import variable_utils
-
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import variable_scope as varscope_ops
-from tensorflow.python.ops import variables as variables_ops
-from tensorflow.python.platform import test
-
-
-def test_fn(x):
-  x = ops.convert_to_tensor(x, name="x")
-  dtype = x.dtype.as_numpy_dtype
-  s = x.shape.as_list()
-  z = varscope_ops.get_variable(
-      name="z",
-      dtype=dtype,
-      initializer=np.arange(np.prod(s)).reshape(s).astype(dtype))
-  y = varscope_ops.get_variable(
-      name="y",
-      dtype=dtype,
-      initializer=np.arange(np.prod(s)).reshape(s).astype(dtype)**2)
-  return x + y + z
-
-
-class _WrapCallableTest(object):
-
-  def testDefaultArgsWorkCorrectly(self):
-    with self.test_session():
-      x = constant_op.constant(self.dtype([0.1, 0.2]))
-      wrapped_fn, vars_args = variable_utils.externalize_variables_as_args(
-          test_fn, [x])
-
-      varscope_ops.get_variable_scope().reuse_variables()
-
-      result = wrapped_fn(self.dtype(2), [3, 4, 5], 0.5)
-
-      y_actual = varscope_ops.get_variable("y", dtype=self.dtype)
-      z_actual = varscope_ops.get_variable("z", dtype=self.dtype)
-
-      variables_ops.global_variables_initializer().run()
-      result_ = result.eval()
-
-      self.assertEqual(self.dtype, result_.dtype)
-      self.assertAllEqual([5.5, 6.5, 7.5], result_)
-      self.assertAllEqual([y_actual, z_actual], vars_args)
-
-  def testNonDefaultArgsWorkCorrectly(self):
-    with self.test_session():
-      x = constant_op.constant(self.dtype([0.1, 0.2]))
-
-      _ = test_fn(self.dtype([0., 0.]))   # Needed to create vars.
-      varscope_ops.get_variable_scope().reuse_variables()
-
-      y_actual = varscope_ops.get_variable("y", dtype=self.dtype)
-
-      wrapped_fn, vars_args = variable_utils.externalize_variables_as_args(
-          test_fn, [x], possible_ancestor_vars=[y_actual])
-
-      result = wrapped_fn(self.dtype([2, 3]), 0.5)  # x, y
-
-      variables_ops.global_variables_initializer().run()
-      result_ = result.eval()
-
-      self.assertEqual(self.dtype, result_.dtype)
-      self.assertAllEqual([2.5, 4.5], result_)
-      self.assertAllEqual([y_actual], vars_args)
-
-  def testWarnings(self):
-    with self.test_session():
-      x = constant_op.constant(self.dtype([0.1, 0.2]))
-      wrapped_fn, _ = variable_utils.externalize_variables_as_args(
-          test_fn, [x], possible_ancestor_vars=[])
-      varscope_ops.get_variable_scope().reuse_variables()
-      with warnings.catch_warnings(record=True) as w:
-        wrapped_fn(self.dtype(2))
-      w = sorted(w, key=lambda w: str(w.message))
-      self.assertEqual(2, len(w))
-      self.assertRegexpMatches(
-          str(w[0].message),
-          r"Variable .* 'y:0' .* not found in bypass dict.")
-      self.assertRegexpMatches(
-          str(w[1].message),
-          r"Variable .* 'z:0' .* not found in bypass dict.")
-
-  def testExceptions(self):
-    with self.test_session():
-      x = constant_op.constant(self.dtype([0.1, 0.2]))
-      wrapped_fn, _ = variable_utils.externalize_variables_as_args(
-          test_fn,
-          [x],
-          possible_ancestor_vars=[],
-          assert_variable_override=True)
-      varscope_ops.get_variable_scope().reuse_variables()
-      with self.assertRaisesRegexp(ValueError, r"not found"):
-        wrapped_fn(self.dtype(2))
-
-
-class WrapCallableTest16(test.TestCase, _WrapCallableTest):
-  dtype = np.float16
-
-
-class WrapCallableTest32(test.TestCase, _WrapCallableTest):
-  dtype = np.float32
-
-
-class WrapCallableTest64(test.TestCase, _WrapCallableTest):
-  dtype = np.float64
-
-
-if __name__ == "__main__":
-  test.main()
diff --git a/tensorflow/contrib/bayesflow/python/ops/docstring_util.py b/tensorflow/contrib/bayesflow/python/ops/docstring_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..081f2d5a8bfd437fd173f63b4226fb7df6ca921c
--- /dev/null
+++ b/tensorflow/contrib/bayesflow/python/ops/docstring_util.py
@@ -0,0 +1,88 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Utilities for programmable docstrings.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import re
+import six
+
+
+def expand_docstring(**kwargs):
+  """Decorator to programmatically expand the docstring.
+
+  Args:
+    **kwargs: Keyword arguments to set. For each key-value pair `k` and `v`,
+      the key is found as `@{k}` in the docstring and replaced with `v`.
+
+  Returns:
+    Decorated function.
+  """
+  def _fn_wrapped(fn):
+    """Original function with modified `__doc__` attribute."""
+    doc = _trim(fn.__doc__)
+    for k, v in six.iteritems(kwargs):
+      # Capture each @{k} reference to replace with v.
+      # We wrap the replacement in a function so no backslash escapes
+      # are processed.
+      pattern = r'@\{' + str(k) + r'\}'
+      doc = re.sub(pattern, lambda match: v, doc)  # pylint: disable=cell-var-from-loop
+    fn.__doc__ = doc
+    return fn
+  return _fn_wrapped
+
+
+def _trim(docstring):
+  """Trims docstring indentation.
+
+  In general, multi-line docstrings carry their level of indentation when
+  defined under a function or class method. This function standardizes
+  indentation levels by removing them. Taken from PEP 257 docs.
+
+  Args:
+    docstring: Python string to trim indentation.
+
+  Returns:
+    Trimmed docstring.
+  """
+  if not docstring:
+    return ''
+  # Convert tabs to spaces (following the normal Python rules)
+  # and split into a list of lines:
+  lines = docstring.expandtabs().splitlines()
+  # Determine minimum indentation (first line doesn't count):
+  indent = None
+  for line in lines[1:]:
+    stripped = line.lstrip()
+    if stripped:
+      if indent is None:
+        indent = len(line) - len(stripped)
+      else:
+        indent = min(indent, len(line) - len(stripped))
+  # Remove indentation (first line is special):
+  trimmed = [lines[0].strip()]
+  if indent is not None:
+    for line in lines[1:]:
+      trimmed.append(line[indent:].rstrip())
+  # Strip off trailing and leading blank lines:
+  while trimmed and not trimmed[-1]:
+    trimmed.pop()
+  while trimmed and not trimmed[0]:
+    trimmed.pop(0)
+  # Return a single string:
+  return '\n'.join(trimmed)
diff --git a/tensorflow/contrib/bayesflow/python/ops/hmc_impl.py b/tensorflow/contrib/bayesflow/python/ops/hmc_impl.py
index f724910c59315867a42a56fab3deb36f5d3adb7a..82693c2b7bcdbca9f6f4a1d799be5728bb5d36bf 100644
--- a/tensorflow/contrib/bayesflow/python/ops/hmc_impl.py
+++ b/tensorflow/contrib/bayesflow/python/ops/hmc_impl.py
@@ -46,15 +46,13 @@ __all__ = [
 KernelResults = collections.namedtuple(
     "KernelResults",
     [
-        "acceptance_probs",
+        "log_accept_ratio",
         "current_grads_target_log_prob",  # "Current result" means "accepted".
         "current_target_log_prob",  # "Current result" means "accepted".
-        "energy_change",
         "is_accepted",
         "proposed_grads_target_log_prob",
         "proposed_state",
         "proposed_target_log_prob",
-        "random_positive",
     ])
 
 
@@ -63,15 +61,13 @@ def _make_dummy_kernel_results(
     dummy_target_log_prob,
     dummy_grads_target_log_prob):
   return KernelResults(
-      acceptance_probs=dummy_target_log_prob,
+      log_accept_ratio=dummy_target_log_prob,
       current_grads_target_log_prob=dummy_grads_target_log_prob,
       current_target_log_prob=dummy_target_log_prob,
-      energy_change=dummy_target_log_prob,
       is_accepted=array_ops.ones_like(dummy_target_log_prob, dtypes.bool),
       proposed_grads_target_log_prob=dummy_grads_target_log_prob,
       proposed_state=dummy_state,
       proposed_target_log_prob=dummy_target_log_prob,
-      random_positive=dummy_target_log_prob,
   )
 
 
@@ -109,10 +105,13 @@ def sample_chain(
 
   Note: `target_log_prob_fn` is called exactly twice.
 
-  Only one out of every `num_steps_between_samples + 1` steps is included in the
-  returned results. This "thinning" comes at a cost of reduced statistical
-  power, while reducing memory requirements and autocorrelation. For more
-  discussion see [1].
+  Since HMC states are correlated, it is sometimes desirable to produce
+  additional intermediate states, and then discard them, ending up with a set of
+  states with decreased autocorrelation.  See [1].  Such "thinning" is made
+  possible by setting `num_steps_between_results > 0`.  The chain then takes
+  `num_steps_between_results` extra steps between the steps that make it into
+  the results.  The extra steps are never materialized (in calls to `sess.run`),
+  and thus do not increase memory requirements.
 
   [1]: "Statistically efficient thinning of a Markov chain sampler."
        Art B. Owen. April 2017.
@@ -225,10 +224,8 @@ def sample_chain(
       Default value: 0 (i.e., no burn-in).
     num_steps_between_results: Integer number of chain steps between collecting
       a result. Only one out of every `num_steps_between_samples + 1` steps is
-      included in the returned results. This "thinning" comes at a cost of
-      reduced statistical power, while reducing memory requirements and
-      autocorrelation. For more discussion see [1].
-      Default value: 0 (i.e., no subsampling).
+      included in the returned results.  The number of returned chain states is
+      still equal to `num_results`.  Default value: 0 (i.e., no thinning).
     seed: Python integer to seed the random number generator.
     current_target_log_prob: (Optional) `Tensor` representing the value of
       `target_log_prob_fn` at the `current_state`. The only reason to specify
@@ -243,7 +240,7 @@ def sample_chain(
       Default value: `None` (i.e., "hmc_sample_chain").
 
   Returns:
-    accepted_states: Tensor or Python list of `Tensor`s representing the
+    next_states: Tensor or Python list of `Tensor`s representing the
       state(s) of the Markov chain(s) at each result step. Has same shape as
       input `current_state` but with a prepended `num_results`-size dimension.
     kernel_results: `collections.namedtuple` of internal calculations used to
@@ -469,7 +466,7 @@ def sample_annealed_importance_chain(
       Default value: `None` (i.e., "hmc_sample_annealed_importance_chain").
 
   Returns:
-    accepted_state: `Tensor` or Python list of `Tensor`s representing the
+    next_state: `Tensor` or Python list of `Tensor`s representing the
       state(s) of the Markov chain(s) at the final iteration. Has same shape as
       input `current_state`.
     ais_weights: Tensor with the estimated weight(s). Has shape matching
@@ -590,18 +587,19 @@ def kernel(target_log_prob_fn,
 
   target = tfd.Normal(loc=dtype(0), scale=dtype(1))
 
-  new_x, other_results = hmc.kernel(
+  next_x, other_results = hmc.kernel(
       target_log_prob_fn=target.log_prob,
       current_state=x,
       step_size=step_size,
       num_leapfrog_steps=3)[:4]
 
-  x_update = x.assign(new_x)
+  x_update = x.assign(next_x)
 
   step_size_update = step_size.assign_add(
       step_size * tf.where(
-        other_results.acceptance_probs > target_accept_rate,
-        0.01, -0.01))
+          tf.exp(tf.minimum(other_results.log_accept_ratio), 0.) >
+              target_accept_rate,
+          0.01, -0.01))
 
   warmup = tf.group([x_update, step_size_update])
 
@@ -752,7 +750,7 @@ def kernel(target_log_prob_fn,
       Default value: `None` (i.e., "hmc_kernel").
 
   Returns:
-    accepted_state: Tensor or Python list of `Tensor`s representing the state(s)
+    next_state: Tensor or Python list of `Tensor`s representing the state(s)
       of the Markov chain(s) at each result step. Has same shape as
       `current_state`.
     kernel_results: `collections.namedtuple` of internal calculations used to
@@ -805,30 +803,27 @@ def kernel(target_log_prob_fn,
                                            proposed_target_log_prob,
                                            proposed_momentums,
                                            independent_chain_ndims)
+    log_accept_ratio = -energy_change
 
-    # u < exp(min(-energy, 0)),  where u~Uniform[0,1)
-    # ==> -log(u) >= max(e, 0)
-    # ==> -log(u) >= e
-    # (Perhaps surprisingly, we don't have a better way to obtain a random
-    # uniform from positive reals, i.e., `tf.random_uniform(minval=0,
-    # maxval=np.inf)` won't work.)
-    random_uniform = random_ops.random_uniform(
+    # u < exp(log_accept_ratio),  where u~Uniform[0,1)
+    # ==> log(u) < log_accept_ratio
+    random_value = random_ops.random_uniform(
         shape=array_ops.shape(energy_change),
         dtype=energy_change.dtype,
         seed=seed)
-    random_positive = -math_ops.log(random_uniform)
-    is_accepted = random_positive >= energy_change
+    random_negative = math_ops.log(random_value)
+    is_accepted = random_negative < log_accept_ratio
 
     accepted_target_log_prob = array_ops.where(is_accepted,
                                                proposed_target_log_prob,
                                                current_target_log_prob)
 
-    accepted_state_parts = [_choose(is_accepted,
-                                    proposed_state_part,
-                                    current_state_part,
-                                    independent_chain_ndims)
-                            for current_state_part, proposed_state_part
-                            in zip(current_state_parts, proposed_state_parts)]
+    next_state_parts = [_choose(is_accepted,
+                                proposed_state_part,
+                                current_state_part,
+                                independent_chain_ndims)
+                        for current_state_part, proposed_state_part
+                        in zip(current_state_parts, proposed_state_parts)]
 
     accepted_grads_target_log_prob = [
         _choose(is_accepted,
@@ -840,17 +835,15 @@ def kernel(target_log_prob_fn,
 
     maybe_flatten = lambda x: x if _is_list_like(current_state) else x[0]
     return [
-        maybe_flatten(accepted_state_parts),
+        maybe_flatten(next_state_parts),
         KernelResults(
-            acceptance_probs=math_ops.exp(math_ops.minimum(-energy_change, 0.)),
+            log_accept_ratio=log_accept_ratio,
             current_grads_target_log_prob=accepted_grads_target_log_prob,
             current_target_log_prob=accepted_target_log_prob,
-            energy_change=energy_change,
             is_accepted=is_accepted,
             proposed_grads_target_log_prob=proposed_grads_target_log_prob,
             proposed_state=maybe_flatten(proposed_state_parts),
             proposed_target_log_prob=proposed_target_log_prob,
-            random_positive=random_positive,
         ),
     ]
 
@@ -882,8 +875,8 @@ def _leapfrog_integrator(current_momentums,
   momentum = tf.placeholder(np.float32)
 
   [
-      new_momentums,
-      new_positions,
+      next_momentums,
+      next_positions,
   ] = hmc._leapfrog_integrator(
       current_momentums=[momentum],
       target_log_prob_fn=tfd.MultivariateNormalDiag(
@@ -900,7 +893,7 @@ def _leapfrog_integrator(current_momentums,
   positions = np.zeros([num_iter, dims], dtype)
   for i in xrange(num_iter):
     position_, momentum_ = sess.run(
-        [new_momentums[0], new_position[0]],
+        [next_momentums[0], next_position[0]],
         feed_dict={position: position_, momentum: momentum_})
     positions[i] = position_
 
@@ -943,9 +936,9 @@ def _leapfrog_integrator(current_momentums,
       state(s) of the Markov chain(s) at each result step. Has same shape as
       input `current_state_parts`.
     proposed_target_log_prob: `Tensor` representing the value of
-      `target_log_prob_fn` at `accepted_state`.
+      `target_log_prob_fn` at `next_state`.
     proposed_grads_target_log_prob: Gradient of `proposed_target_log_prob` wrt
-      `accepted_state`.
+      `next_state`.
 
   Raises:
     ValueError: if `len(momentums) != len(state_parts)`.
@@ -1065,8 +1058,8 @@ def _compute_energy_change(current_target_log_prob,
                                                   axis=-1)
     lk1 = -np.log(2.) + math_ops.reduce_logsumexp(array_ops.stack(lk1, axis=-1),
                                                   axis=-1)
-    lp0 = -current_target_log_prob   # log_potential
-    lp1 = -proposed_target_log_prob  # proposed_log_potential
+    lp0 = -current_target_log_prob   # potential
+    lp1 = -proposed_target_log_prob  # proposed_potential
     x = array_ops.stack([lp1, math_ops.exp(lk1), -lp0, -math_ops.exp(lk0)],
                         axis=-1)
 
diff --git a/tensorflow/contrib/bayesflow/python/ops/layers_conv_variational.py b/tensorflow/contrib/bayesflow/python/ops/layers_conv_variational.py
index 7723cfb442712626ff415f1412e3362f2392ce9f..cb80718f719ff31fb8ba5066170342fc69630780 100644
--- a/tensorflow/contrib/bayesflow/python/ops/layers_conv_variational.py
+++ b/tensorflow/contrib/bayesflow/python/ops/layers_conv_variational.py
@@ -19,6 +19,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.contrib.bayesflow.python.ops import docstring_util
 from tensorflow.contrib.bayesflow.python.ops import layers_util
 from tensorflow.contrib.distributions.python.ops import independent as independent_lib
 from tensorflow.python.framework import dtypes
@@ -34,6 +35,45 @@ from tensorflow.python.ops.distributions import kullback_leibler as kl_lib
 from tensorflow.python.ops.distributions import normal as normal_lib
 from tensorflow.python.ops.distributions import util as distribution_util
 
+doc_args = """activation: Activation function. Set it to None to maintain a
+      linear activation.
+  activity_regularizer: Optional regularizer function for the output.
+  trainable: Boolean, if `True` also add variables to the graph collection
+    `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
+  kernel_posterior_fn: Python `callable` which creates
+    `tf.distributions.Distribution` instance representing the surrogate
+    posterior of the `kernel` parameter. Default value:
+    `default_mean_field_normal_fn()`.
+  kernel_posterior_tensor_fn: Python `callable` which takes a
+    `tf.distributions.Distribution` instance and returns a representative
+    value. Default value: `lambda d: d.sample()`.
+  kernel_prior_fn: Python `callable` which creates `tf.distributions`
+    instance. See `default_mean_field_normal_fn` docstring for required
+    parameter signature.
+    Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
+  kernel_divergence_fn: Python `callable` which takes the surrogate posterior
+    distribution, prior distribution and random variate sample(s) from the
+    surrogate posterior and computes or approximates the KL divergence. The
+    distributions are `tf.distributions.Distribution`-like instances and the
+    sample is a `Tensor`.
+  bias_posterior_fn: Python `callable` which creates
+    `tf.distributions.Distribution` instance representing the surrogate
+    posterior of the `bias` parameter. Default value:
+    `default_mean_field_normal_fn(is_singular=True)` (which creates an
+    instance of `tf.distributions.Deterministic`).
+  bias_posterior_tensor_fn: Python `callable` which takes a
+    `tf.distributions.Distribution` instance and returns a representative
+    value. Default value: `lambda d: d.sample()`.
+  bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
+    See `default_mean_field_normal_fn` docstring for required parameter
+    signature. Default value: `None` (no prior, no variational inference)
+  bias_divergence_fn: Python `callable` which takes the surrogate posterior
+    distribution, prior distribution and random variate sample(s) from the
+    surrogate posterior and computes or approximates the KL divergence. The
+    distributions are `tf.distributions.Distribution`-like instances and the
+    sample is a `Tensor`.
+  name: A string, the name of the layer."""
+
 
 class _ConvVariational(layers_lib.Layer):
   """Abstract nD convolution layer (private, used as implementation base).
@@ -55,65 +95,6 @@ class _ConvVariational(layers_lib.Layer):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
-    rank: An integer, the rank of the convolution, e.g. "2" for 2D convolution.
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of n integers, specifying the
-      length of the convolution window.
-    strides: An integer or tuple/list of n integers,
-      specifying the stride length of the convolution.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, ..., channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, ...)`.
-    dilation_rate: An integer or tuple/list of n integers, specifying
-      the dilation rate to use for dilated convolution.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any `strides` value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    name: A string, the name of the layer.
-
   Properties:
     rank: Python integer, dimensionality of convolution.
     filters: Python integer, dimensionality of the output space.
@@ -134,6 +115,7 @@ class _ConvVariational(layers_lib.Layer):
     bias_divergence_fn: `callable` returning divergence.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       rank,
@@ -157,6 +139,33 @@ class _ConvVariational(layers_lib.Layer):
       bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      rank: An integer, the rank of the convolution, e.g. "2" for 2D
+        convolution.
+      filters: Integer, the dimensionality of the output space (i.e. the number
+        of filters in the convolution).
+      kernel_size: An integer or tuple/list of n integers, specifying the
+        length of the convolution window.
+      strides: An integer or tuple/list of n integers,
+        specifying the stride length of the convolution.
+        Specifying any stride value != 1 is incompatible with specifying
+        any `dilation_rate` value != 1.
+      padding: One of `"valid"` or `"same"` (case-insensitive).
+      data_format: A string, one of `channels_last` (default) or
+        `channels_first`. The ordering of the dimensions in the inputs.
+        `channels_last` corresponds to inputs with shape `(batch, ...,
+        channels)` while `channels_first` corresponds to inputs with shape
+        `(batch, channels, ...)`.
+      dilation_rate: An integer or tuple/list of n integers, specifying
+        the dilation rate to use for dilated convolution.
+        Currently, specifying any `dilation_rate` value != 1 is
+        incompatible with specifying any `strides` value != 1.
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(_ConvVariational, self).__init__(
         trainable=trainable,
         name=name,
@@ -371,65 +380,6 @@ class _ConvReparameterization(_ConvVariational):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
-    rank: An integer, the rank of the convolution, e.g. "2" for 2D convolution.
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of n integers, specifying the
-      length of the convolution window.
-    strides: An integer or tuple/list of n integers,
-      specifying the stride length of the convolution.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, ..., channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, ...)`.
-    dilation_rate: An integer or tuple/list of n integers, specifying
-      the dilation rate to use for dilated convolution.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any `strides` value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    name: A string, the name of the layer.
-
   Properties:
     rank: Python integer, dimensionality of convolution.
     filters: Python integer, dimensionality of the output space.
@@ -454,6 +404,7 @@ class _ConvReparameterization(_ConvVariational):
         International Conference on Learning Representations, 2014.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       rank,
@@ -477,6 +428,33 @@ class _ConvReparameterization(_ConvVariational):
       bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      rank: An integer, the rank of the convolution, e.g. "2" for 2D
+        convolution.
+      filters: Integer, the dimensionality of the output space (i.e. the number
+        of filters in the convolution).
+      kernel_size: An integer or tuple/list of n integers, specifying the
+        length of the convolution window.
+      strides: An integer or tuple/list of n integers,
+        specifying the stride length of the convolution.
+        Specifying any stride value != 1 is incompatible with specifying
+        any `dilation_rate` value != 1.
+      padding: One of `"valid"` or `"same"` (case-insensitive).
+      data_format: A string, one of `channels_last` (default) or
+        `channels_first`. The ordering of the dimensions in the inputs.
+        `channels_last` corresponds to inputs with shape `(batch, ...,
+        channels)` while `channels_first` corresponds to inputs with shape
+        `(batch, channels, ...)`.
+      dilation_rate: An integer or tuple/list of n integers, specifying
+        the dilation rate to use for dilated convolution.
+        Currently, specifying any `dilation_rate` value != 1 is
+        incompatible with specifying any `strides` value != 1.
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(_ConvReparameterization, self).__init__(
         rank=rank,
         filters=filters,
@@ -529,63 +507,6 @@ class Conv1DReparameterization(_ConvReparameterization):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of a single integer, specifying the
-      length of the 1D convolution window.
-    strides: An integer or tuple/list of a single integer,
-      specifying the stride length of the convolution.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, length, channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, length)`.
-    dilation_rate: An integer or tuple/list of a single integer, specifying
-      the dilation rate to use for dilated convolution.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any `strides` value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    name: A string, the name of the layer.
-
   Properties:
     filters: Python integer, dimensionality of the output space.
     kernel_size: Size of the convolution window.
@@ -639,6 +560,7 @@ class Conv1DReparameterization(_ConvReparameterization):
         International Conference on Learning Representations, 2014.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       filters,
@@ -661,6 +583,31 @@ class Conv1DReparameterization(_ConvReparameterization):
       bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      filters: Integer, the dimensionality of the output space (i.e. the number
+        of filters in the convolution).
+      kernel_size: An integer or tuple/list of a single integer, specifying the
+        length of the 1D convolution window.
+      strides: An integer or tuple/list of a single integer,
+        specifying the stride length of the convolution.
+        Specifying any stride value != 1 is incompatible with specifying
+        any `dilation_rate` value != 1.
+      padding: One of `"valid"` or `"same"` (case-insensitive).
+      data_format: A string, one of `channels_last` (default) or
+        `channels_first`. The ordering of the dimensions in the inputs.
+        `channels_last` corresponds to inputs with shape `(batch, length,
+        channels)` while `channels_first` corresponds to inputs with shape
+        `(batch, channels, length)`.
+      dilation_rate: An integer or tuple/list of a single integer, specifying
+        the dilation rate to use for dilated convolution.
+        Currently, specifying any `dilation_rate` value != 1 is
+        incompatible with specifying any `strides` value != 1.
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(Conv1DReparameterization, self).__init__(
         rank=1,
         filters=filters,
@@ -683,6 +630,7 @@ class Conv1DReparameterization(_ConvReparameterization):
         name=name, **kwargs)
 
 
+@docstring_util.expand_docstring(args=doc_args)
 def conv1d_reparameterization(
     inputs,
     filters,
@@ -705,6 +653,7 @@ def conv1d_reparameterization(
     bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
     name=None,
     reuse=None):
+  # pylint: disable=g-doc-args
   """Functional interface for 1D convolution layer (e.g. temporal convolution).
 
   This layer creates a convolution kernel that is convolved
@@ -726,7 +675,7 @@ def conv1d_reparameterization(
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
+  Args:
     inputs: Tensor input.
     filters: Integer, the dimensionality of the output space (i.e. the number
       of filters in the convolution).
@@ -746,43 +695,7 @@ def conv1d_reparameterization(
       the dilation rate to use for dilated convolution.
       Currently, specifying any `dilation_rate` value != 1 is
       incompatible with specifying any `strides` value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    name: A string, the name of the layer.
+    @{args}
     reuse: Boolean, whether to reuse the weights of a previous layer
       by the same name.
 
@@ -827,6 +740,7 @@ def conv1d_reparameterization(
         Diederik P. Kingma, Max Welling.
         International Conference on Learning Representations, 2014.
   """
+  # pylint: enable=g-doc-args
   layer = Conv1DReparameterization(
       filters=filters,
       kernel_size=kernel_size,
@@ -874,70 +788,6 @@ class Conv2DReparameterization(_ConvReparameterization):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of 2 integers, specifying the
-      height and width of the 2D convolution window.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-    strides: An integer or tuple/list of 2 integers,
-      specifying the strides of the convolution along the height and width.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, height, width, channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, height, width)`.
-
-    dilation_rate: An integer or tuple/list of 2 integers, specifying
-      the dilation rate to use for dilated convolution.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any stride value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    name: A string, the name of the layer.
-
   Properties:
     filters: Python integer, dimensionality of the output space.
     kernel_size: Size of the convolution window.
@@ -994,6 +844,7 @@ class Conv2DReparameterization(_ConvReparameterization):
         International Conference on Learning Representations, 2014.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       filters,
@@ -1016,6 +867,37 @@ class Conv2DReparameterization(_ConvReparameterization):
       bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      filters: Integer, the dimensionality of the output space (i.e. the number
+        of filters in the convolution).
+      kernel_size: An integer or tuple/list of 2 integers, specifying the
+        height and width of the 2D convolution window.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+      strides: An integer or tuple/list of 2 integers,
+        specifying the strides of the convolution along the height and width.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+        Specifying any stride value != 1 is incompatible with specifying
+        any `dilation_rate` value != 1.
+      padding: One of `"valid"` or `"same"` (case-insensitive).
+      data_format: A string, one of `channels_last` (default) or
+        `channels_first`. The ordering of the dimensions in the inputs.
+        `channels_last` corresponds to inputs with shape `(batch, height,
+        width, channels)` while `channels_first` corresponds to inputs with
+        shape `(batch, channels, height, width)`.
+      dilation_rate: An integer or tuple/list of 2 integers, specifying
+        the dilation rate to use for dilated convolution.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+        Currently, specifying any `dilation_rate` value != 1 is
+        incompatible with specifying any stride value != 1.
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(Conv2DReparameterization, self).__init__(
         rank=2,
         filters=filters,
@@ -1038,6 +920,7 @@ class Conv2DReparameterization(_ConvReparameterization):
         name=name, **kwargs)
 
 
+@docstring_util.expand_docstring(args=doc_args)
 def conv2d_reparameterization(
     inputs,
     filters,
@@ -1060,6 +943,7 @@ def conv2d_reparameterization(
     bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
     name=None,
     reuse=None):
+  # pylint: disable=g-doc-args
   """Functional interface for the 2D convolution layer.
 
   This layer creates a convolution kernel that is convolved
@@ -1081,7 +965,7 @@ def conv2d_reparameterization(
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
+  Args:
     inputs: Tensor input.
     filters: Integer, the dimensionality of the output space (i.e. the number
       of filters in the convolution).
@@ -1101,50 +985,13 @@ def conv2d_reparameterization(
       `channels_last` corresponds to inputs with shape
       `(batch, height, width, channels)` while `channels_first` corresponds to
       inputs with shape `(batch, channels, height, width)`.
-
     dilation_rate: An integer or tuple/list of 2 integers, specifying
       the dilation rate to use for dilated convolution.
       Can be a single integer to specify the same value for
       all spatial dimensions.
       Currently, specifying any `dilation_rate` value != 1 is
       incompatible with specifying any stride value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    name: A string, the name of the layer.
+    @{args}
     reuse: Boolean, whether to reuse the weights of a previous layer
       by the same name.
 
@@ -1193,6 +1040,7 @@ def conv2d_reparameterization(
         Diederik P. Kingma, Max Welling.
         International Conference on Learning Representations, 2014.
   """
+  # pylint: enable=g-doc-args
   layer = Conv2DReparameterization(
       filters=filters,
       kernel_size=kernel_size,
@@ -1240,71 +1088,6 @@ class Conv3DReparameterization(_ConvReparameterization):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of 3 integers, specifying the
-      depth, height and width of the 3D convolution window.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-    strides: An integer or tuple/list of 3 integers,
-      specifying the strides of the convolution along the depth,
-      height and width.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, depth, height, width, channels)` while `channels_first`
-      corresponds to inputs with shape
-      `(batch, channels, depth, height, width)`.
-    dilation_rate: An integer or tuple/list of 3 integers, specifying
-      the dilation rate to use for dilated convolution.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any stride value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    name: A string, the name of the layer.
-
   Properties:
     filters: Python integer, dimensionality of the output space.
     kernel_size: Size of the convolution window.
@@ -1361,6 +1144,7 @@ class Conv3DReparameterization(_ConvReparameterization):
         International Conference on Learning Representations, 2014.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       filters,
@@ -1383,6 +1167,38 @@ class Conv3DReparameterization(_ConvReparameterization):
       bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      filters: Integer, the dimensionality of the output space (i.e. the number
+        of filters in the convolution).
+      kernel_size: An integer or tuple/list of 3 integers, specifying the
+        depth, height and width of the 3D convolution window.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+      strides: An integer or tuple/list of 3 integers,
+        specifying the strides of the convolution along the depth,
+        height and width.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+        Specifying any stride value != 1 is incompatible with specifying
+        any `dilation_rate` value != 1.
+      padding: One of `"valid"` or `"same"` (case-insensitive).
+      data_format: A string, one of `channels_last` (default) or
+        `channels_first`. The ordering of the dimensions in the inputs.
+        `channels_last` corresponds to inputs with shape `(batch, depth,
+        height, width, channels)` while `channels_first` corresponds to inputs
+        with shape `(batch, channels, depth, height, width)`.
+      dilation_rate: An integer or tuple/list of 3 integers, specifying
+        the dilation rate to use for dilated convolution.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+        Currently, specifying any `dilation_rate` value != 1 is
+        incompatible with specifying any stride value != 1.
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(Conv3DReparameterization, self).__init__(
         rank=3,
         filters=filters,
@@ -1405,6 +1221,7 @@ class Conv3DReparameterization(_ConvReparameterization):
         name=name, **kwargs)
 
 
+@docstring_util.expand_docstring(args=doc_args)
 def conv3d_reparameterization(
     inputs,
     filters,
@@ -1427,6 +1244,7 @@ def conv3d_reparameterization(
     bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
     name=None,
     reuse=None):
+  # pylint: disable=g-doc-args
   """Functional interface for the 3D convolution layer.
 
   This layer creates a convolution kernel that is convolved
@@ -1448,7 +1266,7 @@ def conv3d_reparameterization(
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
+  Args:
     inputs: Tensor input.
     filters: Integer, the dimensionality of the output space (i.e. the number
       of filters in the convolution).
@@ -1476,43 +1294,7 @@ def conv3d_reparameterization(
       all spatial dimensions.
       Currently, specifying any `dilation_rate` value != 1 is
       incompatible with specifying any stride value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    name: A string, the name of the layer.
+    @{args}
     reuse: Boolean, whether to reuse the weights of a previous layer
       by the same name.
 
@@ -1561,6 +1343,7 @@ def conv3d_reparameterization(
         Diederik P. Kingma, Max Welling.
         International Conference on Learning Representations, 2014.
   """
+  # pylint: enable=g-doc-args
   layer = Conv3DReparameterization(
       filters=filters,
       kernel_size=kernel_size,
@@ -1611,67 +1394,6 @@ class _ConvFlipout(_ConvVariational):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
-    rank: An integer, the rank of the convolution, e.g. "2" for 2D convolution.
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of n integers, specifying the
-      length of the convolution window.
-    strides: An integer or tuple/list of n integers,
-      specifying the stride length of the convolution.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, ..., channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, ...)`.
-    dilation_rate: An integer or tuple/list of n integers, specifying
-      the dilation rate to use for dilated convolution.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any `strides` value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    seed: Python scalar `int` which initializes the random number
-      generator. Default value: `None` (i.e., use global seed).
-    name: A string, the name of the layer.
-
   Properties:
     rank: Python integer, dimensionality of convolution.
     filters: Python integer, dimensionality of the output space.
@@ -1694,10 +1416,11 @@ class _ConvFlipout(_ConvVariational):
 
   [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
         Mini-Batches."
-        Anonymous. OpenReview, 2017.
-        https://openreview.net/forum?id=rJnpifWAb
+        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
+        International Conference on Learning Representations, 2018.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       rank,
@@ -1722,6 +1445,33 @@ class _ConvFlipout(_ConvVariational):
       seed=None,
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      rank: An integer, the rank of the convolution, e.g. "2" for 2D
+        convolution.
+      filters: Integer, the dimensionality of the output space (i.e. the number
+        of filters in the convolution).
+      kernel_size: An integer or tuple/list of n integers, specifying the
+        length of the convolution window.
+      strides: An integer or tuple/list of n integers,
+        specifying the stride length of the convolution.
+        Specifying any stride value != 1 is incompatible with specifying
+        any `dilation_rate` value != 1.
+      padding: One of `"valid"` or `"same"` (case-insensitive).
+      data_format: A string, one of `channels_last` (default) or
+        `channels_first`. The ordering of the dimensions in the inputs.
+        `channels_last` corresponds to inputs with shape `(batch, ...,
+        channels)` while `channels_first` corresponds to inputs with shape
+        `(batch, channels, ...)`.
+      dilation_rate: An integer or tuple/list of n integers, specifying
+        the dilation rate to use for dilated convolution.
+        Currently, specifying any `dilation_rate` value != 1 is
+        incompatible with specifying any `strides` value != 1.
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(_ConvFlipout, self).__init__(
         rank=rank,
         filters=filters,
@@ -1822,65 +1572,6 @@ class Conv1DFlipout(_ConvFlipout):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of a single integer, specifying the
-      length of the 1D convolution window.
-    strides: An integer or tuple/list of a single integer,
-      specifying the stride length of the convolution.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, length, channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, length)`.
-    dilation_rate: An integer or tuple/list of a single integer, specifying
-      the dilation rate to use for dilated convolution.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any `strides` value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    seed: Python scalar `int` which initializes the random number
-      generator. Default value: `None` (i.e., use global seed).
-    name: A string, the name of the layer.
-
   Properties:
     filters: Python integer, dimensionality of the output space.
     kernel_size: Size of the convolution window.
@@ -1932,10 +1623,11 @@ class Conv1DFlipout(_ConvFlipout):
 
   [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
         Mini-Batches."
-        Anonymous. OpenReview, 2017.
-        https://openreview.net/forum?id=rJnpifWAb
+        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
+        International Conference on Learning Representations, 2018.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       filters,
@@ -1959,6 +1651,31 @@ class Conv1DFlipout(_ConvFlipout):
       seed=None,
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      filters: Integer, the dimensionality of the output space (i.e. the number
+        of filters in the convolution).
+      kernel_size: An integer or tuple/list of a single integer, specifying the
+        length of the 1D convolution window.
+      strides: An integer or tuple/list of a single integer,
+        specifying the stride length of the convolution.
+        Specifying any stride value != 1 is incompatible with specifying
+        any `dilation_rate` value != 1.
+      padding: One of `"valid"` or `"same"` (case-insensitive).
+      data_format: A string, one of `channels_last` (default) or
+        `channels_first`. The ordering of the dimensions in the inputs.
+        `channels_last` corresponds to inputs with shape `(batch, length,
+        channels)` while `channels_first` corresponds to inputs with shape
+        `(batch, channels, length)`.
+      dilation_rate: An integer or tuple/list of a single integer, specifying
+        the dilation rate to use for dilated convolution.
+        Currently, specifying any `dilation_rate` value != 1 is
+        incompatible with specifying any `strides` value != 1.
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(Conv1DFlipout, self).__init__(
         rank=1,
         filters=filters,
@@ -1982,6 +1699,7 @@ class Conv1DFlipout(_ConvFlipout):
         name=name, **kwargs)
 
 
+@docstring_util.expand_docstring(args=doc_args)
 def conv1d_flipout(
     inputs,
     filters,
@@ -2005,6 +1723,7 @@ def conv1d_flipout(
     seed=None,
     name=None,
     reuse=None):
+  # pylint: disable=g-doc-args
   """Functional interface for 1D convolution layer (e.g. temporal convolution).
 
   This layer creates a convolution kernel that is convolved
@@ -2029,7 +1748,7 @@ def conv1d_flipout(
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
+  Args:
     inputs: Tensor input.
     filters: Integer, the dimensionality of the output space (i.e. the number
       of filters in the convolution).
@@ -2049,45 +1768,7 @@ def conv1d_flipout(
       the dilation rate to use for dilated convolution.
       Currently, specifying any `dilation_rate` value != 1 is
       incompatible with specifying any `strides` value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    seed: Python scalar `int` which initializes the random number
-      generator. Default value: `None` (i.e., use global seed).
-    name: A string, the name of the layer.
+    @{args}
     reuse: Boolean, whether to reuse the weights of a previous layer
       by the same name.
 
@@ -2130,9 +1811,10 @@ def conv1d_flipout(
 
   [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
         Mini-Batches."
-        Anonymous. OpenReview, 2017.
-        https://openreview.net/forum?id=rJnpifWAb
+        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
+        International Conference on Learning Representations, 2018.
   """
+  # pylint: enable=g-doc-args
   layer = Conv1DFlipout(
       filters=filters,
       kernel_size=kernel_size,
@@ -2184,72 +1866,6 @@ class Conv2DFlipout(_ConvFlipout):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of 2 integers, specifying the
-      height and width of the 2D convolution window.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-    strides: An integer or tuple/list of 2 integers,
-      specifying the strides of the convolution along the height and width.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, height, width, channels)` while `channels_first` corresponds to
-      inputs with shape `(batch, channels, height, width)`.
-
-    dilation_rate: An integer or tuple/list of 2 integers, specifying
-      the dilation rate to use for dilated convolution.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any stride value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    seed: Python scalar `int` which initializes the random number
-      generator. Default value: `None` (i.e., use global seed).
-    name: A string, the name of the layer.
-
   Properties:
     filters: Python integer, dimensionality of the output space.
     kernel_size: Size of the convolution window.
@@ -2304,10 +1920,11 @@ class Conv2DFlipout(_ConvFlipout):
 
   [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
         Mini-Batches."
-        Anonymous. OpenReview, 2017.
-        https://openreview.net/forum?id=rJnpifWAb
+        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
+        International Conference on Learning Representations, 2018.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       filters,
@@ -2331,6 +1948,37 @@ class Conv2DFlipout(_ConvFlipout):
       seed=None,
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      filters: Integer, the dimensionality of the output space (i.e. the number
+        of filters in the convolution).
+      kernel_size: An integer or tuple/list of 2 integers, specifying the
+        height and width of the 2D convolution window.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+      strides: An integer or tuple/list of 2 integers,
+        specifying the strides of the convolution along the height and width.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+        Specifying any stride value != 1 is incompatible with specifying
+        any `dilation_rate` value != 1.
+      padding: One of `"valid"` or `"same"` (case-insensitive).
+      data_format: A string, one of `channels_last` (default) or
+        `channels_first`. The ordering of the dimensions in the inputs.
+        `channels_last` corresponds to inputs with shape `(batch, height,
+        width, channels)` while `channels_first` corresponds to inputs with
+        shape `(batch, channels, height, width)`.
+      dilation_rate: An integer or tuple/list of 2 integers, specifying
+        the dilation rate to use for dilated convolution.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+        Currently, specifying any `dilation_rate` value != 1 is
+        incompatible with specifying any stride value != 1.
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(Conv2DFlipout, self).__init__(
         rank=2,
         filters=filters,
@@ -2354,6 +2002,7 @@ class Conv2DFlipout(_ConvFlipout):
         name=name, **kwargs)
 
 
+@docstring_util.expand_docstring(args=doc_args)
 def conv2d_flipout(
     inputs,
     filters,
@@ -2377,6 +2026,7 @@ def conv2d_flipout(
     seed=None,
     name=None,
     reuse=None):
+  # pylint: disable=g-doc-args
   """Functional interface for the 2D convolution layer.
 
   This layer creates a convolution kernel that is convolved
@@ -2401,7 +2051,7 @@ def conv2d_flipout(
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
+  Args:
     inputs: Tensor input.
     filters: Integer, the dimensionality of the output space (i.e. the number
       of filters in the convolution).
@@ -2421,52 +2071,13 @@ def conv2d_flipout(
       `channels_last` corresponds to inputs with shape
       `(batch, height, width, channels)` while `channels_first` corresponds to
       inputs with shape `(batch, channels, height, width)`.
-
     dilation_rate: An integer or tuple/list of 2 integers, specifying
       the dilation rate to use for dilated convolution.
       Can be a single integer to specify the same value for
       all spatial dimensions.
       Currently, specifying any `dilation_rate` value != 1 is
       incompatible with specifying any stride value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    seed: Python scalar `int` which initializes the random number
-      generator. Default value: `None` (i.e., use global seed).
-    name: A string, the name of the layer.
+    @{args}
     reuse: Boolean, whether to reuse the weights of a previous layer
       by the same name.
 
@@ -2513,9 +2124,10 @@ def conv2d_flipout(
 
   [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
         Mini-Batches."
-        Anonymous. OpenReview, 2017.
-        https://openreview.net/forum?id=rJnpifWAb
+        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
+        International Conference on Learning Representations, 2018.
   """
+  # pylint: enable=g-doc-args
   layer = Conv2DFlipout(
       filters=filters,
       kernel_size=kernel_size,
@@ -2567,73 +2179,6 @@ class Conv3DFlipout(_ConvFlipout):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
-    filters: Integer, the dimensionality of the output space (i.e. the number
-      of filters in the convolution).
-    kernel_size: An integer or tuple/list of 3 integers, specifying the
-      depth, height and width of the 3D convolution window.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-    strides: An integer or tuple/list of 3 integers,
-      specifying the strides of the convolution along the depth,
-      height and width.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Specifying any stride value != 1 is incompatible with specifying
-      any `dilation_rate` value != 1.
-    padding: One of `"valid"` or `"same"` (case-insensitive).
-    data_format: A string, one of `channels_last` (default) or `channels_first`.
-      The ordering of the dimensions in the inputs.
-      `channels_last` corresponds to inputs with shape
-      `(batch, depth, height, width, channels)` while `channels_first`
-      corresponds to inputs with shape
-      `(batch, channels, depth, height, width)`.
-    dilation_rate: An integer or tuple/list of 3 integers, specifying
-      the dilation rate to use for dilated convolution.
-      Can be a single integer to specify the same value for
-      all spatial dimensions.
-      Currently, specifying any `dilation_rate` value != 1 is
-      incompatible with specifying any stride value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    seed: Python scalar `int` which initializes the random number
-      generator. Default value: `None` (i.e., use global seed).
-    name: A string, the name of the layer.
-
   Properties:
     filters: Python integer, dimensionality of the output space.
     kernel_size: Size of the convolution window.
@@ -2688,10 +2233,11 @@ class Conv3DFlipout(_ConvFlipout):
 
   [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
         Mini-Batches."
-        Anonymous. OpenReview, 2017.
-        https://openreview.net/forum?id=rJnpifWAb
+        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
+        International Conference on Learning Representations, 2018.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       filters,
@@ -2715,6 +2261,38 @@ class Conv3DFlipout(_ConvFlipout):
       seed=None,
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      filters: Integer, the dimensionality of the output space (i.e. the number
+        of filters in the convolution).
+      kernel_size: An integer or tuple/list of 3 integers, specifying the
+        depth, height and width of the 3D convolution window.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+      strides: An integer or tuple/list of 3 integers,
+        specifying the strides of the convolution along the depth,
+        height and width.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+        Specifying any stride value != 1 is incompatible with specifying
+        any `dilation_rate` value != 1.
+      padding: One of `"valid"` or `"same"` (case-insensitive).
+      data_format: A string, one of `channels_last` (default) or
+        `channels_first`. The ordering of the dimensions in the inputs.
+        `channels_last` corresponds to inputs with shape `(batch, depth,
+        height, width, channels)` while `channels_first` corresponds to inputs
+        with shape `(batch, channels, depth, height, width)`.
+      dilation_rate: An integer or tuple/list of 3 integers, specifying
+        the dilation rate to use for dilated convolution.
+        Can be a single integer to specify the same value for
+        all spatial dimensions.
+        Currently, specifying any `dilation_rate` value != 1 is
+        incompatible with specifying any stride value != 1.
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(Conv3DFlipout, self).__init__(
         rank=3,
         filters=filters,
@@ -2738,6 +2316,7 @@ class Conv3DFlipout(_ConvFlipout):
         name=name, **kwargs)
 
 
+@docstring_util.expand_docstring(args=doc_args)
 def conv3d_flipout(
     inputs,
     filters,
@@ -2761,6 +2340,7 @@ def conv3d_flipout(
     seed=None,
     name=None,
     reuse=None):
+  # pylint: disable=g-doc-args
   """Functional interface for the 3D convolution layer.
 
   This layer creates a convolution kernel that is convolved
@@ -2785,7 +2365,7 @@ def conv3d_flipout(
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Arguments:
+  Args:
     inputs: Tensor input.
     filters: Integer, the dimensionality of the output space (i.e. the number
       of filters in the convolution).
@@ -2813,45 +2393,7 @@ def conv3d_flipout(
       all spatial dimensions.
       Currently, specifying any `dilation_rate` value != 1 is
       incompatible with specifying any stride value != 1.
-    activation: Activation function. Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Optional regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-    seed: Python scalar `int` which initializes the random number
-      generator. Default value: `None` (i.e., use global seed).
-    name: A string, the name of the layer.
+    @{args}
     reuse: Boolean, whether to reuse the weights of a previous layer
       by the same name.
 
@@ -2898,9 +2440,10 @@ def conv3d_flipout(
 
   [1]: "Flipout: Efficient Pseudo-Independent Weight Perturbations on
         Mini-Batches."
-        Anonymous. OpenReview, 2017.
-        https://openreview.net/forum?id=rJnpifWAb
+        Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse.
+        International Conference on Learning Representations, 2018.
   """
+  # pylint: enable=g-doc-args
   layer = Conv3DFlipout(
       filters=filters,
       kernel_size=kernel_size,
diff --git a/tensorflow/contrib/bayesflow/python/ops/layers_dense_variational.py b/tensorflow/contrib/bayesflow/python/ops/layers_dense_variational.py
index 591a8e553de0c194786c7ee8693665f762711b2d..1f1d8fda2a5db4db33a2b6e5d7f027c4b509011a 100644
--- a/tensorflow/contrib/bayesflow/python/ops/layers_dense_variational.py
+++ b/tensorflow/contrib/bayesflow/python/ops/layers_dense_variational.py
@@ -19,6 +19,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.contrib.bayesflow.python.ops import docstring_util
 from tensorflow.contrib.bayesflow.python.ops import layers_util
 from tensorflow.contrib.distributions.python.ops import independent as independent_lib
 from tensorflow.python.framework import dtypes
@@ -33,6 +34,53 @@ from tensorflow.python.ops.distributions import normal as normal_lib
 from tensorflow.python.ops.distributions import util as distribution_util
 
 
+doc_args = """units: Integer or Long, dimensionality of the output space.
+  activation: Activation function (`callable`). Set it to None to maintain a
+    linear activation.
+  activity_regularizer: Regularizer function for the output.
+  trainable: Boolean, if `True` also add variables to the graph collection
+    `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
+  kernel_posterior_fn: Python `callable` which creates
+    `tf.distributions.Distribution` instance representing the surrogate
+    posterior of the `kernel` parameter. Default value:
+    `default_mean_field_normal_fn()`.
+  kernel_posterior_tensor_fn: Python `callable` which takes a
+    `tf.distributions.Distribution` instance and returns a representative
+    value. Default value: `lambda d: d.sample()`.
+  kernel_prior_fn: Python `callable` which creates `tf.distributions`
+    instance. See `default_mean_field_normal_fn` docstring for required
+    parameter signature.
+    Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
+  kernel_divergence_fn: Python `callable` which takes the surrogate posterior
+    distribution, prior distribution and random variate sample(s) from the
+    surrogate posterior and computes or approximates the KL divergence. The
+    distributions are `tf.distributions.Distribution`-like instances and the
+    sample is a `Tensor`.
+  bias_posterior_fn: Python `callable` which creates
+    `tf.distributions.Distribution` instance representing the surrogate
+    posterior of the `bias` parameter. Default value:
+    `default_mean_field_normal_fn(is_singular=True)` (which creates an
+    instance of `tf.distributions.Deterministic`).
+  bias_posterior_tensor_fn: Python `callable` which takes a
+    `tf.distributions.Distribution` instance and returns a representative
+    value. Default value: `lambda d: d.sample()`.
+  bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
+    See `default_mean_field_normal_fn` docstring for required parameter
+    signature. Default value: `None` (no prior, no variational inference)
+  bias_divergence_fn: Python `callable` which takes the surrogate posterior
+    distribution, prior distribution and random variate sample(s) from the
+    surrogate posterior and computes or approximates the KL divergence. The
+    distributions are `tf.distributions.Distribution`-like instances and the
+    sample is a `Tensor`.
+  seed: Python scalar `int` which initializes the random number
+    generator. Default value: `None` (i.e., use global seed).
+  name: Python `str`, the name of the layer. Layers with the same name will
+    share `tf.Variable`s, but to avoid mistakes we require `reuse=True` in
+    such cases.
+  reuse: Python `bool`, whether to reuse the `tf.Variable`s of a previous
+    layer by the same name."""
+
+
 class _DenseVariational(layers_lib.Layer):
   """Abstract densely-connected class (private, used as implementation base).
 
@@ -50,51 +98,6 @@ class _DenseVariational(layers_lib.Layer):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Args:
-    units: Integer or Long, dimensionality of the output space.
-    activation: Activation function (`callable`). Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    name: Python `str`, the name of the layer. Layers with the same name will
-      share `tf.Variable`s, but to avoid mistakes we require `reuse=True` in
-      such cases.
-    reuse: Python `bool`, whether to reuse the `tf.Variable`s of a previous
-      layer by the same name.
-
   Properties:
     units: Python integer, dimensionality of the output space.
     activation: Activation function (`callable`).
@@ -109,6 +112,7 @@ class _DenseVariational(layers_lib.Layer):
     bias_divergence_fn: `callable` returning divergence.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       units,
@@ -126,6 +130,13 @@ class _DenseVariational(layers_lib.Layer):
       bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(_DenseVariational, self).__init__(
         trainable=trainable,
         name=name,
@@ -274,51 +285,6 @@ class DenseReparameterization(_DenseVariational):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Args:
-    units: Integer or Long, dimensionality of the output space.
-    activation: Activation function (`callable`). Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    name: Python `str`, the name of the layer. Layers with the same name will
-      share `tf.Variable`s, but to avoid mistakes we require `reuse=True` in
-      such cases.
-    reuse: Python `bool`, whether to reuse the `tf.Variable`s of a previous
-      layer by the same name.
-
   Properties:
     units: Python integer, dimensionality of the output space.
     activation: Activation function (`callable`).
@@ -363,6 +329,7 @@ class DenseReparameterization(_DenseVariational):
         International Conference on Learning Representations, 2014.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       units,
@@ -381,6 +348,13 @@ class DenseReparameterization(_DenseVariational):
       bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(DenseReparameterization, self).__init__(
         units=units,
         activation=activation,
@@ -405,6 +379,7 @@ class DenseReparameterization(_DenseVariational):
     return self._matmul(inputs, self.kernel_posterior_tensor)
 
 
+@docstring_util.expand_docstring(args=doc_args)
 def dense_reparameterization(
     inputs,
     units,
@@ -422,6 +397,7 @@ def dense_reparameterization(
     bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
     name=None,
     reuse=None):
+  # pylint: disable=g-doc-args
   """Densely-connected layer with reparameterization estimator.
 
   This layer implements the Bayesian variational inference analogue to
@@ -444,49 +420,7 @@ def dense_reparameterization(
 
   Args:
     inputs: Tensor input.
-    units: Integer or Long, dimensionality of the output space.
-    activation: Activation function (`callable`). Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    name: Python `str`, the name of the layer. Layers with the same name will
-      share `tf.Variable`s, but to avoid mistakes we require `reuse=True` in
-      such cases.
-    reuse: Python `bool`, whether to reuse the `tf.Variable`s of a previous
-      layer by the same name.
+    @{args}
 
   Returns:
     output: `Tensor` representing a the affine transformed input under a random
@@ -522,6 +456,7 @@ def dense_reparameterization(
         Diederik P. Kingma, Max Welling.
         International Conference on Learning Representations, 2014.
   """
+  # pylint: enable=g-doc-args
   layer = DenseReparameterization(
       units,
       activation=activation,
@@ -563,51 +498,6 @@ class DenseLocalReparameterization(_DenseVariational):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Args:
-    units: Integer or Long, dimensionality of the output space.
-    activation: Activation function (`callable`). Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    name: Python `str`, the name of the layer. Layers with the same name will
-      share `tf.Variable`s, but to avoid mistakes we require `reuse=True` in
-      such cases.
-    reuse: Python `bool`, whether to reuse the `tf.Variable`s of a previous
-      layer by the same name.
-
   Properties:
     units: Python integer, dimensionality of the output space.
     activation: Activation function (`callable`).
@@ -652,6 +542,7 @@ class DenseLocalReparameterization(_DenseVariational):
         Neural Information Processing Systems, 2015.
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       units,
@@ -670,6 +561,13 @@ class DenseLocalReparameterization(_DenseVariational):
       bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(DenseLocalReparameterization, self).__init__(
         units=units,
         activation=activation,
@@ -705,6 +603,7 @@ class DenseLocalReparameterization(_DenseVariational):
     return self.kernel_posterior_affine_tensor
 
 
+@docstring_util.expand_docstring(args=doc_args)
 def dense_local_reparameterization(
     inputs,
     units,
@@ -723,6 +622,7 @@ def dense_local_reparameterization(
     bias_divergence_fn=lambda q, p, ignore: kl_lib.kl_divergence(q, p),
     name=None,
     reuse=None):
+  # pylint: disable=g-doc-args
   """Densely-connected layer with local reparameterization estimator.
 
   This layer implements the Bayesian variational inference analogue to
@@ -745,49 +645,7 @@ def dense_local_reparameterization(
 
   Args:
     inputs: Tensor input.
-    units: Integer or Long, dimensionality of the output space.
-    activation: Activation function (`callable`). Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    name: Python `str`, the name of the layer. Layers with the same name will
-      share `tf.Variable`s, but to avoid mistakes we require `reuse=True` in
-      such cases.
-    reuse: Python `bool`, whether to reuse the `tf.Variable`s of a previous
-      layer by the same name.
+    @{args}
 
   Returns:
     output: `Tensor` representing a the affine transformed input under a random
@@ -823,6 +681,7 @@ def dense_local_reparameterization(
         Diederik P. Kingma, Tim Salimans, Max Welling.
         Neural Information Processing Systems, 2015.
   """
+  # pylint: enable=g-doc-args
   layer = DenseLocalReparameterization(
       units,
       activation=activation,
@@ -866,53 +725,6 @@ class DenseFlipout(_DenseVariational):
   (`q(W|x)`), prior (`p(W)`), and divergence for both the `kernel` and `bias`
   distributions.
 
-  Args:
-    units: Integer or Long, dimensionality of the output space.
-    activation: Activation function (`callable`). Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    seed: Python scalar `int` which initializes the random number
-      generator. Default value: `None` (i.e., use global seed).
-    name: Python `str`, the name of the layer. Layers with the same name will
-      share `tf.Variable`s, but to avoid mistakes we require `reuse=True` in
-      such cases.
-    reuse: Python `bool`, whether to reuse the `tf.Variable`s of a previous
-      layer by the same name.
-
   Properties:
     units: Python integer, dimensionality of the output space.
     activation: Activation function (`callable`).
@@ -959,6 +771,7 @@ class DenseFlipout(_DenseVariational):
         https://openreview.net/forum?id=rJnpifWAb
   """
 
+  @docstring_util.expand_docstring(args=doc_args)
   def __init__(
       self,
       units,
@@ -978,6 +791,13 @@ class DenseFlipout(_DenseVariational):
       seed=None,
       name=None,
       **kwargs):
+    # pylint: disable=g-doc-args
+    """Construct layer.
+
+    Args:
+      @{args}
+    """
+    # pylint: enable=g-doc-args
     super(DenseFlipout, self).__init__(
         units=units,
         activation=activation,
@@ -1031,6 +851,7 @@ class DenseFlipout(_DenseVariational):
     return outputs
 
 
+@docstring_util.expand_docstring(args=doc_args)
 def dense_flipout(
     inputs,
     units,
@@ -1050,6 +871,7 @@ def dense_flipout(
     seed=None,
     name=None,
     reuse=None):
+  # pylint: disable=g-doc-args
   """Densely-connected layer with Flipout estimator.
 
   This layer implements the Bayesian variational inference analogue to
@@ -1074,51 +896,7 @@ def dense_flipout(
 
   Args:
     inputs: Tensor input.
-    units: Integer or Long, dimensionality of the output space.
-    activation: Activation function (`callable`). Set it to None to maintain a
-      linear activation.
-    activity_regularizer: Regularizer function for the output.
-    trainable: Boolean, if `True` also add variables to the graph collection
-      `GraphKeys.TRAINABLE_VARIABLES` (see `tf.Variable`).
-    kernel_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `kernel` parameter. Default value:
-      `default_mean_field_normal_fn()`.
-    kernel_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    kernel_prior_fn: Python `callable` which creates `tf.distributions`
-      instance. See `default_mean_field_normal_fn` docstring for required
-      parameter signature.
-      Default value: `tf.distributions.Normal(loc=0., scale=1.)`.
-    kernel_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    bias_posterior_fn: Python `callable` which creates
-      `tf.distributions.Distribution` instance representing the surrogate
-      posterior of the `bias` parameter. Default value:
-      `default_mean_field_normal_fn(is_singular=True)` (which creates an
-      instance of `tf.distributions.Deterministic`).
-    bias_posterior_tensor_fn: Python `callable` which takes a
-      `tf.distributions.Distribution` instance and returns a representative
-      value. Default value: `lambda d: d.sample()`.
-    bias_prior_fn: Python `callable` which creates `tf.distributions` instance.
-      See `default_mean_field_normal_fn` docstring for required parameter
-      signature. Default value: `None` (no prior, no variational inference)
-    bias_divergence_fn: Python `callable` which takes the surrogate posterior
-      distribution, prior distribution and random variate sample(s) from the
-      surrogate posterior and computes or approximates the KL divergence. The
-      distributions are `tf.distributions.Distribution`-like instances and the
-      sample is a `Tensor`.
-    seed: Python scalar `int` which initializes the random number
-      generator. Default value: `None` (i.e., use global seed).
-    name: Python `str`, the name of the layer. Layers with the same name will
-      share `tf.Variable`s, but to avoid mistakes we require `reuse=True` in
-      such cases.
-    reuse: Python `bool`, whether to reuse the `tf.Variable`s of a previous
-      layer by the same name.
+    @{args}
 
   Returns:
     output: `Tensor` representing a the affine transformed input under a random
@@ -1155,6 +933,7 @@ def dense_flipout(
         Anonymous. OpenReview, 2017.
         https://openreview.net/forum?id=rJnpifWAb
   """
+  # pylint: enable=g-doc-args
   layer = DenseFlipout(
       units,
       activation=activation,
diff --git a/tensorflow/contrib/bayesflow/python/ops/metropolis_hastings.py b/tensorflow/contrib/bayesflow/python/ops/metropolis_hastings.py
index 7bdeaa862d5bb64fa8940df453c7aa2b66023eda..e7fcbc65ef379e84a140a06e020549f74f905a99 100644
--- a/tensorflow/contrib/bayesflow/python/ops/metropolis_hastings.py
+++ b/tensorflow/contrib/bayesflow/python/ops/metropolis_hastings.py
@@ -25,9 +25,10 @@ from tensorflow.contrib.bayesflow.python.ops.metropolis_hastings_impl import *
 from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = [
+    'kernel',
     'evolve',
-    'uniform_random_proposal',
-    'normal_random_proposal',
+    'proposal_uniform',
+    'proposal_normal',
 ]
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/bayesflow/python/ops/metropolis_hastings_impl.py b/tensorflow/contrib/bayesflow/python/ops/metropolis_hastings_impl.py
index dc1ac68ce009fa46d6c05a3200a29d9fdf245707..05aa134ed5c11092316af5f3e45ba07fdb491e90 100644
--- a/tensorflow/contrib/bayesflow/python/ops/metropolis_hastings_impl.py
+++ b/tensorflow/contrib/bayesflow/python/ops/metropolis_hastings_impl.py
@@ -12,17 +12,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Functions to create a Markov Chain Monte Carlo Metropolis step.
+"""Metropolis-Hastings and proposal distributions.
 
+@@kernel
 @@evolve
-@@uniform_random_proposal
-@@normal_random_proposal
+@@proposal_uniform
+@@proposal_normal
 """
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import collections
+
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
@@ -31,123 +34,198 @@ from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import state_ops
 
 __all__ = [
-    'evolve',
-    'uniform_random_proposal',
-    'normal_random_proposal',
+    "kernel",
+    "evolve",
+    "proposal_uniform",
+    "proposal_normal",
 ]
 
 
-def _single_iteration(current_state, current_log_density,
-                      log_unnormalized_prob_fn, proposal_fn, seed=None,
-                      name='None'):
-  """Performs a single Metropolis-Hastings step.
+KernelResults = collections.namedtuple(
+    "KernelResults",
+    [
+        "log_accept_ratio",
+        "current_target_log_prob",  # "Current result" means "accepted".
+        "is_accepted",
+        "proposed_state",
+    ])
+
+
+def kernel(target_log_prob_fn,
+           proposal_fn,
+           current_state,
+           seed=None,
+           current_target_log_prob=None,
+           name=None):
+  """Runs the Metropolis-Hastings transition kernel.
+
+  This function can update multiple chains in parallel. It assumes that all
+  leftmost dimensions of `current_state` index independent chain states (and are
+  therefore updated independently). The output of `target_log_prob_fn()` should
+  sum log-probabilities across all event dimensions. Slices along the rightmost
+  dimensions may have different target distributions; for example,
+  `current_state[0, :]` could have a different target distribution from
+  `current_state[1, :]`. This is up to `target_log_prob_fn()`. (The number of
+  independent chains is `tf.size(target_log_prob_fn(*current_state))`.)
 
   Args:
-    current_state: Float-like `Tensor` (i.e., `dtype` is either
-      `tf.float16`, `tf.float32` or `tf.float64`) of any shape that can
-      be consumed by the `log_unnormalized_prob_fn` and `proposal_fn`
-      callables.
-    current_log_density: Float-like `Tensor` with `dtype` and shape equivalent
-      to `log_unnormalized_prob_fn(current_state)`, i.e., matching the result of
-      `log_unnormalized_prob_fn` invoked at `current_state`.
-    log_unnormalized_prob_fn: A Python callable evaluated at
-      `current_state` and returning a float-like `Tensor` of log target-density
-      up to a normalizing constant. In other words,
-      `log_unnormalized_prob_fn(x) = log(g(x))`, where
-      `target_density = g(x)/Z` for some constant `A`. The shape of the input
-      tensor is the same as the shape of the `current_state`. The shape of the
-      output tensor is either
-        (a). Same as the input shape if the density being sampled is one
-          dimensional, or
-        (b). If the density is defined for `events` of shape
-          `event_shape = [E1, E2, ... Ee]`, then the input tensor should be of
-          shape `batch_shape + event_shape`, where `batch_shape = [B1, ..., Bb]`
-          and the result must be of shape [B1, ..., Bb]. For example, if the
-          distribution that is being sampled is a 10 dimensional normal,
-          then the input tensor may be of shape [100, 10] or [30, 20, 10]. The
-          last dimension will then be 'consumed' by `log_unnormalized_prob_fn`
-          and it should return tensors of shape [100] and [30, 20] respectively.
-    proposal_fn: A callable accepting a real valued `Tensor` of current sample
-      points and returning a tuple of two `Tensors`. The first element of the
-      pair is a `Tensor` containing the proposal state and should have
-      the same shape as the input `Tensor`. The second element of the pair gives
-      the log of the ratio of the probability of transitioning from the
-      proposal points to the input points and the probability of transitioning
-      from the input points to the proposal points. If the proposal is
-      symmetric (e.g., random walk, where the proposal is either
-      normal or uniform centered at `current_state`), i.e.,
-      Probability(Proposal -> Current) = Probability(Current -> Proposal)
-      the second value should be set to `None` instead of explicitly supplying a
-      tensor of zeros. In addition to being convenient, this also leads to a
-      more efficient graph.
-    seed: `int` or None. The random seed for this `Op`. If `None`, no seed is
-      applied.
-    name: Python `str` name prefix for ops managed by this function.
+    target_log_prob_fn: Python callable which takes an argument like
+      `current_state` (or `*current_state` if it's a list) and returns its
+      (possibly unnormalized) log-density under the target distribution.
+    proposal_fn: Python callable which takes an argument like `current_state`
+      (or `*current_state` if it's a list) and returns a tuple of proposed
+      states of same shape as `state`, and a log ratio `Tensor` of same shape
+      as `current_target_log_prob`. The log ratio is the log-probability of
+      `state` given proposed states minus the log-probability of proposed
+      states given `state`. If the proposal is symmetric, set the second value
+      to `None`: this enables more efficient computation than explicitly
+      supplying a tensor of zeros.
+    current_state: `Tensor` or Python `list` of `Tensor`s representing the
+      current state(s) of the Markov chain(s). The first `r` dimensions index
+      independent chains, `r = tf.rank(target_log_prob_fn(*current_state))`.
+    seed: Python integer to seed the random number generator.
+    current_target_log_prob: (Optional) `Tensor` representing the value of
+      `target_log_prob_fn` at the `current_state`. The only reason to
+      specify this argument is to reduce TF graph size.
+      Default value: `None` (i.e., compute as needed).
+    name: A name of the operation (optional).
 
   Returns:
-    next_state: `Tensor` with `dtype` and shape matching `current_state`.
-      Created by propagating the chain by one step, starting from
+    next_state: Tensor or Python list of `Tensor`s representing the state(s)
+      of the Markov chain(s) at each result step. Has same shape as
       `current_state`.
-    next_log_density: `Tensor` with `dtype` and shape matching
-      `current_log_density`, which is equal to the value of the unnormalized
-      `log_unnormalized_prob_fn` computed at `next_state`.
-    log_accept_ratio: `Tensor` with `dtype` and shape matching
-      `current_log_density`. Stands for the log of Metropolis-Hastings
-      acceptance ratio used in generating the `next_state`.
-  """
+    kernel_results: `collections.namedtuple` of internal calculations used to
+      advance the chain.
 
-  with ops.name_scope(name, 'single_iteration', [current_state]):
-    # The proposed state and the log of the corresponding Hastings ratio.
-    proposal_state, log_transit_ratio = proposal_fn(current_state)
-
-    # If the log ratio is None, assume that the transitions are symmetric,
-    # i.e., Prob(Current -> Proposed) = Prob(Proposed -> Current).
-    if log_transit_ratio is None:
-      log_transit_ratio = 0.
-
-    # Log-density of the proposal state.
-    proposal_log_density = log_unnormalized_prob_fn(proposal_state)
-
-    # Ops to compute the log of the acceptance ratio. Recall that the
-    # acceptance ratio is: [Prob(Proposed) / Prob(Current)] *
-    # [Prob(Proposed -> Current) / Prob(Current -> Proposed)]. The log of the
-    # second term is the log_transit_ratio.
-    with ops.name_scope('accept_reject'):
-      # The log of the acceptance ratio.
-      log_accept_ratio = (proposal_log_density - current_log_density
-                          + log_transit_ratio)
-
-      # A proposal is accepted or rejected depending on the acceptance ratio.
-      # If the acceptance ratio is greater than 1 then it is always accepted.
-      # If the acceptance ratio is less than 1 then the proposal is accepted
-      # with probability = acceptance ratio. As we are working in log space to
-      # prevent over/underflows, this logic is expressed in log terms below.
-      # If a proposal is accepted we place a True in the acceptance state
-      # tensor and if it is to be rejected we place a False.
-      # The log_draws below have to be compared to the log_accept_ratio so we
-      # make sure that they have the same data type.
-      log_draws = math_ops.log(random_ops.random_uniform(
-          array_ops.shape(current_log_density), seed=seed,
-          dtype=log_accept_ratio.dtype))
-      is_proposal_accepted = log_draws < log_accept_ratio
-
-    # The acceptance state decides which elements of the current state are to
-    # be replaced with the corresponding elements in the proposal state.
-    with ops.name_scope(name, 'metropolis_single_step',
-                        [current_state, current_log_density]):
-      next_log_density = array_ops.where(is_proposal_accepted,
-                                         proposal_log_density,
-                                         current_log_density)
-      next_state = array_ops.where(is_proposal_accepted, proposal_state,
-                                   current_state)
-
-    return next_state, next_log_density, log_accept_ratio
+  #### Examples
+
+  We illustrate Metropolis-Hastings on a Normal likelihood with
+  unknown mean.
+
+  ```python
+  tfd = tf.contrib.distributions
+  tfp = tf.contrib.bayesflow
+
+  loc = tf.get_variable("loc", initializer=1.)
+  x = tf.constant([0.0] * 50)
+
+  def make_target_log_prob_fn(x):
+    def target_log_prob_fn(loc):
+      prior = tfd.Normal(loc=0., scale=1.)
+      likelihood = tfd.Independent(
+        tfd.Normal(loc=loc, scale=0.1),
+        reinterpreted_batch_ndims=1)
+      return prior.log_prob(loc) + likelihood.log_prob(x)
+    return target_log_prob_fn
+
+  next_state, kernel_results = tfp.metropolis_hastings.kernel(
+      target_log_prob_fn=make_target_log_prob_fn(x),
+      proposal_fn=tfp.metropolis_hastings.proposal_normal(),
+      current_state=loc)
+  loc_update = loc.assign(next_state)
+  ```
+
+  We illustrate Metropolis-Hastings on a Normal likelihood with
+  unknown mean and variance. We apply 4 chains.
+
+  ```python
+  tfd = tf.contrib.distributions
+  tfp = tf.contrib.bayesflow
+
+  num_chains = 4
+  loc = tf.get_variable("loc", shape=[num_chains],
+                        initializer=tf.random_normal_initializer())
+  scale = tf.get_variable("scale", shape=[num_chains],
+                          initializer=tf.ones_initializer())
+  x = tf.constant([0.0] * 50)
+
+  def make_target_log_prob_fn(x):
+    data = tf.reshape(x, shape=[-1, 1])
+    def target_log_prob_fn(loc, scale):
+      prior_loc = tfd.Normal(loc=0., scale=1.)
+      prior_scale = tfd.InverseGamma(concentration=1., rate=1.)
+      likelihood = tfd.Independent(
+        tfd.Normal(loc=loc, scale=scale),
+        reinterpreted_batch_ndims=1)
+      return (prior_loc.log_prob(loc) +
+              prior_scale.log_prob(scale) +
+              likelihood.log_prob(data))
+    return target_log_prob_fn
+
+  def proposal_fn(loc, scale):
+    loc_proposal = tfp.metropolis_hastings.proposal_normal()
+    scale_proposal = tfp.metropolis_hastings.proposal_uniform(minval=-1.)
+    proposed_loc, _ = loc_proposal(loc)
+    proposed_scale, _ = scale_proposal(scale)
+    proposed_scale = tf.maximum(proposed_scale, 0.01)
+    return [proposed_loc, proposed_scale], None
+
+  next_state, kernel_results = tfp.metropolis_hastings.kernel(
+      target_log_prob_fn=make_target_log_prob_fn(x),
+      proposal_fn=proposal_fn,
+      current_state=[loc, scale])
+  train_op = tf.group(loc.assign(next_state[0]),
+                      scale.assign(next_state[1]))
+  ```
+
+  """
+  with ops.name_scope(
+      name, "metropolis_hastings_kernel",
+      [current_state, seed, current_target_log_prob]):
+    with ops.name_scope("initialize"):
+      maybe_expand = lambda x: list(x) if _is_list_like(x) else [x]
+      current_state_parts = maybe_expand(current_state)
+      if current_target_log_prob is None:
+        current_target_log_prob = target_log_prob_fn(*current_state_parts)
+
+    proposed_state, log_transit_ratio = proposal_fn(*current_state_parts)
+    proposed_state_parts = maybe_expand(proposed_state)
+
+    proposed_target_log_prob = target_log_prob_fn(*proposed_state_parts)
+
+    with ops.name_scope(
+        "accept_reject",
+        [current_state_parts, proposed_state_parts,
+         current_target_log_prob, proposed_target_log_prob]):
+      log_accept_ratio = proposed_target_log_prob - current_target_log_prob
+      if log_transit_ratio is not None:
+        # If the log_transit_ratio is None, then assume the proposal is
+        # symmetric, i.e.,
+        #   log p(old | new) - log p(new | old) = 0.
+        log_accept_ratio += log_transit_ratio
+
+      # u < exp(log_accept_ratio),  where u~Uniform[0,1)
+      # ==> log(u) < log_accept_ratio
+      random_value = random_ops.random_uniform(
+          array_ops.shape(log_accept_ratio),
+          dtype=log_accept_ratio.dtype,
+          seed=seed)
+      random_negative = math_ops.log(random_value)
+      is_accepted = random_negative < log_accept_ratio
+      next_state_parts = [array_ops.where(is_accepted,
+                                          proposed_state_part,
+                                          current_state_part)
+                          for proposed_state_part, current_state_part in
+                          zip(proposed_state_parts, current_state_parts)]
+      accepted_log_prob = array_ops.where(is_accepted,
+                                          proposed_target_log_prob,
+                                          current_target_log_prob)
+    maybe_flatten = lambda x: x if _is_list_like(current_state) else x[0]
+    return [
+        maybe_flatten(next_state_parts),
+        KernelResults(
+            log_accept_ratio=log_accept_ratio,
+            current_target_log_prob=accepted_log_prob,
+            is_accepted=is_accepted,
+            proposed_state=maybe_flatten(proposed_state_parts),
+        ),
+    ]
 
 
 def evolve(initial_sample,
            initial_log_density,
            initial_log_accept_ratio,
-           log_unnormalized_prob_fn,
+           target_log_prob_fn,
            proposal_fn,
            n_steps=1,
            seed=None,
@@ -162,9 +240,11 @@ def evolve(initial_sample,
 
   The probability distribution may have an unknown normalization constan.
   We parameterize the probability density as follows:
-    ```
-      f(x) = exp(L(x) + constant)
-    ```
+
+  ```none
+  f(x) = exp(L(x) + constant)
+  ```
+
   Here `L(x)` is any continuous function with an (possibly unknown but finite)
   upper bound, i.e. there exists a number beta such that
   `L(x)< beta < infinity` for all x. The constant is the normalization needed
@@ -188,72 +268,77 @@ def evolve(initial_sample,
 
   The following example, demonstrates the use to generate a 1000 uniform random
   walk Metropolis samplers run in parallel for the normal target distribution.
+
   ```python
-    n = 3  # dimension of the problem
-
-    # Generate 1000 initial values randomly. Each of these would be an
-    # independent starting point for a Markov chain.
-    state = tf.get_variable(
-        'state',initializer=tf.random_normal([1000, n], mean=3.0,
-                                             dtype=tf.float64, seed=42))
-
-    # Computes the log(p(x)) for the unit normal density and ignores the
-    # normalization constant.
-    def log_density(x):
-      return  - tf.reduce_sum(x * x, reduction_indices=-1) / 2.0
-
-    # Initial log-density value
-    state_log_density = tf.get_variable(
-        'state_log_density', initializer=log_density(state.initialized_value()))
-
-    # A variable to store the log_acceptance_ratio:
-    log_acceptance_ratio = tf.get_variable(
-        'log_acceptance_ratio', initializer=tf.zeros([1000], dtype=tf.float64))
-
-    # Generates random proposals by moving each coordinate uniformly and
-    # independently in a box of size 2 centered around the current value.
-    # Returns the new point and also the log of the Hastings ratio (the
-    # ratio of the probability of going from the proposal to origin and the
-    # probability of the reverse transition). When this ratio is 1, the value
-    # may be omitted and replaced by None.
-    def random_proposal(x):
-      return (x + tf.random_uniform(tf.shape(x), minval=-1, maxval=1,
-                                    dtype=x.dtype, seed=12)), None
-
-    #  Create the op to propagate the chain for 100 steps.
-    stepper = mh.evolve(
-        state, state_log_density, log_acceptance_ratio,
-        log_density, random_proposal, n_steps=100, seed=123)
-    init = tf.initialize_all_variables()
-    with tf.Session() as sess:
-      sess.run(init)
-      # Run the chains for a total of 1000 steps and print out the mean across
-      # the chains every 100 iterations.
-      for n_iter in range(10):
-        # Executing the stepper advances the chain to the next state.
-        sess.run(stepper)
-        # Print out the current value of the mean(sample) for every dimension.
-        print(np.mean(sess.run(state), 0))
-      # Estimated covariance matrix
-      samples = sess.run(state)
-      print('')
-      print(np.cov(samples, rowvar=False))
+  n = 3  # dimension of the problem
+
+  # Generate 1000 initial values randomly. Each of these would be an
+  # independent starting point for a Markov chain.
+  state = tf.get_variable(
+      "state",
+      initializer=tf.random_normal([1000, n],
+                                   mean=3.0,
+                                   dtype=tf.float64,
+                                   seed=42))
+
+  # Computes the log(p(x)) for the unit normal density and ignores the
+  # normalization constant.
+  def log_density(x):
+    return -tf.reduce_sum(x * x, reduction_indices=-1) / 2.0
+
+  # Initial log-density value
+  state_log_density = tf.get_variable(
+      "state_log_density",
+      initializer=log_density(state.initialized_value()))
+
+  # A variable to store the log_acceptance_ratio:
+  log_acceptance_ratio = tf.get_variable(
+      "log_acceptance_ratio",
+      initializer=tf.zeros([1000], dtype=tf.float64))
+
+  # Generates random proposals by moving each coordinate uniformly and
+  # independently in a box of size 2 centered around the current value.
+  # Returns the new point and also the log of the Hastings ratio (the
+  # ratio of the probability of going from the proposal to origin and the
+  # probability of the reverse transition). When this ratio is 1, the value
+  # may be omitted and replaced by None.
+  def random_proposal(x):
+    return (x + tf.random_uniform(tf.shape(x), minval=-1, maxval=1,
+                                  dtype=x.dtype, seed=12)), None
+
+  #  Create the op to propagate the chain for 100 steps.
+  stepper = mh.evolve(
+      state, state_log_density, log_acceptance_ratio,
+      log_density, random_proposal, n_steps=100, seed=123)
+  init = tf.initialize_all_variables()
+  with tf.Session() as sess:
+    sess.run(init)
+    # Run the chains for a total of 1000 steps and print out the mean across
+    # the chains every 100 iterations.
+    for n_iter in range(10):
+      # Executing the stepper advances the chain to the next state.
+      sess.run(stepper)
+      # Print out the current value of the mean(sample) for every dimension.
+      print(np.mean(sess.run(state), 0))
+    # Estimated covariance matrix
+    samples = sess.run(state)
+    print(np.cov(samples, rowvar=False))
   ```
 
   Args:
     initial_sample: A float-like `tf.Variable` of any shape that can
-      be consumed by the `log_unnormalized_prob_fn` and `proposal_fn`
+      be consumed by the `target_log_prob_fn` and `proposal_fn`
       callables.
     initial_log_density: Float-like `tf.Variable` with `dtype` and shape
-      equivalent  to `log_unnormalized_prob_fn(initial_sample)`, i.e., matching
-        the result of `log_unnormalized_prob_fn` invoked at `current_state`.
+      equivalent  to `target_log_prob_fn(initial_sample)`, i.e., matching
+        the result of `target_log_prob_fn` invoked at `current_state`.
     initial_log_accept_ratio: A `tf.Variable` with `dtype` and shape matching
       `initial_log_density`. Stands for the log of Metropolis-Hastings
       acceptance ratio after propagating the chain for `n_steps`.
-    log_unnormalized_prob_fn: A Python callable evaluated at
+    target_log_prob_fn: A Python callable evaluated at
       `current_state` and returning a float-like `Tensor` of log target-density
       up to a normalizing constant. In other words,
-      `log_unnormalized_prob_fn(x) = log(g(x))`, where
+      `target_log_prob_fn(x) = log(g(x))`, where
       `target_density = g(x)/Z` for some constant `A`. The shape of the input
       tensor is the same as the shape of the `current_state`. The shape of the
       output tensor is either
@@ -265,7 +350,7 @@ def evolve(initial_sample,
           and the result must be of shape [B1, ..., Bb]. For example, if the
           distribution that is being sampled is a 10 dimensional normal,
           then the input tensor may be of shape [100, 10] or [30, 20, 10]. The
-          last dimension will then be 'consumed' by `log_unnormalized_prob_fn`
+          last dimension will then be 'consumed' by `target_log_prob_fn`
           and it should return tensors of shape [100] and [30, 20] respectively.
     proposal_fn: A callable accepting a real valued `Tensor` of current sample
       points and returning a tuple of two `Tensors`. The first element of the
@@ -289,42 +374,48 @@ def evolve(initial_sample,
     forward_step: an `Op` to step the Markov chain forward for `n_steps`.
   """
 
-  with ops.name_scope(name, 'metropolis_hastings', [initial_sample]):
+  with ops.name_scope(name, "metropolis_hastings", [initial_sample]):
     current_state = initial_sample
-    current_log_density = initial_log_density
+    current_target_log_prob = initial_log_density
     log_accept_ratio = initial_log_accept_ratio
 
-    # Stop condition for the while_loop
-    def stop_condition(i, _):
-      return i < n_steps
-
-    def step(i, loop_vars):
-      """Wrap `_single_iteration` for `while_loop`."""
-      state = loop_vars[0]
-      state_log_density = loop_vars[1]
-      return i + 1, list(_single_iteration(state, state_log_density,
-                                           log_unnormalized_prob_fn,
-                                           proposal_fn, seed=seed))
-
-    loop_vars = [current_state, current_log_density, log_accept_ratio]
-    # Build an `Op` to evolve the Markov chain for `n_steps`
-    (_, [end_state, end_log_density, end_log_acceptance]) = (
+    def step(i, current_state, current_target_log_prob, log_accept_ratio):
+      """Wrap single Markov chain iteration in `while_loop`."""
+      next_state, kernel_results = kernel(
+          target_log_prob_fn=target_log_prob_fn,
+          proposal_fn=proposal_fn,
+          current_state=current_state,
+          current_target_log_prob=current_target_log_prob,
+          seed=seed)
+      accepted_log_prob = kernel_results.current_target_log_prob
+      log_accept_ratio = kernel_results.log_accept_ratio
+      return i + 1, next_state, accepted_log_prob, log_accept_ratio
+
+    (_, accepted_state, accepted_target_log_prob, accepted_log_accept_ratio) = (
         control_flow_ops.while_loop(
-            stop_condition, step,
-            (0, loop_vars),
-            parallel_iterations=1, swap_memory=1))
+            cond=lambda i, *ignored_args: i < n_steps,
+            body=step,
+            loop_vars=[
+                0,  # i
+                current_state,
+                current_target_log_prob,
+                log_accept_ratio,
+            ],
+            parallel_iterations=1 if seed is not None else 10,
+            # TODO(b/73775595): Confirm optimal setting of swap_memory.
+            swap_memory=1))
 
     forward_step = control_flow_ops.group(
-        state_ops.assign(current_log_density, end_log_density),
-        state_ops.assign(current_state, end_state),
-        state_ops.assign(log_accept_ratio, end_log_acceptance))
+        state_ops.assign(current_target_log_prob, accepted_target_log_prob),
+        state_ops.assign(current_state, accepted_state),
+        state_ops.assign(log_accept_ratio, accepted_log_accept_ratio))
 
     return forward_step
 
 
-def uniform_random_proposal(step_size=1.,
-                            seed=None,
-                            name=None):
+def proposal_uniform(step_size=1.,
+                     seed=None,
+                     name=None):
   """Returns a callable that adds a random uniform tensor to the input.
 
   This function returns a callable that accepts one `Tensor` argument of any
@@ -346,11 +437,13 @@ def uniform_random_proposal(step_size=1.,
 
   Returns:
     proposal_fn:  A callable accepting one float-like `Tensor` and returning a
-    2-tuple. The first value in the tuple is a `Tensor` of the same shape and
-    dtype as the input argument and the second element of the tuple is None.
+      2-tuple. The first value in the tuple is a `Tensor` of the same shape and
+      dtype as the input argument and the second element of the tuple is None.
   """
 
-  with ops.name_scope(name, 'uniform_random_proposal', [step_size]):
+  with ops.name_scope(name, "proposal_uniform", [step_size]):
+    step_size = ops.convert_to_tensor(step_size, name="step_size")
+
     def proposal_fn(input_state, name=None):
       """Adds a uniform perturbation to the input state.
 
@@ -359,12 +452,12 @@ def uniform_random_proposal(step_size=1.,
         name: A string that sets the name for this `Op`.
 
       Returns:
-        proposal_state:  A float-like `Tensot` with `dtype` and shape matching
+        proposal_state:  A float-like `Tensor` with `dtype` and shape matching
           `input_state`.
         log_transit_ratio: `None`. Proposal is symmetric.
       """
-      with ops.name_scope(name, 'proposer', [input_state]):
-        input_state = ops.convert_to_tensor(input_state, name='input_state')
+      with ops.name_scope(name, "proposer", [input_state]):
+        input_state = ops.convert_to_tensor(input_state, name="input_state")
         return input_state + random_ops.random_uniform(
             array_ops.shape(input_state),
             minval=-step_size,
@@ -373,9 +466,9 @@ def uniform_random_proposal(step_size=1.,
     return proposal_fn
 
 
-def normal_random_proposal(scale=1.,
-                           seed=None,
-                           name=None):
+def proposal_normal(scale=1.,
+                    seed=None,
+                    name=None):
   """Returns a callable that adds a random normal tensor to the input.
 
   This function returns a callable that accepts one `Tensor` argument of any
@@ -398,11 +491,13 @@ def normal_random_proposal(scale=1.,
 
   Returns:
     proposal_fn: A callable accepting one float-like `Tensor` and returning a
-    2-tuple. The first value in the tuple is a `Tensor` of the same shape and
-    dtype as the input argument and the second element of the tuple is None.
+      2-tuple. The first value in the tuple is a `Tensor` of the same shape and
+      dtype as the input argument and the second element of the tuple is None.
   """
 
-  with ops.name_scope(name, 'normal_random_proposal', [scale]):
+  with ops.name_scope(name, "proposal_normal", [scale]):
+    scale = ops.convert_to_tensor(scale, name="scale")
+
     def proposal_fn(input_state, name=None):
       """Adds a normal perturbation to the input state.
 
@@ -411,16 +506,22 @@ def normal_random_proposal(scale=1.,
         name: A string that sets the name for this `Op`.
 
       Returns:
-        proposal_state:  A float-like `Tensot` with `dtype` and shape matching
+        proposal_state:  A float-like `Tensor` with `dtype` and shape matching
           `input_state`.
         log_transit_ratio: `None`. Proposal is symmetric.
       """
 
-      with ops.name_scope(name, 'proposer', [input_state]):
-        input_state = ops.convert_to_tensor(input_state, name='input_state')
+      with ops.name_scope(name, "proposer", [input_state]):
+        input_state = ops.convert_to_tensor(input_state, name="input_state")
         return input_state + random_ops.random_normal(
             array_ops.shape(input_state),
             mean=0.,
             stddev=scale,
+            dtype=scale.dtype,
             seed=seed), None
     return proposal_fn
+
+
+def _is_list_like(x):
+  """Helper which returns `True` if input is `list`-like."""
+  return isinstance(x, (tuple, list))
diff --git a/tensorflow/contrib/bayesflow/python/ops/variable_utils_impl.py b/tensorflow/contrib/bayesflow/python/ops/variable_utils_impl.py
deleted file mode 100644
index ca3d75b5bfee093449026c7d1d62e3bdeff6b096..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/bayesflow/python/ops/variable_utils_impl.py
+++ /dev/null
@@ -1,157 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Utility functions related to managing `tf.Variable`s.
-
-@@externalize_variables_as_args
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import warnings
-
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import gradients_impl as gradients_ops
-from tensorflow.python.ops import variable_scope as varscope_ops
-from tensorflow.python.ops import variables as variables_ops
-
-__all__ = [
-    "externalize_variables_as_args",
-]
-
-
-# Cause all warnings to always be triggered.
-# Not having this means subsequent calls wont trigger the warning.
-warnings.simplefilter("always")
-
-
-def externalize_variables_as_args(fn,
-                                  fn_args=(),
-                                  ancestor_variables=None,
-                                  possible_ancestor_vars=None,
-                                  assert_variable_override=False,
-                                  name=None):
-  """"Converts variables within a callable into explicit args.
-
-  Makes a new callable from `fn` which has arguments `list(fn_args) +
-  list(ancestor_variables)`. If `ancestor_variables` is not specified, it is
-  inferred by checking which of `possible_ancestor_vars` actually influences the
-  return value of `fn` (concretely, gradient of `fn(*fn_args)` is not `None`).
-  By default `possible_ancestor_vars` is `tf.trainable_variables() +
-  tf.get_collection(tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES)`.
-
-  #### Examples:
-
-  ```python
-  num_samples = 2
-  num_dims = 1
-  dtype = np.float32
-
-  def foo(x):
-    x = tf.convert_to_tensor(x, dtype=dtype, name="x")
-    s = x.shape.as_list()
-    y = tf.get_variable(
-        name="y",
-        dtype=dtype,
-        initializer=np.arange(np.prod(s)).reshape(s).astype(dtype))
-    return x + y
-
-  x = tf.constant(dtype([0.1, 0.2]))
-
-  wrapped_foo, discovered_ancestor_variables = (
-      externalize_variables_as_args(foo, [x]))
-
-  new_x = dtype([[1.], [2.]])
-  new_y = dtype([[3.], [4.]])
-  new_result = wrapped_foo(new_x, new_y)
-  # ==> [[4.], [6.]]
-
-  discovered_ancestor_variables == [tf.get_variable("y", dtype)]
-  # ==> [True]
-  ```
-
-  Args:
-    fn: Python callable which returns a `Tensor` and accepts `*fn_args`.
-    fn_args: Python list of args to `fn`. Represents dummy arguments passed to
-      `fn` to trace its execution; actual values are unimportant. These args are
-      only used to construct the output of `fn` and to resolve the ancestor
-      `tf.Variable`s.
-      Default value: `()` (i.e., `fn` takes no args).
-    ancestor_variables: Python list of `tf.Variable`s. When `None` the list is
-      expanded to non-`None` gradients of `fn(*fn_args)`. By directly providing
-      the `ancestor_variables` the internal call to `fn` is avoided.
-      Default value: `None` (i.e., `tf.Variable` dependencies are discovered).
-    possible_ancestor_vars: Python list of possible `tf.Variable`s which might
-      be a dependency of computing `fn(*fn_args)`.
-      Default value: `None` (i.e., expanded as described above).
-    assert_variable_override: Python `bool` indicating that not finding a
-      `tf.Variable` in the override list is an exception.
-      Default value: `False` (i.e., missing a `Variable` triggers a `warning`).
-    name: Python `str` name prefixed to Ops created by this function.
-      Default value: `None` (i.e., "externalize_variables_as_args").
-
-  Returns:
-    wrapped_fn: Python callable taking arguments like
-      `*(list(fn_args) + discovered_ancestor_variables)`.
-    discovered_ancestor_variables: Python list of `tf.Variable`s known to be a
-      dependency of `fn(*fn_args)`.
-
-  Raises:
-    ValueError: if `assert_variable_override` is `True` and `Variable` is
-      requested but not overridden.
-  """
-  def _make_bypassing_custom_getter_fn(new_var_dict):
-    """Return dict value rather than what would otherwise be dict key."""
-    def _custom_getter(getter, *args, **kwargs):
-      v = getter(*args, **kwargs)
-      new_v = new_var_dict.get(v, None)
-      if new_v is None:
-        msg = "Variable \"{}\" not found in bypass dict.".format(v)
-        if assert_variable_override:
-          raise ValueError(msg)
-        warnings.warn(msg)
-        return v
-      return new_v
-    return _custom_getter
-
-  with ops.name_scope(name, "externalize_variables_as_args"):
-    if ancestor_variables is not None and not ancestor_variables:
-      return fn, ()
-    if ancestor_variables is None:
-      y = fn(*fn_args)  # Side-effect: adds trainable vars.
-      if possible_ancestor_vars is None:
-        possible_ancestor_vars = (
-            variables_ops.trainable_variables() +
-            ops.get_collection(ops.GraphKeys.TRAINABLE_RESOURCE_VARIABLES))
-      # TODO(b/72873296): Add a dedicated op for identifying ancestors.
-      ancestors = [v for g, v
-                   in zip(gradients_ops.gradients(y, possible_ancestor_vars),
-                          possible_ancestor_vars)
-                   if g is not None]
-      ancestor_variables = sorted(ancestors, key=lambda v: v.name)
-  n = len(fn_args)
-  def _fn(*args):
-    with ops.name_scope("wrapped_fn"):
-      vars_dict = dict(
-          (k, ops.convert_to_tensor(
-              v, dtype=k.dtype.base_dtype, name=k.op.name))
-          for k, v in zip(ancestor_variables, args[n:]))
-      with varscope_ops.variable_scope(
-          varscope_ops.get_variable_scope(),
-          reuse=True,
-          custom_getter=_make_bypassing_custom_getter_fn(vars_dict)):
-        return fn(*args[:n])
-  return _fn, ancestor_variables
diff --git a/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py b/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py
index 31f5c444817b9b82723c86bea3504d4934e57eb8..23ba76210b3b68d0d0b2eef9d4040882654bdad9 100644
--- a/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py
+++ b/tensorflow/contrib/boosted_trees/estimator_batch/custom_export_strategy.py
@@ -93,7 +93,9 @@ def make_custom_export_strategy(name,
                          "w") as f:
           f.write("\n".join("%s, %f" % (k, v) for k, v in sorted_by_importance))
     return result_dir
-  return export_strategy.ExportStrategy(name, export_fn)
+
+  return export_strategy.ExportStrategy(
+      name, export_fn, strip_default_attrs=True)
 
 
 def convert_to_universal_format(dtec, sorted_feature_names,
diff --git a/tensorflow/contrib/boosted_trees/kernels/model_ops.cc b/tensorflow/contrib/boosted_trees/kernels/model_ops.cc
index 754b7bc3270d647fc381033b769eadd7b791771e..3bf33186ec13f5ff991db938d59849c0124a30a0 100644
--- a/tensorflow/contrib/boosted_trees/kernels/model_ops.cc
+++ b/tensorflow/contrib/boosted_trees/kernels/model_ops.cc
@@ -137,6 +137,61 @@ class TreeEnsembleDeserializeOp : public OpKernel {
   }
 };
 
+class TreeEnsembleUsedHandlersOp : public OpKernel {
+ public:
+  explicit TreeEnsembleUsedHandlersOp(OpKernelConstruction* context)
+      : OpKernel(context) {
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("num_all_handlers", &num_handlers_));
+  }
+
+  void Compute(OpKernelContext* context) override {
+    boosted_trees::models::DecisionTreeEnsembleResource* ensemble_resource;
+
+    OP_REQUIRES_OK(context, LookupResource(context, HandleFromInput(context, 0),
+                                           &ensemble_resource));
+    tf_shared_lock l(*ensemble_resource->get_mutex());
+    core::ScopedUnref unref_me(ensemble_resource);
+
+    // Get the stamp token.
+    const Tensor* stamp_token_t;
+    OP_REQUIRES_OK(context, context->input("stamp_token", &stamp_token_t));
+    int64 stamp_token = stamp_token_t->scalar<int64>()();
+
+    // Only the Chief should run this Op and it is guaranteed to be in
+    // a consistent state so the stamps must always match.
+    CHECK(ensemble_resource->is_stamp_valid(stamp_token));
+
+    Tensor* output_used_handlers_t = nullptr;
+    OP_REQUIRES_OK(
+        context, context->allocate_output("used_handlers_mask", {num_handlers_},
+                                          &output_used_handlers_t));
+    auto output_used_handlers = output_used_handlers_t->vec<bool>();
+
+    Tensor* output_num_used_handlers_t = nullptr;
+    OP_REQUIRES_OK(context,
+                   context->allocate_output("num_used_handlers", {},
+                                            &output_num_used_handlers_t));
+    int handler_idx = 0;
+    std::vector<int64> used_handlers = ensemble_resource->GetUsedHandlers();
+    output_num_used_handlers_t->scalar<int64>()() = used_handlers.size();
+    for (int64 i = 0; i < num_handlers_; ++i) {
+      if (handler_idx >= used_handlers.size() ||
+          used_handlers[handler_idx] > i) {
+        output_used_handlers(i) = false;
+      } else {
+        OP_REQUIRES(context, used_handlers[handler_idx] == i,
+                    errors::InvalidArgument("Handler IDs should be sorted."));
+        ++handler_idx;
+        output_used_handlers(i) = true;
+      }
+    }
+  }
+
+ private:
+  int64 num_handlers_;
+};
+
 REGISTER_RESOURCE_HANDLE_KERNEL(DecisionTreeEnsembleResource);
 
 REGISTER_KERNEL_BUILDER(
@@ -155,5 +210,7 @@ REGISTER_KERNEL_BUILDER(Name("TreeEnsembleSerialize").Device(DEVICE_CPU),
 REGISTER_KERNEL_BUILDER(Name("TreeEnsembleDeserialize").Device(DEVICE_CPU),
                         TreeEnsembleDeserializeOp);
 
+REGISTER_KERNEL_BUILDER(Name("TreeEnsembleUsedHandlers").Device(DEVICE_CPU),
+                        TreeEnsembleUsedHandlersOp);
 }  // namespace boosted_trees
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/boosted_trees/kernels/training_ops.cc b/tensorflow/contrib/boosted_trees/kernels/training_ops.cc
index 7f8dea1d3c2a04b725843f6e2932a0cdfbc7733c..1bfeed306641111718984b2097512e5ec3fa8630 100644
--- a/tensorflow/contrib/boosted_trees/kernels/training_ops.cc
+++ b/tensorflow/contrib/boosted_trees/kernels/training_ops.cc
@@ -361,27 +361,10 @@ class GrowTreeEnsembleOp : public OpKernel {
     // Increment attempt stats.
     ensemble_resource->IncrementAttempts();
 
-    // In case we want to do feature selection and we have reached the limit,
-    // build a list of handlers used so far to avoid adding new features.
-    std::vector<int64> allowed_handlers;
-    if (learner_config_.constraints().max_number_of_unique_feature_columns() >
-        0) {
-      allowed_handlers = ensemble_resource->GetUsedHandlers();
-      // TODO(soroush): We can disable handlers that are not going to be used to
-      // avoid unnecessary computations.
-      if (allowed_handlers.size() <
-          learner_config_.constraints()
-              .max_number_of_unique_feature_columns()) {
-        // We have not reached the limit yet. Empty the list of allow features
-        // which means we can keep adding new features.
-        allowed_handlers.clear();
-      }
-    }
-
     // Find best splits for each active partition.
     std::map<int32, SplitCandidate> best_splits;
-    FindBestSplitsPerPartition(context, allowed_handlers, partition_ids_list,
-                               gains_list, splits_list, &best_splits);
+    FindBestSplitsPerPartition(context, partition_ids_list, gains_list,
+                               splits_list, &best_splits);
 
     // No-op if no new splits can be considered.
     if (best_splits.empty()) {
@@ -422,19 +405,12 @@ class GrowTreeEnsembleOp : public OpKernel {
   // and finds the best split for each partition.
   void FindBestSplitsPerPartition(
       OpKernelContext* const context,
-      const std::vector<int64>& allowed_handlers,  // Empty means all handlers.
       const OpInputList& partition_ids_list, const OpInputList& gains_list,
       const OpInputList& splits_list,
       std::map<int32, SplitCandidate>* best_splits) {
     // Find best split per partition going through every feature candidate.
     // TODO(salehay): Is this worth parallelizing?
     for (int64 handler_id = 0; handler_id < num_handlers_; ++handler_id) {
-      if (!allowed_handlers.empty()) {
-        if (!std::binary_search(allowed_handlers.begin(),
-                                allowed_handlers.end(), handler_id)) {
-          continue;
-        }
-      }
       const auto& partition_ids = partition_ids_list[handler_id].vec<int32>();
       const auto& gains = gains_list[handler_id].vec<float>();
       const auto& splits = splits_list[handler_id].vec<string>();
diff --git a/tensorflow/contrib/boosted_trees/ops/model_ops.cc b/tensorflow/contrib/boosted_trees/ops/model_ops.cc
index 0786c4166410720e8d4d70960e5747ff111076d8..9d6343c7e80f369bf6a5465821c5f4bacb984cd0 100644
--- a/tensorflow/contrib/boosted_trees/ops/model_ops.cc
+++ b/tensorflow/contrib/boosted_trees/ops/model_ops.cc
@@ -110,5 +110,32 @@ stamp_token: Token to use as the new value of the resource stamp.
 tree_ensemble_config: Serialized proto of the ensemble.
 )doc");
 
+REGISTER_OP("TreeEnsembleUsedHandlers")
+    .Attr("num_all_handlers: int >= 0")
+    .Input("tree_ensemble_handle: resource")
+    .Input("stamp_token: int64")
+    .Output("num_used_handlers: int64")
+    .Output("used_handlers_mask: bool")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      shape_inference::ShapeHandle unused_input;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 0, &unused_input));
+      c->set_output(0, c->Scalar());
+      int num_all_handlers;
+      c->GetAttr("num_all_handlers", &num_all_handlers).IgnoreError();
+      c->set_output(1, {c->Vector(num_all_handlers)});
+
+      return Status::OK();
+    })
+    .Doc(R"doc(
+Returns the mask of used handlers along with the number of non-zero elements in 
+this mask. Used in feature selection.
+
+tree_ensemble_handle: Handle to the tree ensemble.
+stamp_token: Token to use as the new value of the resource stamp.
+num_used_handlers: number of feature column handlers used in the model.
+used_handlers_mask: A boolean vector of showing which handlers are used in the
+                    model.
+)doc");
+
 }  // namespace boosted_trees
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/boosted_trees/python/kernel_tests/model_ops_test.py b/tensorflow/contrib/boosted_trees/python/kernel_tests/model_ops_test.py
index 27c288bbf78b3b593d0807e92ac7fd9afc4d2725..63b9c5fddf0d9967d53077608664b59d9ae00481 100644
--- a/tensorflow/contrib/boosted_trees/python/kernel_tests/model_ops_test.py
+++ b/tensorflow/contrib/boosted_trees/python/kernel_tests/model_ops_test.py
@@ -310,6 +310,22 @@ class ModelOpsTest(test_util.TensorFlowTestCase):
         # The third tree was added after the save.
         self.assertAllClose(result.eval(), [[-1.1], [-1.1]])
 
+  def testUsedHandlers(self):
+    with self.test_session():
+      tree_ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
+      tree_ensemble_config.growing_metadata.used_handler_ids.append(1)
+      tree_ensemble_config.growing_metadata.used_handler_ids.append(5)
+      stamp_token = 3
+      tree_ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=stamp_token,
+          tree_ensemble_config=tree_ensemble_config.SerializeToString(),
+          name="create_tree")
+      resources.initialize_resources(resources.shared_resources()).run()
+      result = model_ops.tree_ensemble_used_handlers(
+          tree_ensemble_handle, stamp_token, num_all_handlers=6)
+      self.assertAllEqual([0, 1, 0, 0, 0, 1], result.used_handlers_mask.eval())
+      self.assertEqual(2, result.num_used_handlers.eval())
+
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py b/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py
index 8ca1aabacaf53b66aaba184962922294427d6803..3e524efbeac74ff754d63cae92b3e194411cb2de 100644
--- a/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py
+++ b/tensorflow/contrib/boosted_trees/python/kernel_tests/training_ops_test.py
@@ -1588,7 +1588,7 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
       self.assertEqual(
           2, tree_ensemble_config.tree_metadata[2].num_tree_weight_updates)
 
-  def testGrowExistingEnsembleTreeWithFeatureSelectionCanStillGrow(self):
+  def testGrowExistingEnsembleTreeWithFeatureSelectionUsedHandlers(self):
     """Test growing a tree with feature selection."""
     with self.test_session() as session:
       # Create existing ensemble with one root split and one bias tree.
@@ -1649,7 +1649,6 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           num_trees_attempted: 2
           num_layers_attempted: 2
           used_handler_ids: 2
-          used_handler_ids: 5
         }
       """, tree_ensemble_config)
       tree_ensemble_handle = model_ops.tree_ensemble_variable(
@@ -1668,183 +1667,8 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
           min_node_weight=0,
           pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
           growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
-      # There are 2 handler_ids in used_handler_ids already but one of them
-      # is handler 2, so we can still grow trees.
-      learner_config.constraints.max_number_of_unique_feature_columns = 2
-      learner_config = learner_config.SerializeToString()
-      # Prepare handler inputs.
-      handler1_partitions = np.array([0], dtype=np.int32)
-      handler1_gains = np.array([7.62], dtype=np.float32)
-      handler1_split = [_gen_dense_split_info(5, 0.52, -4.375, 7.143)]
-      handler2_partitions = np.array([0], dtype=np.int32)
-      handler2_gains = np.array([0.63], dtype=np.float32)
-      handler2_split = [_gen_dense_split_info(2, 0.23, -0.6, 0.24)]
-      handler3_partitions = np.array([0], dtype=np.int32)
-      handler3_gains = np.array([7.62], dtype=np.float32)
-      handler3_split = [_gen_categorical_split_info(8, 7, -4.375, 7.143)]
-
-      # Grow tree ensemble.
-      grow_op = training_ops.grow_tree_ensemble(
-          tree_ensemble_handle,
-          stamp_token=0,
-          next_stamp_token=1,
-          learning_rate=1,
-          partition_ids=[
-              handler1_partitions, handler2_partitions, handler3_partitions
-          ],
-          gains=[handler1_gains, handler2_gains, handler3_gains],
-          splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
-          dropout_seed=123,
-          center_bias=True)
-      session.run(grow_op)
-
-      # Expect a new tree to be added with the split from handler 1.
-      _, serialized = session.run(
-          model_ops.tree_ensemble_serialize(tree_ensemble_handle))
-      tree_ensemble_config.ParseFromString(serialized)
-      self.assertEqual(3, len(tree_ensemble_config.trees))
-      self.assertEqual(
-          2, len(tree_ensemble_config.growing_metadata.used_handler_ids))
-
-  def testGrowExistingEnsembleTreeWithFeatureSelectionEmptyEnsemble(self):
-    """Test growing a tree with feature selection with empty ensemble."""
-    with self.test_session() as session:
-      # Create existing ensemble with one root split and one bias tree.
-      tree_ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
-      tree_ensemble_handle = model_ops.tree_ensemble_variable(
-          stamp_token=0,
-          tree_ensemble_config=tree_ensemble_config.SerializeToString(),
-          name="tree_ensemble")
-      resources.initialize_resources(resources.shared_resources()).run()
-
-      # Prepare learner config.
-      learner_config = _gen_learner_config(
-          num_classes=2,
-          l1_reg=0,
-          l2_reg=0,
-          tree_complexity=0,
-          max_depth=1,
-          min_node_weight=0,
-          pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
-          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
-      learner_config.constraints.max_number_of_unique_feature_columns = 2
-      learner_config = learner_config.SerializeToString()
-      # Prepare handler inputs.
-      handler1_partitions = np.array([0], dtype=np.int32)
-      handler1_gains = np.array([7.62], dtype=np.float32)
-      handler1_split = [_gen_dense_split_info(5, 0.52, -4.375, 7.143)]
-      handler2_partitions = np.array([0], dtype=np.int32)
-      handler2_gains = np.array([0.63], dtype=np.float32)
-      handler2_split = [_gen_dense_split_info(2, 0.23, -0.6, 0.24)]
-      handler3_partitions = np.array([0], dtype=np.int32)
-      handler3_gains = np.array([7.62], dtype=np.float32)
-      handler3_split = [_gen_categorical_split_info(8, 7, -4.375, 7.143)]
-
-      # Grow tree ensemble.
-      grow_op = training_ops.grow_tree_ensemble(
-          tree_ensemble_handle,
-          stamp_token=0,
-          next_stamp_token=1,
-          learning_rate=1,
-          partition_ids=[
-              handler1_partitions, handler2_partitions, handler3_partitions
-          ],
-          gains=[handler1_gains, handler2_gains, handler3_gains],
-          splits=[handler1_split, handler2_split, handler3_split],
-          learner_config=learner_config,
-          dropout_seed=123,
-          center_bias=True)
-      session.run(grow_op)
-
-      _, serialized = session.run(
-          model_ops.tree_ensemble_serialize(tree_ensemble_handle))
-      tree_ensemble_config.ParseFromString(serialized)
-      self.assertEqual(1, len(tree_ensemble_config.trees))
-      self.assertEqual(
-          1, len(tree_ensemble_config.growing_metadata.used_handler_ids))
-
-  def testGrowExistingEnsembleTreeWithFeatureSelectionCantGrow(self):
-    """Test growing a tree with feature selection with empty ensemble."""
-    with self.test_session() as session:
-      # Create existing ensemble with one root split and one bias tree.
-      tree_ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
-      text_format.Merge("""
-        trees {
-          nodes {
-            leaf {
-              vector {
-                value: -0.32
-                value: 0.28
-              }
-            }
-          }
-        }
-        trees {
-          nodes {
-            categorical_id_binary_split {
-              feature_column: 3
-              feature_id: 7
-              left_id: 1
-              right_id: 2
-            }
-            node_metadata {
-              gain: 1.3
-            }
-          }
-          nodes {
-            leaf {
-              sparse_vector {
-                index: 0
-                value: 2.3
-              }
-            }
-          }
-          nodes {
-            leaf {
-              sparse_vector {
-                index: 0
-                value: -0.9
-              }
-            }
-          }
-        }
-        tree_weights: 0.7
-        tree_weights: 1
-        tree_metadata {
-          num_tree_weight_updates: 1
-          num_layers_grown: 1
-          is_finalized: true
-        }
-        tree_metadata {
-          num_tree_weight_updates: 5
-          num_layers_grown: 1
-          is_finalized: true
-        }
-        growing_metadata {
-          num_trees_attempted: 2
-          num_layers_attempted: 2
-          used_handler_ids: 4
-          used_handler_ids: 5
-        }
-      """, tree_ensemble_config)
-      tree_ensemble_handle = model_ops.tree_ensemble_variable(
-          stamp_token=0,
-          tree_ensemble_config=tree_ensemble_config.SerializeToString(),
-          name="tree_ensemble")
-      resources.initialize_resources(resources.shared_resources()).run()
 
-      # Prepare learner config.
-      learner_config = _gen_learner_config(
-          num_classes=2,
-          l1_reg=0,
-          l2_reg=0,
-          tree_complexity=0,
-          max_depth=1,
-          min_node_weight=0,
-          pruning_mode=learner_pb2.LearnerConfig.PRE_PRUNE,
-          growing_mode=learner_pb2.LearnerConfig.WHOLE_TREE)
-      learner_config.constraints.max_number_of_unique_feature_columns = 2
+      learner_config.constraints.max_number_of_unique_feature_columns = 3
       learner_config = learner_config.SerializeToString()
       # Prepare handler inputs.
       handler1_partitions = np.array([0], dtype=np.int32)
@@ -1876,12 +1700,10 @@ class GrowTreeEnsembleOpTest(test_util.TensorFlowTestCase):
       _, serialized = session.run(
           model_ops.tree_ensemble_serialize(tree_ensemble_handle))
       tree_ensemble_config.ParseFromString(serialized)
-      # We can't grow a tree since we have reached the limit of 2 unique
-      # features [4, 5] and the only available splits are from
-      # handlers [0, 1, 2].
-      self.assertEqual(2, len(tree_ensemble_config.trees))
-      self.assertEqual(
-          2, len(tree_ensemble_config.growing_metadata.used_handler_ids))
+      self.assertEqual(3, len(tree_ensemble_config.trees))
+      # 2 was already used. handler 0 is being added in this tree.
+      self.assertAllEqual(
+          [0, 2], tree_ensemble_config.growing_metadata.used_handler_ids)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/contrib/boosted_trees/python/ops/model_ops.py b/tensorflow/contrib/boosted_trees/python/ops/model_ops.py
index 7a5f509047d46549ba81039a23d29ec987ca7920..25b2c9e2fd72bd018717e8a87fce726f26bad968 100644
--- a/tensorflow/contrib/boosted_trees/python/ops/model_ops.py
+++ b/tensorflow/contrib/boosted_trees/python/ops/model_ops.py
@@ -25,6 +25,7 @@ from tensorflow.contrib.boosted_trees.python.ops.gen_model_ops import tree_ensem
 from tensorflow.contrib.boosted_trees.python.ops.gen_model_ops import tree_ensemble_serialize
 # pylint: disable=unused-import
 from tensorflow.contrib.boosted_trees.python.ops.gen_model_ops import tree_ensemble_stamp_token
+from tensorflow.contrib.boosted_trees.python.ops.gen_model_ops import tree_ensemble_used_handlers
 # pylint: enable=unused-import
 
 from tensorflow.python.framework import ops
diff --git a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
index f0b66dcbbe1c5167b9993e66b30b1dc8a839c380..233e21f1cf286a51c27810f3b42511e698e23281 100644
--- a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
+++ b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch.py
@@ -57,6 +57,8 @@ PREDICTIONS = "predictions"
 PARTITION_IDS = "partition_ids"
 NUM_LAYERS_ATTEMPTED = "num_layers"
 NUM_TREES_ATTEMPTED = "num_trees"
+NUM_USED_HANDLERS = "num_used_handlers"
+USED_HANDLERS_MASK = "used_handlers_mask"
 _FEATURE_NAME_TEMPLATE = "%s_%d"
 
 
@@ -70,7 +72,8 @@ def _get_column_by_index(tensor, indices):
   return array_ops.reshape(array_ops.gather(p_flat, i_flat), [shape[0], -1])
 
 
-def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats):
+def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats,
+                           used_handlers):
   """Returns predictions for the given logits and n_classes.
 
   Args:
@@ -79,6 +82,8 @@ def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats):
         that contains predictions when no dropout was applied.
     partition_ids: A rank 1 `Tensor` with shape [batch_size].
     ensemble_stats: A TreeEnsembleStatsOp result tuple.
+    used_handlers: A TreeEnsembleUsedHandlerOp result tuple of an int and a
+        boolean mask..
 
   Returns:
     A dict of predictions.
@@ -89,6 +94,8 @@ def _make_predictions_dict(stamp, logits, partition_ids, ensemble_stats):
   result[PARTITION_IDS] = partition_ids
   result[NUM_LAYERS_ATTEMPTED] = ensemble_stats.attempted_layers
   result[NUM_TREES_ATTEMPTED] = ensemble_stats.attempted_trees
+  result[NUM_USED_HANDLERS] = used_handlers.num_used_handlers
+  result[USED_HANDLERS_MASK] = used_handlers.used_handlers_mask
   return result
 
 
@@ -361,6 +368,13 @@ class GradientBoostedDecisionTreeModel(object):
     """
     ensemble_stats = training_ops.tree_ensemble_stats(ensemble_handle,
                                                       ensemble_stamp)
+    num_handlers = (
+        len(self._dense_floats) + len(self._sparse_float_shapes) +
+        len(self._sparse_int_shapes))
+    # Used during feature selection.
+    used_handlers = model_ops.tree_ensemble_used_handlers(
+        ensemble_handle, ensemble_stamp, num_all_handlers=num_handlers)
+
     # We don't need dropout info - we can always restore it based on the
     # seed.
     apply_dropout, seed = _dropout_params(mode, ensemble_stats)
@@ -395,7 +409,7 @@ class GradientBoostedDecisionTreeModel(object):
           use_locking=True)
 
     return _make_predictions_dict(ensemble_stamp, predictions, partition_ids,
-                                  ensemble_stats)
+                                  ensemble_stats, used_handlers)
 
   def predict(self, mode):
     """Returns predictions given the features and mode.
@@ -716,6 +730,22 @@ class GradientBoostedDecisionTreeModel(object):
     else:
       active_handlers = array_ops.ones([len(handlers), 2], dtype=dtypes.bool)
 
+    if self._learner_config.constraints.max_number_of_unique_feature_columns:
+      target = (
+          self._learner_config.constraints.max_number_of_unique_feature_columns)
+
+      def _feature_selection_active_handlers():
+        # The active list for current and the next iteration.
+        used_handlers = array_ops.reshape(predictions_dict[USED_HANDLERS_MASK],
+                                          [-1, 1])
+        used_handlers = array_ops.concat([used_handlers, used_handlers], axis=1)
+        return math_ops.logical_and(used_handlers, active_handlers)
+
+      active_handlers = (
+          control_flow_ops.cond(predictions_dict[NUM_USED_HANDLERS] >= target,
+                                _feature_selection_active_handlers,
+                                lambda: active_handlers))
+
     # Prepare empty gradients and hessians when handlers are not ready.
     empty_hess_shape = [1] + hessian_shape.as_list()
     empty_grad_shape = [1] + gradient_shape.as_list()
diff --git a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py
index dba51d4f527792d2a8dedc693f74c07119fd231d..6411f57a5419123e799af9231a04fce8ae7724d4 100644
--- a/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py
+++ b/tensorflow/contrib/boosted_trees/python/training/functions/gbdt_batch_test.py
@@ -47,6 +47,38 @@ def _squared_loss(label, unused_weights, predictions):
   return loss
 
 
+def _append_to_leaf(leaf, c_id, w):
+  """Helper method for building tree leaves.
+
+  Appends weight contributions for the given class index to a leaf node.
+
+  Args:
+    leaf: leaf node to append to.
+    c_id: class Id for the weight update.
+    w: weight contribution value.
+  """
+  leaf.sparse_vector.index.append(c_id)
+  leaf.sparse_vector.value.append(w)
+
+
+def _set_float_split(split, feat_col, thresh, l_id, r_id):
+  """Helper method for building tree float splits.
+
+  Sets split feature column, threshold and children.
+
+  Args:
+    split: split node to update.
+    feat_col: feature column for the split.
+    thresh: threshold to split on forming rule x <= thresh.
+    l_id: left child Id.
+    r_id: right child Id.
+  """
+  split.feature_column = feat_col
+  split.threshold = thresh
+  split.left_id = l_id
+  split.right_id = r_id
+
+
 class GbdtTest(test_util.TensorFlowTestCase):
 
   def setUp(self):
@@ -917,6 +949,350 @@ class GbdtTest(test_util.TensorFlowTestCase):
           output.trees[0].nodes[2].leaf.sparse_vector.value[0],
           atol=1e-4, rtol=1e-4)
 
+  def testTrainFnChiefFeatureSelectionReachedLimitNoGoodSplit(self):
+    """Tests the train function running on chief with feature selection."""
+    with self.test_session() as sess:
+      ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=0, tree_ensemble_config="", name="tree_ensemble")
+      learner_config = learner_pb2.LearnerConfig()
+      learner_config.learning_rate_tuner.fixed.learning_rate = 0.1
+      learner_config.num_classes = 2
+      learner_config.regularization.l1 = 0
+      learner_config.regularization.l2 = 0
+      learner_config.constraints.max_tree_depth = 1
+      learner_config.constraints.max_number_of_unique_feature_columns = 1
+      learner_config.constraints.min_node_weight = 0
+      features = {}
+      features["dense_float_0"] = array_ops.ones([4, 1], dtypes.float32)
+      # Feature 1 is predictive but it won't be used because we have reached the
+      # limit of num_used_handlers >= max_number_of_unique_feature_columns
+      features["dense_float_1"] = array_ops.constant([0, 0, 1, 1],
+                                                     dtypes.float32)
+
+      gbdt_model = gbdt_batch.GradientBoostedDecisionTreeModel(
+          is_chief=True,
+          num_ps_replicas=0,
+          center_bias=False,
+          ensemble_handle=ensemble_handle,
+          examples_per_layer=1,
+          learner_config=learner_config,
+          logits_dimension=1,
+          features=features)
+
+      predictions = array_ops.constant(
+          [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
+      partition_ids = array_ops.zeros([4], dtypes.int32)
+      ensemble_stamp = variables.Variable(
+          initial_value=0,
+          name="ensemble_stamp",
+          trainable=False,
+          dtype=dtypes.int64)
+
+      predictions_dict = {
+          "predictions":
+              predictions,
+          "predictions_no_dropout":
+              predictions,
+          "partition_ids":
+              partition_ids,
+          "ensemble_stamp":
+              ensemble_stamp,
+          "num_trees":
+              12,
+          "num_used_handlers":
+              array_ops.constant(1, dtype=dtypes.int64),
+          "used_handlers_mask":
+              array_ops.constant([True, False], dtype=dtypes.bool),
+      }
+
+      labels = array_ops.constant([0, 0, 1, 1], dtypes.float32)
+      weights = array_ops.ones([4, 1], dtypes.float32)
+      # Create train op.
+      train_op = gbdt_model.train(
+          loss=math_ops.reduce_mean(
+              _squared_loss(labels, weights, predictions)),
+          predictions_dict=predictions_dict,
+          labels=labels)
+      variables.global_variables_initializer().run()
+      resources.initialize_resources(resources.shared_resources()).run()
+
+      # On first run, expect no splits to be chosen because the quantile
+      # buckets will not be ready.
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 0)
+      self.assertEquals(len(output.tree_weights), 0)
+      self.assertEquals(stamp_token.eval(), 1)
+
+      # Update the stamp to be able to run a second time.
+      sess.run([ensemble_stamp.assign_add(1)])
+
+      # On second run, expect a trivial split to be chosen to basically
+      # predict the average.
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 1)
+      self.assertAllClose(output.tree_weights, [0.1])
+      self.assertEquals(stamp_token.eval(), 2)
+      expected_tree = """
+          nodes {
+            dense_float_binary_split {
+              feature_column: 0
+              threshold: 1.0
+              left_id: 1
+              right_id: 2
+            }
+            node_metadata {
+              gain: 0
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: -0.25
+              }
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: 0.0
+              }
+            }
+          }"""
+      self.assertProtoEquals(expected_tree, output.trees[0])
+
+  def testTrainFnChiefFeatureSelectionWithGoodSplits(self):
+    """Tests the train function running on chief with feature selection."""
+    with self.test_session() as sess:
+      ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=0, tree_ensemble_config="", name="tree_ensemble")
+      learner_config = learner_pb2.LearnerConfig()
+      learner_config.learning_rate_tuner.fixed.learning_rate = 0.1
+      learner_config.num_classes = 2
+      learner_config.regularization.l1 = 0
+      learner_config.regularization.l2 = 0
+      learner_config.constraints.max_tree_depth = 1
+      learner_config.constraints.max_number_of_unique_feature_columns = 1
+      learner_config.constraints.min_node_weight = 0
+      features = {}
+      features["dense_float_0"] = array_ops.ones([4, 1], dtypes.float32)
+      # Feature 1 is predictive and is in our selected features so it will be
+      # used even when we're at the limit.
+      features["dense_float_1"] = array_ops.constant([0, 0, 1, 1],
+                                                     dtypes.float32)
+
+      gbdt_model = gbdt_batch.GradientBoostedDecisionTreeModel(
+          is_chief=True,
+          num_ps_replicas=0,
+          center_bias=False,
+          ensemble_handle=ensemble_handle,
+          examples_per_layer=1,
+          learner_config=learner_config,
+          logits_dimension=1,
+          features=features)
+
+      predictions = array_ops.constant(
+          [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
+      partition_ids = array_ops.zeros([4], dtypes.int32)
+      ensemble_stamp = variables.Variable(
+          initial_value=0,
+          name="ensemble_stamp",
+          trainable=False,
+          dtype=dtypes.int64)
+
+      predictions_dict = {
+          "predictions":
+              predictions,
+          "predictions_no_dropout":
+              predictions,
+          "partition_ids":
+              partition_ids,
+          "ensemble_stamp":
+              ensemble_stamp,
+          "num_trees":
+              12,
+          "num_used_handlers":
+              array_ops.constant(1, dtype=dtypes.int64),
+          "used_handlers_mask":
+              array_ops.constant([False, True], dtype=dtypes.bool),
+      }
+
+      labels = array_ops.constant([0, 0, 1, 1], dtypes.float32)
+      weights = array_ops.ones([4, 1], dtypes.float32)
+      # Create train op.
+      train_op = gbdt_model.train(
+          loss=math_ops.reduce_mean(
+              _squared_loss(labels, weights, predictions)),
+          predictions_dict=predictions_dict,
+          labels=labels)
+      variables.global_variables_initializer().run()
+      resources.initialize_resources(resources.shared_resources()).run()
+
+      # On first run, expect no splits to be chosen because the quantile
+      # buckets will not be ready.
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 0)
+      self.assertEquals(len(output.tree_weights), 0)
+      self.assertEquals(stamp_token.eval(), 1)
+
+      # Update the stamp to be able to run a second time.
+      sess.run([ensemble_stamp.assign_add(1)])
+
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+
+      self.assertEquals(len(output.trees), 1)
+      self.assertAllClose(output.tree_weights, [0.1])
+      self.assertEquals(stamp_token.eval(), 2)
+      expected_tree = """
+          nodes {
+            dense_float_binary_split {
+              feature_column: 1
+              left_id: 1
+              right_id: 2
+            }
+            node_metadata {
+              gain: 0.5
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: 0.0
+              }
+            }
+          }
+          nodes {
+            leaf {
+              vector {
+                value: -0.5
+              }
+            }
+          }"""
+      self.assertProtoEquals(expected_tree, output.trees[0])
+
+  def testTrainFnChiefFeatureSelectionReachedLimitIncrementAttemptedLayer(self):
+    """Tests the train function running on chief with feature selection."""
+    with self.test_session() as sess:
+      tree_ensemble_config = tree_config_pb2.DecisionTreeEnsembleConfig()
+      tree = tree_ensemble_config.trees.add()
+
+      _set_float_split(tree.nodes.add()
+                       .sparse_float_binary_split_default_right.split, 2, 4.0,
+                       1, 2)
+      _append_to_leaf(tree.nodes.add().leaf, 0, 0.5)
+      _append_to_leaf(tree.nodes.add().leaf, 1, 1.2)
+      tree_ensemble_config.tree_weights.append(1.0)
+      metadata = tree_ensemble_config.tree_metadata.add()
+      metadata.is_finalized = False
+      metadata.num_layers_grown = 1
+      tree_ensemble_config = tree_ensemble_config.SerializeToString()
+      ensemble_handle = model_ops.tree_ensemble_variable(
+          stamp_token=0, tree_ensemble_config=tree_ensemble_config,
+          name="tree_ensemble")
+      learner_config = learner_pb2.LearnerConfig()
+      learner_config.learning_rate_tuner.fixed.learning_rate = 0.1
+      learner_config.num_classes = 2
+      learner_config.regularization.l1 = 0
+      learner_config.regularization.l2 = 0
+      learner_config.constraints.max_tree_depth = 1
+      learner_config.constraints.max_number_of_unique_feature_columns = 1
+      learner_config.constraints.min_node_weight = 0
+      features = {}
+      # Both features will be disabled since the feature selection limit is
+      # already reached.
+      features["dense_float_0"] = array_ops.ones([4, 1], dtypes.float32)
+      features["dense_float_1"] = array_ops.constant([0, 0, 1, 1],
+                                                     dtypes.float32)
+
+      gbdt_model = gbdt_batch.GradientBoostedDecisionTreeModel(
+          is_chief=True,
+          num_ps_replicas=0,
+          center_bias=False,
+          ensemble_handle=ensemble_handle,
+          examples_per_layer=1,
+          learner_config=learner_config,
+          logits_dimension=1,
+          features=features)
+
+      predictions = array_ops.constant(
+          [[0.0], [1.0], [0.0], [2.0]], dtype=dtypes.float32)
+      partition_ids = array_ops.zeros([4], dtypes.int32)
+      ensemble_stamp = variables.Variable(
+          initial_value=0,
+          name="ensemble_stamp",
+          trainable=False,
+          dtype=dtypes.int64)
+
+      predictions_dict = {
+          "predictions":
+              predictions,
+          "predictions_no_dropout":
+              predictions,
+          "partition_ids":
+              partition_ids,
+          "ensemble_stamp":
+              ensemble_stamp,
+          "num_trees":
+              12,
+          # We have somehow reached our limit 1. Both of the handlers will be
+          # disabled.
+          "num_used_handlers":
+              array_ops.constant(1, dtype=dtypes.int64),
+          "used_handlers_mask":
+              array_ops.constant([False, False], dtype=dtypes.bool),
+      }
+
+      labels = array_ops.constant([0, 0, 1, 1], dtypes.float32)
+      weights = array_ops.ones([4, 1], dtypes.float32)
+      # Create train op.
+      train_op = gbdt_model.train(
+          loss=math_ops.reduce_mean(
+              _squared_loss(labels, weights, predictions)),
+          predictions_dict=predictions_dict,
+          labels=labels)
+      variables.global_variables_initializer().run()
+      resources.initialize_resources(resources.shared_resources()).run()
+
+      # On first run, expect no splits to be chosen because the quantile
+      # buckets will not be ready.
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      self.assertEquals(len(output.trees), 1)
+      self.assertEquals(output.growing_metadata.num_layers_attempted, 1)
+      self.assertEquals(stamp_token.eval(), 1)
+
+      # Update the stamp to be able to run a second time.
+      sess.run([ensemble_stamp.assign_add(1)])
+
+      train_op.run()
+      stamp_token, serialized = model_ops.tree_ensemble_serialize(
+          ensemble_handle)
+      output = tree_config_pb2.DecisionTreeEnsembleConfig()
+      output.ParseFromString(serialized.eval())
+      # Make sure the trees are not modified, but the num_layers_attempted is
+      # incremented so that eventually the training stops.
+      self.assertEquals(len(output.trees), 1)
+      self.assertEquals(len(output.trees[0].nodes), 3)
+
+      self.assertEquals(output.growing_metadata.num_layers_attempted, 2)
 
 if __name__ == "__main__":
   googletest.main()
diff --git a/tensorflow/contrib/boosted_trees/resources/decision_tree_ensemble_resource.h b/tensorflow/contrib/boosted_trees/resources/decision_tree_ensemble_resource.h
index 3ebf28ea442edf87815c39971ae9e01a2a8aae9a..94aeb2c7bb48c6eddb6c7894f8bf6f1567470113 100644
--- a/tensorflow/contrib/boosted_trees/resources/decision_tree_ensemble_resource.h
+++ b/tensorflow/contrib/boosted_trees/resources/decision_tree_ensemble_resource.h
@@ -126,7 +126,8 @@ class DecisionTreeEnsembleResource : public StampedResource {
       return;
     }
     used_ids->Add(handler_id);
-    std::rotate(first, used_ids->end() - 1, used_ids->end());
+    // Keep the list of used handlers sorted.
+    std::sort(used_ids->begin(), used_ids->end());
   }
 
   std::vector<int64> GetUsedHandlers() const {
diff --git a/tensorflow/contrib/cluster_resolver/BUILD b/tensorflow/contrib/cluster_resolver/BUILD
index 80e18a43a71cc9d6c9e2ccf5836e50c6427a30f6..1a124eca364424b651de86bfaac6f33ad131804b 100644
--- a/tensorflow/contrib/cluster_resolver/BUILD
+++ b/tensorflow/contrib/cluster_resolver/BUILD
@@ -30,6 +30,7 @@ py_library(
         "python/training/__init__.py",
     ],
     srcs_version = "PY2AND3",
+    visibility = ["//visibility:public"],
     deps = [
         ":cluster_resolver_py",
         ":gce_cluster_resolver_py",
@@ -109,5 +110,6 @@ tf_py_test(
         "//tensorflow/python:platform_test",
         "//tensorflow/python:training",
     ],
+    grpc_enabled = True,
     main = "python/training/tpu_cluster_resolver_test.py",
 )
diff --git a/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver.py b/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver.py
index b04822fa9d66465e34a545d3b00c399bbb196514..1c480b25134b1e54200e0ddb780bd7bb0f122341 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver.py
@@ -53,11 +53,16 @@ class ClusterResolver(object):
     raise NotImplementedError(
         'cluster_spec is not implemented for {}.'.format(self))
 
+  @abc.abstractmethod
+  def master(self):
+    """..."""
+    raise NotImplementedError('master is not implemented for {}.'.format(self))
+
 
 class SimpleClusterResolver(ClusterResolver):
   """Simple implementation of ClusterResolver that accepts a ClusterSpec."""
 
-  def __init__(self, cluster_spec):
+  def __init__(self, cluster_spec, master=''):
     """Creates a SimpleClusterResolver from a ClusterSpec."""
     super(SimpleClusterResolver, self).__init__()
 
@@ -65,10 +70,18 @@ class SimpleClusterResolver(ClusterResolver):
       raise TypeError('cluster_spec must be a ClusterSpec.')
     self._cluster_spec = cluster_spec
 
+    if not isinstance(master, str):
+      raise TypeError('master must be a string.')
+    self._master = master
+
   def cluster_spec(self):
     """Returns the ClusterSpec passed into the constructor."""
     return self._cluster_spec
 
+  def master(self):
+    """Returns the master address to use when creating a session."""
+    return self._master
+
 
 class UnionClusterResolver(ClusterResolver):
   """Performs a union on underlying ClusterResolvers.
@@ -87,9 +100,13 @@ class UnionClusterResolver(ClusterResolver):
 
     Raises:
       TypeError: If any argument is not a subclass of `ClusterResolvers`.
+      ValueError: If there are no arguments passed.
     """
     super(UnionClusterResolver, self).__init__()
 
+    if not args:
+      raise ValueError('At least one ClusterResolver is required.')
+
     for cluster_resolver in args:
       if not isinstance(cluster_resolver, ClusterResolver):
         raise TypeError('All arguments must be a sub-class of '
@@ -169,3 +186,7 @@ class UnionClusterResolver(ClusterResolver):
           merged_cluster[job_name].update(task_dict)
 
     return ClusterSpec(merged_cluster)
+
+  def master(self):
+    """master returns the master address from the first cluster resolver."""
+    return self._cluster_resolvers[0].master()
diff --git a/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver_test.py b/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver_test.py
index dbfb77723cdaab66e29bb41b764593bb5fd61b35..d9c97d53eb3663f6ab2f7b40395592dc7638b896 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver_test.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/cluster_resolver_test.py
@@ -234,5 +234,7 @@ class UnionClusterResolverTest(test.TestCase):
     self._verifyClusterSpecEquality(cluster_spec, expected_proto)
 
 
+# TODO(saeta): Include tests for master resolution
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/cluster_resolver/python/training/gce_cluster_resolver.py b/tensorflow/contrib/cluster_resolver/python/training/gce_cluster_resolver.py
index d6f2eced93ba4fda5ac27f9412b6f729981f4f40..3f5824128948453634bc5e5a7d6fdeedae60f5bd 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/gce_cluster_resolver.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/gce_cluster_resolver.py
@@ -134,3 +134,6 @@ class GceClusterResolver(ClusterResolver):
 
     worker_list.sort()
     return ClusterSpec({self._job_name: worker_list})
+
+  def master(self):
+    return ''
diff --git a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
index a6a6e642e4e4c721b94821a70d55d6fe931347d6..aeccf4c06bb57a03ac79e20a5e001935d847b2a7 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py
@@ -23,7 +23,8 @@ from six.moves.urllib.request import Request
 from six.moves.urllib.request import urlopen
 
 from tensorflow.contrib.cluster_resolver.python.training.cluster_resolver import ClusterResolver
-from tensorflow.python.training.server_lib import ClusterSpec
+from tensorflow.python.training import server_lib
+from tensorflow.python.util import compat
 
 _GOOGLE_API_CLIENT_INSTALLED = True
 try:
@@ -46,13 +47,23 @@ class TPUClusterResolver(ClusterResolver):
     req = Request('http://metadata/computeMetadata/v1/%s' % path,
                   headers={'Metadata-Flavor': 'Google'})
     resp = urlopen(req)
-    return resp.read()
+    return compat.as_bytes(resp.read())
+
+  def _shouldResolve(self):
+    if (self._tpu == compat.as_bytes('') or
+        self._tpu == compat.as_bytes('local') or
+        self._tpu.startswith(compat.as_bytes('/bns')) or
+        self._tpu.startswith(compat.as_bytes('grpc://'))):
+      return False
+    return True
 
   def __init__(self,
-               tpu_names,
+               tpu,
                zone=None,
                project=None,
-               job_name='tpu_worker',
+               job_name='worker',
+               coordinator_name='coordinator',
+               coordinator_address=None,
                credentials='default',
                service=None):
     """Creates a new TPUClusterResolver object.
@@ -61,7 +72,11 @@ class TPUClusterResolver(ClusterResolver):
     for the IP addresses and ports of each Cloud TPU listed.
 
     Args:
-      tpu_names: A list of names of the target Cloud TPUs.
+      tpu: Either a string, or a list of strings corresponding to the TPUs to
+        use. If the single string is the empty string, the string 'local', or a
+        string that begins with 'grpc://' or '/bns', then it is assumed to not
+        correspond with a Cloud TPU and will instead be passed as the session
+        master and no ClusterSpec propagation will be done.
       zone: Zone where the TPUs are located. If omitted or empty, we will assume
         that the zone of the TPU is the same as the zone of the GCE VM, which we
         will try to discover from the GCE metadata service.
@@ -69,6 +84,12 @@ class TPUClusterResolver(ClusterResolver):
         empty, we will try to discover the project name of the GCE VM from the
         GCE metadata service.
       job_name: Name of the TensorFlow job the TPUs belong to.
+      coordinator_name: The name to use for the coordinator. Set to None if the
+        coordinator should not be included in the computed ClusterSpec.
+      coordinator_address: The address of the coordinator (typically an ip:port
+        pair). If set to None, a TF server will be started. If coordinator_name
+        is None, a TF server will not be started even if coordinator_address is
+        None.
       credentials: GCE Credentials. If None, then we use default credentials
         from the oauth2client
       service: The GCE API object returned by the googleapiclient.discovery
@@ -77,26 +98,36 @@ class TPUClusterResolver(ClusterResolver):
 
     Raises:
       ImportError: If the googleapiclient is not installed.
+      ValueError: If no TPUs are specified.
     """
+    if isinstance(tpu, list):
+      if not tpu:
+        raise ValueError('At least one TPU must be specified.')
+      if len(tpu) != 1:
+        raise NotImplementedError(
+            'Using multiple TPUs in a single session is not yet implemented')
+      tpu = tpu[0]
+    self._tpu = compat.as_bytes(tpu)  # self._tpu is always bytes
+    self._job_name = job_name
+    self._credentials = credentials
 
-    if not project:
-      project = self._requestComputeMetadata('/project/project-id')
+    should_resolve = self._shouldResolve()
 
-    if not zone:
-      zone_path = self._requestComputeMetadata('/instance/zone')
+    if not project and should_resolve:
+      project = self._requestComputeMetadata('project/project-id')
+
+    if not zone and should_resolve:
+      zone_path = self._requestComputeMetadata('instance/zone')
       zone = zone_path.split('/')[-1]
 
     self._project = project
     self._zone = zone
-    self._tpu_names = tpu_names
-    self._job_name = job_name
-    self._credentials = credentials
 
-    if credentials == 'default':
+    if credentials == 'default' and should_resolve:
       if _GOOGLE_API_CLIENT_INSTALLED:
         self._credentials = GoogleCredentials.get_application_default()
 
-    if service is None:
+    if service is None and should_resolve:
       if not _GOOGLE_API_CLIENT_INSTALLED:
         raise ImportError('googleapiclient must be installed before using the '
                           'TPU cluster resolver')
@@ -107,25 +138,41 @@ class TPUClusterResolver(ClusterResolver):
     else:
       self._service = service
 
-  def get_master(self):
-    """Get the ClusterSpec grpc master path.
+    self._coordinator_name = coordinator_name
+    if coordinator_name and not coordinator_address and should_resolve:
+      self._start_local_server()
+    else:
+      self._coordinator_address = coordinator_address
+
+  def master(self):
+    """Get the Master string to be used for the session.
+
+    In the normal case, this returns the grpc path (grpc://1.2.3.4:8470) of
+    first instance in the ClusterSpec returned by the cluster_spec function.
 
-    This returns the grpc path (grpc://1.2.3.4:8470) of first instance in the
-    ClusterSpec returned by the cluster_spec function. This is suitable for use
-    for the `master` argument in tf.Session() when you are using one TPU.
+    If a non-TPU name is used when constructing a TPUClusterResolver, that will
+    be returned instead (e.g. If the tpus argument's value when constructing
+    this TPUClusterResolver was 'grpc://10.240.1.2:8470',
+    'grpc://10.240.1.2:8470' will be returned).
 
     Returns:
-      string, the grpc path of the first instance in the ClusterSpec.
+      string, the connection string to use when creating a session.
 
     Raises:
       ValueError: If none of the TPUs specified exists.
     """
+    if not self._shouldResolve():
+      return self._tpu
+
     job_tasks = self.cluster_spec().job_tasks(self._job_name)
     if not job_tasks:
       raise ValueError('No TPUs exists with the specified names exist.')
 
     return 'grpc://' + job_tasks[0]
 
+  def get_master(self):
+    return self.master()
+
   def cluster_spec(self):
     """Returns a ClusterSpec object based on the latest TPU information.
 
@@ -134,17 +181,54 @@ class TPUClusterResolver(ClusterResolver):
 
     Returns:
       A ClusterSpec containing host information returned from Cloud TPUs.
-    """
-    worker_list = []
-
-    for tpu_name in self._tpu_names:
-      full_name = 'projects/%s/locations/%s/nodes/%s' % (
-          self._project, self._zone, tpu_name)
-      request = self._service.projects().locations().nodes().get(name=full_name)
-      response = request.execute()
 
-      if 'health' in response and response['health'] == 'HEALTHY':
-        instance_url = '%s:%s' % (response['ipAddress'], response['port'])
-        worker_list.append(instance_url)
-
-    return ClusterSpec({self._job_name: worker_list})
+    Raises:
+      RuntimeError: If the provided TPU is not healthy.
+    """
+    if not self._shouldResolve():
+      return server_lib.ClusterSpec({})
+
+    full_name = 'projects/%s/locations/%s/nodes/%s' % (
+        self._project, self._zone, compat.as_text(self._tpu))
+    request = self._service.projects().locations().nodes().get(name=full_name)
+    response = request.execute()
+
+    if 'health' in response and response['health'] != 'HEALTHY':
+      raise RuntimeError('TPU "%s" is unhealthy: "%s"' % (self._tpu,
+                                                          response['health']))
+
+    if 'networkEndpoints' in response:
+      worker_list = [
+          '%s:%s' % (endpoint['ipAddress'], endpoint['port'])
+          for endpoint in response['networkEndpoints']
+      ]
+    else:
+      # Fall back to the deprecated response format
+      instance_url = '%s:%s' % (response['ipAddress'], response['port'])
+      worker_list = [instance_url]
+
+    cluster_spec = {self._job_name: worker_list}
+
+    if self._coordinator_address:
+      cluster_spec[self._coordinator_name] = [self._coordinator_address]
+
+    return server_lib.ClusterSpec(cluster_spec)
+
+  def _start_local_server(self):
+    address = self._requestComputeMetadata('instance/network-interfaces/0/ip')
+    self._server = server_lib.Server(
+        {
+            'local': ['0.0.0.0:0']
+        }, protocol='grpc', config=None, start=True)
+    # self._server.target is of the form: grpc://ipaddress:port
+    target = compat.as_bytes(self._server.target)
+    splits = target.split(compat.as_bytes(':'))
+    assert len(splits) == 3, self._server.target
+    assert splits[0] == compat.as_bytes('grpc'), self._server.target
+    self._coordinator_port = compat.as_text(splits[2])
+    self._coordinator_address = '%s:%s' % (
+        address, compat.as_text(self._coordinator_port))
+
+  def __deepcopy__(self, memo):
+    # TODO(b/73668574): Remove this once RunConfig avoids performing deepcopy.
+    return self
diff --git a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py
index 4fd34629cf74f90869c77b8cb098d3c585a49404..6b4a15515262b35e3cf8d7d2943e06d86b870ca9 100644
--- a/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py
+++ b/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver_test.py
@@ -21,7 +21,7 @@ from __future__ import print_function
 from tensorflow.contrib.cluster_resolver.python.training.tpu_cluster_resolver import TPUClusterResolver
 from tensorflow.python.platform import test
 from tensorflow.python.training import server_lib
-
+from tensorflow.python.util import compat
 
 mock = test.mock
 
@@ -50,10 +50,12 @@ class MockNodeClass(object):
 
 def mock_request_compute_metadata(cls, *args, **kwargs):
   del cls, kwargs  # Unused.
-  if args[0] == '/project/project-id':
+  if args[0] == 'project/project-id':
     return 'test-project'
-  elif args[0] == '/instance/zone':
+  elif args[0] == 'instance/zone':
     return 'projects/test-project/locations/us-central1-c'
+  elif args[0] == 'instance/network-interfaces/0/ip':
+    return '10.128.1.2'
   return ''
 
 
@@ -113,17 +115,26 @@ class TPUClusterResolverTest(test.TestCase):
     tpu_cluster_resolver = TPUClusterResolver(
         project=None,
         zone=None,
-        tpu_names=['test-tpu-1'],
+        tpu=['test-tpu-1'],
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
 
     actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
     expected_proto = """
-    job { name: 'tpu_worker' tasks { key: 0 value: '10.1.2.3:8470' } }
-    """
-    self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
+    job {
+      name: 'coordinator'
+      tasks { key: 0 value: '10.128.1.2:%s' }
+    }
+    job {
+      name: 'worker'
+      tasks { key: 0 value: '10.1.2.3:8470' }
+    }
+    """ % tpu_cluster_resolver._coordinator_port
+    self._verifyClusterSpecEquality(actual_cluster_spec, str(expected_proto))
 
-  def testSimpleSuccessfulRetrieval(self):
+  @mock.patch.object(TPUClusterResolver, '_requestComputeMetadata',
+                     mock_request_compute_metadata)
+  def testRetrieveProjectAndZoneFromMetadataNoCoordinator(self):
     tpu_map = {
         'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
             'ipAddress': '10.1.2.3',
@@ -133,116 +144,217 @@ class TPUClusterResolverTest(test.TestCase):
     }
 
     tpu_cluster_resolver = TPUClusterResolver(
-        project='test-project',
-        zone='us-central1-c',
-        tpu_names=['test-tpu-1'],
+        project=None,
+        zone=None,
+        tpu=['test-tpu-1'],
+        coordinator_name=None,
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
 
     actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
     expected_proto = """
-    job { name: 'tpu_worker' tasks { key: 0 value: '10.1.2.3:8470' } }
+    job { name: 'worker' tasks { key: 0 value: '10.1.2.3:8470' } }
     """
     self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
 
-  def testMultipleSuccessfulRetrieval(self):
+  def testSimpleSuccessfulRetrieval(self):
     tpu_map = {
         'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
             'ipAddress': '10.1.2.3',
             'port': '8470',
             'health': 'HEALTHY'
-        },
-        'projects/test-project/locations/us-central1-c/nodes/test-tpu-2': {
-            'ipAddress': '10.4.5.6',
-            'port': '8470',
-            'health': 'HEALTHY'
         }
     }
 
     tpu_cluster_resolver = TPUClusterResolver(
         project='test-project',
         zone='us-central1-c',
-        tpu_names=['test-tpu-2', 'test-tpu-1'],
+        tpu=['test-tpu-1'],
+        coordinator_address='10.128.1.5:10203',
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
 
     actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
     expected_proto = """
-    job { name: 'tpu_worker' tasks { key: 0 value: '10.4.5.6:8470' }
-                             tasks { key: 1 value: '10.1.2.3:8470' } }
+    job { name: 'coordinator' tasks { key: 0 value: '10.128.1.5:10203' } }
+    job { name: 'worker' tasks { key: 0 value: '10.1.2.3:8470' } }
     """
     self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
 
-  def testHealthyTpuNodeRetrieval(self):
+  def testNewNetworkEndpointFormat(self):
     tpu_map = {
         'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
-            'ipAddress': '10.1.2.3',
-            'port': '8470',
-            'health': 'HEALTHY'
-        },
-        'projects/test-project/locations/us-central1-c/nodes/test-tpu-2': {
-            'ipAddress': '10.4.5.6',
-            'port': '8470',
-        },
-        'projects/test-project/locations/us-central1-c/nodes/test-tpu-3': {
-            'ipAddress': '10.7.8.9',
-            'port': '8470',
-            'health': 'UNHEALTHY'
+            'health': 'HEALTHY',
+            'networkEndpoints': [{
+                'ipAddress': '10.2.3.4',
+                'port': 8470,
+            }]
         }
     }
 
     tpu_cluster_resolver = TPUClusterResolver(
         project='test-project',
         zone='us-central1-c',
-        tpu_names=['test-tpu-2', 'test-tpu-1', 'test-tpu-3'],
+        tpu='test-tpu-1',
+        coordinator_address='10.128.1.5:10203',
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
 
     actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
     expected_proto = """
-    job {
-      name: 'tpu_worker'
-      tasks {
-        key: 0
-        value: '10.1.2.3:8470'
-      }
-    }
+    job { name: 'coordinator' tasks { key: 0 value: '10.128.1.5:10203' } }
+    job { name: 'worker' tasks { key: 0 value: '10.2.3.4:8470' } }
     """
     self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
+    self.assertEqual('grpc://10.2.3.4:8470', tpu_cluster_resolver.master())
 
-  def testGetMasterMultipleEntries(self):
+  @mock.patch.object(TPUClusterResolver, '_requestComputeMetadata',
+                     mock_request_compute_metadata)
+  def testPodResolution(self):
     tpu_map = {
         'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
-            'ipAddress': '10.1.2.3',
-            'port': '8470',
-            'health': 'HEALTHY'
-        },
-        'projects/test-project/locations/us-central1-c/nodes/test-tpu-2': {
-            'ipAddress': '10.4.5.6',
-            'port': '8470',
-            'health': 'HEALTHY'
+            'health':
+                'HEALTHY',
+            'networkEndpoints': [
+                {
+                    'ipAddress': '10.2.3.4',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.5',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.6',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.7',
+                    'port': 8470,
+                },
+            ]
+        }
+    }
+
+    tpu_cluster_resolver = TPUClusterResolver(
+        tpu='test-tpu-1',
+        credentials=None,
+        service=self.mock_service_client(tpu_map=tpu_map))
+
+    actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
+    expected_proto = """
+    job {
+      name: 'coordinator',
+      tasks { key: 0 value: '10.128.1.2:%s'}
+    }
+    job {
+      name: 'worker'
+      tasks { key: 0 value: '10.2.3.4:8470' }
+      tasks { key: 1 value: '10.2.3.5:8470' }
+      tasks { key: 2 value: '10.2.3.6:8470' }
+      tasks { key: 3 value: '10.2.3.7:8470' }
+    }
+    """ % tpu_cluster_resolver._coordinator_port
+    self._verifyClusterSpecEquality(actual_cluster_spec, str(expected_proto))
+
+  def testPodResolutionNoCoordinator(self):
+    tpu_map = {
+        'projects/test-project/locations/us-central1-c/nodes/test-tpu-1': {
+            'health':
+                'HEALTHY',
+            'networkEndpoints': [
+                {
+                    'ipAddress': '10.2.3.4',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.5',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.6',
+                    'port': 8470,
+                },
+                {
+                    'ipAddress': '10.2.3.7',
+                    'port': 8470,
+                },
+            ]
         }
     }
 
     tpu_cluster_resolver = TPUClusterResolver(
         project='test-project',
         zone='us-central1-c',
-        tpu_names=['test-tpu-2', 'test-tpu-1'],
+        tpu='test-tpu-1',
+        coordinator_name=None,
         credentials=None,
         service=self.mock_service_client(tpu_map=tpu_map))
-    self.assertEqual('grpc://10.4.5.6:8470', tpu_cluster_resolver.get_master())
+
+    actual_cluster_spec = tpu_cluster_resolver.cluster_spec()
+    expected_proto = """
+    job {
+      name: 'worker'
+      tasks { key: 0 value: '10.2.3.4:8470' }
+      tasks { key: 1 value: '10.2.3.5:8470' }
+      tasks { key: 2 value: '10.2.3.6:8470' }
+      tasks { key: 3 value: '10.2.3.7:8470' }
+    }
+    """
+    self._verifyClusterSpecEquality(actual_cluster_spec, expected_proto)
 
   def testGetMasterNoEntries(self):
     tpu_map = {}
 
+    with self.assertRaises(ValueError):
+      TPUClusterResolver(
+          project='test-project',
+          zone='us-central1-c',
+          tpu=[],
+          coordinator_name=None,
+          credentials=None,
+          service=self.mock_service_client(tpu_map=tpu_map))
+
+  # TODO(saeta): Convert to parameterized test when included in OSS TF.
+  def verifyShouldResolve(self, tpu, should_resolve):
     tpu_cluster_resolver = TPUClusterResolver(
         project='test-project',
         zone='us-central1-c',
-        tpu_names=[],
+        tpu=tpu,
+        coordinator_name=None,
         credentials=None,
-        service=self.mock_service_client(tpu_map=tpu_map))
-    with self.assertRaises(ValueError):
-      tpu_cluster_resolver.get_master()
+        service=self.mock_service_client(tpu_map={}))
+    self.assertEqual(should_resolve, tpu_cluster_resolver._shouldResolve(),
+                     "TPU: '%s'" % tpu)
+
+  def testShouldResolveNoName(self):
+    self.verifyShouldResolve('', False)
+
+  def testShouldResolveLocal(self):
+    self.verifyShouldResolve('local', False)
+
+  def testShouldResolveGrpc(self):
+    self.verifyShouldResolve('grpc://10.1.2.3:8470', False)
+
+  def testShouldResolveBns(self):
+    self.verifyShouldResolve('/bns/foo/bar', False)
+
+  def testShouldResolveName(self):
+    self.verifyShouldResolve('mytpu', True)
+
+  def testShouldResolveList(self):
+    self.verifyShouldResolve(['myothertpu'], True)
+
+  def testShouldResolveGrpcPrefix(self):
+    self.verifyShouldResolve('grpctpu', True)
+
+  def testNoCallComputeMetadata(self):
+    tpu_cluster_resolver = TPUClusterResolver(tpu='/bns/foo/bar')
+    self.assertEqual(compat.as_bytes('/bns/foo/bar'),
+                     tpu_cluster_resolver.master())
+    self.assertEqual(
+        server_lib.ClusterSpec({}), tpu_cluster_resolver.cluster_spec())
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/cmake/external/cub.cmake b/tensorflow/contrib/cmake/external/cub.cmake
index 836889895567f679d9960e29ece1600d1a7a58eb..98a8c7e736e5c8c407b90e8eac440cdc7ab21579 100644
--- a/tensorflow/contrib/cmake/external/cub.cmake
+++ b/tensorflow/contrib/cmake/external/cub.cmake
@@ -14,8 +14,8 @@
 # ==============================================================================
 include (ExternalProject)
 
-set(cub_URL https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.7.4.zip)
-set(cub_HASH SHA256=20a1a39fd97e5da7f40f5f2e7fd73fd2ea59f9dc4bb8a6c5f228aa543e727e31)
+set(cub_URL https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.8.0.zip)
+set(cub_HASH SHA256=6bfa06ab52a650ae7ee6963143a0bbc667d6504822cbd9670369b598f18c58c3)
 set(cub_BUILD ${CMAKE_CURRENT_BINARY_DIR}/cub/src/cub)
 set(cub_INCLUDE_DIR ${CMAKE_CURRENT_BINARY_DIR}/cub/src/cub)
 set(cub_ARCHIVE_DIR ${CMAKE_CURRENT_BINARY_DIR}/external/cub_archive)
diff --git a/tensorflow/contrib/cmake/python_modules.txt b/tensorflow/contrib/cmake/python_modules.txt
index bfe53c01b3b5fb9db8a5d8fa280d1d7f98974882..0d2a6a23db26af2fb9498849aa93e74379915fe3 100644
--- a/tensorflow/contrib/cmake/python_modules.txt
+++ b/tensorflow/contrib/cmake/python_modules.txt
@@ -165,6 +165,7 @@ tensorflow/contrib/distributions/python
 tensorflow/contrib/distributions/python/ops
 tensorflow/contrib/distributions/python/ops/bijectors
 tensorflow/contrib/eager
+tensorflow/contrib/eager/proto
 tensorflow/contrib/eager/python
 tensorflow/contrib/estimator
 tensorflow/contrib/estimator/python
diff --git a/tensorflow/contrib/cmake/python_protos.txt b/tensorflow/contrib/cmake/python_protos.txt
index 8a9c406d8b118c10ddcaafb0e4fc242aa79cdb57..c03c0c80fe62a4f95d0fcf240ee25725a19d86f0 100644
--- a/tensorflow/contrib/cmake/python_protos.txt
+++ b/tensorflow/contrib/cmake/python_protos.txt
@@ -4,6 +4,7 @@ tensorflow/python
 tensorflow/contrib/boosted_trees/proto
 tensorflow/contrib/cloud/kernels
 tensorflow/contrib/decision_trees/proto
+tensorflow/contrib/eager/proto
 tensorflow/contrib/gdr
 tensorflow/contrib/lite/toco
 tensorflow/contrib/mpi
diff --git a/tensorflow/contrib/cmake/tests/cuda/compatibility_test.c b/tensorflow/contrib/cmake/tests/cuda/compatibility_test.c
index 968ab13a0c43793341431248713f81010c87f148..9e355da33a7258119b6086216f5487d7ea94716c 100644
--- a/tensorflow/contrib/cmake/tests/cuda/compatibility_test.c
+++ b/tensorflow/contrib/cmake/tests/cuda/compatibility_test.c
@@ -1,3 +1,18 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
 // This is a program to test if compiler is compatible with CUDA.
 #define __CUDACC__
 #include "crt/host_config.h"
diff --git a/tensorflow/contrib/cmake/tests/cuda/compatibility_test.cc b/tensorflow/contrib/cmake/tests/cuda/compatibility_test.cc
index 968ab13a0c43793341431248713f81010c87f148..beb574061bea8d04af8386223749677ae36a5d9b 100644
--- a/tensorflow/contrib/cmake/tests/cuda/compatibility_test.cc
+++ b/tensorflow/contrib/cmake/tests/cuda/compatibility_test.cc
@@ -1,3 +1,18 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+============================================================================*/
+
 // This is a program to test if compiler is compatible with CUDA.
 #define __CUDACC__
 #include "crt/host_config.h"
diff --git a/tensorflow/contrib/cmake/tf_core_cpu.cmake b/tensorflow/contrib/cmake/tf_core_cpu.cmake
index 96ac60d095dbc84470ff1be92f4bf52bb420fc52..a54cbff33b66d63d7229fa2f50b8a4ca962111ed 100644
--- a/tensorflow/contrib/cmake/tf_core_cpu.cmake
+++ b/tensorflow/contrib/cmake/tf_core_cpu.cmake
@@ -63,6 +63,12 @@ file(GLOB_RECURSE tf_core_cpu_exclude_srcs
     "${tensorflow_source_dir}/tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.h"
     "${tensorflow_source_dir}/tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.cc"
 )
+file(GLOB_RECURSE tf_core_cpu_whitelisted_srcs
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu/gpu_id.h"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu/gpu_id.cc"
+    "${tensorflow_source_dir}/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc"
+)
+list(REMOVE_ITEM tf_core_cpu_exclude_srcs ${tf_core_cpu_whitelisted_srcs})
 list(REMOVE_ITEM tf_core_cpu_srcs ${tf_core_cpu_exclude_srcs})
 
 if (tensorflow_ENABLE_GPU)
@@ -79,6 +85,7 @@ if (tensorflow_ENABLE_GPU)
      "${tensorflow_source_dir}/tensorflow/core/*test*.cc"
   )
   list(REMOVE_ITEM tf_core_gpu_srcs ${tf_core_gpu_exclude_srcs})
+  list(REMOVE_ITEM tf_core_gpu_srcs ${tf_core_cpu_whitelisted_srcs})
   list(APPEND tf_core_cpu_srcs ${tf_core_gpu_srcs})
 endif()
 
diff --git a/tensorflow/contrib/cmake/tf_core_kernels.cmake b/tensorflow/contrib/cmake/tf_core_kernels.cmake
index f219d5eb577afa9edaadca09aef9869c81d2bd87..998f99ecc19f88921dce14fde892912fb699ad08 100644
--- a/tensorflow/contrib/cmake/tf_core_kernels.cmake
+++ b/tensorflow/contrib/cmake/tf_core_kernels.cmake
@@ -71,6 +71,8 @@ if(tensorflow_BUILD_CONTRIB_KERNELS)
       "${tensorflow_source_dir}/tensorflow/contrib/cudnn_rnn/ops/cudnn_rnn_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/ignore_errors_dataset_op.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/prefetching_kernels.cc"
+      "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc"
+      "${tensorflow_source_dir}/tensorflow/contrib/data/kernels/unique_dataset_op.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/data/ops/dataset_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/factorization/kernels/clustering_ops.cc"
       "${tensorflow_source_dir}/tensorflow/contrib/factorization/kernels/masked_matmul_ops.cc"
diff --git a/tensorflow/contrib/crf/python/ops/crf.py b/tensorflow/contrib/crf/python/ops/crf.py
index faa78769b98699af59047aed2865771120110fc2..1233c8f251c404c57d9e2b38993e7a386b1e6ceb 100644
--- a/tensorflow/contrib/crf/python/ops/crf.py
+++ b/tensorflow/contrib/crf/python/ops/crf.py
@@ -105,8 +105,8 @@ def crf_sequence_score(inputs, tag_indices, sequence_lengths,
   return utils.smart_cond(
       pred=math_ops.equal(inputs.shape[1].value or array_ops.shape(inputs)[1],
                           1),
-      fn1=_single_seq_fn,
-      fn2=_multi_seq_fn)
+      true_fn=_single_seq_fn,
+      false_fn=_multi_seq_fn)
 
 
 def crf_log_norm(inputs, sequence_lengths, transition_params):
@@ -511,7 +511,7 @@ def crf_decode(potentials, transition_params, sequence_length):
     return decode_tags, best_score
 
   return utils.smart_cond(
-      pred=math_ops.equal(
-          potentials.shape[1].value or array_ops.shape(potentials)[1], 1),
-      fn1=_single_seq_fn,
-      fn2=_multi_seq_fn)
+      pred=math_ops.equal(potentials.shape[1].value or
+                          array_ops.shape(potentials)[1], 1),
+      true_fn=_single_seq_fn,
+      false_fn=_multi_seq_fn)
diff --git a/tensorflow/contrib/data/__init__.py b/tensorflow/contrib/data/__init__.py
index fcdccdd26ca1824bf13f1fd0cfd80b20ca8a10c3..1777727de8720face9acacdaee9865a8475f44cc 100644
--- a/tensorflow/contrib/data/__init__.py
+++ b/tensorflow/contrib/data/__init__.py
@@ -25,6 +25,7 @@ See the @{$datasets$Importing Data} Programmer's Guide for an overview.
 @@Counter
 
 @@batch_and_drop_remainder
+@@bucket_by_sequence_length
 @@dense_to_sparse_batch
 @@enumerate_dataset
 @@group_by_window
@@ -58,6 +59,7 @@ from tensorflow.contrib.data.python.ops.counter import Counter
 from tensorflow.contrib.data.python.ops.enumerate_ops import enumerate_dataset
 from tensorflow.contrib.data.python.ops.error_ops import ignore_errors
 from tensorflow.contrib.data.python.ops.get_single_element import get_single_element
+from tensorflow.contrib.data.python.ops.grouping import bucket_by_sequence_length
 from tensorflow.contrib.data.python.ops.grouping import group_by_window
 from tensorflow.contrib.data.python.ops.interleave_ops import parallel_interleave
 from tensorflow.contrib.data.python.ops.interleave_ops import sloppy_interleave
diff --git a/tensorflow/contrib/data/kernels/BUILD b/tensorflow/contrib/data/kernels/BUILD
index 56471911c5c0d1c1825955c67997b5bbc0786463..9bd6a42da2d93263e84a759cffdc5a9e8f9742fd 100644
--- a/tensorflow/contrib/data/kernels/BUILD
+++ b/tensorflow/contrib/data/kernels/BUILD
@@ -28,11 +28,33 @@ cc_library(
     alwayslink = 1,
 )
 
+cc_library(
+    name = "threadpool_dataset_op",
+    srcs = ["threadpool_dataset_op.cc"],
+    deps = [
+        "//tensorflow/core:framework_headers_lib",
+        "//third_party/eigen3",
+        "@protobuf_archive//:protobuf_headers",
+    ],
+)
+
+cc_library(
+    name = "unique_dataset_op",
+    srcs = ["unique_dataset_op.cc"],
+    deps = [
+        "//tensorflow/core:framework_headers_lib",
+        "//third_party/eigen3",
+        "@protobuf_archive//:protobuf_headers",
+    ],
+)
+
 cc_library(
     name = "dataset_kernels",
     deps = [
         ":ignore_errors_dataset_op",
         ":prefetching_kernels",
+        ":threadpool_dataset_op",
+        ":unique_dataset_op",
         "//tensorflow/core:framework_headers_lib",
         "//third_party/eigen3",
         "@protobuf_archive//:protobuf_headers",
diff --git a/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc b/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4b3edde85fc755f1c7694a555b867317e81f149d
--- /dev/null
+++ b/tensorflow/contrib/data/kernels/threadpool_dataset_op.cc
@@ -0,0 +1,197 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/dataset.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/resource_mgr.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+
+namespace tensorflow {
+namespace {
+
+class ThreadPoolResource : public ResourceBase {
+ public:
+  ThreadPoolResource(Env* env, const ThreadOptions& thread_options,
+                     const string& name, int num_threads, bool low_latency_hint)
+      : thread_pool_(env, thread_options, name, num_threads, low_latency_hint) {
+  }
+
+  // Schedules fn() for execution in the pool of threads.
+  void Schedule(std::function<void()> fn) {
+    thread_pool_.Schedule(std::move(fn));
+  }
+
+  string DebugString() override { return "ThreadPoolResource"; }
+
+ private:
+  thread::ThreadPool thread_pool_;
+};
+
+// Creates a handle to a ThreadPool resource. Note that we don't use
+// ResourceOpKernel here because the ThreadPoolResource constructor requires
+// access to `OpKernelContext::env()`, which isn't provided by
+// `ResourceOpKernel<T>::CreateResource()`.
+class ThreadPoolHandleOp : public OpKernel {
+ public:
+  explicit ThreadPoolHandleOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("display_name", &display_name_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("num_threads", &num_threads_));
+    OP_REQUIRES(
+        ctx, num_threads_ > 0,
+        errors::InvalidArgument("`num_threads` must be greater than zero."));
+  }
+
+  // The resource is deleted from the resource manager only when it is private
+  // to kernel. Ideally the resource should be deleted when it is no longer held
+  // by anyone, but it would break backward compatibility.
+  ~ThreadPoolHandleOp() override {
+    if (cinfo_.resource_is_private_to_kernel()) {
+      if (!cinfo_.resource_manager()
+               ->Delete<ThreadPoolResource>(cinfo_.container(), cinfo_.name())
+               .ok()) {
+        // Do nothing; the resource can have been deleted by session resets.
+      }
+    }
+  }
+
+  void Compute(OpKernelContext* ctx) override LOCKS_EXCLUDED(mu_) {
+    mutex_lock l(mu_);
+    if (!initialized_) {
+      ResourceMgr* mgr = ctx->resource_manager();
+      OP_REQUIRES_OK(ctx, cinfo_.Init(mgr, def()));
+      ThreadPoolResource* resource;
+      OP_REQUIRES_OK(ctx, mgr->LookupOrCreate<ThreadPoolResource>(
+                              cinfo_.container(), cinfo_.name(), &resource,
+                              [this, ctx](ThreadPoolResource** ret)
+                                  EXCLUSIVE_LOCKS_REQUIRED(mu_) {
+                                    *ret = new ThreadPoolResource(
+                                        ctx->env(), {}, display_name_,
+                                        num_threads_,
+                                        false /* low_latency_hint */);
+                                    return Status::OK();
+                                  }));
+      initialized_ = true;
+    }
+    OP_REQUIRES_OK(ctx, MakeResourceHandleToOutput(
+                            ctx, 0, cinfo_.container(), cinfo_.name(),
+                            MakeTypeIndex<ThreadPoolResource>()));
+  }
+
+ private:
+  mutex mu_;
+  ContainerInfo cinfo_ GUARDED_BY(mu_);
+  bool initialized_ GUARDED_BY(mu_) = false;
+  string display_name_;
+  int num_threads_;
+};
+
+class ThreadPoolDatasetOp : public UnaryDatasetOpKernel {
+ public:
+  explicit ThreadPoolDatasetOp(OpKernelConstruction* ctx)
+      : UnaryDatasetOpKernel(ctx) {}
+
+  void MakeDataset(OpKernelContext* ctx, DatasetBase* input,
+                   DatasetBase** output) override {
+    ThreadPoolResource* threadpool_resource;
+    OP_REQUIRES_OK(ctx, LookupResource(ctx, HandleFromInput(ctx, 1),
+                                       &threadpool_resource));
+    core::ScopedUnref unref_iterator(threadpool_resource);
+
+    *output = new Dataset(ctx, input, threadpool_resource);
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    Dataset(OpKernelContext* ctx, const DatasetBase* input,
+            ThreadPoolResource* threadpool)
+        : GraphDatasetBase(ctx), input_(input), threadpool_(threadpool) {
+      input_->Ref();
+      threadpool_->Ref();
+    }
+
+    ~Dataset() override {
+      input_->Unref();
+      threadpool_->Unref();
+    }
+
+    std::unique_ptr<IteratorBase> MakeIterator(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(
+          new Iterator({this, strings::StrCat(prefix, "::ThreadPool")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      return input_->output_dtypes();
+    }
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      return input_->output_shapes();
+    }
+
+    string DebugString() override { return "ThreadPoolDatasetOp::Dataset"; }
+
+   protected:
+    Status AsGraphDefInternal(OpKernelContext* ctx, DatasetGraphDefBuilder* b,
+                              Node** output) const override {
+      return errors::Unimplemented(
+          "Cannot currently serialize the thread pool for a "
+          "ThreadPoolDataset.");
+    }
+
+   private:
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params),
+            input_impl_(params.dataset->input_->MakeIterator(params.prefix)) {}
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        ThreadPoolResource* pool = dataset()->threadpool_;
+        IteratorContext::Params params;
+        params.env = ctx->env();
+        params.runner = [pool](std::function<void()> c) {
+          pool->Schedule(std::move(c));
+        };
+        params.stats_aggregator_getter = [ctx]() {
+          return ctx->stats_aggregator();
+        };
+        params.lib = ctx->lib();
+        params.function_library = ctx->function_library();
+        params.allocator_getter = [ctx](AllocatorAttributes attrs) {
+          return ctx->allocator(attrs);
+        };
+        IteratorContext threadpool_ctx(params);
+        return input_impl_->GetNext(&threadpool_ctx, out_tensors,
+                                    end_of_sequence);
+      }
+
+     private:
+      std::unique_ptr<IteratorBase> input_impl_;
+    };
+
+    const DatasetBase* const input_;
+    ThreadPoolResource* const threadpool_;
+  };
+};
+
+REGISTER_KERNEL_BUILDER(Name("ThreadPoolHandle").Device(DEVICE_CPU),
+                        ThreadPoolHandleOp);
+REGISTER_KERNEL_BUILDER(Name("ThreadPoolDataset").Device(DEVICE_CPU),
+                        ThreadPoolDatasetOp);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/unique_dataset_op.cc b/tensorflow/contrib/data/kernels/unique_dataset_op.cc
similarity index 99%
rename from tensorflow/core/kernels/data/unique_dataset_op.cc
rename to tensorflow/contrib/data/kernels/unique_dataset_op.cc
index 7726ee0edf71b34cb65fe5fceb2b60dd30bb58e2..69fbb0fcdcce87951d2c9b84210fda378081b103 100644
--- a/tensorflow/core/kernels/data/unique_dataset_op.cc
+++ b/tensorflow/contrib/data/kernels/unique_dataset_op.cc
@@ -12,9 +12,9 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
+#include "tensorflow/core/framework/dataset.h"
 #include "tensorflow/core/framework/partial_tensor_shape.h"
 #include "tensorflow/core/framework/tensor.h"
-#include "tensorflow/core/kernels/data/dataset.h"
 #include "tensorflow/core/lib/hash/hash.h"
 
 namespace tensorflow {
diff --git a/tensorflow/contrib/data/ops/dataset_ops.cc b/tensorflow/contrib/data/ops/dataset_ops.cc
index 289ffa1d9c29092cdf434e86ed5553ff9644d43e..a4c1212da11a2410461a120ed5f7116e80e4b903 100644
--- a/tensorflow/contrib/data/ops/dataset_ops.cc
+++ b/tensorflow/contrib/data/ops/dataset_ops.cc
@@ -27,6 +27,16 @@ REGISTER_OP("IgnoreErrorsDataset")
 Creates a dataset that contains the elements of `input_dataset` ignoring errors.
 )doc");
 
+REGISTER_OP("UniqueDataset")
+    .Input("input_dataset: variant")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape)
+    .Doc(R"doc(
+Creates a dataset that contains the unique elements of `input_dataset`.
+)doc");
+
 REGISTER_OP("FunctionBufferingResource")
     .Input("string_arg: string")
     .Input("target_device: string")
@@ -65,4 +75,33 @@ output: A list of return values.
 output_types: The type list for the return values.
 )doc");
 
+REGISTER_OP("ThreadPoolDataset")
+    .Input("input_dataset: variant")
+    .Input("thread_pool: resource")
+    .Output("handle: variant")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetShapeFn(shape_inference::ScalarShape)
+    .Doc(R"doc(
+Creates a dataset that uses a custom thread pool to compute `input_dataset`.
+
+handle: A resource produced by the ThreadPoolHandle op.
+)doc");
+
+REGISTER_OP("ThreadPoolHandle")
+    .Output("handle: resource")
+    .SetShapeFn(shape_inference::ScalarShape)
+    .Attr("num_threads: int")
+    .Attr("display_name: string")
+    .Attr("container: string = ''")
+    .Attr("shared_name: string = ''")
+    .Doc(R"doc(
+Creates a custom thread pool with the given number of threads.
+
+handle: A resource that can be consumed by one or more ThreadPoolDataset ops.
+num_threads: The number of threads in the thread pool.
+display_name: A human-readable name for the threads that may be visible in
+  some visualizations.
+)doc");
+
 }  // namespace tensorflow
diff --git a/tensorflow/contrib/data/python/kernel_tests/BUILD b/tensorflow/contrib/data/python/kernel_tests/BUILD
index e51d57cc896dc32d8e11912cd89f34a04a858c78..10cb05ece1b3dd59527160ba6857df27c57711d1 100644
--- a/tensorflow/contrib/data/python/kernel_tests/BUILD
+++ b/tensorflow/contrib/data/python/kernel_tests/BUILD
@@ -168,6 +168,7 @@ py_test(
     srcs = ["interleave_dataset_op_test.py"],
     srcs_version = "PY2AND3",
     tags = [
+        "no_cuda_on_cpu_tap",
         "no_oss",
         "no_pip",
     ],
@@ -419,6 +420,20 @@ py_test(
     ],
 )
 
+py_test(
+    name = "threadpool_dataset_ops_test",
+    size = "small",
+    srcs = ["threadpool_dataset_ops_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["no_pip"],
+    deps = [
+        "//tensorflow/contrib/data/python/ops:dataset_ops",
+        "//tensorflow/contrib/data/python/ops:transformation_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:errors",
+    ],
+)
+
 py_test(
     name = "unique_dataset_op_test",
     size = "small",
diff --git a/tensorflow/contrib/data/python/kernel_tests/bucketing_test.py b/tensorflow/contrib/data/python/kernel_tests/bucketing_test.py
index f1b494e1a620992365ed75613b508e32f94b40a4..94f800e8a58bc34eef3034cd976b931528c01940 100644
--- a/tensorflow/contrib/data/python/kernel_tests/bucketing_test.py
+++ b/tensorflow/contrib/data/python/kernel_tests/bucketing_test.py
@@ -17,6 +17,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import random
+
 import numpy as np
 
 from tensorflow.contrib.data.python.kernel_tests import dataset_serialization_test_base
@@ -379,5 +381,93 @@ class BucketTest(test.TestCase):
       self.assertEqual(batches, 15)
 
 
+class BucketBySequenceLength(test.TestCase):
+
+  def testBucket(self):
+
+    boundaries = [10, 20, 30]
+    batch_sizes = [10, 8, 4, 2]
+    lengths = [8, 13, 25, 35]
+
+    def element_gen():
+      # Produce 1 batch for each bucket
+      elements = []
+      for batch_size, length in zip(batch_sizes, lengths):
+        for _ in range(batch_size):
+          elements.append([1] * length)
+      random.shuffle(elements)
+      for el in elements:
+        yield (el,)
+
+    element_len = lambda el: array_ops.shape(el)[0]
+    dataset = dataset_ops.Dataset.from_generator(
+        element_gen, (dtypes.int64,), ([None],)).apply(
+            grouping.bucket_by_sequence_length(
+                element_len, boundaries, batch_sizes))
+    batch, = dataset.make_one_shot_iterator().get_next()
+
+    with self.test_session() as sess:
+      batches = []
+      for _ in range(4):
+        batches.append(sess.run(batch))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(batch)
+    batch_sizes_val = []
+    lengths_val = []
+    for batch in batches:
+      batch_size = batch.shape[0]
+      length = batch.shape[1]
+      batch_sizes_val.append(batch_size)
+      lengths_val.append(length)
+    self.assertEqual(sum(batch_sizes_val), sum(batch_sizes))
+    self.assertEqual(sorted(batch_sizes), sorted(batch_sizes_val))
+    self.assertEqual(sorted(lengths), sorted(lengths_val))
+
+  def testPadToBoundary(self):
+
+    boundaries = [10, 20, 30]
+    batch_sizes = [10, 8, 4, 2]
+    lengths = [8, 13, 25]
+
+    def element_gen():
+      # Produce 1 batch for each bucket
+      elements = []
+      for batch_size, length in zip(batch_sizes[:-1], lengths):
+        for _ in range(batch_size):
+          elements.append([1] * length)
+      random.shuffle(elements)
+      for el in elements:
+        yield (el,)
+      for _ in range(batch_sizes[-1]):
+        el = [1] * (boundaries[-1] + 5)
+        yield (el,)
+
+    element_len = lambda el: array_ops.shape(el)[0]
+    dataset = dataset_ops.Dataset.from_generator(
+        element_gen, (dtypes.int64,), ([None],)).apply(
+            grouping.bucket_by_sequence_length(
+                element_len, boundaries, batch_sizes,
+                pad_to_bucket_boundary=True))
+    batch, = dataset.make_one_shot_iterator().get_next()
+
+    with self.test_session() as sess:
+      batches = []
+      for _ in range(3):
+        batches.append(sess.run(batch))
+      with self.assertRaisesOpError("bucket_boundaries"):
+        sess.run(batch)
+    batch_sizes_val = []
+    lengths_val = []
+    for batch in batches:
+      batch_size = batch.shape[0]
+      length = batch.shape[1]
+      batch_sizes_val.append(batch_size)
+      lengths_val.append(length)
+    batch_sizes = batch_sizes[:-1]
+    self.assertEqual(sum(batch_sizes_val), sum(batch_sizes))
+    self.assertEqual(sorted(batch_sizes), sorted(batch_sizes_val))
+    self.assertEqual(sorted(boundaries), sorted(lengths_val))
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/data/python/kernel_tests/threadpool_dataset_ops_test.py b/tensorflow/contrib/data/python/kernel_tests/threadpool_dataset_ops_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..9167cb3379bba5cb1ba76a96549395c45dca9e35
--- /dev/null
+++ b/tensorflow/contrib/data/python/kernel_tests/threadpool_dataset_ops_test.py
@@ -0,0 +1,77 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the experimental input pipeline statistics gathering ops."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import threading
+
+import numpy as np
+
+from tensorflow.contrib.data.python.ops import threadpool
+from tensorflow.contrib.data.python.ops import unique
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.ops import script_ops
+from tensorflow.python.platform import test
+
+
+class OverrideThreadpoolDatasetTest(test.TestCase):
+
+  def testNumThreads(self):
+
+    def get_thread_id(_):
+      # Python creates a dummy thread object to represent the current
+      # thread when called from an "alien" thread (such as a
+      # `PrivateThreadPool` thread in this case). It does not include
+      # the TensorFlow-given display name, but it has a unique
+      # identifier that maps one-to-one with the underlying OS thread.
+      return np.array(threading.current_thread().ident).astype(np.int64)
+
+    for num_threads in [1, 2, 4, 8, 16]:
+
+      dataset = (
+          dataset_ops.Dataset.range(1000).map(
+              lambda x: script_ops.py_func(get_thread_id, [x], dtypes.int64),
+              num_parallel_calls=32).apply(unique.unique()))
+
+      dataset = threadpool.override_threadpool(
+          dataset,
+          threadpool.PrivateThreadPool(
+              num_threads, display_name="private_thread_pool_%d" % num_threads))
+
+      iterator = dataset.make_initializable_iterator()
+      next_element = iterator.get_next()
+
+      with self.test_session() as sess:
+        sess.run(iterator.initializer)
+        thread_ids = []
+        try:
+          while True:
+            thread_ids.append(sess.run(next_element))
+        except errors.OutOfRangeError:
+          pass
+        self.assertEqual(len(thread_ids), len(set(thread_ids)))
+        self.assertGreater(len(thread_ids), 0)
+        # NOTE(mrry): We don't control the thread pool scheduling, and
+        # so cannot guarantee that all of the threads in the pool will
+        # perform work.
+        self.assertLessEqual(len(thread_ids), num_threads)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/data/python/ops/BUILD b/tensorflow/contrib/data/python/ops/BUILD
index b488357f226d0922bba3799cc1f4b5c75e2e8328..16fe31675f618e7a3e4b86491722267fe33e91e1 100644
--- a/tensorflow/contrib/data/python/ops/BUILD
+++ b/tensorflow/contrib/data/python/ops/BUILD
@@ -105,6 +105,7 @@ py_library(
         "resampling.py",
         "scan_ops.py",
         "stats_ops.py",
+        "threadpool.py",
         "unique.py",
     ],
     srcs_version = "PY2AND3",
@@ -120,10 +121,12 @@ py_library(
         "//tensorflow/python:logging_ops",
         "//tensorflow/python:math_ops",
         "//tensorflow/python:random_ops",
+        "//tensorflow/python:resource_variable_ops",
         "//tensorflow/python:tensor_shape",
         "//tensorflow/python:tensor_util",
         "//tensorflow/python:util",
         "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/data/ops:readers",
         "//tensorflow/python/data/util:convert",
         "//tensorflow/python/data/util:nest",
         "//tensorflow/python/data/util:sparse",
diff --git a/tensorflow/contrib/data/python/ops/dataset_ops.py b/tensorflow/contrib/data/python/ops/dataset_ops.py
deleted file mode 100644
index ff15c4451ad987bcd77dbdd022a1c070056c47e1..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/data/python/ops/dataset_ops.py
+++ /dev/null
@@ -1,691 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Python wrappers for Datasets and Iterators."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.contrib.data.python.ops import batching
-from tensorflow.contrib.data.python.ops import enumerate_ops
-from tensorflow.contrib.data.python.ops import error_ops
-from tensorflow.contrib.data.python.ops import grouping
-from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.data.util import nest
-from tensorflow.python.ops import gen_dataset_ops
-from tensorflow.python.ops import gen_io_ops
-from tensorflow.python.util import deprecation
-
-
-class Dataset(dataset_ops.Dataset):
-  """Represents a potentially large set of elements.
-
-  A `Dataset` can be used to represent an input pipeline as a
-  collection of elements (nested structures of tensors) and a "logical
-  plan" of transformations that act on those elements.
-  """
-
-  def __init__(self, dataset):
-    super(Dataset, self).__init__()
-    self._dataset = dataset
-
-  @deprecation.deprecated(None, "Use `ds._as_variant_tensor()`.")
-  def make_dataset_resource(self):
-    return self._as_variant_tensor()
-
-  def _as_variant_tensor(self):
-    return self._dataset._as_variant_tensor()  # pylint: disable=protected-access
-
-  @property
-  def output_classes(self):
-    return self._dataset.output_classes
-
-  @property
-  def output_shapes(self):
-    return self._dataset.output_shapes
-
-  @property
-  def output_types(self):
-    return self._dataset.output_types
-
-  @staticmethod
-  @deprecation.deprecated(None, "Use `tf.data.Dataset.from_tensors()`.")
-  def from_tensors(tensors):
-    """Creates a `Dataset` with a single element, comprising the given tensors.
-
-    Args:
-      tensors: A nested structure of tensors.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.TensorDataset(tensors))
-
-  @staticmethod
-  @deprecation.deprecated(None, "Use `tf.data.Dataset.from_tensor_slices()`.")
-  def from_tensor_slices(tensors):
-    """Creates a `Dataset` whose elements are slices of the given tensors.
-
-    Args:
-      tensors: A nested structure of tensors, each having the same size in the
-        0th dimension.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.TensorSliceDataset(tensors))
-
-  @staticmethod
-  @deprecation.deprecated(None,
-                          "Use `tf.data.Dataset.from_sparse_tensor_slices()`.")
-  def from_sparse_tensor_slices(sparse_tensor):
-    """Splits each rank-N `tf.SparseTensor` in this dataset row-wise.
-
-    Args:
-      sparse_tensor: A `tf.SparseTensor`.
-
-    Returns:
-      A `Dataset` of rank-(N-1) sparse tensors.
-    """
-    return Dataset(dataset_ops.SparseTensorSliceDataset(sparse_tensor))
-
-  @staticmethod
-  @deprecation.deprecated(None, "Use `tf.data.Dataset.from_generator()`.")
-  def from_generator(generator, output_types, output_shapes=None):
-    """Creates a `Dataset` whose elements are generated by `generator`.
-
-    The `generator` argument must be a callable object that returns
-    an object that support the `iter()` protocol (e.g. a generator function).
-    The elements generated by `generator` must be compatible with the given
-    `output_types` and (optional) `output_shapes` arguments.
-
-    For example:
-
-    ```python
-    import itertools
-
-    def gen():
-      for i in itertools.count(1):
-        yield (i, [1] * i)
-
-    ds = Dataset.from_generator(
-        gen, (tf.int64, tf.int64), (tf.TensorShape([]), tf.TensorShape([None])))
-    value = ds.make_one_shot_iterator().get_next()
-
-    sess.run(value)  # (1, array([1]))
-    sess.run(value)  # (2, array([1, 1]))
-    ```
-
-    Args:
-      generator: A callable object that takes no arguments and returns an
-        object that supports the `iter()` protocol.
-      output_types: A nested structure of `tf.DType` objects corresponding to
-        each component of an element yielded by `generator`.
-      output_shapes: (Optional.) A nested structure of `tf.TensorShape`
-        objects corresponding to each component of an element yielded by
-        `generator`.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.Dataset.from_generator(
-        generator, output_types, output_shapes))
-
-  @staticmethod
-  @deprecation.deprecated(None, "Use `tf.data.Dataset.range()`.")
-  def range(*args):
-    """Creates a `Dataset` of a step-separated range of values.
-
-    For example:
-
-    ```python
-    Dataset.range(5) == [0, 1, 2, 3, 4]
-    Dataset.range(2, 5) == [2, 3, 4]
-    Dataset.range(1, 5, 2) == [1, 3]
-    Dataset.range(1, 5, -2) == []
-    Dataset.range(5, 1) == []
-    Dataset.range(5, 1, -2) == [5, 3]
-    ```
-
-    Args:
-      *args: follow same semantics as python's xrange.
-        len(args) == 1 -> start = 0, stop = args[0], step = 1
-        len(args) == 2 -> start = args[0], stop = args[1], step = 1
-        len(args) == 3 -> start = args[0], stop = args[1, stop = args[2]
-
-    Returns:
-      A `RangeDataset`.
-
-    Raises:
-      ValueError: if len(args) == 0.
-    """
-    return Dataset(dataset_ops.RangeDataset(*args))
-
-  @staticmethod
-  @deprecation.deprecated(None, "Use `tf.data.Dataset.zip()`.")
-  def zip(datasets):
-    """Creates a `Dataset` by zipping together the given datasets.
-
-    This method has similar semantics to the built-in `zip()` function
-    in Python, with the main difference being that the `datasets`
-    argument can be an arbitrary nested structure of `Dataset` objects.
-    For example:
-
-    ```python
-    # NOTE: The following examples use `{ ... }` to represent the
-    # contents of a dataset.
-    a = { 1, 2, 3 }
-    b = { 4, 5, 6 }
-    c = { (7, 8), (9, 10), (11, 12) }
-    d = { 13, 14 }
-
-    # The nested structure of the `datasets` argument determines the
-    # structure of elements in the resulting dataset.
-    Dataset.zip((a, b)) == { (1, 4), (2, 5), (3, 6) }
-    Dataset.zip((b, a)) == { (4, 1), (5, 2), (6, 3) }
-
-    # The `datasets` argument may contain an arbitrary number of
-    # datasets.
-    Dataset.zip((a, b, c)) == { (1, 4, (7, 8)),
-                                (2, 5, (9, 10)),
-                                (3, 6, (11, 12)) }
-
-    # The number of elements in the resulting dataset is the same as
-    # the size of the smallest dataset in `datasets`.
-    Dataset.zip((a, d)) == { (1, 13), (2, 14) }
-    ```
-
-    Args:
-      datasets: A nested structure of datasets.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.ZipDataset(datasets))
-
-  def concatenate(self, dataset):
-    """Creates a `Dataset` by concatenating given dataset with this dataset.
-
-    ```python
-    # NOTE: The following examples use `{ ... }` to represent the
-    # contents of a dataset.
-    a = { 1, 2, 3 }
-    b = { 4, 5, 6, 7 }
-
-    # Input dataset and dataset to be concatenated should have same
-    # nested structures and output types.
-    # c = { (8, 9), (10, 11), (12, 13) }
-    # d = { 14.0, 15.0, 16.0 }
-    # a.concatenate(c) and a.concatenate(d) would result in error.
-
-    a.concatenate(b) == { 1, 2, 3, 4, 5, 6, 7 }
-    ```
-
-    Args:
-      dataset: `Dataset` to be concatenated.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.ConcatenateDataset(self._dataset, dataset))
-
-  def prefetch(self, buffer_size):
-    """Creates a `Dataset` that prefetches elements from this dataset.
-
-    Args:
-      buffer_size: A `tf.int64` scalar `tf.Tensor`, representing the
-        maximum number elements that will be buffered when prefetching.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.PrefetchDataset(self._dataset, buffer_size))
-
-  @staticmethod
-  @deprecation.deprecated(None, "Use `tf.data.Dataset.list_files()`.")
-  def list_files(file_pattern):
-    """A dataset of all files matching a pattern.
-
-    Example:
-      If we had the following files on our filesystem:
-        - /path/to/dir/a.txt
-        - /path/to/dir/b.py
-        - /path/to/dir/c.py
-      If we pass "/path/to/dir/*.py" as the directory, the dataset would
-      produce:
-        - /path/to/dir/b.py
-        - /path/to/dir/c.py
-
-    Args:
-      file_pattern: A string or scalar string `tf.Tensor`, representing
-        the filename pattern that will be matched.
-
-    Returns:
-     A `Dataset` of strings corresponding to file names.
-    """
-    return Dataset.from_tensor_slices(gen_io_ops.matching_files(file_pattern))
-
-  def repeat(self, count=None):
-    """Repeats this dataset `count` times.
-
-    Args:
-      count: (Optional.) A `tf.int64` scalar `tf.Tensor`, representing the
-        number of times the elements of this dataset should be repeated. The
-        default behavior (if `count` is `None` or `-1`) is for the elements to
-        be repeated indefinitely.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.RepeatDataset(self._dataset, count))
-
-  @deprecation.deprecated(
-      None, "Use `ds.apply(tf.contrib.data.enumerate_dataset())`.")
-  def enumerate(self, start=0):
-    """Deprecated: Use `Dataset.apply(tf.contrib.data.enumerate_dataset(..)`."""
-
-    return self.apply(enumerate_ops.enumerate_dataset(start))
-
-  def shuffle(self, buffer_size, seed=None):
-    """Randomly shuffles the elements of this dataset.
-
-    Args:
-      buffer_size: A `tf.int64` scalar `tf.Tensor`, representing the
-        number of elements from this dataset from which the new
-        dataset will sample.
-      seed: (Optional.) A `tf.int64` scalar `tf.Tensor`, representing the
-        random seed that will be used to create the distribution. See
-        @{tf.set_random_seed} for behavior.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.ShuffleDataset(self._dataset, buffer_size, seed))
-
-  def cache(self, filename=""):
-    """Caches the elements in this dataset.
-
-    Args:
-      filename: A `tf.string` scalar `tf.Tensor`, representing the name of a
-        directory on the filesystem to use for caching tensors in this Dataset.
-        If a filename is not provided, the dataset will be cached in memory.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.CacheDataset(self._dataset, filename))
-
-  def take(self, count):
-    """Creates a `Dataset` with at most `count` elements from this dataset.
-
-    Args:
-      count: A `tf.int64` scalar `tf.Tensor`, representing the number of
-        elements of this dataset that should be taken to form the new dataset.
-        If `count` is -1, or if `count` is greater than the size of this
-        dataset, the new dataset will contain all elements of this dataset.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.TakeDataset(self._dataset, count))
-
-  def skip(self, count):
-    """Creates a `Dataset` that skips `count` elements from this dataset.
-
-    Args:
-      count: A `tf.int64` scalar `tf.Tensor`, representing the number
-        of elements of this dataset that should be skipped to form the
-        new dataset.  If `count` is greater than the size of this
-        dataset, the new dataset will contain no elements.  If `count`
-        is -1, skips the entire dataset.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.SkipDataset(self._dataset, count))
-
-  def shard(self, num_shards, index):
-    """Creates a `Dataset` that includes only 1/`num_shards` of this dataset.
-
-    This dataset operator is very useful when running distributed training, as
-    it allows each worker to read a unique subset.
-
-    When reading a single input file, you can skip elements as follows:
-
-    ```python
-    d = tf.data.TFRecordDataset(FLAGS.input_file)
-    d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
-    d = d.repeat(FLAGS.num_epochs)
-    d = d.shuffle(FLAGS.shuffle_buffer_size)
-    d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
-    ```
-
-    Important caveats:
-
-    - Be sure to shard before you use any randomizing operator (such as
-      shuffle).
-    - Generally it is best if the shard operator is used early in the dataset
-      pipeline. For example, when reading from a set of TFRecord files, shard
-      before converting the dataset to input samples. This avoids reading every
-      file on every worker. The following is an example of an efficient
-      sharding strategy within a complete pipeline:
-
-    ```python
-    d = tf.data.Dataset.list_files(FLAGS.pattern)
-    d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
-    d = d.repeat(FLAGS.num_epochs)
-    d = d.shuffle(FLAGS.shuffle_buffer_size)
-    d = d.interleave(tf.data.TFRecordDataset,
-                     cycle_length=FLAGS.num_readers, block_length=1)
-    d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
-    ```
-
-    Args:
-      num_shards: A `tf.int64` scalar `tf.Tensor`, representing the number of
-        shards operating in parallel.
-      index: A `tf.int64` scalar `tf.Tensor`, representing the worker index.
-
-    Returns:
-      A `Dataset`.
-
-    Raises:
-      ValueError: if `num_shards` or `index` are illegal values. Note: error
-        checking is done on a best-effort basis, and aren't guaranteed to be
-        caught upon dataset creation. (e.g. providing in a placeholder tensor
-        bypasses the early checking, and will instead result in an error during
-        a session.run call.)
-    """
-    return Dataset(self._dataset.shard(num_shards, index))
-
-  @deprecation.deprecated(
-      None, "Use `ds.apply(tf.contrib.data.ignore_errors())`.")
-  def ignore_errors(self):
-    """Deprecated: Use `Dataset.apply(tf.contrib.data.ignore_errors())`."""
-
-    return self.apply(error_ops.ignore_errors())
-
-  def batch(self, batch_size):
-    """Combines consecutive elements of this dataset into batches.
-
-    Args:
-      batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
-        consecutive elements of this dataset to combine in a single batch.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.BatchDataset(self._dataset, batch_size))
-
-  def padded_batch(self, batch_size, padded_shapes, padding_values=None):
-    """Combines consecutive elements of this dataset into padded batches.
-
-    Like `Dataset.dense_to_sparse_batch()`, this method combines
-    multiple consecutive elements of this dataset, which might have
-    different shapes, into a single element. The tensors in the
-    resulting element have an additional outer dimension, and are
-    padded to the respective shape in `padded_shapes`.
-
-    Args:
-      batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
-        consecutive elements of this dataset to combine in a single batch.
-      padded_shapes: A nested structure of `tf.TensorShape` or
-        `tf.int64` vector tensor-like objects representing the shape
-        to which the respective component of each input element should
-        be padded prior to batching. Any unknown dimensions
-        (e.g. `tf.Dimension(None)` in a `tf.TensorShape` or `-1` in a
-        tensor-like object) will be padded to the maximum size of that
-        dimension in each batch.
-      padding_values: (Optional.) A nested structure of scalar-shaped
-        `tf.Tensor`, representing the padding values to use for the
-        respective components.  Defaults are `0` for numeric types and
-        the empty string for string types.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(
-        dataset_ops.PaddedBatchDataset(self._dataset, batch_size, padded_shapes,
-                                       padding_values))
-
-  @deprecation.deprecated(
-      None, "Use `ds.apply(tf.contrib.data.dense_to_sparse_batch())`.")
-  def dense_to_sparse_batch(self, batch_size, row_shape):
-    """Use: `Dataset.apply(tf.contrib.data.dense_to_sparse_batch(...))`."""
-
-    return self.apply(batching.dense_to_sparse_batch(batch_size, row_shape))
-
-  @deprecation.deprecated(
-      None, "Use `ds.apply(tf.contrib.data.group_by_window())`.")
-  def group_by_window(self, key_func, reduce_func, window_size):
-    """Deprecated: Use `Dataset.apply(tf.contrib.data.group_by_window(...))`."""
-
-    return self.apply(
-        grouping.group_by_window(key_func, reduce_func, window_size))
-
-  @deprecation.deprecated_args(
-      None,
-      "Replace `num_threads=T` with `num_parallel_calls=T`. Replace "
-      "`output_buffer_size=N` with `ds.prefetch(N)` on the returned dataset.",
-      "num_threads", "output_buffer_size")
-  def map(self,
-          map_func,
-          num_threads=None,
-          output_buffer_size=None,
-          num_parallel_calls=None):
-    """Maps `map_func` across this dataset.
-
-    Args:
-      map_func: A function mapping a nested structure of tensors (having
-        shapes and types defined by `self.output_shapes` and
-       `self.output_types`) to another nested structure of tensors.
-      num_threads: (Optional.) Deprecated, use `num_parallel_calls` instead.
-      output_buffer_size: (Optional.) A `tf.int64` scalar `tf.Tensor`,
-        representing the maximum number of processed elements that will be
-        buffered.
-      num_parallel_calls: (Optional.) A `tf.int32` scalar `tf.Tensor`,
-        representing the number elements to process in parallel. If not
-        specified, elements will be processed sequentially.
-
-    Returns:
-      A `Dataset`.
-    """
-    if num_threads is None and num_parallel_calls is None:
-      ret = Dataset(dataset_ops.MapDataset(self._dataset, map_func))
-    else:
-      if num_threads is None:
-        ret = Dataset(
-            dataset_ops.ParallelMapDataset(self._dataset, map_func,
-                                           num_parallel_calls))
-      else:
-        ret = Dataset(
-            dataset_ops.ParallelMapDataset(self._dataset, map_func,
-                                           num_threads))
-    if output_buffer_size is not None:
-      ret = ret.prefetch(output_buffer_size)
-    return ret
-
-  def flat_map(self, map_func):
-    """Maps `map_func` across this dataset and flattens the result.
-
-    Args:
-      map_func: A function mapping a nested structure of tensors (having shapes
-        and types defined by `self.output_shapes` and `self.output_types`) to a
-        `Dataset`.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.FlatMapDataset(self._dataset, map_func))
-
-  def interleave(self, map_func, cycle_length, block_length=1):
-    """Maps `map_func` across this dataset, and interleaves the results.
-
-    For example, you can use `Dataset.interleave()` to process many input files
-    concurrently:
-
-    ```python
-    # Preprocess 4 files concurrently, and interleave blocks of 16 records from
-    # each file.
-    filenames = ["/var/data/file1.txt", "/var/data/file2.txt", ...]
-    dataset = (Dataset.from_tensor_slices(filenames)
-               .interleave(lambda x:
-                   TextLineDataset(x).map(parse_fn, num_parallel_calls=1),
-                   cycle_length=4, block_length=16))
-    ```
-
-    The `cycle_length` and `block_length` arguments control the order in which
-    elements are produced. `cycle_length` controls the number of input elements
-    that are processed concurrently. If you set `cycle_length` to 1, this
-    transformation will handle one input element at a time, and will produce
-    identical results = to @{tf.data.Dataset.flat_map}. In general,
-    this transformation will apply `map_func` to `cycle_length` input elements,
-    open iterators on the returned `Dataset` objects, and cycle through them
-    producing `block_length` consecutive elements from each iterator, and
-    consuming the next input element each time it reaches the end of an
-    iterator.
-
-    For example:
-
-    ```python
-    # NOTE: The following examples use `{ ... }` to represent the
-    # contents of a dataset.
-    a = { 1, 2, 3, 4, 5 }
-
-    # NOTE: New lines indicate "block" boundaries.
-    a.interleave(lambda x: Dataset.from_tensors(x).repeat(6),
-                 cycle_length=2, block_length=4) == {
-        1, 1, 1, 1,
-        2, 2, 2, 2,
-        1, 1,
-        2, 2,
-        3, 3, 3, 3,
-        4, 4, 4, 4,
-        3, 3,
-        4, 4,
-        5, 5, 5, 5,
-        5, 5,
-    }
-    ```
-
-    NOTE: The order of elements yielded by this transformation is
-    deterministic, as long as `map_func` is a pure function. If
-    `map_func` contains any stateful operations, the order in which
-    that state is accessed is undefined.
-
-    Args:
-      map_func: A function mapping a nested structure of tensors (having shapes
-        and types defined by `self.output_shapes` and `self.output_types`) to a
-        `Dataset`.
-      cycle_length: The number of elements from this dataset that will be
-        processed concurrently.
-      block_length: The number of consecutive elements to produce from each
-        input element before cycling to another input element.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(
-        dataset_ops.InterleaveDataset(self._dataset, map_func, cycle_length,
-                                      block_length))
-
-  @deprecation.deprecated(None, "Use `ds.apply(tf.contrib.data.unbatch())`.")
-  def unbatch(self):
-    """Deprecated: Use `Dataset.apply(tf.contrib.data.unbatch()`."""
-
-    return self.apply(batching.unbatch())
-
-  def filter(self, predicate):
-    """Filters this dataset according to `predicate`.
-
-    Args:
-      predicate: A function mapping a nested structure of tensors (having shapes
-        and types defined by `self.output_shapes` and `self.output_types`) to a
-        scalar `tf.bool` tensor.
-
-    Returns:
-      A `Dataset`.
-    """
-    return Dataset(dataset_ops.FilterDataset(self._dataset, predicate))
-
-  def apply(self, transformation_func):
-    """Apply a transformation function to this dataset.
-
-    `apply` enables chaining of custom `Dataset` transformations, which are
-    represented as functions that take one `Dataset` argument and return a
-    transformed `Dataset`.
-
-    For example:
-
-    ```
-    dataset = (dataset.map(lambda x: x ** 2)
-               .(group_by_window(key_func, reduce_func, window_size))
-               .map(lambda x: x ** 3))
-    ```
-
-    Args:
-      transformation_func: A function that takes one `Dataset` argument and
-        returns a `Dataset`.
-
-    Returns:
-      The `Dataset` returned by applying `transformation_func` to this dataset.
-    """
-    dataset = transformation_func(self)
-    if not isinstance(dataset, dataset_ops.Dataset):
-      raise TypeError("`transformation_func` must return a Dataset.")
-    return Dataset(dataset)
-
-
-def get_single_element(dataset):
-  """Returns the single element in `dataset` as a nested structure of tensors.
-
-  This function enables you to use a @{tf.data.Dataset} in a stateless
-  "tensor-in tensor-out" expression, without creating a @{tf.data.Iterator}.
-  This can be useful when your preprocessing transformations are expressed
-  as a `Dataset`, and you want to use the transformation at serving time.
-  For example:
-
-  ```python
-  input_batch = tf.placeholder(tf.string, shape=[BATCH_SIZE])
-
-  def preprocessing_fn(input_str):
-    # ...
-    return image, label
-
-  dataset = (tf.data.Dataset.from_tensor_slices(input_batch)
-             .map(preprocessing_fn, num_parallel_calls=BATCH_SIZE)
-             .batch(BATCH_SIZE))
-
-  image_batch, label_batch = tf.contrib.data.get_single_element(dataset)
-  ```
-
-  Args:
-    dataset: A @{tf.data.Dataset} object containing a single element.
-
-  Returns:
-    A nested structure of @{tf.Tensor} objects, corresponding to the single
-    element of `dataset`.
-
-  Raises:
-    TypeError: if `dataset` is not a `tf.data.Dataset` object.
-    InvalidArgumentError (at runtime): if `dataset` does not contain exactly
-      one element.
-  """
-  if not isinstance(dataset, dataset_ops.Dataset):
-    raise TypeError("`dataset` must be a `tf.data.Dataset` object.")
-  return nest.pack_sequence_as(
-      dataset.output_types,
-      gen_dataset_ops.dataset_to_single_element(
-          dataset._as_variant_tensor(),  # pylint: disable=protected-access
-          output_types=nest.flatten(dataset.output_types),
-          output_shapes=nest.flatten(dataset.output_shapes)))
diff --git a/tensorflow/contrib/data/python/ops/grouping.py b/tensorflow/contrib/data/python/ops/grouping.py
index 67b085002aa7797d858837fea4646fb968ad5d97..a19be222545ef0242502ec07badbdae5c7634a0c 100644
--- a/tensorflow/contrib/data/python/ops/grouping.py
+++ b/tensorflow/contrib/data/python/ops/grouping.py
@@ -17,13 +17,20 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import numpy as np
+
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import gen_dataset_ops
+from tensorflow.python.ops import math_ops
 
 
 def group_by_window(key_func,
@@ -85,6 +92,114 @@ def group_by_window(key_func,
   return _apply_fn
 
 
+def bucket_by_sequence_length(element_length_func,
+                              bucket_boundaries,
+                              bucket_batch_sizes,
+                              padded_shapes=None,
+                              padding_values=None,
+                              pad_to_bucket_boundary=False):
+  """A transformation that buckets elements in a `Dataset` by length.
+
+  Elements of the `Dataset` are grouped together by length and then are padded
+  and batched.
+
+  This is useful for sequence tasks in which the elements have variable length.
+  Grouping together elements that have similar lengths reduces the total
+  fraction of padding in a batch which increases training step efficiency.
+
+  Args:
+    element_length_func: function from element in `Dataset` to `tf.int64`,
+      determines the length of the element, which will determine the bucket it
+      goes into.
+    bucket_boundaries: `list<int>`, upper length boundaries of the buckets.
+    bucket_batch_sizes: `list<int>`, batch size per bucket. Length should be
+      `len(bucket_boundaries) + 1`.
+    padded_shapes: Nested structure of `tf.TensorShape` to pass to
+      @{tf.data.Dataset.padded_batch}. If not provided, will use
+      `dataset.output_shapes`, which will result in variable length dimensions
+      being padded out to the maximum length in each batch.
+    padding_values: Values to pad with, passed to
+      @{tf.data.Dataset.padded_batch}. Defaults to padding with 0.
+    pad_to_bucket_boundary: bool, if `False`, will pad dimensions with unknown
+      size to maximum length in batch. If `True`, will pad dimensions with
+      unknown size to bucket boundary, and caller must ensure that the source
+      `Dataset` does not contain any elements with length longer than
+      `max(bucket_boundaries)`.
+
+  Returns:
+    A `Dataset` transformation function, which can be passed to
+    @{tf.data.Dataset.apply}.
+
+  Raises:
+    ValueError: if `len(bucket_batch_sizes) != len(bucket_boundaries) + 1`.
+  """
+  with ops.name_scope("bucket_by_seq_length"):
+    if len(bucket_batch_sizes) != (len(bucket_boundaries) + 1):
+      raise ValueError(
+          "len(bucket_batch_sizes) must equal len(bucket_boundaries) + 1")
+
+    batch_sizes = constant_op.constant(bucket_batch_sizes, dtype=dtypes.int64)
+
+    def element_to_bucket_id(element):
+      """Return int64 id of the length bucket for this element."""
+      seq_length = element_length_func(element)
+
+      boundaries = list(bucket_boundaries)
+      buckets_min = [np.iinfo(np.int32).min] + boundaries
+      buckets_max = boundaries + [np.iinfo(np.int32).max]
+      conditions_c = math_ops.logical_and(
+          math_ops.less_equal(buckets_min, seq_length),
+          math_ops.less(seq_length, buckets_max))
+      bucket_id = math_ops.reduce_min(array_ops.where(conditions_c))
+
+      return bucket_id
+
+    def window_size_fn(bucket_id):
+      # The window size is set to the batch size for this bucket
+      window_size = batch_sizes[bucket_id]
+      return window_size
+
+    def make_padded_shapes(shapes, none_filler=None):
+      padded = []
+      for shape in nest.flatten(shapes):
+        shape = tensor_shape.TensorShape(shape)
+        shape = [
+            none_filler if d.value is None else d
+            for d in shape
+        ]
+        padded.append(shape)
+      return nest.pack_sequence_as(shapes, padded)
+
+    def batching_fn(bucket_id, grouped_dataset):
+      """Batch elements in dataset."""
+      batch_size = batch_sizes[bucket_id]
+      none_filler = None
+      if pad_to_bucket_boundary:
+        err_msg = ("When pad_to_bucket_boundary=True, elements must have "
+                   "length <= max(bucket_boundaries).")
+        check = check_ops.assert_less(
+            bucket_id,
+            constant_op.constant(len(bucket_batch_sizes) - 1,
+                                 dtype=dtypes.int64),
+            message=err_msg)
+        with ops.control_dependencies([check]):
+          boundaries = constant_op.constant(bucket_boundaries,
+                                            dtype=dtypes.int64)
+          bucket_boundary = boundaries[bucket_id]
+          none_filler = bucket_boundary
+      shapes = make_padded_shapes(
+          padded_shapes or grouped_dataset.output_shapes,
+          none_filler=none_filler)
+      return grouped_dataset.padded_batch(batch_size, shapes, padding_values)
+
+    def _apply_fn(dataset):
+      return dataset.apply(
+          group_by_window(element_to_bucket_id, batching_fn,
+                          window_size_func=window_size_fn))
+
+    return _apply_fn
+
+
 class _VariantDataset(dataset_ops.Dataset):
   """A Dataset wrapper for a tf.variant-typed function argument."""
 
diff --git a/tensorflow/contrib/data/python/ops/interleave_ops.py b/tensorflow/contrib/data/python/ops/interleave_ops.py
index 3124ca1d1540e12d949dded88ce1c66181be3595..91f19da02d4a479820782822475d9121125fc38e 100644
--- a/tensorflow/contrib/data/python/ops/interleave_ops.py
+++ b/tensorflow/contrib/data/python/ops/interleave_ops.py
@@ -17,101 +17,10 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.python.data.ops import dataset_ops
-from tensorflow.python.data.util import convert
-from tensorflow.python.data.util import nest
-from tensorflow.python.data.util import sparse
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import function
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import gen_dataset_ops
+from tensorflow.python.data.ops import readers
 from tensorflow.python.util import deprecation
 
 
-class ParallelInterleaveDataset(dataset_ops.Dataset):
-  """A `Dataset` that maps a function over its input and flattens the result."""
-
-  def __init__(self, input_dataset, map_func, cycle_length, block_length,
-               sloppy, buffer_output_elements, prefetch_input_elements):
-    """See `tf.contrib.data.parallel_interleave()` for details."""
-    super(ParallelInterleaveDataset, self).__init__()
-    self._input_dataset = input_dataset
-
-    @function.Defun(*nest.flatten(
-        sparse.as_dense_types(input_dataset.output_types,
-                              input_dataset.output_classes)))
-    def tf_map_func(*args):
-      """A wrapper for Defun that facilitates shape inference."""
-      # Pass in shape information from the input_dataset.
-      dense_shapes = sparse.as_dense_shapes(input_dataset.output_shapes,
-                                            input_dataset.output_classes)
-      for arg, shape in zip(args, nest.flatten(dense_shapes)):
-        arg.set_shape(shape)
-
-      nested_args = nest.pack_sequence_as(input_dataset.output_types, args)
-      nested_args = sparse.deserialize_sparse_tensors(
-          nested_args, input_dataset.output_types, input_dataset.output_shapes,
-          input_dataset.output_classes)
-      if dataset_ops._should_unpack_args(nested_args):  # pylint: disable=protected-access
-        dataset = map_func(*nested_args)
-      else:
-        dataset = map_func(nested_args)
-
-      if not isinstance(dataset, dataset_ops.Dataset):
-        raise TypeError("`map_func` must return a `Dataset` object.")
-
-      self._output_classes = dataset.output_classes
-      self._output_types = dataset.output_types
-      self._output_shapes = dataset.output_shapes
-
-      return dataset._as_variant_tensor()  # pylint: disable=protected-access
-
-    self._map_func = tf_map_func
-    self._map_func.add_to_graph(ops.get_default_graph())
-
-    self._cycle_length = ops.convert_to_tensor(
-        cycle_length, dtype=dtypes.int64, name="cycle_length")
-    self._block_length = ops.convert_to_tensor(
-        block_length, dtype=dtypes.int64, name="block_length")
-    self._sloppy = ops.convert_to_tensor(
-        sloppy, dtype=dtypes.bool, name="sloppy")
-    self._buffer_output_elements = convert.optional_param_to_tensor(
-        "buffer_output_elements",
-        buffer_output_elements,
-        argument_default=2 * block_length)
-    self._prefetch_input_elements = convert.optional_param_to_tensor(
-        "prefetch_input_elements",
-        prefetch_input_elements,
-        argument_default=2 * cycle_length)
-
-  def _as_variant_tensor(self):
-    return gen_dataset_ops.parallel_interleave_dataset(
-        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
-        self._map_func.captured_inputs,
-        self._cycle_length,
-        self._block_length,
-        self._sloppy,
-        self._buffer_output_elements,
-        self._prefetch_input_elements,
-        f=self._map_func,
-        output_types=nest.flatten(
-            sparse.as_dense_types(self.output_types, self.output_classes)),
-        output_shapes=nest.flatten(
-            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
-
-  @property
-  def output_classes(self):
-    return self._output_classes
-
-  @property
-  def output_shapes(self):
-    return self._output_shapes
-
-  @property
-  def output_types(self):
-    return self._output_types
-
-
 def parallel_interleave(map_func,
                         cycle_length,
                         block_length=1,
@@ -162,7 +71,7 @@ def parallel_interleave(map_func,
     @{tf.data.Dataset.apply}.
   """
   def _apply_fn(dataset):
-    return ParallelInterleaveDataset(
+    return readers.ParallelInterleaveDataset(
         dataset, map_func, cycle_length, block_length, sloppy,
         buffer_output_elements, prefetch_input_elements)
 
@@ -221,7 +130,7 @@ def sloppy_interleave(map_func, cycle_length, block_length=1):
     @{tf.data.Dataset.apply}.
   """
   def _apply_fn(dataset):
-    return ParallelInterleaveDataset(
+    return readers.ParallelInterleaveDataset(
         dataset,
         map_func,
         cycle_length,
diff --git a/tensorflow/contrib/data/python/ops/random_ops.py b/tensorflow/contrib/data/python/ops/random_ops.py
index 7d727165feabb101549567f28a2dfa07083de244..28ef5e50f39dd7d1b6f124e58e068fc968ddd6dc 100644
--- a/tensorflow/contrib/data/python/ops/random_ops.py
+++ b/tensorflow/contrib/data/python/ops/random_ops.py
@@ -19,11 +19,10 @@ from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import random_seed
 from tensorflow.python.data.util import sparse
-from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import random_seed
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import gen_dataset_ops
 
@@ -34,16 +33,7 @@ class RandomDataset(dataset_ops.Dataset):
   def __init__(self, seed=None):
     """A `Dataset` of pseudorandom values."""
     super(RandomDataset, self).__init__()
-    seed, seed2 = random_seed.get_seed(seed)
-    if seed is None:
-      self._seed = constant_op.constant(0, dtype=dtypes.int64, name="seed")
-    else:
-      self._seed = ops.convert_to_tensor(seed, dtype=dtypes.int64, name="seed")
-    if seed2 is None:
-      self._seed2 = constant_op.constant(0, dtype=dtypes.int64, name="seed2")
-    else:
-      self._seed2 = ops.convert_to_tensor(
-          seed2, dtype=dtypes.int64, name="seed2")
+    self._seed, self._seed2 = random_seed.get_seed(seed)
 
   def _as_variant_tensor(self):
     return gen_dataset_ops.random_dataset(
diff --git a/tensorflow/contrib/data/python/ops/shuffle_ops.py b/tensorflow/contrib/data/python/ops/shuffle_ops.py
index 99bb79bc06a421f811869ca9169aaa11deaca2f3..f35795abd38000b13cec0f08596e2ff66e86286c 100644
--- a/tensorflow/contrib/data/python/ops/shuffle_ops.py
+++ b/tensorflow/contrib/data/python/ops/shuffle_ops.py
@@ -19,11 +19,11 @@ from __future__ import print_function
 
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import random_seed
 from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import random_seed
 from tensorflow.python.ops import gen_dataset_ops
 
 
@@ -45,17 +45,7 @@ class _ShuffleAndRepeatDataset(dataset_ops.Dataset):
     else:
       self._count = ops.convert_to_tensor(
           count, dtype=dtypes.int64, name="count")
-
-    seed, seed2 = random_seed.get_seed(seed)
-    if seed is None:
-      self._seed = constant_op.constant(0, dtype=dtypes.int64, name="seed")
-    else:
-      self._seed = ops.convert_to_tensor(seed, dtype=dtypes.int64, name="seed")
-    if seed2 is None:
-      self._seed2 = constant_op.constant(0, dtype=dtypes.int64, name="seed2")
-    else:
-      self._seed2 = ops.convert_to_tensor(
-          seed2, dtype=dtypes.int64, name="seed2")
+    self._seed, self._seed2 = random_seed.get_seed(seed)
 
   def _as_variant_tensor(self):
     # pylint: disable=protected-access
diff --git a/tensorflow/contrib/data/python/ops/threadpool.py b/tensorflow/contrib/data/python/ops/threadpool.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f85aa84cd53fcf5e21480aac96e067766ad1b65
--- /dev/null
+++ b/tensorflow/contrib/data/python/ops/threadpool.py
@@ -0,0 +1,102 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Experimental API for controlling threading in `tf.data` pipelines."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import threading
+
+from tensorflow.contrib.data.python.ops import contrib_op_loader  # pylint: disable=unused-import
+from tensorflow.contrib.data.python.ops import gen_dataset_ops
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import sparse
+from tensorflow.python.eager import context
+from tensorflow.python.ops import resource_variable_ops
+
+_uid_counter = 0
+_uid_lock = threading.Lock()
+
+
+def _generate_shared_name(prefix):
+  with _uid_lock:
+    global _uid_counter
+    uid = _uid_counter
+    _uid_counter += 1
+  return "{}{}".format(prefix, uid)
+
+
+class PrivateThreadPool(object):
+  """A stateful resource that represents a private thread pool."""
+
+  def __init__(self, num_threads, display_name=None):
+    """Creates a `PrivateThreadPool` with the given number of threads."""
+    if context.in_eager_mode():
+      shared_name = _generate_shared_name("privatethreadpool")
+      self._resource = gen_dataset_ops.thread_pool_handle(
+          num_threads=num_threads,
+          display_name=display_name,
+          shared_name=shared_name)
+      self._resource_deleter = resource_variable_ops.EagerResourceDeleter(
+          handle=self._resource, handle_device=context.context().device_name)
+    else:
+      self._resource = gen_dataset_ops.thread_pool_handle(
+          num_threads=num_threads, display_name=display_name)
+
+
+class _ThreadPoolDataset(dataset_ops.Dataset):
+  """A `Dataset` that acts as an identity, and sets a custom threadpool."""
+
+  def __init__(self, input_dataset, thread_pool):
+    super(_ThreadPoolDataset, self).__init__()
+    self._input_dataset = input_dataset
+    self._thread_pool = thread_pool
+
+  def _as_variant_tensor(self):
+    return gen_dataset_ops.thread_pool_dataset(
+        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+        self._thread_pool._resource,  # pylint: disable=protected-access
+        output_shapes=nest.flatten(
+            sparse.as_dense_shapes(self.output_shapes, self.output_classes)),
+        output_types=nest.flatten(
+            sparse.as_dense_types(self.output_types, self.output_classes)))
+
+  @property
+  def output_shapes(self):
+    return self._input_dataset.output_shapes
+
+  @property
+  def output_types(self):
+    return self._input_dataset.output_types
+
+  @property
+  def output_classes(self):
+    return self._input_dataset.output_classes
+
+
+def override_threadpool(dataset, thread_pool):
+  """Returns a new dataset that uses the given thread pool for its operations.
+
+  Args:
+    dataset: A `tf.data.Dataset` object.
+    thread_pool: A `PrivateThreadPool` object.
+
+  Returns:
+    A dataset containing the same values as `dataset`, but which uses
+    `thread_pool` to compute any of its parallel operations (such as
+    @{tf.data.Dataset.map}).
+  """
+  return _ThreadPoolDataset(dataset, thread_pool)
diff --git a/tensorflow/contrib/data/python/ops/unique.py b/tensorflow/contrib/data/python/ops/unique.py
index 133e17d20d0fc4c8d52cef3c95c132374e927a0b..765ef3f9b6d42c9d7af3ce4916731d37d65c9260 100644
--- a/tensorflow/contrib/data/python/ops/unique.py
+++ b/tensorflow/contrib/data/python/ops/unique.py
@@ -17,11 +17,12 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.contrib.data.python.ops import contrib_op_loader  # pylint: disable=unused-import
+from tensorflow.contrib.data.python.ops import gen_dataset_ops
 from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import nest
 from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
-from tensorflow.python.ops import gen_dataset_ops
 
 
 def unique():
diff --git a/tensorflow/contrib/distributions/BUILD b/tensorflow/contrib/distributions/BUILD
index 7f510c42215f48a9e795eb81bd9f66b0a2108335..1b4877c57fb4708b20860d6b64438ea717f63cf1 100644
--- a/tensorflow/contrib/distributions/BUILD
+++ b/tensorflow/contrib/distributions/BUILD
@@ -251,6 +251,21 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "kumaraswamy_test",
+    srcs = ["python/kernel_tests/kumaraswamy_test.py"],
+    additional_deps = [
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "//tensorflow/python:client",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:nn_ops",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 cuda_py_test(
     name = "moving_stats_test",
     size = "small",
@@ -403,7 +418,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "poisson_lognormal_test",
-    size = "small",
+    size = "medium",
     srcs = ["python/kernel_tests/poisson_lognormal_test.py"],
     additional_deps = [
         ":distributions_py",
@@ -459,6 +474,19 @@ cuda_py_test(
     tags = ["nomsan"],  # disable to avoid false positives from scipy.
 )
 
+cuda_py_test(
+    name = "statistical_testing_test",
+    size = "medium",
+    srcs = [
+        "python/kernel_tests/statistical_testing_test.py",
+    ],
+    additional_deps = [
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
 cuda_py_test(
     name = "vector_sinh_arcsinh_diag_test",
     size = "medium",
@@ -915,6 +943,25 @@ cuda_py_test(
     ],
 )
 
+cuda_py_test(
+    name = "kumaraswamy_bijector_test",
+    size = "small",
+    srcs = ["python/kernel_tests/bijectors/kumaraswamy_bijector_test.py"],
+    additional_deps = [
+        ":bijectors_py",
+        ":distributions_py",
+        "//third_party/py/numpy",
+        "@six_archive//:six",
+        "//tensorflow/contrib/linalg:linalg_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+    ],
+)
+
 cuda_py_test(
     name = "masked_autoregressive_test",
     size = "small",
@@ -984,7 +1031,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "reshape_test",
-    size = "small",
+    size = "medium",
     srcs = ["python/kernel_tests/bijectors/reshape_test.py"],
     additional_deps = [
         ":bijectors_py",
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/bijectors/kumaraswamy_bijector_test.py b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/kumaraswamy_bijector_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad11d9f2484c4b08c67c5f82aec1320475d1d983
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/bijectors/kumaraswamy_bijector_test.py
@@ -0,0 +1,80 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for Kumaraswamy Bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.ops.bijectors.kumaraswamy import Kumaraswamy
+from tensorflow.python.ops.distributions.bijector_test_util import assert_bijective_and_finite
+from tensorflow.python.ops.distributions.bijector_test_util import assert_scalar_congruency
+from tensorflow.python.platform import test
+
+
+class KumaraswamyBijectorTest(test.TestCase):
+  """Tests correctness of the Kumaraswamy bijector."""
+
+  def testBijector(self):
+    with self.test_session():
+      a = 2.
+      b = 0.3
+      bijector = Kumaraswamy(
+          concentration1=a, concentration0=b,
+          event_ndims=0, validate_args=True)
+      self.assertEqual("kumaraswamy", bijector.name)
+      x = np.array([[[0.1], [0.2], [0.3], [0.4], [0.5]]], dtype=np.float32)
+      # Kumaraswamy cdf. This is the same as inverse(x).
+      y = 1. - (1. - x ** a) ** b
+      self.assertAllClose(y, bijector.inverse(x).eval())
+      self.assertAllClose(x, bijector.forward(y).eval())
+      kumaraswamy_log_pdf = (np.log(a) + np.log(b) + (a - 1) * np.log(x) +
+                             (b - 1) * np.log1p(-x ** a))
+
+      self.assertAllClose(
+          # We should lose a dimension from calculating the determinant of the
+          # jacobian.
+          kumaraswamy_log_pdf,
+          bijector.inverse_log_det_jacobian(x).eval())
+      self.assertAllClose(
+          -bijector.inverse_log_det_jacobian(x).eval(),
+          bijector.forward_log_det_jacobian(y).eval(),
+          rtol=1e-4,
+          atol=0.)
+
+  def testScalarCongruency(self):
+    with self.test_session():
+      assert_scalar_congruency(
+          Kumaraswamy(concentration1=0.5, concentration0=1.1),
+          lower_x=0., upper_x=1., n=int(10e3), rtol=0.02)
+
+  def testBijectiveAndFinite(self):
+    with self.test_session():
+      concentration1 = 1.2
+      concentration0 = 2.
+      bijector = Kumaraswamy(
+          concentration1=concentration1,
+          concentration0=concentration0, validate_args=True)
+      # Omitting the endpoints 0 and 1, since idlj will be inifinity at these
+      # endpoints.
+      y = np.linspace(.01, 0.99, num=10).astype(np.float32)
+      x = 1 - (1 - y ** concentration1) ** concentration0
+      assert_bijective_and_finite(bijector, x, y, rtol=1e-3)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/kumaraswamy_test.py b/tensorflow/contrib/distributions/python/kernel_tests/kumaraswamy_test.py
index ea3c86b5c0f42b64fc6e4e362cbcc162bccf74a2..2980e2bfe93b2e2aa01d38fc9fa4650a015efc06 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/kumaraswamy_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/kumaraswamy_test.py
@@ -130,10 +130,8 @@ class KumaraswamyTest(test.TestCase):
       dist.prob([.1, .3, .6]).eval()
       dist.prob([.2, .3, .5]).eval()
       # Either condition can trigger.
-      with self.assertRaisesOpError("sample must be positive"):
+      with self.assertRaisesOpError("sample must be non-negative"):
         dist.prob([-1., 0.1, 0.5]).eval()
-      with self.assertRaisesOpError("sample must be positive"):
-        dist.prob([0., 0.1, 0.5]).eval()
       with self.assertRaisesOpError("sample must be no larger than `1`"):
         dist.prob([.1, .2, 1.2]).eval()
 
@@ -249,13 +247,13 @@ class KumaraswamyTest(test.TestCase):
       a = np.array([1., 2, 3])
       b = np.array([2., 4, 1.2])
       dist = kumaraswamy_lib.Kumaraswamy(a, b, allow_nan_stats=False)
-      with self.assertRaisesOpError("Condition x < y.*"):
+      with self.assertRaisesOpError("Mode undefined for concentration1 <= 1."):
         dist.mode().eval()
 
       a = np.array([2., 2, 3])
       b = np.array([1., 4, 1.2])
       dist = kumaraswamy_lib.Kumaraswamy(a, b, allow_nan_stats=False)
-      with self.assertRaisesOpError("Condition x < y.*"):
+      with self.assertRaisesOpError("Mode undefined for concentration0 <= 1."):
         dist.mode().eval()
 
   def testKumaraswamyModeEnableAllowNanStats(self):
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/poisson_test.py b/tensorflow/contrib/distributions/python/kernel_tests/poisson_test.py
index d9c9008417cdb20b62390630cf887d3bd888a0d3..19a7472d91758a2dbd00c4d918853d7bae33685d 100644
--- a/tensorflow/contrib/distributions/python/kernel_tests/poisson_test.py
+++ b/tensorflow/contrib/distributions/python/kernel_tests/poisson_test.py
@@ -18,6 +18,7 @@ from __future__ import division
 from __future__ import print_function
 
 import numpy as np
+from scipy import special
 from scipy import stats
 from tensorflow.contrib.distributions.python.ops import poisson as poisson_lib
 from tensorflow.python.framework import constant_op
@@ -110,7 +111,7 @@ class PoissonTest(test.TestCase):
       batch_size = 6
       lam = constant_op.constant([3.0] * batch_size)
       lam_v = 3.0
-      x = [2.2, 3.1, 4., 5.5, 6., 7.]
+      x = [2., 3., 4., 5., 6., 7.]
 
       poisson = self._make_poisson(rate=lam)
       log_cdf = poisson.log_cdf(x)
@@ -121,12 +122,31 @@ class PoissonTest(test.TestCase):
       self.assertEqual(cdf.get_shape(), (6,))
       self.assertAllClose(cdf.eval(), stats.poisson.cdf(x, lam_v))
 
+  def testPoissonCDFNonIntegerValues(self):
+    with self.test_session():
+      batch_size = 6
+      lam = constant_op.constant([3.0] * batch_size)
+      lam_v = 3.0
+      x = np.array([2.2, 3.1, 4., 5.5, 6., 7.], dtype=np.float32)
+
+      poisson = self._make_poisson(rate=lam)
+      cdf = poisson.cdf(x)
+      self.assertEqual(cdf.get_shape(), (6,))
+
+      # The Poisson CDF should be valid on these non-integer values, and
+      # equal to igammac(1 + x, rate).
+      self.assertAllClose(cdf.eval(), special.gammaincc(1. + x, lam_v))
+
+      with self.assertRaisesOpError("cannot contain fractional components"):
+        poisson_validate = self._make_poisson(rate=lam, validate_args=True)
+        poisson_validate.cdf(x).eval()
+
   def testPoissonCdfMultidimensional(self):
     with self.test_session():
       batch_size = 6
       lam = constant_op.constant([[2.0, 4.0, 5.0]] * batch_size)
       lam_v = [2.0, 4.0, 5.0]
-      x = np.array([[2.2, 3.1, 4., 5.5, 6., 7.]], dtype=np.float32).T
+      x = np.array([[2., 3., 4., 5., 6., 7.]], dtype=np.float32).T
 
       poisson = self._make_poisson(rate=lam)
       log_cdf = poisson.log_cdf(x)
diff --git a/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py b/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..3548ac18078a0b40f117c2bf9e2b34d20cee163b
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/kernel_tests/statistical_testing_test.py
@@ -0,0 +1,166 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for the statistical testing library."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.distributions.python.ops import statistical_testing as st
+from tensorflow.python.framework import errors
+from tensorflow.python.ops import check_ops
+from tensorflow.python.platform import test
+
+
+class StatisticalTestingTest(test.TestCase):
+
+  def test_dkwm_design_mean_one_sample_soundness(self):
+    numbers = [1e-5, 1e-2, 1.1e-1, 0.9, 1., 1.02, 2., 10., 1e2, 1e5, 1e10]
+    rates = [1e-6, 1e-3, 1e-2, 1.1e-1, 0.2, 0.5, 0.7, 1.]
+    with self.test_session() as sess:
+      for ff in rates:
+        for fp in rates:
+          sufficient_n = st.min_num_samples_for_dkwm_mean_test(
+              numbers, 0., 1., false_fail_rate=ff, false_pass_rate=fp)
+          detectable_d = st.min_discrepancy_of_true_means_detectable_by_dkwm(
+              sufficient_n, 0., 1., false_fail_rate=ff, false_pass_rate=fp)
+          sess.run(check_ops.assert_less_equal(detectable_d, numbers))
+
+  def test_dkwm_design_mean_two_sample_soundness(self):
+    numbers = [1e-5, 1e-2, 1.1e-1, 0.9, 1., 1.02, 2., 10., 1e2, 1e5, 1e10]
+    rates = [1e-6, 1e-3, 1e-2, 1.1e-1, 0.2, 0.5, 0.7, 1.]
+    with self.test_session() as sess:
+      for ff in rates:
+        for fp in rates:
+          (sufficient_n1,
+           sufficient_n2) = st.min_num_samples_for_dkwm_mean_two_sample_test(
+               numbers, 0., 1., 0., 1.,
+               false_fail_rate=ff, false_pass_rate=fp)
+          d_fn = st.min_discrepancy_of_true_means_detectable_by_dkwm_two_sample
+          detectable_d = d_fn(
+              sufficient_n1, 0., 1., sufficient_n2, 0., 1.,
+              false_fail_rate=ff, false_pass_rate=fp)
+          sess.run(check_ops.assert_less_equal(detectable_d, numbers))
+
+  def test_true_mean_confidence_interval_by_dkwm_one_sample(self):
+    rng = np.random.RandomState(seed=0)
+
+    num_samples = 5000
+    # 5000 samples is chosen to be enough to find discrepancies of
+    # size 0.1 or more with assurance 1e-6, as confirmed here:
+    with self.test_session() as sess:
+      d = st.min_discrepancy_of_true_means_detectable_by_dkwm(
+          num_samples, 0., 1., false_fail_rate=1e-6, false_pass_rate=1e-6)
+      d = sess.run(d)
+      self.assertLess(d, 0.1)
+
+    # Test that the confidence interval computed for the mean includes
+    # 0.5 and excludes 0.4 and 0.6.
+    with self.test_session() as sess:
+      samples = rng.uniform(size=num_samples).astype(np.float32)
+      (low, high) = st.true_mean_confidence_interval_by_dkwm(
+          samples, 0., 1., error_rate=1e-6)
+      low, high = sess.run([low, high])
+      self.assertGreater(low, 0.4)
+      self.assertLess(low, 0.5)
+      self.assertGreater(high, 0.5)
+      self.assertLess(high, 0.6)
+
+  def test_dkwm_mean_one_sample_assertion(self):
+    rng = np.random.RandomState(seed=0)
+    num_samples = 5000
+
+    # Test that the test assertion agrees that the mean of the standard
+    # uniform distribution is 0.5.
+    samples = rng.uniform(size=num_samples).astype(np.float32)
+    with self.test_session() as sess:
+      sess.run(st.assert_true_mean_equal_by_dkwm(
+          samples, 0., 1., 0.5, false_fail_rate=1e-6))
+
+      # Test that the test assertion confirms that the mean of the
+      # standard uniform distribution is not 0.4.
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.assert_true_mean_equal_by_dkwm(
+            samples, 0., 1., 0.4, false_fail_rate=1e-6))
+
+      # Test that the test assertion confirms that the mean of the
+      # standard uniform distribution is not 0.6.
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.assert_true_mean_equal_by_dkwm(
+            samples, 0., 1., 0.6, false_fail_rate=1e-6))
+
+  def test_dkwm_mean_two_sample_assertion(self):
+    rng = np.random.RandomState(seed=0)
+    num_samples = 15000
+
+    # 15000 samples is chosen to be enough to find discrepancies of
+    # size 0.1 or more with assurance 1e-6, as confirmed here:
+    with self.test_session() as sess:
+      d = st.min_discrepancy_of_true_means_detectable_by_dkwm_two_sample(
+          num_samples, 0., 1., num_samples, 0., 1.,
+          false_fail_rate=1e-6, false_pass_rate=1e-6)
+      d = sess.run(d)
+      self.assertLess(d, 0.1)
+
+    # Test that the test assertion agrees that the standard
+    # uniform distribution has the same mean as itself.
+    samples1 = rng.uniform(size=num_samples).astype(np.float32)
+    samples2 = rng.uniform(size=num_samples).astype(np.float32)
+    with self.test_session() as sess:
+      sess.run(st.assert_true_mean_equal_by_dkwm_two_sample(
+          samples1, 0., 1., samples2, 0., 1., false_fail_rate=1e-6))
+
+      # Test that the test assertion confirms that the mean of the
+      # standard uniform distribution is different from the mean of beta(2, 1).
+      beta_high_samples = rng.beta(2, 1, size=num_samples).astype(np.float32)
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.assert_true_mean_equal_by_dkwm_two_sample(
+            samples1, 0., 1.,
+            beta_high_samples, 0., 1.,
+            false_fail_rate=1e-6))
+
+      # Test that the test assertion confirms that the mean of the
+      # standard uniform distribution is different from the mean of beta(1, 2).
+      beta_low_samples = rng.beta(1, 2, size=num_samples).astype(np.float32)
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.assert_true_mean_equal_by_dkwm_two_sample(
+            samples1, 0., 1.,
+            beta_low_samples, 0., 1.,
+            false_fail_rate=1e-6))
+
+  def test_dkwm_argument_validity_checking(self):
+    rng = np.random.RandomState(seed=0)
+    samples = rng.uniform(size=5000).astype(np.float32)
+
+    # Test that the test library complains if the given samples fall
+    # outside the purported bounds.
+    with self.test_session() as sess:
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.true_mean_confidence_interval_by_dkwm(
+            samples, 0., 0.5, error_rate=0.5))
+      with self.assertRaises(errors.InvalidArgumentError):
+        sess.run(st.true_mean_confidence_interval_by_dkwm(
+            samples, 0.5, 1., error_rate=0.5))
+
+      # But doesn't complain if they don't.
+      op = st.true_mean_confidence_interval_by_dkwm(
+          samples, 0., 1., error_rate=0.5)
+      _ = sess.run(op)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py b/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py
index 93923c3f083c7f5136b55e9021cbd6323684b976..9437f56b1ebc76165edec224928baeb836277163 100644
--- a/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/__init__.py
@@ -26,6 +26,7 @@
 @@Identity
 @@Inline
 @@Invert
+@@Kumaraswamy
 @@MaskedAutoregressiveFlow
 @@Permute
 @@PowerTransform
@@ -59,6 +60,7 @@ from tensorflow.contrib.distributions.python.ops.bijectors.exp import *
 from tensorflow.contrib.distributions.python.ops.bijectors.gumbel import *
 from tensorflow.contrib.distributions.python.ops.bijectors.inline import *
 from tensorflow.contrib.distributions.python.ops.bijectors.invert import *
+from tensorflow.contrib.distributions.python.ops.bijectors.kumaraswamy import *
 from tensorflow.contrib.distributions.python.ops.bijectors.masked_autoregressive import *
 from tensorflow.contrib.distributions.python.ops.bijectors.permute import *
 from tensorflow.contrib.distributions.python.ops.bijectors.power_transform import *
diff --git a/tensorflow/contrib/distributions/python/ops/bijectors/kumaraswamy.py b/tensorflow/contrib/distributions/python/ops/bijectors/kumaraswamy.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5de052c9ed18b1ebf4c174aeea3a951b1ddcd9d
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/bijectors/kumaraswamy.py
@@ -0,0 +1,153 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Kumaraswamy bijector."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops.distributions import bijector
+
+__all__ = [
+    "Kumaraswamy",
+]
+
+
+class Kumaraswamy(bijector.Bijector):
+  """Compute `Y = g(X) = (1 - (1 - X)**(1 / b))**(1 / a), X in [0, 1]`.
+
+  This bijector maps inputs from `[0, 1]` to [0, 1]`. The inverse of the
+  bijector applied to a uniform random variable `X ~ U(0, 1) gives back a
+  random variable with the [Kumaraswamy distribution](
+  https://en.wikipedia.org/wiki/Kumaraswamy_distribution):
+
+  ```none
+  Y ~ Kumaraswamy(a, b)
+  pdf(y; a, b, 0 <= y <= 1) = a * b * y ** (a - 1) * (1 - y**a) ** (b - 1)
+  ```
+  """
+
+  def __init__(self,
+               concentration1=None,
+               concentration0=None,
+               event_ndims=0,
+               validate_args=False,
+               name="kumaraswamy"):
+    """Instantiates the `Kumaraswamy` bijector.
+
+    Args:
+      concentration1: Python `float` scalar indicating the transform power,
+        i.e., `Y = g(X) = (1 - (1 - X)**(1 / b))**(1 / a)` where `a` is
+        `concentration1`.
+      concentration0: Python `float` scalar indicating the transform power,
+        i.e., `Y = g(X) = (1 - (1 - X)**(1 / b))**(1 / a)` where `b` is
+        `concentration0`.
+      event_ndims: Python scalar indicating the number of dimensions associated
+        with a particular draw from the distribution. Currently only zero is
+        supported.
+      validate_args: Python `bool` indicating whether arguments should be
+        checked for correctness.
+      name: Python `str` name given to ops managed by this object.
+
+    Raises:
+      ValueError:  If `event_ndims` is not zero.
+    """
+    self._graph_parents = []
+    self._name = name
+    self._validate_args = validate_args
+
+    event_ndims = ops.convert_to_tensor(event_ndims, name="event_ndims")
+    event_ndims_const = tensor_util.constant_value(event_ndims)
+    if event_ndims_const is not None and event_ndims_const not in (0,):
+      raise ValueError("event_ndims(%s) was not 0" % event_ndims_const)
+    else:
+      if validate_args:
+        event_ndims = control_flow_ops.with_dependencies(
+            [check_ops.assert_equal(
+                event_ndims, 0, message="event_ndims was not 0")],
+            event_ndims)
+
+    with self._name_scope("init", values=[concentration1, concentration0]):
+      concentration1 = self._maybe_assert_valid_concentration(
+          ops.convert_to_tensor(concentration1, name="concentration1"),
+          validate_args=validate_args)
+      concentration0 = self._maybe_assert_valid_concentration(
+          ops.convert_to_tensor(concentration0, name="concentration0"),
+          validate_args=validate_args)
+
+    self._concentration1 = concentration1
+    self._concentration0 = concentration0
+    super(Kumaraswamy, self).__init__(
+        event_ndims=0,
+        validate_args=validate_args,
+        name=name)
+
+  @property
+  def concentration1(self):
+    """The `a` in: `Y = g(X) = (1 - (1 - X)**(1 / b))**(1 / a)`."""
+    return self._concentration1
+
+  @property
+  def concentration0(self):
+    """The `b` in: `Y = g(X) = (1 - (1 - X)**(1 / b))**(1 / a)`."""
+    return self._concentration0
+
+  def _forward(self, x):
+    x = self._maybe_assert_valid(x)
+    return math_ops.exp(
+        math_ops.log1p(-math_ops.exp(math_ops.log1p(-x) / self.concentration0))
+        / self.concentration1)
+
+  def _inverse(self, y):
+    y = self._maybe_assert_valid(y)
+    return math_ops.exp(math_ops.log1p(
+        -(1 - y**self.concentration1)**self.concentration0))
+
+  def _inverse_log_det_jacobian(self, y):
+    y = self._maybe_assert_valid(y)
+    event_dims = self._event_dims_tensor(y)
+    return math_ops.reduce_sum(
+        math_ops.log(self.concentration1) + math_ops.log(self.concentration0) +
+        (self.concentration1 - 1) * math_ops.log(y) +
+        (self.concentration0 - 1) * math_ops.log1p(-y**self.concentration1),
+        axis=event_dims)
+
+  def _maybe_assert_valid_concentration(self, concentration, validate_args):
+    """Checks the validity of a concentration parameter."""
+    if not validate_args:
+      return concentration
+    return control_flow_ops.with_dependencies([
+        check_ops.assert_positive(
+            concentration,
+            message="Concentration parameter must be positive."),
+    ], concentration)
+
+  def _maybe_assert_valid(self, x):
+    if not self.validate_args:
+      return x
+    return control_flow_ops.with_dependencies([
+        check_ops.assert_non_negative(
+            x,
+            message="sample must be non-negative"),
+        check_ops.assert_less_equal(
+            x, array_ops.ones([], self.concentration0.dtype),
+            message="sample must be no larger than `1`."),
+    ], x)
diff --git a/tensorflow/contrib/distributions/python/ops/kumaraswamy.py b/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
index 74d5d8773cf3e69a52554c87d656fea2835c8354..120b38db3cf72e8fce56a7e9293cdf25e75784e2 100644
--- a/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
+++ b/tensorflow/contrib/distributions/python/ops/kumaraswamy.py
@@ -20,15 +20,17 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.contrib.distributions.python.ops import bijectors
+from tensorflow.contrib.distributions.python.ops import distribution_util
+from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import special_math_ops
-from tensorflow.python.ops.distributions import beta
 from tensorflow.python.ops.distributions import distribution
-from tensorflow.python.ops.distributions import util as distribution_util
+from tensorflow.python.ops.distributions import transformed_distribution
+from tensorflow.python.ops.distributions import uniform
 from tensorflow.python.util.tf_export import tf_export
 
 __all__ = [
@@ -60,7 +62,7 @@ def _harmonic_number(x):
 
 
 @tf_export("distributions.Kumaraswamy")
-class Kumaraswamy(beta.Beta):
+class Kumaraswamy(transformed_distribution.TransformedDistribution):
   """Kumaraswamy distribution.
 
   The Kumaraswamy distribution is defined over the `(0, 1)` interval using
@@ -151,59 +153,32 @@ class Kumaraswamy(beta.Beta):
         more of the statistic's batch members are undefined.
       name: Python `str` name prefixed to Ops created by this class.
     """
+    concentration1 = ops.convert_to_tensor(
+        concentration1, name="concentration1")
+    concentration0 = ops.convert_to_tensor(
+        concentration0, name="concentration0")
     super(Kumaraswamy, self).__init__(
-        concentration1=concentration1,
-        concentration0=concentration0,
-        validate_args=validate_args,
-        allow_nan_stats=allow_nan_stats,
+        distribution=uniform.Uniform(
+            low=array_ops.zeros([], dtype=concentration1.dtype),
+            high=array_ops.ones([], dtype=concentration1.dtype),
+            allow_nan_stats=allow_nan_stats),
+        bijector=bijectors.Kumaraswamy(
+            concentration1=concentration1, concentration0=concentration0,
+            validate_args=validate_args),
+        batch_shape=distribution_util.get_broadcast_shape(
+            concentration1, concentration0),
         name=name)
     self._reparameterization_type = distribution.FULLY_REPARAMETERIZED
 
-  def _sample_n(self, n, seed=None):
-    expanded_concentration1 = array_ops.ones_like(
-        self.total_concentration, dtype=self.dtype) * self.concentration1
-    expanded_concentration0 = array_ops.ones_like(
-        self.total_concentration, dtype=self.dtype) * self.concentration0
-    shape = array_ops.concat([[n], self.batch_shape_tensor()], 0)
-    uniform_sample = random_ops.random_uniform(
-        shape=shape, minval=0.0, maxval=1.0, dtype=self.dtype, seed=seed)
-
-    kumaraswamy_sample = (1 - uniform_sample**(1. / expanded_concentration0))**(
-        1. / expanded_concentration1)
-    return kumaraswamy_sample
-
-  @distribution_util.AppendDocstring(_kumaraswamy_sample_note)
-  def _log_cdf(self, x):
-    a = self.concentration1
-    b = self.concentration0
-    return math_ops.log1p(-(1 - x**a)**b)
+  @property
+  def concentration1(self):
+    """Concentration parameter associated with a `1` outcome."""
+    return self.bijector.concentration1
 
-  @distribution_util.AppendDocstring(_kumaraswamy_sample_note)
-  def _cdf(self, x):
-    a = self.concentration1
-    b = self.concentration0
-    return 1 - (1 - x**a)**b
-
-  def _survival_function(self, x):
-    a = self.concentration1
-    b = self.concentration0
-    return (1 - x**a)**b
-
-  def _log_survival_function(self, x):
-    a = self.concentration1
-    b = self.concentration0
-    return b * math_ops.log1p(-x**a)
-
-  def _log_unnormalized_prob(self, x):
-    x = self._maybe_assert_valid_sample(x)
-    a = self.concentration1
-    b = self.concentration0
-    return (a - 1) * math_ops.log(x) + (b - 1) * math_ops.log1p(-x**a)
-
-  def _log_normalization(self):
-    a = self.concentration1
-    b = self.concentration0
-    return -(math_ops.log(a) + math_ops.log(b))
+  @property
+  def concentration0(self):
+    """Concentration parameter associated with a `0` outcome."""
+    return self.bijector.concentration0
 
   def _entropy(self):
     a = self.concentration1
@@ -213,10 +188,11 @@ class Kumaraswamy(beta.Beta):
 
   def _moment(self, n):
     """Compute the n'th (uncentered) moment."""
+    total_concentration = self.concentration1 + self.concentration0
     expanded_concentration1 = array_ops.ones_like(
-        self.total_concentration, dtype=self.dtype) * self.concentration1
+        total_concentration, dtype=self.dtype) * self.concentration1
     expanded_concentration0 = array_ops.ones_like(
-        self.total_concentration, dtype=self.dtype) * self.concentration0
+        total_concentration, dtype=self.dtype) * self.concentration0
     beta_arg0 = 1 + n / expanded_concentration1
     beta_arg = array_ops.stack([beta_arg0, expanded_concentration0], -1)
     log_moment = math_ops.log(expanded_concentration0) + special_math_ops.lbeta(
@@ -246,13 +222,14 @@ class Kumaraswamy(beta.Beta):
           name="nan")
       is_defined = (self.concentration1 > 1.) & (self.concentration0 > 1.)
       return array_ops.where(is_defined, mode, nan)
+
     return control_flow_ops.with_dependencies([
         check_ops.assert_less(
-            array_ops.ones([], dtype=self.dtype),
+            array_ops.ones([], dtype=self.concentration1.dtype),
             self.concentration1,
             message="Mode undefined for concentration1 <= 1."),
         check_ops.assert_less(
-            array_ops.ones([], dtype=self.dtype),
+            array_ops.ones([], dtype=self.concentration0.dtype),
             self.concentration0,
             message="Mode undefined for concentration0 <= 1.")
     ], mode)
diff --git a/tensorflow/contrib/distributions/python/ops/poisson.py b/tensorflow/contrib/distributions/python/ops/poisson.py
index e967dcc90d0712ffc346fb61ee67c44a6d9207cb..02e97c0a2fd004c4fa9382d5367af9f5b034a869 100644
--- a/tensorflow/contrib/distributions/python/ops/poisson.py
+++ b/tensorflow/contrib/distributions/python/ops/poisson.py
@@ -35,9 +35,15 @@ __all__ = [
 
 
 _poisson_sample_note = """
-Note that the input value must be a non-negative floating point tensor with
-dtype `dtype` and whose shape can be broadcast with `self.rate`. `x` is only
-legal if it is non-negative and its components are equal to integer values.
+The Poisson distribution is technically only defined for non-negative integer
+values. When `validate_args=False`, non-integral inputs trigger an assertion.
+
+When `validate_args=False` calculations are otherwise unchanged despite
+integral or non-integral inputs.
+
+When `validate_args=False`, evaluating the pmf at non-integral values,
+corresponds to evaluations of an unnormalized distribution, that does not
+correspond to evaluations of the cdf.
 """
 
 
@@ -150,10 +156,6 @@ class Poisson(distribution.Distribution):
   def _cdf(self, x):
     if self.validate_args:
       x = distribution_util.embed_check_nonnegative_integer_form(x)
-    else:
-      # Whether or not x is integer-form, the following is well-defined.
-      # However, scipy takes the floor, so we do too.
-      x = math_ops.floor(x)
     return math_ops.igammac(1. + x, self.rate)
 
   def _log_normalization(self):
@@ -162,9 +164,6 @@ class Poisson(distribution.Distribution):
   def _log_unnormalized_prob(self, x):
     if self.validate_args:
       x = distribution_util.embed_check_nonnegative_integer_form(x)
-    else:
-      # For consistency with cdf, we take the floor.
-      x = math_ops.floor(x)
     return x * self.log_rate - math_ops.lgamma(1. + x)
 
   def _mean(self):
diff --git a/tensorflow/contrib/distributions/python/ops/statistical_testing.py b/tensorflow/contrib/distributions/python/ops/statistical_testing.py
new file mode 100644
index 0000000000000000000000000000000000000000..d66c34cc1a45cc09da5138a5f72ae3817690db49
--- /dev/null
+++ b/tensorflow/contrib/distributions/python/ops/statistical_testing.py
@@ -0,0 +1,728 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Statistical test assertions calibrated for their error rates."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import itertools
+
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import gen_math_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn_ops
+
+__all__ = [
+    "true_mean_confidence_interval_by_dkwm",
+    "assert_true_mean_equal_by_dkwm",
+    "min_discrepancy_of_true_means_detectable_by_dkwm",
+    "min_num_samples_for_dkwm_mean_test",
+    "assert_true_mean_equal_by_dkwm_two_sample",
+    "min_discrepancy_of_true_means_detectable_by_dkwm_two_sample",
+    "min_num_samples_for_dkwm_mean_two_sample_test",
+]
+
+
+def _batch_sort_vector(x, ascending=True, name=None):
+  with ops.name_scope(name, "sort_each_row", [x]):
+    x = ops.convert_to_tensor(x, name="x")
+    n = array_ops.shape(x)[-1]
+    if ascending:
+      y, _ = nn_ops.top_k(-x, k=n, sorted=True)
+      y = -y
+    else:
+      y, _ = nn_ops.top_k(x, k=n, sorted=True)
+    y.set_shape(x.shape)
+    return y
+
+
+def _do_maximum_mean(samples, envelope, high, name=None):
+  """Common code between maximum_mean and minimum_mean."""
+  with ops.name_scope(name, "do_maximum_mean", [samples, envelope, high]):
+    n = array_ops.rank(samples)
+    # Move the batch dimension of `samples` to the rightmost position,
+    # where the _batch_sort_vector function wants it.
+    perm = array_ops.concat([math_ops.range(1, n), [0]], axis=0)
+    samples = array_ops.transpose(samples, perm)
+
+    samples = _batch_sort_vector(samples)
+    batch_shape = array_ops.shape(samples)[:-1]
+    n = array_ops.shape(samples)[-1]
+    step = 1. / math_ops.cast(n, dtype=samples.dtype.base_dtype)
+
+    def _loop_body(iter_, total, to_skip):
+      total = array_ops.where(
+          step <= to_skip,
+          total,
+          array_ops.where(
+              to_skip > 0.,
+              total + (step - to_skip) * samples[..., iter_],
+              total + step * samples[..., iter_]))
+      to_skip = array_ops.where(step <= to_skip, to_skip - step, 0.)
+      return [iter_ + 1, total, to_skip]
+
+    _, total, _ = control_flow_ops.while_loop(
+        cond=lambda iter_, *args: iter_ < n,
+        body=_loop_body,
+        loop_vars=[
+            0,
+            array_ops.zeros(batch_shape, dtype=samples.dtype.base_dtype),
+            envelope,  # to_skip
+        ])
+
+  return total + envelope * high
+
+
+def _maximum_mean(samples, envelope, high, name=None):
+  """Returns a stochastic upper bound on the mean of a scalar distribution.
+
+  The idea is that if the true CDF is within an `eps`-envelope of the
+  empirical CDF of the samples, and the support is bounded above, then
+  the mean is bounded above as well.  In symbols,
+
+  ```none
+  sup_x(|F_n(x) - F(x)|) < eps
+  ```
+
+  The 0th dimension of `samples` is interpreted as independent and
+  identically distributed samples.  The remaining dimensions are
+  broadcast together with `envelope` and `high`, and operated on
+  separately.
+
+  Args:
+    samples: Floating-point tensor of samples from the distribution(s)
+      of interest.  Entries are assumed IID across the 0th dimension.
+      The other dimensions must broadcast with `envelope` and `high`.
+    envelope: Floating-point tensor of sizes of admissible CDF
+      envelopes (i.e., the `eps` above).
+    high: Floating-point tensor of upper bounds on the distributions'
+      supports.
+    name: A name for this operation (optional).
+
+  Returns:
+    bound: Floating-point tensor of upper bounds on the true means.
+
+  Raises:
+    InvalidArgumentError: If some `sample` is found to be larger than
+      the corresponding `high`.
+  """
+  with ops.name_scope(name, "maximum_mean", [samples, envelope, high]):
+    samples = ops.convert_to_tensor(samples, name="samples")
+    envelope = ops.convert_to_tensor(envelope, name="envelope")
+    high = ops.convert_to_tensor(high, name="high")
+
+    xmax = math_ops.reduce_max(samples, axis=[-1])
+    msg = "Given sample maximum value exceeds expectations"
+    check_op = check_ops.assert_less_equal(xmax, high, message=msg)
+    with ops.control_dependencies([check_op]):
+      return array_ops.identity(_do_maximum_mean(samples, envelope, high))
+
+
+def _minimum_mean(samples, envelope, low, name=None):
+  """Returns a stochastic lower bound on the mean of a scalar distribution.
+
+  The idea is that if the true CDF is within an `eps`-envelope of the
+  empirical CDF of the samples, and the support is bounded below, then
+  the mean is bounded below as well.  In symbols,
+
+  ```none
+  sup_x(|F_n(x) - F(x)|) < eps
+  ```
+
+  The 0th dimension of `samples` is interpreted as independent and
+  identically distributed samples.  The remaining dimensions are
+  broadcast together with `envelope` and `low`, and operated on
+  separately.
+
+  Args:
+    samples: Floating-point tensor of samples from the distribution(s)
+      of interest.  Entries are assumed IID across the 0th dimension.
+      The other dimensions must broadcast with `envelope` and `low`.
+    envelope: Floating-point tensor of sizes of admissible CDF
+      envelopes (i.e., the `eps` above).
+    low: Floating-point tensor of lower bounds on the distributions'
+      supports.
+    name: A name for this operation (optional).
+
+  Returns:
+    bound: Floating-point tensor of lower bounds on the true means.
+
+  Raises:
+    InvalidArgumentError: If some `sample` is found to be smaller than
+      the corresponding `low`.
+  """
+  with ops.name_scope(name, "minimum_mean", [samples, envelope, low]):
+    samples = ops.convert_to_tensor(samples, name="samples")
+    envelope = ops.convert_to_tensor(envelope, name="envelope")
+    low = ops.convert_to_tensor(low, name="low")
+
+    xmin = math_ops.reduce_min(samples, axis=[-1])
+    msg = "Given sample minimum value falls below expectations"
+    check_op = check_ops.assert_greater_equal(xmin, low, message=msg)
+    with ops.control_dependencies([check_op]):
+      return - _do_maximum_mean(-samples, envelope, -low)
+
+
+def _dkwm_cdf_envelope(n, error_rate, name=None):
+  """Computes the CDF envelope that the DKWM inequality licenses.
+
+  The [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval)
+  gives a stochastic bound on the distance between the true cumulative
+  distribution function (CDF) of any distribution and its empirical
+  CDF.  To wit, for `n` iid samples from any distribution with CDF F,
+
+  ```none
+  P(sup_x |F_n(x) - F(x)| > eps) < 2exp(-2n eps^2)
+  ```
+
+  This function computes the envelope size `eps` as a function of the
+  number of samples `n` and the desired limit on the left-hand
+  probability above.
+
+  Args:
+    n: Tensor of numbers of samples drawn.
+    error_rate: Floating-point tensor of admissible rates of mistakes.
+    name: A name for this operation (optional).
+
+  Returns:
+    eps: Tensor of maximum distances the true CDF can be from the
+      empirical CDF.  This scales as `O(sqrt(-log(error_rate)))` and
+      as `O(1 / sqrt(n))`.  The shape is the broadcast of `n` and
+      `error_rate`.
+  """
+  with ops.name_scope(name, "dkwm_cdf_envelope", [n, error_rate]):
+    n = math_ops.cast(n, dtype=error_rate.dtype)
+    return math_ops.sqrt(-gen_math_ops.log(error_rate / 2.) / (2. * n))
+
+
+def _check_shape_dominates(tensor, tensors):
+  """Check that broadcasting `tensor` against `tensors` does not expand it.
+
+  Why?  Because I want to be very sure that the samples tensor is not
+  accidentally enlarged by broadcasting against tensors that are
+  supposed to be describing the distribution(s) sampled from, lest the
+  sample counts end up inflated.
+
+  Args:
+    tensor: A Tensor whose shape is to be protected against broadcasting.
+    tensors: A list of Tensors to check
+
+  Returns:
+    tensor: `tf.identity(tensor)` with control dependencies attached;
+      be sure to use that downstream.
+  """
+  def check(t):
+    target = array_ops.shape(tensor)[1:]
+    result = array_ops.broadcast_dynamic_shape(target, array_ops.shape(t))
+    # This rank check ensures that I don't get a wrong answer from the
+    # _shapes_ broadcasting against each other.
+    gt = check_ops.assert_greater(array_ops.rank(target), array_ops.rank(t))
+    eq = check_ops.assert_equal(target, result)
+    return gt, eq
+  checks = list(itertools.chain(*[check(t) for t in tensors]))
+  with ops.control_dependencies(checks):
+    return array_ops.identity(array_ops.identity(tensor))
+
+
+def true_mean_confidence_interval_by_dkwm(
+    samples, low, high, error_rate=1e-6, name=None):
+  """Computes a confidence interval for the mean of a scalar distribution.
+
+  In batch mode, computes confidence intervals for all distributions
+  in the batch (which need not be identically distributed).
+
+  Relies on the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval).
+
+  The probability (over the randomness of drawing the given samples)
+  that any true mean is outside the corresponding returned interval is
+  no more than the given `error_rate`.  The size of the intervals
+  scale as
+  `O(1 / sqrt(#samples))`, as `O(high - low)`, and as `O(-log(error_rate))`.
+
+  Note that `error_rate` is a total error rate for all the confidence
+  intervals in the batch.  As such, if the batch is nontrivial, the
+  error rate is not broadcast but divided (evenly) among the batch
+  members.
+
+  Args:
+    samples: Floating-point tensor of samples from the distribution(s)
+      of interest.  Entries are assumed IID across the 0th dimension.
+      The other dimensions must broadcast with `low` and `high`.
+    low: Floating-point tensor of lower bounds on the distributions'
+      supports.
+    high: Floating-point tensor of upper bounds on the distributions'
+      supports.
+    error_rate: *Scalar* admissible total rate of mistakes.
+    name: A name for this operation (optional).
+
+  Returns:
+    low: A floating-point tensor of stochastic lower bounds on the true means.
+    high: A floating-point tensor of stochastic upper bounds on the true means.
+  """
+  with ops.name_scope(
+      name, "true_mean_confidence_interval_by_dkwm",
+      [samples, low, high, error_rate]):
+    samples = ops.convert_to_tensor(samples, name="samples")
+    low = ops.convert_to_tensor(low, name="low")
+    high = ops.convert_to_tensor(high, name="high")
+    error_rate = ops.convert_to_tensor(error_rate, name="error_rate")
+    samples = _check_shape_dominates(samples, [low, high])
+    check_ops.assert_scalar(error_rate)  # Static shape
+    error_rate = _itemwise_error_rate(error_rate, [low, high], samples)
+    n = array_ops.shape(samples)[0]
+    envelope = _dkwm_cdf_envelope(n, error_rate)
+    min_mean = _minimum_mean(samples, envelope, low)
+    max_mean = _maximum_mean(samples, envelope, high)
+    return min_mean, max_mean
+
+
+def _itemwise_error_rate(
+    total_error_rate, param_tensors, sample_tensor=None, name=None):
+  with ops.name_scope(
+      name, "itemwise_error_rate",
+      [total_error_rate, param_tensors, sample_tensor]):
+    result_shape = [1]
+    for p_tensor in param_tensors:
+      result_shape = array_ops.broadcast_dynamic_shape(
+          array_ops.shape(p_tensor), result_shape)
+    if sample_tensor is not None:
+      result_shape = array_ops.broadcast_dynamic_shape(
+          array_ops.shape(sample_tensor)[1:], result_shape)
+    num_items = math_ops.reduce_prod(result_shape)
+    return total_error_rate / math_ops.cast(
+        num_items, dtype=total_error_rate.dtype)
+
+
+def assert_true_mean_equal_by_dkwm(
+    samples, low, high, expected, false_fail_rate=1e-6, name=None):
+  """Asserts the mean of the given distribution is as expected.
+
+  More precisely, fails if there is enough evidence (using the
+  [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval))
+  that the true mean of some distribution from which the given samples are
+  drawn is _not_ the given expected mean with statistical significance
+  `false_fail_rate` or stronger, otherwise passes.  If you also want to
+  check that you are gathering enough evidence that a pass is not
+  spurious, see `min_num_samples_for_dkwm_mean_test` and
+  `min_discrepancy_of_true_means_detectable_by_dkwm`.
+
+  Note that `false_fail_rate` is a total false failure rate for all
+  the assertions in the batch.  As such, if the batch is nontrivial,
+  the assertion will insist on stronger evidence to fail any one member.
+
+  Args:
+    samples: Floating-point tensor of samples from the distribution(s)
+      of interest.  Entries are assumed IID across the 0th dimension.
+      The other dimensions must broadcast with `low` and `high`.
+    low: Floating-point tensor of lower bounds on the distributions'
+      supports.
+    high: Floating-point tensor of upper bounds on the distributions'
+      supports.
+    expected: Floating-point tensor of expected true means.
+    false_fail_rate: *Scalar* admissible total rate of mistakes.
+    name: A name for this operation (optional).
+
+  Returns:
+    check: Op that raises `InvalidArgumentError` if any expected mean is
+      outside the corresponding confidence interval.
+  """
+  with ops.name_scope(
+      name, "assert_true_mean_equal_by_dkwm",
+      [samples, low, high, expected, false_fail_rate]):
+    samples = ops.convert_to_tensor(samples, name="samples")
+    low = ops.convert_to_tensor(low, name="low")
+    high = ops.convert_to_tensor(high, name="high")
+    expected = ops.convert_to_tensor(expected, name="expected")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    samples = _check_shape_dominates(samples, [low, high, expected])
+    min_mean, max_mean = true_mean_confidence_interval_by_dkwm(
+        samples, low, high, error_rate=false_fail_rate)
+    less_op = check_ops.assert_less(
+        min_mean, expected, message="Mean confidence interval too high")
+    with ops.control_dependencies([less_op]):
+      return check_ops.assert_greater(
+          max_mean, expected, message="Mean confidence interval too low")
+
+
+def min_discrepancy_of_true_means_detectable_by_dkwm(
+    n, low, high, false_fail_rate, false_pass_rate, name=None):
+  """Returns the minimum mean discrepancy that a DKWM-based test can detect.
+
+  DKWM is the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval).
+
+  Note that `false_fail_rate` is a total false failure rate for all
+  the tests in the batch.  As such, if the batch is nontrivial, each
+  member will demand more samples.  The `false_pass_rate` is also
+  interpreted as a total, but is treated asymmetrically: If each test
+  in the batch detects its corresponding discrepancy with probability
+  at least `1 - false_pass_rate`, then running all those tests and
+  failing if any one fails will jointly detect all those discrepancies
+  with the same `false_pass_rate`.
+
+  Args:
+    n: Tensor of numbers of samples to be drawn from the distributions
+      of interest.
+    low: Floating-point tensor of lower bounds on the distributions'
+      supports.
+    high: Floating-point tensor of upper bounds on the distributions'
+      supports.
+    false_fail_rate: *Scalar* admissible total rate of false failures.
+    false_pass_rate: *Scalar* admissible rate of false passes.
+    name: A name for this operation (optional).
+
+  Returns:
+    discr: Tensor of lower bounds on the distances between true
+       means detectable by a DKWM-based test.
+
+  For each batch member `i`, of `K` total, drawing `n[i]` samples from
+  some scalar distribution supported on `[low[i], high[i]]` is enough
+  to detect a difference in means of size `discr[i]` or more.
+  Specifically, we guarantee that (a) if the true mean is the expected
+  mean, `assert_true_mean_equal_by_dkwm` will fail with probability at
+  most `false_fail_rate / K` (which amounts to `false_fail_rate` if
+  applied to the whole batch at once), and (b) if the true mean
+  differs from the expected mean by at least `discr[i]`,
+  `assert_true_mean_equal_by_dkwm` will pass with probability at most
+  `false_pass_rate`.
+
+  The detectable discrepancy scales as
+
+  - `O(high[i] - low[i])`,
+  - `O(1 / sqrt(n[i]))`,
+  - `O(-log(false_fail_rate/K))`, and
+  - `O(-log(false_pass_rate))`.
+  """
+  with ops.name_scope(
+      name, "min_discrepancy_of_true_means_detectable_by_dkwm",
+      [n, low, high, false_fail_rate, false_pass_rate]):
+    n = ops.convert_to_tensor(n, name="n")
+    low = ops.convert_to_tensor(low, name="low")
+    high = ops.convert_to_tensor(high, name="high")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    false_pass_rate = ops.convert_to_tensor(
+        false_pass_rate, name="false_pass_rate")
+    # Algorithm: Assume a true CDF F.  The DKWM inequality gives a
+    # stochastic bound on how far the observed empirical CDF F_n can be.
+    # Then, using the DKWM inequality again gives a stochastic bound on
+    # the farthest candidate true CDF F' that
+    # true_mean_confidence_interval_by_dkwm might consider.  At worst, these
+    # errors may go in the same direction, so the distance between F and
+    # F' is bounded by the sum.
+    # On batching: false fail rates sum, so I need to reduce
+    # the input to account for the batching.  False pass rates
+    # max, so I don't.
+    sampling_envelope = _dkwm_cdf_envelope(n, false_pass_rate)
+    false_fail_rate = _itemwise_error_rate(false_fail_rate, [n, low, high])
+    analysis_envelope = _dkwm_cdf_envelope(n, false_fail_rate)
+    return (high - low) * (sampling_envelope + analysis_envelope)
+
+
+def min_num_samples_for_dkwm_mean_test(
+    discrepancy, low, high,
+    false_fail_rate=1e-6, false_pass_rate=1e-6, name=None):
+  """Returns how many samples suffice for a one-sample DKWM mean test.
+
+  To wit, returns an upper bound on the number of samples necessary to
+  guarantee detecting a mean difference of at least the given
+  `discrepancy`, with the given `false_fail_rate` and `false_pass_rate`,
+  using the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval)
+  on a scalar distribution supported on `[low, high]`.
+
+  Args:
+    discrepancy: Floating-point tensor of desired upper limits on mean
+      differences that may go undetected with probability higher than
+      `1 - false_pass_rate`.
+    low: Tensor of lower bounds on the distributions' support.
+    high: Tensor of upper bounds on the distributions' support.
+    false_fail_rate: *Scalar* admissible total rate of false failures.
+    false_pass_rate: *Scalar* admissible rate of false passes.
+    name: A name for this operation (optional).
+
+  Returns:
+    n: Tensor of numbers of samples to be drawn from the distributions
+      of interest.
+
+  The `discrepancy`, `low`, and `high` tensors must have
+  broadcast-compatible shapes.
+
+  For each batch member `i`, of `K` total, drawing `n[i]` samples from
+  some scalar distribution supported on `[low[i], high[i]]` is enough
+  to detect a difference in means of size `discrepancy[i]` or more.
+  Specifically, we guarantee that (a) if the true mean is the expected
+  mean, `assert_true_mean_equal_by_dkwm` will fail with probability at
+  most `false_fail_rate / K` (which amounts to `false_fail_rate` if
+  applied to the whole batch at once), and (b) if the true mean
+  differs from the expected mean by at least `discrepancy[i]`,
+  `assert_true_mean_equal_by_dkwm` will pass with probability at most
+  `false_pass_rate`.
+
+  The required number of samples scales
+  as `O((high[i] - low[i])**2)`, `O(-log(false_fail_rate/K))`,
+  `O(-log(false_pass_rate))`, and `O(1 / discrepancy[i]**2)`.
+  """
+  with ops.name_scope(
+      name, "min_num_samples_for_dkwm_mean_test",
+      [low, high, false_fail_rate, false_pass_rate, discrepancy]):
+    discrepancy = ops.convert_to_tensor(
+        discrepancy, name="discrepancy")
+    low = ops.convert_to_tensor(low, name="low")
+    high = ops.convert_to_tensor(high, name="high")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    false_pass_rate = ops.convert_to_tensor(
+        false_pass_rate, name="false_pass_rate")
+    # Could choose to cleverly allocate envelopes, but this is sound.
+    envelope1 = discrepancy / (2. * (high - low))
+    envelope2 = envelope1
+    false_fail_rate = _itemwise_error_rate(
+        false_fail_rate, [low, high, discrepancy])
+    n1 = -math_ops.log(false_fail_rate / 2.) / (2. * envelope1**2)
+    n2 = -math_ops.log(false_pass_rate / 2.) / (2. * envelope2**2)
+    return math_ops.maximum(n1, n2)
+
+
+def assert_true_mean_equal_by_dkwm_two_sample(
+    samples1, low1, high1, samples2, low2, high2,
+    false_fail_rate=1e-6, name=None):
+  """Asserts the means of the given distributions are equal.
+
+  More precisely, fails if there is enough evidence (using the
+  [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval))
+  that the means of the distributions from which the given samples are
+  drawn are _not_ equal with statistical significance `false_fail_rate`
+  or stronger, otherwise passes.  If you also want to check that you
+  are gathering enough evidence that a pass is not spurious, see
+  `min_num_samples_for_dkwm_mean_two_sample_test` and
+  `min_discrepancy_of_true_means_detectable_by_dkwm_two_sample`.
+
+  Note that `false_fail_rate` is a total false failure rate for all
+  the assertions in the batch.  As such, if the batch is nontrivial,
+  the assertion will insist on stronger evidence to fail any one member.
+
+  Args:
+    samples1: Floating-point tensor of samples from the
+      distribution(s) A.  Entries are assumed IID across the 0th
+      dimension.  The other dimensions must broadcast with `low1`,
+      `high1`, `low2`, and `high2`.
+    low1: Floating-point tensor of lower bounds on the supports of the
+      distributions A.
+    high1: Floating-point tensor of upper bounds on the supports of
+      the distributions A.
+    samples2: Floating-point tensor of samples from the
+      distribution(s) B.  Entries are assumed IID across the 0th
+      dimension.  The other dimensions must broadcast with `low1`,
+      `high1`, `low2`, and `high2`.
+    low2: Floating-point tensor of lower bounds on the supports of the
+      distributions B.
+    high2: Floating-point tensor of upper bounds on the supports of
+      the distributions B.
+    false_fail_rate: *Scalar* admissible total rate of mistakes.
+    name: A name for this operation (optional).
+
+  Returns:
+    check: Op that raises `InvalidArgumentError` if any pair of confidence
+      intervals true for corresponding true means do not overlap.
+  """
+  with ops.name_scope(
+      name, "assert_true_mean_equal_by_dkwm_two_sample",
+      [samples1, low1, high1, samples2, low2, high2, false_fail_rate]):
+    samples1 = ops.convert_to_tensor(samples1, name="samples1")
+    low1 = ops.convert_to_tensor(low1, name="low1")
+    high1 = ops.convert_to_tensor(high1, name="high1")
+    samples2 = ops.convert_to_tensor(samples2, name="samples2")
+    low2 = ops.convert_to_tensor(low2, name="low2")
+    high2 = ops.convert_to_tensor(high2, name="high2")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    samples1 = _check_shape_dominates(samples1, [low1, high1])
+    samples2 = _check_shape_dominates(samples2, [low2, high2])
+    compatible_samples = check_ops.assert_equal(
+        array_ops.shape(samples1)[1:], array_ops.shape(samples2)[1:])
+    with ops.control_dependencies([compatible_samples]):
+      # Could in principle play games with cleverly allocating
+      # significance instead of the even split below.  It may be possible
+      # to get tighter intervals, in order to obtain a higher power test.
+      # Any allocation strategy that depends only on the support bounds
+      # and sample counts should be valid; however, because the intervals
+      # scale as O(-log(false_fail_rate)), there doesn't seem to be much
+      # room to win.
+      min_mean_1, max_mean_1 = true_mean_confidence_interval_by_dkwm(
+          samples1, low1, high1, false_fail_rate / 2.)
+      min_mean_2, max_mean_2 = true_mean_confidence_interval_by_dkwm(
+          samples2, low2, high2, false_fail_rate / 2.)
+      # I want to assert
+      #   not (max_mean_1 < min_mean_2 or min_mean_1 > max_mean_2),
+      # but I think I only have and-combination of asserts, so use DeMorgan.
+      clause1_op = check_ops.assert_greater_equal(max_mean_1, min_mean_2)
+      with ops.control_dependencies([clause1_op]):
+        return check_ops.assert_less_equal(min_mean_1, max_mean_2)
+
+
+def min_discrepancy_of_true_means_detectable_by_dkwm_two_sample(
+    n1, low1, high1, n2, low2, high2,
+    false_fail_rate, false_pass_rate, name=None):
+  """Returns the minimum mean discrepancy for a two-sample DKWM-based test.
+
+  DKWM is the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval).
+
+  Note that `false_fail_rate` is a total false failure rate for all
+  the tests in the batch.  As such, if the batch is nontrivial, each
+  member will demand more samples.  The `false_pass_rate` is also
+  interpreted as a total, but is treated asymmetrically: If each test
+  in the batch detects its corresponding discrepancy with probability
+  at least `1 - false_pass_rate`, then running all those tests and
+  failing if any one fails will jointly detect all those discrepancies
+  with the same `false_pass_rate`.
+
+  Args:
+    n1: Tensor of numbers of samples to be drawn from the distributions A.
+    low1: Floating-point tensor of lower bounds on the supports of the
+      distributions A.
+    high1: Floating-point tensor of upper bounds on the supports of
+      the distributions A.
+    n2: Tensor of numbers of samples to be drawn from the distributions B.
+    low2: Floating-point tensor of lower bounds on the supports of the
+      distributions B.
+    high2: Floating-point tensor of upper bounds on the supports of
+      the distributions B.
+    false_fail_rate: *Scalar* admissible total rate of false failures.
+    false_pass_rate: *Scalar* admissible rate of false passes.
+    name: A name for this operation (optional).
+
+  Returns:
+    discr: Tensor of lower bounds on the distances between true means
+       detectable by a two-sample DKWM-based test.
+
+  For each batch member `i`, of `K` total, drawing `n1[i]` samples
+  from scalar distribution A supported on `[low1[i], high1[i]]` and `n2[i]`
+  samples from scalar distribution B supported on `[low2[i], high2[i]]`
+  is enough to detect a difference in their true means of size
+  `discr[i]` or more.  Specifically, we guarantee that (a) if their
+  true means are equal, `assert_true_mean_equal_by_dkwm_two_sample`
+  will fail with probability at most `false_fail_rate/K` (which
+  amounts to `false_fail_rate` if applied to the whole batch at once),
+  and (b) if their true means differ by at least `discr[i]`,
+  `assert_true_mean_equal_by_dkwm_two_sample` will pass with
+  probability at most `false_pass_rate`.
+
+  The detectable distribution scales as
+
+  - `O(high1[i] - low1[i])`, `O(high2[i] - low2[i])`,
+  - `O(1 / sqrt(n1[i]))`, `O(1 / sqrt(n2[i]))`,
+  - `O(-log(false_fail_rate/K))`, and
+  - `O(-log(false_pass_rate))`.
+  """
+  with ops.name_scope(
+      name, "min_discrepancy_of_true_means_detectable_by_dkwm_two_sample",
+      [n1, low1, high1, n2, low2, high2, false_fail_rate, false_pass_rate]):
+    n1 = ops.convert_to_tensor(n1, name="n1")
+    low1 = ops.convert_to_tensor(low1, name="low1")
+    high1 = ops.convert_to_tensor(high1, name="high1")
+    n2 = ops.convert_to_tensor(n2, name="n2")
+    low2 = ops.convert_to_tensor(low2, name="low2")
+    high2 = ops.convert_to_tensor(high2, name="high2")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    false_pass_rate = ops.convert_to_tensor(
+        false_pass_rate, name="false_pass_rate")
+    det_disc1 = min_discrepancy_of_true_means_detectable_by_dkwm(
+        n1, low1, high1, false_fail_rate / 2., false_pass_rate / 2.)
+    det_disc2 = min_discrepancy_of_true_means_detectable_by_dkwm(
+        n2, low2, high2, false_fail_rate / 2., false_pass_rate / 2.)
+    return det_disc1 + det_disc2
+
+
+def min_num_samples_for_dkwm_mean_two_sample_test(
+    discrepancy, low1, high1, low2, high2,
+    false_fail_rate=1e-6, false_pass_rate=1e-6, name=None):
+  """Returns how many samples suffice for a two-sample DKWM mean test.
+
+  DKWM is the [Dvoretzky-Kiefer-Wolfowitz-Massart inequality]
+  (https://en.wikipedia.org/wiki/CDF-based_nonparametric_confidence_interval).
+
+  Args:
+    discrepancy: Floating-point tensor of desired upper limits on mean
+      differences that may go undetected with probability higher than
+      `1 - false_pass_rate`.
+    low1: Floating-point tensor of lower bounds on the supports of the
+      distributions A.
+    high1: Floating-point tensor of upper bounds on the supports of
+      the distributions A.
+    low2: Floating-point tensor of lower bounds on the supports of the
+      distributions B.
+    high2: Floating-point tensor of upper bounds on the supports of
+      the distributions B.
+    false_fail_rate: *Scalar* admissible total rate of false failures.
+    false_pass_rate: *Scalar* admissible rate of false passes.
+    name: A name for this operation (optional).
+
+  Returns:
+    n1: Tensor of numbers of samples to be drawn from the distributions A.
+    n2: Tensor of numbers of samples to be drawn from the distributions B.
+
+  For each batch member `i`, of `K` total, drawing `n1[i]` samples
+  from scalar distribution A supported on `[low1[i], high1[i]]` and `n2[i]`
+  samples from scalar distribution B supported on `[low2[i], high2[i]]`
+  is enough to detect a difference in their true means of size
+  `discr[i]` or more.  Specifically, we guarantee that (a) if their
+  true means are equal, `assert_true_mean_equal_by_dkwm_two_sample`
+  will fail with probability at most `false_fail_rate/K` (which
+  amounts to `false_fail_rate` if applied to the whole batch at once),
+  and (b) if their true means differ by at least `discr[i]`,
+  `assert_true_mean_equal_by_dkwm_two_sample` will pass with
+  probability at most `false_pass_rate`.
+
+  The required number of samples scales as
+
+  - `O((high1[i] - low1[i])**2)`, `O((high2[i] - low2[i])**2)`,
+  - `O(-log(false_fail_rate/K))`,
+  - `O(-log(false_pass_rate))`, and
+  - `O(1 / discrepancy[i]**2)`.
+  """
+  with ops.name_scope(
+      name, "min_num_samples_for_dkwm_mean_two_sample_test",
+      [low1, high1, low2, high2,
+       false_fail_rate, false_pass_rate, discrepancy]):
+    discrepancy = ops.convert_to_tensor(discrepancy, name="discrepancy")
+    low1 = ops.convert_to_tensor(low1, name="low1")
+    high1 = ops.convert_to_tensor(high1, name="high1")
+    low2 = ops.convert_to_tensor(low2, name="low2")
+    high2 = ops.convert_to_tensor(high2, name="high2")
+    false_fail_rate = ops.convert_to_tensor(
+        false_fail_rate, name="false_fail_rate")
+    false_pass_rate = ops.convert_to_tensor(
+        false_pass_rate, name="false_pass_rate")
+    # Could choose to cleverly allocate discrepancy tolerances and
+    # failure probabilities, but this is sound.
+    n1 = min_num_samples_for_dkwm_mean_test(
+        discrepancy / 2., low1, high1,
+        false_fail_rate / 2., false_pass_rate / 2.)
+    n2 = min_num_samples_for_dkwm_mean_test(
+        discrepancy / 2., low2, high2,
+        false_fail_rate / 2., false_pass_rate / 2.)
+    return n1, n2
diff --git a/tensorflow/contrib/eager/python/BUILD b/tensorflow/contrib/eager/python/BUILD
index ad40e55cb48aac08eca7022846a0bd07b8accb3f..7fde53476d68428fb0f616e624c6c9c54631a88a 100644
--- a/tensorflow/contrib/eager/python/BUILD
+++ b/tensorflow/contrib/eager/python/BUILD
@@ -11,6 +11,7 @@ py_library(
     srcs_version = "PY2AND3",
     visibility = ["//visibility:public"],
     deps = [
+        ":checkpointable_utils",
         ":datasets",
         ":metrics",
         ":network",
@@ -69,6 +70,7 @@ cuda_py_test(
     srcs = ["datasets_test.py"],
     additional_deps = [
         ":datasets",
+        "//tensorflow/contrib/data/python/ops:transformation_ops",
         "//tensorflow/contrib/lookup:lookup_py",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
@@ -115,6 +117,7 @@ py_library(
     srcs_version = "PY2AND3",
     visibility = ["//tensorflow:internal"],
     deps = [
+        "//tensorflow/contrib/eager/python:checkpointable_utils",
         "//tensorflow/contrib/summary:summary_ops",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:control_flow_ops",
@@ -240,13 +243,13 @@ py_library(
     ],
 )
 
-py_test(
+cuda_py_test(
     name = "checkpointable_utils_test",
     srcs = ["checkpointable_utils_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
+    additional_deps = [
         ":checkpointable_utils",
         ":network",
+        "@six_archive//:six",
         "//tensorflow/python:constant_op",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
@@ -261,7 +264,7 @@ py_test(
         "//tensorflow/python:variables",
         "//tensorflow/python/eager:context",
         "//tensorflow/python/eager:test",
-        "@six_archive//:six",
+        "//tensorflow/python/keras",
     ],
 )
 
diff --git a/tensorflow/contrib/eager/python/checkpointable_utils.py b/tensorflow/contrib/eager/python/checkpointable_utils.py
index d3c57bc606179a36d110eb5ac5b29a8f7e2469fb..cd742991afe11b4f5357020fe6630940d63a3433 100644
--- a/tensorflow/contrib/eager/python/checkpointable_utils.py
+++ b/tensorflow/contrib/eager/python/checkpointable_utils.py
@@ -17,12 +17,17 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import abc
 import collections
+import weakref
 
 from tensorflow.contrib.eager.proto import checkpointable_object_graph_pb2
+from tensorflow.python import pywrap_tensorflow
+from tensorflow.python.client import session as session_lib
 from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import control_flow_ops
@@ -31,8 +36,10 @@ from tensorflow.python.ops import io_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.training import checkpointable as core_checkpointable
+from tensorflow.python.training import checkpointable_utils as core_checkpointable_utils
 from tensorflow.python.training import optimizer as optimizer_lib
 from tensorflow.python.training import saver as saver_lib
+from tensorflow.python.util import deprecation
 
 
 _ESCAPE_CHAR = "."  # For avoiding conflicts with user-specified names.
@@ -214,7 +221,7 @@ def _serialize_checkpointables(
     object_proto.slot_variables.extend(slot_variables.get(checkpointable, ()))
     object_name = object_names[checkpointable]
     for name, saveable in (
-        checkpointable._gather_tensors_for_checkpoint().items()):  # pylint: disable=protected-access
+        checkpointable._gather_saveables_for_checkpoint().items()):  # pylint: disable=protected-access
       attribute = object_proto.attributes.add()
       attribute.name = name
       attribute.checkpoint_key = "%s/%s/%s" % (
@@ -274,6 +281,37 @@ def _serialize_object_graph(root_checkpointable):
       slot_variables=slot_variables)
 
 
+def gather_initializers(root_checkpointable):
+  """Traverse the object graph and find initialization ops.
+
+  Looks for `Checkpointable` objects which are dependencies of
+  `root_checkpointable` and which have an `initializer` property. Includes
+  initializers for slot variables only if the variable they are slotting for and
+  the optimizer are dependencies of `root_checkpointable` (i.e. if they would be
+  saved with a checkpoint).
+
+  Args:
+    root_checkpointable: A `Checkpointable` object to gather initializers for.
+  Returns:
+    A list of initialization ops.
+  """
+  # TODO(allenl): Extract out gathering logic so the naming logic doesn't have
+  # to run.
+  checkpointable_objects, path_to_root = (
+      _breadth_first_checkpointable_traversal(root_checkpointable))
+  object_names = {
+      obj: _object_prefix_from_path(path)
+      for obj, path in path_to_root.items()}
+  node_ids = {node: node_id for node_id, node
+              in enumerate(checkpointable_objects)}
+  _serialize_slot_variables(
+      checkpointable_objects=checkpointable_objects,
+      node_ids=node_ids,
+      object_names=object_names)
+  return [c.initializer for c in checkpointable_objects
+          if hasattr(c, "initializer") and c.initializer is not None]
+
+
 class _NoRestoreSaveable(saver_lib.BaseSaverBuilder.SaveableObject):
 
   def __init__(self, tensor, name):
@@ -284,130 +322,557 @@ class _NoRestoreSaveable(saver_lib.BaseSaverBuilder.SaveableObject):
     return control_flow_ops.no_op()
 
 
-def save(file_prefix, root_checkpointable, checkpoint_number=None,
-         session=None):
-  """Save a training checkpoint.
+class _LoadStatus(object):
+  """Abstract base for load status callbacks."""
 
-  Args:
-    file_prefix: A prefix to use for the checkpoint filenames
-      (/path/to/directory/and_a_prefix). Names are generated based on this
-      prefix and the global step, if provided.
-    root_checkpointable: A Checkpointable object to save. The checkpoint
-      includes variables created by this object and any Checkpointable objects
-      it depends on.
-    checkpoint_number: An integer variable or Tensor, used to number
-      checkpoints. Typically this value is saved along with other variables in
-      training checkpoints, which will happen automatically if it was created by
-      `root_checkpointable` or one of its dependencies (via
-      `Checkpointable._add_variable`).
-    session: The session to evaluate variables in. Ignored when executing
-      eagerly. If not provided when graph building, the default session is used.
+  @abc.abstractmethod
+  def assert_consumed(self):
+    """Raises an exception unless a non-trivial restoration has completed."""
+    pass
 
-  Returns:
-    The full path to the checkpoint.
+  @abc.abstractmethod
+  def run_restore_ops(self, session=None):
+    """Runs restore ops from the checkpoint. Requires a valid checkpoint."""
+    pass
+
+  @abc.abstractmethod
+  def initialize_or_restore(self, session=None):
+    """Runs restore ops from the checkpoint, or initializes variables."""
+    pass
+
+
+class CheckpointLoadStatus(_LoadStatus):
+  """Checks the status of checkpoint loading and manages restore ops.
+
+  Returned from `Saver.restore`. Since `restore` may defer the loading of values
+  in the checkpoint which don't yet have corresponding Python objects,
+  `CheckpointLoadStatus` provides a callback to verify that checkpoint loading
+  is complete (`assert_consumed`).
+
+  When graph building, `restore` does not run restore ops itself since their
+  creation may be deferred. The `run_restore_ops` method must be called once all
+  Python objects with values to restore have been created and added to the
+  dependency graph (this does not necessarily have to be the whole checkpoint;
+  calling `run_restore_ops` while `assert_consumed` fails is supported and will
+  partially restore the checkpoint).
+
+  See `Saver.restore` for usage examples.
   """
-  named_variables, serialized_graph = _serialize_object_graph(
-      root_checkpointable)
-  if context.in_graph_mode():
-    if session is None:
-      session = ops.get_default_session()
-  else:
-    session = None
-  assert _OBJECT_GRAPH_PROTO_KEY not in named_variables
-  # TODO(allenl): Feed rather than embedding a constant.
-  named_variables[_OBJECT_GRAPH_PROTO_KEY] = _NoRestoreSaveable(
-      tensor=constant_op.constant(
-          serialized_graph.SerializeToString(), dtype=dtypes.string),
-      name=_OBJECT_GRAPH_PROTO_KEY)
-  with ops.device("/device:CPU:0"):
-    save_path = saver_lib.Saver(var_list=named_variables).save(
-        sess=session,
-        save_path=file_prefix,
-        write_meta_graph=False,
-        global_step=checkpoint_number)
-  return save_path
-
-
-class CheckpointLoadStatus(object):
-
-  def __init__(self, checkpoint):
+
+  def __init__(self, checkpoint, feed_dict):
     self._checkpoint = checkpoint
+    self._feed_dict = feed_dict
 
   def assert_consumed(self):
-    """Asserts that all objects in the checkpoint have been created/matched."""
+    """Asserts that all objects in the checkpoint have been created/matched.
+
+    Returns:
+      `self` for chaining.
+    Raises:
+      AssertionError: If there are any Python objects in the dependency graph
+        which have not been restored from this checkpoint or a later `restore`,
+        or if there are any checkpointed values which have not been matched to
+        Python objects.
+    """
     for node_id, node in enumerate(self._checkpoint.object_graph_proto.nodes):
       checkpointable = self._checkpoint.object_by_proto_id.get(node_id, None)
       if checkpointable is None:
-        raise AssertionError("Unresolved object in checkpoint: %s" % (node))
+        raise AssertionError("Unresolved object in checkpoint: %s" % (node,))
       if checkpointable._update_uid < self._checkpoint.restore_uid:  # pylint: disable=protected-access
         raise AssertionError(
-            "Object not assigned a value from checkpoint: %s" % (node))
+            "Object not assigned a value from checkpoint: %s" % (node,))
+    if self._checkpoint.slot_restorations:
+      # Sanity check; this collection should be clear if everything has been
+      # restored.
+      raise AssertionError("Unresolved slot restorations: %s" % (
+          self._checkpoint.slot_restorations,))
+    if self._checkpoint.unused_attributes:
+      raise AssertionError(
+          ("Unused attributes in these objects (the attributes exist in the "
+           "checkpoint but not in the objects): %s") % (
+               self._checkpoint.unused_attributes.items(),))
     return self
 
+  def run_restore_ops(self, session=None):
+    """Run operations to restore objects in the dependency graph."""
+    if context.in_eager_mode():
+      return  # Run eagerly
+    if session is None:
+      session = ops.get_default_session()
+    session.run(self._checkpoint.restore_ops, feed_dict=self._feed_dict)
+
+  def initialize_or_restore(self, session=None):
+    """Alias for `run_restore_ops`.
+
+    This method has a sibling in `InitializationOnlyStatus` which instead
+    initializes variables. That type is returned if no checkpoint is specified
+    in `Saver.restore`.
+
+    Args:
+      session: The session to run restore ops in. If `None`, uses the default
+        session.
+    """
+    self.run_restore_ops(session=session)
 
-def restore(save_path, root_checkpointable, session=None):
-  """Restore a training checkpoint.
 
-  Restores the values of variables created with `Checkpointable._add_variable`
-  in `root_checkpointable` and any objects that it tracks (transitive). Either
-  assigns values immediately if variables to restore have been created already,
-  or defers restoration until the variables are created. Dependencies added to
-  `root_checkpointable` after this call will be matched if they have a
-  corresponding object in the checkpoint.
+class InitializationOnlyStatus(_LoadStatus):
+  """Returned from `Saver.restore` when no checkpoint has been specified.
 
-  When building a graph, restorations are executed in the default session if
-  `session` is `None`. Variable initializers read checkpointed values.
+  Objects of this type have the same `assert_consumed` method as
+  `CheckpointLoadStatus`, but it always fails. However,
+  `initialize_or_restore` works on objects of both types, and will
+  initialize variables in `InitializationOnlyStatus` objects or restore them
+  otherwise.
+  """
+
+  def __init__(self, root_checkpointable):
+    self._root_checkpointable = root_checkpointable
+
+  def assert_consumed(self):
+    """Assertion for consistency with `CheckpointLoadStatus`. Always fails."""
+    raise AssertionError(
+        "No checkpoint specified (save_path=None); nothing is being restored.")
+
+  def run_restore_ops(self, session=None):
+    """For consistency with `CheckpointLoadStatus`.
+
+    Use `initialize_or_restore` for initializing if no checkpoint was passed
+    to `Saver.restore` and restoring otherwise.
+
+    Args:
+      session: Not used.
+    """
+    raise AssertionError(
+        "No checkpoint specified, so no restore ops are available "
+        "(save_path=None to Saver.restore).")
+
+  def initialize_or_restore(self, session=None):
+    """Runs initialization ops for variables.
+
+    Only objects which would be saved by `Saver.save` will be initialized. See
+    `gather_initializers` for details.
+
+    This method does nothing when executing eagerly (initializers get run
+    eagerly).
+
+    Args:
+      session: The session to run initialization ops in. If `None`, uses the
+        default session.
+    """
+    if context.in_eager_mode():
+      return  # run eagerly
+    if session is None:
+      session = ops.get_default_session()
+    session.run(gather_initializers(self._root_checkpointable))
+
+
+_DEPRECATED_RESTORE_INSTRUCTIONS = (
+    "Restoring a name-based tf.train.Saver checkpoint using the object-based "
+    "restore API. This mode uses global names to match variables, and so is "
+    "somewhat fragile. It also adds new restore ops to the graph each time it "
+    "is called. Prefer re-encoding training checkpoints in the object-based "
+    "format: run save() on the object-based saver (the same one this message "
+    "is coming from) and use that checkpoint in the future.")
+
+
+class NameBasedSaverStatus(_LoadStatus):
+  """Status for loading a name-based training checkpoint."""
+
+  def __init__(self, object_saver, save_path):
+    self._object_saver = object_saver
+    self._save_path = save_path
+
+  def assert_consumed(self):
+    """Assertion for consistency with `CheckpointLoadStatus`. Always fails."""
+    raise AssertionError(
+        "Restoring a name-based checkpoint. No load status is available.")
+
+  @deprecation.deprecated(
+      date=None, instructions=_DEPRECATED_RESTORE_INSTRUCTIONS)
+  def run_restore_ops(self, session=None):
+    """Load the name-based training checkpoint using a new `tf.train.Saver`."""
+    if session is None and context.in_graph_mode():
+      session = ops.get_default_session()
+    saver_lib.Saver(self._object_saver._global_variable_names()).restore(  # pylint: disable=protected-access
+        sess=session, save_path=self._save_path)
+
+  def initialize_or_restore(self, session=None):
+    """Alias for `run_restore_ops`."""
+    self.run_restore_ops(session=session)
+
+
+class _SessionWithFeedDictAdditions(session_lib.SessionInterface):
+  """Pretends to be a session, inserts extra feeds on run()."""
 
-  To disallow deferred loading, assert immediately that all checkpointed
-  variables have been matched to variable objects:
+  def __init__(self, session, feed_additions):
+    self._wrapped_session = session
+    self._feed_additions = feed_additions
+
+  def run(self, fetches, feed_dict=None, **kwargs):
+    if feed_dict is None:
+      feed_dict = {}
+    else:
+      feed_dict = feed_dict.copy()
+    feed_dict.update(self._feed_additions)
+    return self._wrapped_session.run(
+        fetches=fetches, feed_dict=feed_dict, **kwargs)
+
+
+class CheckpointableSaver(object):
+  """Saves and restores a `Checkpointable` object and its dependencies.
+
+  See `Checkpointable` for details of dependency management. `Saver` wraps
+  `tf.train.Saver` for saving, including extra information about the graph of
+  dependencies between Python objects. When restoring, it uses this information
+  about the save-time dependency graph to more robustly match objects with their
+  checkpointed values. When executing eagerly, it supports restoring variables
+  on object creation (see `Saver.restore`).
+
+  Values in a checkpoint are mapped to `Checkpointable` Python objects
+  (`Variable`s, `Optimizer`s, `Layer`s) based on the names provided when the
+  checkpoint was written. To avoid breaking existing checkpoints when modifying
+  a class, dependency names (the names of attributes to which `Checkpointable`
+  objects are assigned) may not change. These names are local to objects, in
+  contrast to the `Variable.name`-based save/restore from `tf.train.Saver`, and
+  so allow additional program transformations.
+  """
+
+  def __init__(self, root_checkpointable):
+    """Configure saving.
+
+    Args:
+      root_checkpointable: The root of the object graph to save/restore. This
+        object and all of its dependencies are saved in the checkpoint. When
+        restoring, objects are matched and restored starting from this root.
+    """
+    # Allow passing in a weak reference to avoid reference cycles when
+    # `Checkpointable` objects save themselves.
+    self._root_checkpointable_ref = root_checkpointable
+    if context.in_graph_mode():
+      with ops.device("/cpu:0"):
+        self._file_prefix_placeholder = constant_op.constant("model")
+    else:
+      self._file_prefix_placeholder = None
+
+    # Op caching for save
+    self._object_graph_feed_tensor = None
+    self._last_save_object_graph = None
+    self._last_save_saver = None
+
+    # Op caching for restore
+    self._object_graph_restore_tensor = None
+    self._last_restore_object_graph = None
+    self._last_restore_checkpoint = None
+
+  @property
+  def _root_checkpointable(self):
+    if isinstance(self._root_checkpointable_ref, weakref.ref):
+      derefed = self._root_checkpointable_ref()
+      assert derefed is not None
+      return derefed
+    else:
+      return self._root_checkpointable_ref
+
+  def save(self, file_prefix, checkpoint_number=None, session=None):
+    """Save a training checkpoint.
+
+    The saved checkpoint includes variables created by this object and any
+    Checkpointable objects it depends on at the time `Saver.save()` is called.
+
+    Args:
+      file_prefix: A prefix to use for the checkpoint filenames
+        (/path/to/directory/and_a_prefix). Names are generated based on this
+        prefix and `checkpoint_number`, if provided.
+      checkpoint_number: An integer variable or Tensor, used to number
+        checkpoints. Typically this value is saved along with other variables in
+        training checkpoints, which will happen automatically if it was created
+        by `root_checkpointable` or one of its dependencies (via
+        `Checkpointable._add_variable`).
+      session: The session to evaluate variables in. Ignored when executing
+        eagerly. If not provided when graph building, the default session is
+        used.
+
+    Returns:
+      The full path to the checkpoint.
+    """
+    named_variables, graph_proto = _serialize_object_graph(
+        self._root_checkpointable)
+    in_graph_mode = context.in_graph_mode()
+    if in_graph_mode:
+      if session is None:
+        session = ops.get_default_session()
+      if self._object_graph_feed_tensor is None:
+        with ops.device("/cpu:0"):
+          self._object_graph_feed_tensor = constant_op.constant(
+              "", dtype=dtypes.string)
+      object_graph_tensor = self._object_graph_feed_tensor
+      feed_additions = {object_graph_tensor: graph_proto.SerializeToString()}
+    else:
+      session = None
+      with ops.device("/cpu:0"):
+        object_graph_tensor = constant_op.constant(
+            graph_proto.SerializeToString(), dtype=dtypes.string)
+      feed_additions = None
+    assert _OBJECT_GRAPH_PROTO_KEY not in named_variables
+    named_variables[_OBJECT_GRAPH_PROTO_KEY] = _NoRestoreSaveable(
+        tensor=object_graph_tensor,
+        name=_OBJECT_GRAPH_PROTO_KEY)
+    if not in_graph_mode or self._last_save_object_graph != graph_proto:
+      if self._last_save_object_graph is not None and in_graph_mode:
+        raise NotImplementedError(
+            "Using a single Saver to save a mutated object graph is not "
+            "currently supported when graph building. Use a different Saver "
+            "when the object graph changes (save ops will be duplicated), or "
+            "file a feature request if this limitation bothers you.")
+      saver = saver_lib.Saver(var_list=named_variables)
+      if in_graph_mode:
+        self._last_save_saver = saver
+        self._last_save_object_graph = graph_proto
+    else:
+      saver = self._last_save_saver
+    with ops.device("/cpu:0"):
+      save_path = saver.save(
+          sess=_SessionWithFeedDictAdditions(
+              session=session, feed_additions=feed_additions),
+          save_path=file_prefix,
+          write_meta_graph=False,
+          global_step=checkpoint_number)
+    return save_path
+
+  def _global_variable_names(self):
+    """Generate a `tf.train.Saver`-style `var_list` using `variable.name`s."""
+    named_saveables, graph_proto = _serialize_object_graph(
+        self._root_checkpointable)
+    saver_names = {}
+    for object_proto in graph_proto.nodes:
+      for attribute_proto in object_proto.attributes:
+        saver_names[attribute_proto.full_name] = named_saveables[
+            attribute_proto.checkpoint_key]
+    return saver_names
+
+  def restore(self, save_path, session=None):
+    """Restore a training checkpoint.
+
+    Restores `root_checkpointable` and any objects that it tracks
+    (transitive). Either assigns values immediately if variables to restore have
+    been created already, or defers restoration until the variables are
+    created. Dependencies added to the `root_checkpointable` passed to the
+    constructor after this call will be matched if they have a corresponding
+    object in the checkpoint.
+
+    When building a graph, restorations are added to the graph but not run. A
+    session is required to retrieve checkpoint metadata.
+
+    To disallow deferred loading, assert immediately that all checkpointed
+    variables have been matched to variable objects:
+
+    ```python
+    saver = Saver(root)
+    saver.restore(path).assert_consumed()
+    ```
+
+    An exception will be raised unless every object was matched and its
+    variables already exist.
+
+    When graph building, `assert_consumed()` indicates that all of the restore
+    ops which will be created for this checkpoint have been created. They can be
+    run via the `run_restore_ops()` function of the status object:
+
+    ```python
+    saver.restore(path).assert_consumed().run_restore_ops()
+    ```
+
+    If the checkpoint has not been consumed completely, then the list of restore
+    ops will grow as more objects are added to the dependency graph.
+
+    Name-based `tf.train.Saver` checkpoints can be loaded using this
+    method. There is no deferred loading, and names are used to match
+    variables. No restore ops are created/run until `run_restore_ops()` or
+    `initialize_or_restore()` are called on the returned status object, even
+    when executing eagerly. Re-encode name-based checkpoints using this
+    object-based `Saver.save` as soon as possible.
+
+    Args:
+      save_path: The path to the checkpoint, as returned by `save` or
+        `tf.train.latest_checkpoint`. If None (as when there is no latest
+        checkpoint for `tf.train.latest_checkpoint` to return), returns an
+        object which may run initializers for objects in the dependency
+        graph. If the checkpoint was written by the name-based `tf.train.Saver`,
+        names are used to match variables.
+      session: The session to retrieve metadata with. Ignored when executing
+        eagerly. If not provided when graph building, the default session is
+        used.
+
+    Returns:
+      A load status object, which can be used to make assertions about the
+      status of checkpoint restoration and run initialization/restore ops
+      (of type `CheckpointLoadStatus`, or `InitializationOnlyStatus` if
+      `save_path` is `None`).
+
+      If `save_path` points to a name-based checkpoint, a `NameBasedSaverStatus`
+      object is returned which runs restore ops from a name-based saver.
+    """
+    if save_path is None:
+      return InitializationOnlyStatus(self._root_checkpointable)
+    in_graph_mode = context.in_graph_mode()
+    if in_graph_mode:
+      if session is None:
+        session = ops.get_default_session()
+      file_prefix_tensor = self._file_prefix_placeholder
+      file_prefix_feed_dict = {self._file_prefix_placeholder: save_path}
+    else:
+      session = None
+      with ops.device("/cpu:0"):
+        file_prefix_tensor = constant_op.constant(save_path)
+      file_prefix_feed_dict = None
+    try:
+      if not in_graph_mode or self._object_graph_restore_tensor is None:
+        with ops.device("/cpu:0"):
+          object_graph_string, = io_ops.restore_v2(
+              prefix=file_prefix_tensor,
+              tensor_names=[_OBJECT_GRAPH_PROTO_KEY],
+              shape_and_slices=[""],
+              dtypes=[dtypes.string],
+              name="object_graph_proto_read")
+        if in_graph_mode:
+          self._object_graph_restore_tensor = object_graph_string
+      if in_graph_mode:
+        object_graph_string = session.run(
+            self._object_graph_restore_tensor,
+            feed_dict=file_prefix_feed_dict)
+      else:
+        object_graph_string = object_graph_string.numpy()
+    except errors_impl.NotFoundError:
+      # The object graph proto does not exist in this checkpoint. Try again with
+      # name-based saving.
+      return NameBasedSaverStatus(self, save_path)
+
+    object_graph_proto = (
+        checkpointable_object_graph_pb2.CheckpointableObjectGraph())
+    object_graph_proto.ParseFromString(object_graph_string)
+    if in_graph_mode and object_graph_proto == self._last_restore_object_graph:
+      checkpoint = self._last_restore_checkpoint
+    else:
+      if in_graph_mode:
+        dtype_map = None
+      else:
+        reader = pywrap_tensorflow.NewCheckpointReader(save_path)
+        dtype_map = reader.get_variable_to_dtype_map()
+      checkpoint = core_checkpointable_utils._Checkpoint(  # pylint: disable=protected-access
+          object_graph_proto=object_graph_proto,
+          save_path=file_prefix_tensor,
+          dtype_map=dtype_map)
+      if in_graph_mode:
+        if self._last_restore_object_graph is not None:
+          raise NotImplementedError(
+              "Using a single Saver to restore different object graphs is not "
+              "currently supported when graph building. Use a different Saver "
+              "for each object graph (restore ops will be duplicated), or "
+              "file a feature request if this limitation bothers you.")
+        self._last_restore_checkpoint = checkpoint
+        self._last_restore_object_graph = object_graph_proto
+    core_checkpointable._CheckpointPosition(  # pylint: disable=protected-access
+        checkpoint=checkpoint, proto_id=0).restore(self._root_checkpointable)
+    load_status = CheckpointLoadStatus(
+        checkpoint, feed_dict=file_prefix_feed_dict)
+    return load_status
+
+
+class Checkpoint(core_checkpointable.Checkpointable):
+  """A utility class which groups `Checkpointable` objects.
+
+  Accepts arbitrary keyword arguments to its constructor and saves those values
+  with a checkpoint. Maintains a `save_counter` for numbering checkpoints.
+
+  Example usage:
 
   ```python
-  restore(path, root).assert_consumed()
+  import tensorflow as tf
+  import tensorflow.contrib.eager as tfe
+  import os
+
+  checkpoint_directory = "/tmp/training_checkpoints"
+  checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+
+  root = tfe.Checkpoint(optimizer=optimizer, model=model)
+  root.restore(tf.train.latest_checkpoint(checkpoint_directory))
+  for _ in range(num_training_steps):
+    optimizer.minimize( ... )
+  root.save(file_prefix=checkpoint_prefix)
   ```
 
-  An exception will be raised unless every object was matched and its variables
-  already exist.
+  For more manual control over saving, use `tfe.CheckpointableSaver` directly.
 
-  Args:
-    save_path: The path to the checkpoint, as returned by `save` or
-      `tf.train.latest_checkpoint`. If None (as when there is no latest
-      checkpoint for `tf.train.latest_checkpoint` to return), does nothing.
-    root_checkpointable: The root of the object graph to restore. Variables to
-      restore need not have been created yet, but all dependencies on other
-      Checkpointable objects should already be declared. Objects in the
-      dependency graph are matched to objects in the checkpointed graph, and
-      matching objects have their variables restored (or the checkpointed values
-      saved for eventual restoration when the variable is created).
-    session: The session to evaluate assignment ops in. Ignored when executing
-      eagerly. If not provided when graph building, the default session is used.
-  Returns:
-    A CheckpointLoadStatus object, which can be used to make assertions about
-    the status of checkpoint restoration.
+  Attributes:
+    save_counter: Incremented when `save()` is called. Used to number
+      checkpoints.
   """
-  if save_path is None:
-    return
-  if context.in_graph_mode():
-    if session is None:
-      session = ops.get_default_session()
-  else:
-    session = None
-  object_graph_string, = io_ops.restore_v2(
-      prefix=save_path,
-      tensor_names=[_OBJECT_GRAPH_PROTO_KEY],
-      shape_and_slices=[""],
-      dtypes=[dtypes.string],
-      name="object_graph_proto_read")
-  if session is not None:
-    object_graph_string = session.run(object_graph_string)
-  else:
-    object_graph_string = object_graph_string.numpy()
-  object_graph_proto = (
-      checkpointable_object_graph_pb2.CheckpointableObjectGraph())
-  object_graph_proto.ParseFromString(object_graph_string)
-  checkpoint = core_checkpointable._Checkpoint(  # pylint: disable=protected-access
-      object_graph_proto=object_graph_proto,
-      save_path=save_path,
-      session=session)
-  core_checkpointable._CheckpointPosition(  # pylint: disable=protected-access
-      checkpoint=checkpoint, proto_id=0).restore(root_checkpointable)
-  return CheckpointLoadStatus(checkpoint)
+
+  def __init__(self, **kwargs):
+    """Group objects into a training checkpoint.
+
+    Args:
+      **kwargs: Keyword arguments are set as attributes of this object, and are
+        saved with the checkpoint. Attribute values must derive from
+        `CheckpointableBase`.
+    Raises:
+      ValueError: If objects in `kwargs` are not Checkpointable.
+    """
+    super(Checkpoint, self).__init__()
+    for k, v in sorted(kwargs.items(), key=lambda item: item[0]):
+      if not isinstance(v, core_checkpointable.CheckpointableBase):
+        raise ValueError(
+            ("`Checkpoint` was expecting an object derived from "
+             "`CheckpointableBase`, got %s.") % (v,))
+      setattr(self, k, v)
+    self._save_counter = None  # Created lazily for restore-on-create.
+    self._saver = CheckpointableSaver(weakref.ref(self))
+
+  def _maybe_create_save_counter(self):
+    """Create a save counter if it does not yet exist."""
+    if self._save_counter is None:
+      # Initialized to 0 and incremented before saving.
+      with ops.device("/cpu:0"):
+        self._save_counter = add_variable(
+            self, name="save_counter", initializer=0, dtype=dtypes.int64)
+
+  @property
+  def save_counter(self):
+    """An integer variable which starts at zero and is incremented on save.
+
+    Used to number checkpoints.
+
+    Returns:
+      The save counter variable.
+    """
+    self._maybe_create_save_counter()
+    return self._save_counter
+
+  def save(self, file_prefix, session=None):
+    """Save a checkpoint. Wraps `tfe.CheckpointableSaver.save`."""
+    in_graph_mode = context.in_graph_mode()
+    if in_graph_mode:
+      if session is None:
+        session = ops.get_default_session()
+      if self._save_counter is None:
+        # When graph building, if this is a new save counter variable then it
+        # needs to be initialized before assign_add. This is only an issue if
+        # restore() has not been called first.
+        session.run(self.save_counter.initializer)
+    with ops.colocate_with(self.save_counter):
+      assign_op = self.save_counter.assign_add(1)
+    if in_graph_mode:
+      session.run(assign_op)
+    return self._saver.save(
+        file_prefix=file_prefix,
+        checkpoint_number=self.save_counter,
+        session=session)
+
+  def restore(self, save_path):
+    """Restore a checkpoint. Wraps `tfe.CheckpointableSaver.restore`."""
+    status = self._saver.restore(save_path=save_path)
+    # Create the save counter now so it gets initialized with other variables
+    # when graph building. Creating it earlier would lead to double
+    # initialization when executing eagerly.
+    self._maybe_create_save_counter()
+    return status
diff --git a/tensorflow/contrib/eager/python/checkpointable_utils_test.py b/tensorflow/contrib/eager/python/checkpointable_utils_test.py
index 1394f0cf0f8a1e2eefaa185c74e376b21f73b688..9ec89edce88235213184798ac0ab6a900656a110 100644
--- a/tensorflow/contrib/eager/python/checkpointable_utils_test.py
+++ b/tensorflow/contrib/eager/python/checkpointable_utils_test.py
@@ -18,98 +18,30 @@ from __future__ import print_function
 
 import functools
 import os
-import unittest
 
 import six
 
 from tensorflow.contrib.eager.python import checkpointable_utils
-from tensorflow.contrib.eager.python import network as network_lib
+from tensorflow.python.client import session as session_lib
 from tensorflow.python.eager import context
 from tensorflow.python.eager import test
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
-from tensorflow.python.layers import base
+from tensorflow.python.keras._impl.keras.engine import training
 from tensorflow.python.layers import core
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import template
 from tensorflow.python.ops import variable_scope
-from tensorflow.python.ops import variables
 from tensorflow.python.training import adam
 from tensorflow.python.training import checkpointable
 from tensorflow.python.training import saver as core_saver
 from tensorflow.python.training import training_util
 
 
-class CheckpointableDenseLayer(core.Dense, checkpointable.Checkpointable):
-
-  def __init__(self, *args, **kwargs):
-    checkpointable.Checkpointable.__init__(self)
-    core.Dense.__init__(self, *args, **kwargs)
-
-  def add_variable(self, name, shape, **kwargs):
-    # Calls both Checkpointable._add_variable and Layer.add_variable. Eventually
-    # Layer.add_variable should inherit from Checkpointable and simply call
-    # super and then do post-processing.
-    return checkpointable.Checkpointable._add_variable_with_custom_getter(
-        self,
-        name=name,
-        shape=shape,
-        getter=functools.partial(core.Dense.add_variable, self),
-        **kwargs)
-
-
-# pylint: disable=not-callable
-class CheckpointableNetwork(network_lib.Network, checkpointable.Checkpointable):
-
-  def __setattr__(self, name, value):
-    if isinstance(value, base.Layer):
-      self.track_layer(value, name=name)
-    # Checkpointable is next in the method resolution order, so this will catch
-    # Checkpointable objects which aren't Layers.
-    super(CheckpointableNetwork, self).__setattr__(name, value)
-
-  def track_layer(self, layer, name):
-    self._track_checkpointable(layer, name=name)
-    return super(CheckpointableNetwork, self).track_layer(layer)
-
-
-class CheckpointableAdam(adam.AdamOptimizer, checkpointable.Checkpointable):
-
-  # NOTE: Copied from Optimizer with modifications to use add_variable
-  # for non-slot variables. These contortions are necessary to maintain
-  # checkpoint compatibility with variable.name based saving.
-  # TODO(allenl): Make this cleaner.
-  def _create_non_slot_variable(self, initial_value, name, colocate_with):
-    """Add an extra variable, not associated with a slot."""
-    if context.in_graph_mode():
-      graph = colocate_with.graph
-    else:
-      graph = None
-
-    key = (name, graph)
-    v = self._non_slot_dict.get(key, None)
-    if v is None:
-      with ops.colocate_with(colocate_with):
-        def _variable_getter(name, shape, dtype, initializer):
-          del shape, dtype  # not used, but there for compatibility
-          return variable_scope.variable(
-              name=name, initial_value=initializer, trainable=False)
-
-        initial_value = ops.convert_to_tensor(initial_value)
-        v = self._add_variable_with_custom_getter(
-            name=name,
-            shape=initial_value.get_shape(),
-            initializer=initial_value,
-            getter=_variable_getter)
-
-      self._non_slot_dict[key] = v
-
-    return v
-
-
 class NonLayerCheckpointable(checkpointable.Checkpointable):
 
   def __init__(self):
@@ -118,61 +50,20 @@ class NonLayerCheckpointable(checkpointable.Checkpointable):
         self, name="a_variable", shape=[])
 
 
-class MyNetwork(CheckpointableNetwork):
-  """A concrete Network for testing."""
+# pylint: disable=not-callable
+class MyModel(training.Model):
+  """A concrete Model for testing."""
 
   def __init__(self):
-    super(MyNetwork, self).__init__()
-    self._named_dense = CheckpointableDenseLayer(1, use_bias=True)
-    self._via_track_layer = self.track_layer(
-        CheckpointableDenseLayer(1, use_bias=False), name="via_track_layer")
+    super(MyModel, self).__init__()
+    self._named_dense = core.Dense(1, use_bias=True)
+    self._second = core.Dense(1, use_bias=False)
     # We can still track Checkpointables which aren't Layers.
     self._non_layer = NonLayerCheckpointable()
 
   def call(self, values):
-    return self._via_track_layer(self._named_dense(values))
-
-
-class Checkpoint(checkpointable.Checkpointable):
-  """A utility class which groups `Checkpointable` objects."""
-
-  def __init__(self, **kwargs):
-    super(Checkpoint, self).__init__()
-    for k, v in sorted(kwargs.items(), key=lambda item: item[0]):
-      setattr(self, k, v)
-    self._save_counter = None
-
-  @property
-  def save_counter(self):
-    """An integer variable which starts at zero and is incremented on save.
-
-    Used to number checkpoints.
-
-    Returns:
-      The save counter variable.
-    """
-    if self._save_counter is None:
-      # Initialized to 0 and incremented before saving.
-      self._save_counter = checkpointable_utils.add_variable(
-          self, name="save_counter", initializer=0, dtype=dtypes.int64)
-    return self._save_counter
-
-  def save(self, file_prefix, session=None):
-    assign_op = self.save_counter.assign_add(1)
-    if context.in_graph_mode():
-      if session is None:
-        session = ops.get_default_session()
-      session.run(assign_op)
-    return checkpointable_utils.save(
-        file_prefix=file_prefix,
-        root_checkpointable=self,
-        checkpoint_number=self.save_counter,
-        session=session)
-
-  def restore(self, save_path):
-    return checkpointable_utils.restore(
-        save_path=save_path,
-        root_checkpointable=self)
+    ret = self._second(self._named_dense(values))
+    return ret
 
 
 class InterfaceTests(test.TestCase):
@@ -207,8 +98,7 @@ class InterfaceTests(test.TestCase):
     with self.assertRaisesRegexp(ValueError, "'duplicate' already exists"):
       checkpointable_utils.add_variable(obj, name="duplicate", shape=[])
 
-    if context.in_graph_mode():
-      self.evaluate(variables.global_variables_initializer())
+    self.evaluate(checkpointable_utils.gather_initializers(obj))
     self.assertEqual("constant_initializer:0", constant_initializer.name)
     self.assertEqual(1, self.evaluate(constant_initializer))
     self.assertEqual("some_variable_scope/ones_initializer:0",
@@ -267,52 +157,50 @@ class CheckpointingTests(test.TestCase):
   @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
   def testNamingWithOptimizer(self):
     input_value = constant_op.constant([[3.]])
-    network = MyNetwork()
-    # A nuisance Network using the same optimizer. Its slot variables should not
+    model = MyModel()
+    # A nuisance Model using the same optimizer. Its slot variables should not
     # go in the checkpoint, since it is never depended on.
-    other_network = MyNetwork()
-    optimizer = CheckpointableAdam(0.001)
+    other_model = MyModel()
+    optimizer = adam.AdamOptimizer(0.001)
     optimizer_step = training_util.get_or_create_global_step()
-    root_checkpointable = Checkpoint(
-        optimizer=optimizer, network=network, optimizer_step=optimizer_step)
+    root_checkpointable = checkpointable_utils.Checkpoint(
+        optimizer=optimizer, model=model, optimizer_step=optimizer_step)
     if context.in_eager_mode():
       optimizer.minimize(
-          lambda: network(input_value),
+          lambda: model(input_value),
           global_step=optimizer_step)
       optimizer.minimize(
-          lambda: other_network(input_value),
+          lambda: other_model(input_value),
           global_step=optimizer_step)
     else:
       train_op = optimizer.minimize(
-          network(input_value), global_step=optimizer_step)
+          model(input_value), global_step=optimizer_step)
       optimizer.minimize(
-          other_network(input_value),
+          other_model(input_value),
           global_step=optimizer_step)
-      self.evaluate(variables.global_variables_initializer())
+      self.evaluate(checkpointable_utils.gather_initializers(
+          root_checkpointable))
       self.evaluate(train_op)
     named_variables, serialized_graph = (
         checkpointable_utils._serialize_object_graph(root_checkpointable))
     expected_checkpoint_names = (
         # Created in the root node, so no prefix.
         "optimizer_step",
-        # No name provided to track_checkpointable(), so the position is used
-        # instead (one-based).
-        "network/via_track_layer/kernel",
-        # track_checkpointable() with a name provided, so that's used
-        "network/_named_dense/kernel",
-        "network/_named_dense/bias",
-        # non-Layer dependency of the network
-        "network/_non_layer/a_variable",
+        "model/_second/kernel",
+        "model/_named_dense/kernel",
+        "model/_named_dense/bias",
+        # non-Layer dependency of the model
+        "model/_non_layer/a_variable",
         # The optimizer creates two non-slot variables
         "optimizer/beta1_power",
         "optimizer/beta2_power",
         # Slot variables
-        "network/via_track_layer/kernel/.OPTIMIZER_SLOT/optimizer/m",
-        "network/via_track_layer/kernel/.OPTIMIZER_SLOT/optimizer/v",
-        "network/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/m",
-        "network/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/v",
-        "network/_named_dense/bias/.OPTIMIZER_SLOT/optimizer/m",
-        "network/_named_dense/bias/.OPTIMIZER_SLOT/optimizer/v",
+        "model/_second/kernel/.OPTIMIZER_SLOT/optimizer/m",
+        "model/_second/kernel/.OPTIMIZER_SLOT/optimizer/v",
+        "model/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/m",
+        "model/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/v",
+        "model/_named_dense/bias/.OPTIMIZER_SLOT/optimizer/m",
+        "model/_named_dense/bias/.OPTIMIZER_SLOT/optimizer/v",
     )
     suffix = "/.ATTRIBUTES/VARIABLE_VALUE"
     expected_checkpoint_names = [
@@ -324,11 +212,11 @@ class CheckpointingTests(test.TestCase):
         "global_step:0",
         named_variables["optimizer_step" + suffix].name)
     self.assertEqual(
-        "my_network/checkpointable_dense_layer_1/kernel:0",
-        named_variables["network/via_track_layer/kernel" + suffix].name)
+        "my_model/dense_1/kernel:0",
+        named_variables["model/_second/kernel" + suffix].name)
     self.assertEqual(
-        "my_network/checkpointable_dense_layer/kernel:0",
-        named_variables["network/_named_dense/kernel" + suffix].name)
+        "my_model/dense/kernel:0",
+        named_variables["model/_named_dense/kernel" + suffix].name)
     self.assertEqual(
         "beta1_power:0",
         named_variables["optimizer/beta1_power" + suffix].name)
@@ -346,106 +234,110 @@ class CheckpointingTests(test.TestCase):
                      serialized_graph.nodes[optimizer_node.children[0].node_id]
                      .attributes[0].full_name)
     self.assertEqual(
-        "my_network/checkpointable_dense_layer/kernel",
+        "my_model/dense/kernel",
         serialized_graph.nodes[optimizer_node.slot_variables[0]
                                .original_variable_node_id]
         .attributes[0].full_name)
     # We strip off the :0 suffix, as variable.name-based saving does.
     self.assertEqual(
-        "my_network/checkpointable_dense_layer/kernel/Adam",
+        "my_model/dense/kernel/Adam",
         serialized_graph.nodes[optimizer_node.slot_variables[0]
                                .slot_variable_node_id]
         .attributes[0].full_name)
     self.assertEqual(
-        "my_network/checkpointable_dense_layer/kernel/Adam:0",
+        "my_model/dense/kernel/Adam:0",
         optimizer.get_slot(
-            var=named_variables["network/_named_dense/kernel" + suffix],
+            var=named_variables["model/_named_dense/kernel" + suffix],
             name="m").name)
     self.assertEqual(
-        "network/_named_dense/kernel" + suffix,
+        "model/_named_dense/kernel" + suffix,
         serialized_graph.nodes[
             optimizer_node.slot_variables[0]
             .original_variable_node_id].attributes[0].checkpoint_key)
     self.assertEqual("m", optimizer_node.slot_variables[0].slot_name)
     self.assertEqual(
-        "network/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/m" + suffix,
+        "model/_named_dense/kernel/.OPTIMIZER_SLOT/optimizer/m" + suffix,
         serialized_graph.nodes[
             optimizer_node.slot_variables[0]
             .slot_variable_node_id].attributes[0].checkpoint_key)
 
   @test_util.run_in_graph_and_eager_modes()
   def testSaveRestore(self):
-    network = MyNetwork()
-    optimizer = CheckpointableAdam(0.001)
-    root_checkpointable = Checkpoint(optimizer=optimizer, network=network)
+    model = MyModel()
+    optimizer = adam.AdamOptimizer(0.001)
+    root_checkpointable = checkpointable_utils.Checkpoint(
+        optimizer=optimizer, model=model)
     input_value = constant_op.constant([[3.]])
     if context.in_eager_mode():
       optimizer.minimize(
-          lambda: network(input_value))
+          lambda: model(input_value))
     else:
-      train_op = optimizer.minimize(network(input_value))
+      train_op = optimizer.minimize(model(input_value))
       # TODO(allenl): Make initialization more pleasant when graph building.
       root_checkpointable.save_counter  # pylint: disable=pointless-statement
-      self.evaluate(variables.global_variables_initializer())
+      self.evaluate(checkpointable_utils.gather_initializers(
+          root_checkpointable))
       self.evaluate(train_op)
     prefix = os.path.join(self.get_temp_dir(), "ckpt")
-    self.evaluate(state_ops.assign(network._named_dense.variables[1], [42.]))
-    m_bias_slot = optimizer.get_slot(network._named_dense.variables[1], "m")
+    self.evaluate(state_ops.assign(model._named_dense.variables[1], [42.]))
+    m_bias_slot = optimizer.get_slot(model._named_dense.variables[1], "m")
     self.evaluate(state_ops.assign(m_bias_slot, [1.5]))
     save_path = root_checkpointable.save(file_prefix=prefix)
-    self.evaluate(state_ops.assign(network._named_dense.variables[1], [43.]))
+    self.evaluate(state_ops.assign(model._named_dense.variables[1], [43.]))
     self.evaluate(state_ops.assign(root_checkpointable.save_counter, 3))
     optimizer_variables = self.evaluate(optimizer.variables())
     self.evaluate(state_ops.assign(m_bias_slot, [-2.]))
     # Immediate restoration
-    root_checkpointable.restore(save_path=save_path).assert_consumed()
-    self.assertAllEqual([42.], self.evaluate(network._named_dense.variables[1]))
+    status = root_checkpointable.restore(save_path=save_path).assert_consumed()
+    status.run_restore_ops()
+    self.assertAllEqual([42.], self.evaluate(model._named_dense.variables[1]))
     self.assertAllEqual(1, self.evaluate(root_checkpointable.save_counter))
     self.assertAllEqual([1.5], self.evaluate(m_bias_slot))
-    with ops.Graph().as_default():
-      on_create_network = MyNetwork()
-      on_create_optimizer = CheckpointableAdam(0.001)
-      on_create_root = Checkpoint(
-          optimizer=on_create_optimizer, network=on_create_network)
-      with self.test_session(graph=ops.get_default_graph()):
-        # Deferred restoration
-        status = on_create_root.restore(save_path=save_path)
-        on_create_network(constant_op.constant([[3.]]))  # create variables
-        self.assertAllEqual(1, self.evaluate(on_create_root.save_counter))
-        self.assertAllEqual([42.],
-                            self.evaluate(
-                                on_create_network._named_dense.variables[1]))
-        on_create_m_bias_slot = on_create_optimizer.get_slot(
-            on_create_network._named_dense.variables[1], "m")
-        # Optimizer slot variables are created when the original variable is
-        # restored.
-        self.assertAllEqual([1.5], self.evaluate(on_create_m_bias_slot))
-        self.assertAllEqual(optimizer_variables[2:],
-                            self.evaluate(on_create_optimizer.variables()))
-        on_create_optimizer._create_slots(
-            [resource_variable_ops.ResourceVariable([1.])])
-        status.assert_consumed()
-        beta1_power, beta2_power = on_create_optimizer._get_beta_accumulators()
-        self.assertAllEqual(optimizer_variables[0], self.evaluate(beta1_power))
-        self.assertAllEqual(optimizer_variables[1], self.evaluate(beta2_power))
+    if context.in_graph_mode():
+      return  # Restore-on-create is only supported when executing eagerly
+    on_create_model = MyModel()
+    on_create_optimizer = adam.AdamOptimizer(0.001)
+    on_create_root = checkpointable_utils.Checkpoint(
+        optimizer=on_create_optimizer, model=on_create_model)
+    # Deferred restoration
+    status = on_create_root.restore(save_path=save_path)
+    on_create_model(constant_op.constant([[3.]]))  # create variables
+    self.assertAllEqual(1, self.evaluate(on_create_root.save_counter))
+    self.assertAllEqual([42.],
+                        self.evaluate(
+                            on_create_model._named_dense.variables[1]))
+    on_create_m_bias_slot = on_create_optimizer.get_slot(
+        on_create_model._named_dense.variables[1], "m")
+    # Optimizer slot variables are created when the original variable is
+    # restored.
+    self.assertAllEqual([1.5], self.evaluate(on_create_m_bias_slot))
+    self.assertAllEqual(optimizer_variables[2:],
+                        self.evaluate(on_create_optimizer.variables()))
+    on_create_optimizer._create_slots(
+        [resource_variable_ops.ResourceVariable([1.])])
+    status.assert_consumed()
+    beta1_power, beta2_power = on_create_optimizer._get_beta_accumulators()
+    self.assertAllEqual(optimizer_variables[0], self.evaluate(beta1_power))
+    self.assertAllEqual(optimizer_variables[1], self.evaluate(beta2_power))
 
+  # TODO(allenl): Debug garbage created by this test in python3.
   def testDeferredRestorationUsageEager(self):
     """An idiomatic eager execution example."""
     num_training_steps = 10
     checkpoint_directory = self.get_temp_dir()
     checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
     for training_continuation in range(3):
-      network = MyNetwork()
-      optimizer = CheckpointableAdam(0.001)
-      root = Checkpoint(
-          optimizer=optimizer, network=network,
+      model = MyModel()
+      optimizer = adam.AdamOptimizer(0.001)
+      root = checkpointable_utils.Checkpoint(
+          optimizer=optimizer, model=model,
           optimizer_step=training_util.get_or_create_global_step())
       root.restore(core_saver.latest_checkpoint(checkpoint_directory))
       for _ in range(num_training_steps):
         # TODO(allenl): Use a Dataset and serialize/checkpoint it.
         input_value = constant_op.constant([[3.]])
         optimizer.minimize(
-            lambda: network(input_value),  # pylint: disable=cell-var-from-loop
+            lambda: model(input_value),  # pylint: disable=cell-var-from-loop
             global_step=root.optimizer_step)
       root.save(file_prefix=checkpoint_prefix)
       self.assertEqual((training_continuation + 1) * num_training_steps,
@@ -459,39 +351,66 @@ class CheckpointingTests(test.TestCase):
       checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
       for training_continuation in range(3):
         with ops.Graph().as_default():
-          network = MyNetwork()
-          optimizer = CheckpointableAdam(0.001)
-          root = Checkpoint(
-              optimizer=optimizer, network=network,
+          model = MyModel()
+          optimizer = adam.AdamOptimizer(0.001)
+          root = checkpointable_utils.Checkpoint(
+              optimizer=optimizer, model=model,
               global_step=training_util.get_or_create_global_step())
           input_value = constant_op.constant([[3.]])
           train_op = optimizer.minimize(
-              network(input_value),
+              model(input_value),
               global_step=root.global_step)
-          root.save_counter  # pylint: disable=pointless-statement
-          init_op = variables.global_variables_initializer()
           checkpoint_path = core_saver.latest_checkpoint(checkpoint_directory)
           with self.test_session(graph=ops.get_default_graph()) as session:
+            status = root.restore(save_path=checkpoint_path)
+            status.initialize_or_restore(session=session)
             if checkpoint_path is None:
               self.assertEqual(0, training_continuation)
-              session.run(init_op)
-              # Another alternative would be to run initializers automatically
-              # if no checkpoint is being loaded. This would make deferred
-              # loading a bit more useful with graph execution.
+              with self.assertRaises(AssertionError):
+                status.assert_consumed()
             else:
-              checkpointable_utils.restore(
-                  save_path=checkpoint_path,
-                  root_checkpointable=root,
-                  session=session)
+              status.assert_consumed()
             for _ in range(num_training_steps):
               session.run(train_op)
-            root.save(file_prefix=checkpoint_prefix,
-                      session=session)
+            root.save(file_prefix=checkpoint_prefix, session=session)
             self.assertEqual((training_continuation + 1) * num_training_steps,
                              session.run(root.global_step))
             self.assertEqual(training_continuation + 1,
                              session.run(root.save_counter))
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testAgnosticUsage(self):
+    """Graph/eager agnostic usage."""
+    # Does create garbage when executing eagerly due to ops.Graph() creation.
+    num_training_steps = 10
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    for training_continuation in range(3):
+      with ops.Graph().as_default(), self.test_session(
+          graph=ops.get_default_graph()), test_util.device(use_gpu=True):
+        model = MyModel()
+        optimizer = adam.AdamOptimizer(0.001)
+        root = checkpointable_utils.Checkpoint(
+            optimizer=optimizer, model=model,
+            global_step=training_util.get_or_create_global_step())
+        checkpoint_path = core_saver.latest_checkpoint(checkpoint_directory)
+        status = root.restore(save_path=checkpoint_path)
+        input_value = constant_op.constant([[3.]])
+        train_fn = functools.partial(
+            optimizer.minimize,
+            functools.partial(model, input_value),
+            global_step=root.global_step)
+        if context.in_graph_mode():
+          train_fn = functools.partial(self.evaluate, train_fn())
+        status.initialize_or_restore()
+        for _ in range(num_training_steps):
+          train_fn()
+        root.save(file_prefix=checkpoint_prefix)
+        self.assertEqual((training_continuation + 1) * num_training_steps,
+                         self.evaluate(root.global_step))
+        self.assertEqual(training_continuation + 1,
+                         self.evaluate(root.save_counter))
+
   def _get_checkpoint_name(self, name):
     root = checkpointable.Checkpointable()
     checkpointable_utils.add_variable(
@@ -553,13 +472,16 @@ class CheckpointingTests(test.TestCase):
     self.evaluate(state_ops.assign(original.dep.var, 123.))
     checkpoint_directory = self.get_temp_dir()
     checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
-    save_path = checkpointable_utils.save(checkpoint_prefix, original)
+    save_path = checkpointable_utils.CheckpointableSaver(
+        original).save(checkpoint_prefix)
     load_into = LateDependencies()
-    status = checkpointable_utils.restore(save_path, load_into)
+    status = checkpointable_utils.CheckpointableSaver(
+        load_into).restore(save_path)
     with self.assertRaises(AssertionError):
       status.assert_consumed()
     load_into.add_dep()
     status.assert_consumed()
+    status.run_restore_ops()
     self.assertEqual(123., self.evaluate(load_into.dep.var))
 
   @test_util.run_in_graph_and_eager_modes()
@@ -583,14 +505,15 @@ class CheckpointingTests(test.TestCase):
     self.evaluate(state_ops.assign(dep_after_var.dep.var, -14.))
     checkpoint_directory = self.get_temp_dir()
     checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
-    save_path = checkpointable_utils.save(
-        checkpoint_prefix, dep_after_var)
+    save_path = checkpointable_utils.CheckpointableSaver(dep_after_var).save(
+        checkpoint_prefix)
 
     loaded_dep_after_var = DepAfterVar()
-    status = checkpointable_utils.restore(
-        save_path, loaded_dep_after_var)
+    status = checkpointable_utils.CheckpointableSaver(
+        loaded_dep_after_var).restore(save_path)
     loaded_dep_after_var.add_dep()
     status.assert_consumed()
+    status.run_restore_ops()
     self.assertEqual(-14., self.evaluate(loaded_dep_after_var.dep.var))
 
   @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
@@ -600,43 +523,59 @@ class CheckpointingTests(test.TestCase):
     root = checkpointable.Checkpointable()
     root.var = checkpointable_utils.add_variable(
         root, name="var", initializer=0.)
-    optimizer = CheckpointableAdam(0.1)
+    optimizer = adam.AdamOptimizer(0.1)
     if context.in_graph_mode():
       train_op = optimizer.minimize(root.var)
-      self.evaluate(variables.global_variables_initializer())
+      # Note that `optimizer` has not been added as a dependency of
+      # `root`. Create a one-off grouping so that slot variables for `root.var`
+      # get initialized too.
+      self.evaluate(checkpointable_utils.gather_initializers(
+          checkpointable_utils.Checkpoint(root=root, optimizer=optimizer)))
       self.evaluate(train_op)
     else:
       optimizer.minimize(root.var.read_value)
     self.evaluate(state_ops.assign(root.var, 12.))
-    no_slots_path = checkpointable_utils.save(
-        os.path.join(checkpoint_directory, "no_slots"), root)
+    no_slots_path = checkpointable_utils.CheckpointableSaver(root).save(
+        os.path.join(checkpoint_directory, "no_slots"))
     root.optimizer = optimizer
     self.evaluate(state_ops.assign(root.var, 13.))
     self.evaluate(state_ops.assign(optimizer.get_slot(name="m", var=root.var),
                                    14.))
-    slots_path = checkpointable_utils.save(
-        os.path.join(checkpoint_directory, "with_slots"), root)
+    slots_path = checkpointable_utils.CheckpointableSaver(root).save(
+        os.path.join(checkpoint_directory, "with_slots"))
     new_root = checkpointable.Checkpointable()
     # Load the slot-containing checkpoint (deferred), then immediately overwrite
     # the non-slot variable (also deferred).
-    slot_status = checkpointable_utils.restore(
-        slots_path, new_root)
-    no_slot_status = checkpointable_utils.restore(
-        no_slots_path, new_root)
+    slot_status = checkpointable_utils.CheckpointableSaver(
+        new_root).restore(slots_path)
+    no_slot_status = checkpointable_utils.CheckpointableSaver(
+        new_root).restore(no_slots_path)
     with self.assertRaises(AssertionError):
       no_slot_status.assert_consumed()
     new_root.var = checkpointable_utils.add_variable(
         new_root, name="var", shape=[])
-    self.assertEqual(12., self.evaluate(new_root.var))
     no_slot_status.assert_consumed()
-    new_root.optimizer = CheckpointableAdam(0.1)
+    no_slot_status.run_restore_ops()
+    self.assertEqual(12., self.evaluate(new_root.var))
+    new_root.optimizer = adam.AdamOptimizer(0.1)
     with self.assertRaisesRegexp(AssertionError, "beta1_power"):
       slot_status.assert_consumed()
     self.assertEqual(12., self.evaluate(new_root.var))
-    self.assertEqual(14., self.evaluate(
-        new_root.optimizer.get_slot(name="m", var=new_root.var)))
+    if context.in_eager_mode():
+      # Slot variables are only created with restoring initializers when
+      # executing eagerly.
+      self.assertEqual(14., self.evaluate(
+          new_root.optimizer.get_slot(name="m", var=new_root.var)))
+    else:
+      self.assertIs(new_root.optimizer.get_slot(name="m", var=new_root.var),
+                    None)
     if context.in_graph_mode():
       train_op = new_root.optimizer.minimize(new_root.var)
+      # The slot variable now exists; restore() didn't create it, but we should
+      # now have a restore op for it.
+      slot_status.run_restore_ops()
+      self.assertEqual(14., self.evaluate(
+          new_root.optimizer.get_slot(name="m", var=new_root.var)))
       self.evaluate(train_op)
     else:
       new_root.optimizer.minimize(new_root.var.read_value)
@@ -650,44 +589,47 @@ class CheckpointingTests(test.TestCase):
     save_root.dep.var = checkpointable_utils.add_variable(
         save_root.dep, name="var", initializer=0.)
     self.evaluate(state_ops.assign(save_root.dep.var, 12.))
-    first_path = checkpointable_utils.save(
-        os.path.join(checkpoint_directory, "first"), save_root)
+    saver = checkpointable_utils.CheckpointableSaver(save_root)
+    first_path = saver.save(os.path.join(checkpoint_directory, "first"))
     self.evaluate(state_ops.assign(save_root.dep.var, 13.))
-    second_path = checkpointable_utils.save(
-        os.path.join(checkpoint_directory, "second"), save_root)
+    second_path = saver.save(os.path.join(checkpoint_directory, "second"))
 
     first_root = checkpointable.Checkpointable()
     second_root = checkpointable.Checkpointable()
-    first_status = checkpointable_utils.restore(
-        first_path, first_root)
-    second_status = checkpointable_utils.restore(
-        second_path, second_root)
+    first_status = checkpointable_utils.CheckpointableSaver(
+        first_root).restore(first_path)
+    second_status = checkpointable_utils.CheckpointableSaver(
+        second_root).restore(second_path)
     load_dep = checkpointable.Checkpointable()
     load_dep.var = checkpointable_utils.add_variable(
         load_dep, name="var", shape=[])
     first_root.dep = load_dep
     first_status.assert_consumed()
+    first_status.run_restore_ops()
     self.assertEqual(12., self.evaluate(load_dep.var))
     second_root.dep = load_dep
     second_status.assert_consumed()
+    second_status.run_restore_ops()
     self.assertEqual(13., self.evaluate(load_dep.var))
 
     # Try again with the order of the restore() reversed. The last restore
     # determines the final value.
     first_root = checkpointable.Checkpointable()
     second_root = checkpointable.Checkpointable()
-    second_status = checkpointable_utils.restore(
-        second_path, second_root)
-    first_status = checkpointable_utils.restore(
-        first_path, first_root)
+    second_status = checkpointable_utils.CheckpointableSaver(
+        second_root).restore(second_path)
+    first_status = checkpointable_utils.CheckpointableSaver(
+        first_root).restore(first_path)
     load_dep = checkpointable.Checkpointable()
     load_dep.var = checkpointable_utils.add_variable(
         load_dep, name="var", shape=[])
     first_root.dep = load_dep
     first_status.assert_consumed()
+    first_status.run_restore_ops()
     self.assertEqual(12., self.evaluate(load_dep.var))
     second_root.dep = load_dep
     second_status.assert_consumed()
+    second_status.run_restore_ops()
     self.assertEqual(12., self.evaluate(load_dep.var))
 
   @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
@@ -701,11 +643,11 @@ class CheckpointingTests(test.TestCase):
     save_root.dep_one.dep_three = dep_three
     save_root.dep_two.dep_three = dep_three
     checkpointable_utils.add_variable(dep_three, name="var", initializer=0.)
-    self.evaluate(variables.global_variables_initializer())
-    save_path = checkpointable_utils.save(
-        os.path.join(checkpoint_directory, "ckpt"), save_root)
+    self.evaluate(checkpointable_utils.gather_initializers(save_root))
+    save_path = checkpointable_utils.CheckpointableSaver(save_root).save(
+        os.path.join(checkpoint_directory, "ckpt"))
     load_root = checkpointable.Checkpointable()
-    checkpointable_utils.restore(save_path, load_root)
+    checkpointable_utils.CheckpointableSaver(load_root).restore(save_path)
     load_root.dep_one = checkpointable.Checkpointable()
     load_root.dep_two = checkpointable.Checkpointable()
     load_root.dep_one.dep_three = checkpointable.Checkpointable()
@@ -724,9 +666,9 @@ class CheckpointingTests(test.TestCase):
         save_root.dep_one, name="var1", initializer=32., dtype=dtypes.float64)
     checkpointable_utils.add_variable(
         save_root.dep_two, name="var2", initializer=64., dtype=dtypes.float64)
-    self.evaluate(variables.global_variables_initializer())
-    save_path = checkpointable_utils.save(
-        os.path.join(checkpoint_directory, "ckpt"), save_root)
+    self.evaluate(checkpointable_utils.gather_initializers(save_root))
+    save_path = checkpointable_utils.CheckpointableSaver(save_root).save(
+        os.path.join(checkpoint_directory, "ckpt"))
     load_root = checkpointable.Checkpointable()
     load_root.dep_one = checkpointable.Checkpointable()
     load_root.dep_two = load_root.dep_one
@@ -734,7 +676,9 @@ class CheckpointingTests(test.TestCase):
         load_root.dep_one, name="var1", shape=[], dtype=dtypes.float64)
     v2 = checkpointable_utils.add_variable(
         load_root.dep_one, name="var2", shape=[], dtype=dtypes.float64)
-    checkpointable_utils.restore(save_path, load_root).assert_consumed()
+    status = checkpointable_utils.CheckpointableSaver(load_root).restore(
+        save_path).assert_consumed()
+    status.run_restore_ops()
     self.assertEqual(32., self.evaluate(v1))
     self.assertEqual(64., self.evaluate(v2))
 
@@ -750,14 +694,15 @@ class CheckpointingTests(test.TestCase):
         first, "v1", initializer=[3., 1., 4.])
     second.v = checkpointable_utils.add_variable(
         second, "v2", initializer=[1., 1., 2., 3.])
-    self.evaluate(variables.global_variables_initializer())
+    self.evaluate(checkpointable_utils.gather_initializers(first))
     checkpoint_directory = self.get_temp_dir()
-    save_path = checkpointable_utils.save(
-        os.path.join(checkpoint_directory, "ckpt"), first)
+    save_path = checkpointable_utils.CheckpointableSaver(first).save(
+        os.path.join(checkpoint_directory, "ckpt"))
 
     # Test deferred loading
     first_load = checkpointable.Checkpointable()
-    status = checkpointable_utils.restore(save_path, first_load)
+    status = checkpointable_utils.CheckpointableSaver(
+        first_load).restore(save_path)
     second_load = checkpointable.Checkpointable()
     first_load.second = second_load
     second_load.first = first_load
@@ -768,6 +713,7 @@ class CheckpointingTests(test.TestCase):
     second_load.v = checkpointable_utils.add_variable(
         second_load, "v2", shape=[4])
     status.assert_consumed()
+    status.run_restore_ops()
     self.assertAllEqual([3., 1., 4.], self.evaluate(first_load.v))
     self.assertAllEqual([1., 1., 2., 3.], self.evaluate(second_load.v))
 
@@ -776,8 +722,9 @@ class CheckpointingTests(test.TestCase):
     self.assertAllEqual([2., 7., 1.], self.evaluate(first_load.v))
     self.evaluate(second_load.v.assign([2., 7., 1., 8.]))
     self.assertAllEqual([2., 7., 1., 8.], self.evaluate(second_load.v))
-    checkpointable_utils.restore(
-        save_path, first_load).assert_consumed()
+    status = checkpointable_utils.CheckpointableSaver(first_load).restore(
+        save_path).assert_consumed()
+    status.run_restore_ops()
     self.assertAllEqual([3., 1., 4.], self.evaluate(first_load.v))
     self.assertAllEqual([1., 1., 2., 3.], self.evaluate(second_load.v))
 
@@ -794,24 +741,25 @@ class CheckpointingTests(test.TestCase):
           name="blah", initializer=0.)
       self.evaluate(first.var1.assign(4.))
       self.evaluate(first.var2.assign(8.))
-      save_path = checkpointable_utils.save(
-          checkpoint_prefix, root_checkpointable=first)
+      save_path = checkpointable_utils.CheckpointableSaver(first).save(
+          checkpoint_prefix)
     restore_graph = ops.Graph()
     with restore_graph.as_default(), self.test_session(restore_graph):
       second = checkpointable.Checkpointable()
       second.var2 = variable_scope.get_variable(
           name="blah", initializer=0.)
-      checkpointable_utils.restore(save_path, root_checkpointable=second)
+      status = checkpointable_utils.CheckpointableSaver(
+          second).restore(save_path)
       recreated_var1 = variable_scope.get_variable(
           name="outside_var", initializer=0.)
+      status.run_restore_ops()
       self.assertEqual(8., self.evaluate(second.var2))
       self.evaluate(recreated_var1.assign(-2.))
       self.assertEqual(-2., self.evaluate(recreated_var1))
       second.var1 = recreated_var1
+      status.run_restore_ops()
       self.assertEqual(4., self.evaluate(recreated_var1))
 
-  # TODO(allenl): Saver class that doesn't pollute the graph with constants.
-  @unittest.skip("todo")
   def testManySavesGraph(self):
     """Saves after the first should not modify the graph."""
     with context.graph_mode():
@@ -821,17 +769,15 @@ class CheckpointingTests(test.TestCase):
         checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
         obj = checkpointable.Checkpointable()
         obj.var = variable_scope.get_variable(name="v", initializer=0.)
-        obj.opt = CheckpointableAdam(0.1)
+        obj.opt = adam.AdamOptimizer(0.1)
         obj.opt.minimize(obj.var.read_value())
-        self.evaluate(variables.global_variables_initializer())
-        checkpointable_utils.save(
-            checkpoint_prefix, root_checkpointable=obj)
+        self.evaluate(checkpointable_utils.gather_initializers(obj))
+        saver = checkpointable_utils.CheckpointableSaver(obj)
+        saver.save(checkpoint_prefix)
         before_ops = graph.get_operations()
-        checkpointable_utils.save(
-            checkpoint_prefix, root_checkpointable=obj)
+        saver.save(checkpoint_prefix)
         self.assertEqual(before_ops, graph.get_operations())
 
-  @unittest.skip("todo")
   def testManyRestoresGraph(self):
     """Restores after the first should not modify the graph."""
     with context.graph_mode():
@@ -841,17 +787,262 @@ class CheckpointingTests(test.TestCase):
         checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
         obj = checkpointable.Checkpointable()
         obj.var = variable_scope.get_variable(name="v", initializer=0.)
-        obj.opt = CheckpointableAdam(0.1)
+        obj.opt = adam.AdamOptimizer(0.1)
         obj.opt.minimize(obj.var.read_value())
-        self.evaluate(variables.global_variables_initializer())
-        save_path = checkpointable_utils.save(
-            checkpoint_prefix, root_checkpointable=obj)
-        checkpointable_utils.restore(
-            save_path, root_checkpointable=obj)
+        self.evaluate(checkpointable_utils.gather_initializers(obj))
+        saver = checkpointable_utils.CheckpointableSaver(obj)
+        save_path = saver.save(checkpoint_prefix)
+        saver.restore(save_path)
         before_ops = graph.get_operations()
-        checkpointable_utils.restore(
-            save_path, root_checkpointable=obj)
+        saver.restore(save_path)
         self.assertEqual(before_ops, graph.get_operations())
 
+  def testMultipleGraphsNonSlotVariables(self):
+    with context.graph_mode():
+      checkpoint_directory = self.get_temp_dir()
+      checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+      optimizer = adam.AdamOptimizer(0.001)
+      # Construct a model in one graph
+      first_graph = ops.Graph()
+      first_session = session_lib.Session(graph=first_graph)
+      with first_graph.as_default(), first_session.as_default():
+        first_variable = resource_variable_ops.ResourceVariable([1.])
+        first_root_checkpointable = checkpointable_utils.Checkpoint(
+            optimizer=optimizer, variable=first_variable)
+        train_op = optimizer.minimize(first_variable.read_value)
+        self.evaluate(checkpointable_utils.gather_initializers(
+            first_root_checkpointable))
+        self.evaluate(train_op)
+        self.evaluate(first_variable.assign([1.]))
+        self.evaluate(optimizer.get_slot(
+            var=first_variable, name="m").assign([2.]))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.evaluate(beta1_power.assign(3.))
+
+      # Save and load in a second graph
+      second_graph = ops.Graph()
+      with second_graph.as_default(), session_lib.Session(graph=second_graph):
+        second_variable = resource_variable_ops.ResourceVariable([1.])
+        second_root_checkpointable = checkpointable_utils.Checkpoint(
+            optimizer=optimizer, variable=second_variable)
+        train_op = optimizer.minimize(second_variable.read_value)
+        second_root_checkpointable.restore(None).initialize_or_restore()
+        self.evaluate(train_op)
+        self.evaluate(second_variable.assign([4.]))
+        self.evaluate(optimizer.get_slot(
+            var=second_variable, name="m").assign([5.]))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.evaluate(beta1_power.assign(6.))
+        save_path = second_root_checkpointable.save(checkpoint_prefix)
+        self.evaluate(second_variable.assign([7.]))
+        self.evaluate(optimizer.get_slot(
+            var=second_variable, name="m").assign([8.]))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.assertAllEqual(6., self.evaluate(beta1_power))
+        status = second_root_checkpointable.restore(save_path)
+        status.assert_consumed().run_restore_ops()
+        self.assertAllEqual([4.], self.evaluate(second_variable))
+        self.assertAllEqual([5.], self.evaluate(optimizer.get_slot(
+            var=second_variable, name="m")))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.assertAllEqual(6., self.evaluate(beta1_power))
+
+      # Check that the first graph is unmolested
+      with first_graph.as_default(), first_session.as_default():
+        self.assertAllEqual([1.], self.evaluate(first_variable))
+        self.assertAllEqual([2.], self.evaluate(optimizer.get_slot(
+            var=first_variable, name="m")))
+        beta1_power, _ = optimizer._get_beta_accumulators()
+        self.assertAllEqual(3., self.evaluate(beta1_power))
+
+
+class TemplateTests(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
+  def test_checkpointable_save_restore(self):
+
+    def _templated():
+      v = variable_scope.get_variable(
+          "v", shape=[1], initializer=init_ops.zeros_initializer())
+      v2 = variable_scope.get_variable(
+          "v2", shape=[1], initializer=init_ops.zeros_initializer())
+      return v, v + 1., v2
+
+    save_template = template.make_template("s1", _templated)
+    save_root = checkpointable_utils.Checkpoint(my_template=save_template)
+    v1_save, _, v2_save = save_template()
+    self.evaluate(v1_save.assign([12.]))
+    self.evaluate(v2_save.assign([14.]))
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    save_path = save_root.save(checkpoint_prefix)
+
+    load_template = template.make_template("s2", _templated)
+    load_root = checkpointable_utils.Checkpoint(my_template=load_template)
+    status = load_root.restore(save_path)
+    var, var_plus_one, var2 = load_template()
+    self.assertEqual(2, len(load_template._checkpoint_dependencies))
+    self.assertEqual("v", load_template._checkpoint_dependencies[0].name)
+    self.assertEqual("v2", load_template._checkpoint_dependencies[1].name)
+    status.assert_consumed().run_restore_ops()
+    self.assertAllEqual([12.], self.evaluate(var))
+    self.assertAllEqual([13.], self.evaluate(var_plus_one))
+    self.assertAllEqual([14.], self.evaluate(var2))
+
+  @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
+  def test_checkpointable_save_restore_nested(self):
+
+    def _inner_template():
+      v = variable_scope.get_variable(
+          "v", shape=[1], initializer=init_ops.zeros_initializer())
+      return v
+
+    def _outer_template():
+      first_inner = template.make_template("i1", _inner_template)
+      second_inner = template.make_template("i2", _inner_template)
+      v1 = first_inner()
+      v2 = second_inner()
+      v3 = second_inner()
+      return (first_inner, second_inner), (v1, v2, v3)
+
+    with variable_scope.variable_scope("ignored"):
+      save_template = template.make_template("s1", _outer_template)
+      save_root = checkpointable_utils.Checkpoint(my_template=save_template)
+      (inner_template_one, inner_template_two), _ = save_template()
+    self.evaluate(inner_template_one.variables[0].assign([20.]))
+    self.evaluate(inner_template_two.variables[0].assign([25.]))
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    save_path = save_root.save(checkpoint_prefix)
+
+    load_template = template.make_template("s2", _outer_template)
+    load_root = checkpointable_utils.Checkpoint(my_template=load_template)
+    status = load_root.restore(save_path)
+    (inner_template_one, inner_template_two), (v1, v2, v3) = load_template()
+    outer_template_dependencies = load_root.my_template._checkpoint_dependencies
+    self.assertEqual(2, len(outer_template_dependencies))
+    self.assertEqual("i1", outer_template_dependencies[0].name)
+    self.assertIs(inner_template_one, outer_template_dependencies[0].ref)
+    self.assertEqual("i2", outer_template_dependencies[1].name)
+    self.assertIs(inner_template_two, outer_template_dependencies[1].ref)
+    self.assertEqual(1, len(inner_template_one._checkpoint_dependencies))
+    self.assertEqual("v", inner_template_one._checkpoint_dependencies[0].name)
+    self.assertEqual(1, len(inner_template_two._checkpoint_dependencies))
+    self.assertEqual("v", inner_template_two._checkpoint_dependencies[0].name)
+    status.assert_consumed().run_restore_ops()
+    self.assertAllEqual([20.], self.evaluate(v1))
+    self.assertAllEqual([25.], self.evaluate(v2))
+    self.assertAllEqual([25.], self.evaluate(v3))
+
+
+class CheckpointCompatibilityTests(test.TestCase):
+
+  def _initialized_model(self):
+    input_value = constant_op.constant([[3.]])
+    model = MyModel()
+    optimizer = adam.AdamOptimizer(0.001)
+    optimizer_step = training_util.get_or_create_global_step()
+    root_checkpointable = checkpointable_utils.Checkpoint(
+        optimizer=optimizer, model=model, optimizer_step=optimizer_step)
+    train_op = optimizer.minimize(
+        functools.partial(model, input_value),
+        global_step=optimizer_step)
+    self.evaluate(checkpointable_utils.gather_initializers(
+        root_checkpointable))
+    self.evaluate(train_op)
+    # A regular variable, a slot variable, and a non-slot Optimizer variable
+    # with known values to check when loading.
+    self.evaluate(model._named_dense.bias.assign([1.]))
+    self.evaluate(optimizer.get_slot(
+        var=model._named_dense.bias, name="m").assign([2.]))
+    beta1_power, _ = optimizer._get_beta_accumulators()
+    self.evaluate(beta1_power.assign(3.))
+    return root_checkpointable
+
+  def _set_sentinels(self, root_checkpointable):
+    self.evaluate(root_checkpointable.model._named_dense.bias.assign([101.]))
+    self.evaluate(
+        root_checkpointable.optimizer.get_slot(
+            var=root_checkpointable.model._named_dense.bias, name="m")
+        .assign([102.]))
+    beta1_power, _ = root_checkpointable.optimizer._get_beta_accumulators()
+    self.evaluate(beta1_power.assign(103.))
+
+  def _check_sentinels(self, root_checkpointable):
+    self.assertAllEqual(
+        [1.], self.evaluate(root_checkpointable.model._named_dense.bias))
+    self.assertAllEqual([2.], self.evaluate(
+        root_checkpointable.optimizer.get_slot(
+            var=root_checkpointable.model._named_dense.bias, name="m")))
+    beta1_power, _ = root_checkpointable.optimizer._get_beta_accumulators()
+    self.assertAllEqual(3., self.evaluate(beta1_power))
+
+  def _write_name_based_checkpoint(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    with context.graph_mode():
+      save_graph = ops.Graph()
+      with save_graph.as_default(), self.test_session(
+          graph=save_graph) as session:
+        root = self._initialized_model()
+        name_saver = core_saver.Saver()
+        return name_saver.save(
+            sess=session, save_path=checkpoint_prefix,
+            global_step=root.optimizer_step)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testLoadFromNameBasedSaver(self):
+    """Save a name-based checkpoint, load it using the object-based API."""
+    save_path = self._write_name_based_checkpoint()
+    root = self._initialized_model()
+    self._set_sentinels(root)
+    with self.assertRaises(AssertionError):
+      self._check_sentinels(root)
+    object_saver = checkpointable_utils.CheckpointableSaver(root)
+    status = object_saver.restore(save_path)
+    with self.assertRaises(AssertionError):
+      status.assert_consumed()
+    status.run_restore_ops()
+    self._check_sentinels(root)
+    self._set_sentinels(root)
+    status.initialize_or_restore()
+    self._check_sentinels(root)
+
+  # TODO(allenl): Test for the core name-based saver loading object-based
+  # checkpoints once object-based checkpointing is in core.
+
+  def testSaveGraphLoadEager(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    with context.graph_mode():
+      save_graph = ops.Graph()
+      with save_graph.as_default(), self.test_session(
+          graph=save_graph) as session:
+        root = self._initialized_model()
+        object_saver = checkpointable_utils.CheckpointableSaver(root)
+        save_path = object_saver.save(
+            session=session, file_prefix=checkpoint_prefix)
+    with context.eager_mode():
+      root = self._initialized_model()
+      self._set_sentinels(root)
+      root.restore(save_path).assert_consumed()
+      self._check_sentinels(root)
+
+  def testSaveEagerLoadGraph(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    with context.eager_mode():
+      root = self._initialized_model()
+      object_saver = checkpointable_utils.CheckpointableSaver(root)
+      save_path = object_saver.save(file_prefix=checkpoint_prefix)
+    with context.graph_mode():
+      save_graph = ops.Graph()
+      with save_graph.as_default(), self.test_session(
+          graph=save_graph):
+        root = self._initialized_model()
+        self._set_sentinels(root)
+        root.restore(save_path).assert_consumed().run_restore_ops()
+        self._check_sentinels(root)
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/eager/python/datasets.py b/tensorflow/contrib/eager/python/datasets.py
index d177bfeab2d1fdc05d7ced54df8723fae2c77fdb..36b7d6d0098baea997ffeaaffab18e0f996a5110 100644
--- a/tensorflow/contrib/eager/python/datasets.py
+++ b/tensorflow/contrib/eager/python/datasets.py
@@ -71,7 +71,7 @@ class Iterator(object):
     if not context.in_eager_mode():
       raise RuntimeError(
           "{} objects can only be used when eager execution is enabled, use "
-          "tf.data.Dataset.make_iterator or "
+          "tf.data.Dataset.make_initializable_iterator or "
           "tf.data.Dataset.make_one_shot_iterator for graph construction".
           format(type(self)))
     with ops.device("/device:CPU:0"):
diff --git a/tensorflow/contrib/eager/python/datasets_test.py b/tensorflow/contrib/eager/python/datasets_test.py
index a1611e92b113839c2dd2a3b2560b0ba90c0a7ef0..35c3c5d3fad0a84bbe4d24c7bb17878583bded4b 100644
--- a/tensorflow/contrib/eager/python/datasets_test.py
+++ b/tensorflow/contrib/eager/python/datasets_test.py
@@ -16,11 +16,14 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import threading
 import time
 
 import numpy as np
 
 from tensorflow.contrib import lookup
+from tensorflow.contrib.data.python.ops import threadpool
+from tensorflow.contrib.data.python.ops import unique
 from tensorflow.contrib.eager.python import datasets
 from tensorflow.python.data import Dataset
 from tensorflow.python.eager import test
@@ -165,6 +168,38 @@ class IteratorTest(test.TestCase):
       x = math_ops.add(x, x)
     self.assertAllEqual([0., 2.], x.numpy())
 
+  def testOverrideThreadPool(self):
+
+    def get_thread_id(_):
+      # Python creates a dummy thread object to represent the current
+      # thread when called from an "alien" thread (such as a
+      # `PrivateThreadPool` thread in this case). It does not include
+      # the TensorFlow-given display name, but it has a unique
+      # identifier that maps one-to-one with the underlying OS thread.
+      return np.array(threading.current_thread().ident).astype(np.int64)
+
+    for num_threads in [1, 2, 4, 8, 16]:
+
+      dataset = (
+          Dataset.range(1000).map(
+              lambda x: script_ops.py_func(get_thread_id, [x], dtypes.int64),
+              num_parallel_calls=32).apply(unique.unique()))
+
+      dataset = threadpool.override_threadpool(
+          dataset,
+          threadpool.PrivateThreadPool(
+              num_threads, display_name='private_thread_pool_%d' % num_threads))
+
+      thread_ids = []
+      for next_element in datasets.Iterator(dataset):
+        thread_ids.append(next_element)
+      self.assertEqual(len(thread_ids), len(set(thread_ids)))
+      self.assertGreater(len(thread_ids), 0)
+      # NOTE(mrry): We don't control the thread pool scheduling, and
+      # so cannot guarantee that all of the threads in the pool will
+      # perform work.
+      self.assertLessEqual(len(thread_ids), num_threads)
+
 
 class DatasetConstructorBenchmark(test.Benchmark):
 
diff --git a/tensorflow/contrib/eager/python/examples/BUILD b/tensorflow/contrib/eager/python/examples/BUILD
index 15a21885f66eface291a39fa0ee1ff28bc297548..c1fd9e0ed020beeb722204edf1adfe1dfcf8ff03 100644
--- a/tensorflow/contrib/eager/python/examples/BUILD
+++ b/tensorflow/contrib/eager/python/examples/BUILD
@@ -8,7 +8,6 @@ py_library(
     deps = [
         "//tensorflow/contrib/eager/python/examples/gan:mnist",
         "//tensorflow/contrib/eager/python/examples/linear_regression",
-        "//tensorflow/contrib/eager/python/examples/mnist",
         "//tensorflow/contrib/eager/python/examples/resnet50",
         "//tensorflow/contrib/eager/python/examples/rnn_colorbot",
         "//tensorflow/contrib/eager/python/examples/rnn_ptb",
diff --git a/tensorflow/contrib/eager/python/examples/gan/mnist.py b/tensorflow/contrib/eager/python/examples/gan/mnist.py
index b9ac79f46c83bb709918e3b72830b90ddcfd71b4..5f51d52622caedc6baa9f9f9950a6fd91761259a 100644
--- a/tensorflow/contrib/eager/python/examples/gan/mnist.py
+++ b/tensorflow/contrib/eager/python/examples/gan/mnist.py
@@ -35,7 +35,7 @@ from tensorflow.examples.tutorials.mnist import input_data
 FLAGS = None
 
 
-class Discriminator(tfe.Network):
+class Discriminator(tf.keras.Model):
   """GAN Discriminator.
 
   A network to differentiate between generated and real handwritten digits.
@@ -56,19 +56,15 @@ class Discriminator(tfe.Network):
     else:
       assert data_format == 'channels_last'
       self._input_shape = [-1, 28, 28, 1]
-    self.conv1 = self.track_layer(tf.layers.Conv2D(64, 5, padding='SAME',
-                                                   data_format=data_format,
-                                                   activation=tf.tanh))
-    self.pool1 = self.track_layer(
-        tf.layers.AveragePooling2D(2, 2, data_format=data_format))
-    self.conv2 = self.track_layer(tf.layers.Conv2D(128, 5,
-                                                   data_format=data_format,
-                                                   activation=tf.tanh))
-    self.pool2 = self.track_layer(
-        tf.layers.AveragePooling2D(2, 2, data_format=data_format))
-    self.flatten = self.track_layer(tf.layers.Flatten())
-    self.fc1 = self.track_layer(tf.layers.Dense(1024, activation=tf.tanh))
-    self.fc2 = self.track_layer(tf.layers.Dense(1, activation=None))
+    self.conv1 = tf.layers.Conv2D(
+        64, 5, padding='SAME', data_format=data_format, activation=tf.tanh)
+    self.pool1 = tf.layers.AveragePooling2D(2, 2, data_format=data_format)
+    self.conv2 = tf.layers.Conv2D(
+        128, 5, data_format=data_format, activation=tf.tanh)
+    self.pool2 = tf.layers.AveragePooling2D(2, 2, data_format=data_format)
+    self.flatten = tf.layers.Flatten()
+    self.fc1 = tf.layers.Dense(1024, activation=tf.tanh)
+    self.fc2 = tf.layers.Dense(1, activation=None)
 
   def call(self, inputs):
     """Return two logits per image estimating input authenticity.
@@ -95,7 +91,7 @@ class Discriminator(tfe.Network):
     return x
 
 
-class Generator(tfe.Network):
+class Generator(tf.keras.Model):
   """Generator of handwritten digits similar to the ones in the MNIST dataset.
   """
 
@@ -116,18 +112,17 @@ class Generator(tfe.Network):
     else:
       assert data_format == 'channels_last'
       self._pre_conv_shape = [-1, 6, 6, 128]
-    self.fc1 = self.track_layer(tf.layers.Dense(6 * 6 * 128,
-                                                activation=tf.tanh))
+    self.fc1 = tf.layers.Dense(6 * 6 * 128, activation=tf.tanh)
 
     # In call(), we reshape the output of fc1 to _pre_conv_shape
 
     # Deconvolution layer. Resulting image shape: (batch, 14, 14, 64)
-    self.conv1 = self.track_layer(tf.layers.Conv2DTranspose(
-        64, 4, strides=2, activation=None, data_format=data_format))
+    self.conv1 = tf.layers.Conv2DTranspose(
+        64, 4, strides=2, activation=None, data_format=data_format)
 
     # Deconvolution layer. Resulting image shape: (batch, 28, 28, 1)
-    self.conv2 = self.track_layer(tf.layers.Conv2DTranspose(
-        1, 2, strides=2, activation=tf.nn.sigmoid, data_format=data_format))
+    self.conv2 = tf.layers.Conv2DTranspose(
+        1, 2, strides=2, activation=tf.nn.sigmoid, data_format=data_format)
 
   def call(self, inputs):
     """Return a batch of generated images.
@@ -168,7 +163,8 @@ def discriminator_loss(discriminator_real_outputs, discriminator_gen_outputs):
   """
 
   loss_on_real = tf.losses.sigmoid_cross_entropy(
-      tf.ones_like(discriminator_real_outputs), discriminator_real_outputs,
+      tf.ones_like(discriminator_real_outputs),
+      discriminator_real_outputs,
       label_smoothing=0.25)
   loss_on_generated = tf.losses.sigmoid_cross_entropy(
       tf.zeros_like(discriminator_gen_outputs), discriminator_gen_outputs)
@@ -198,9 +194,8 @@ def generator_loss(discriminator_gen_outputs):
   return loss
 
 
-def train_one_epoch(generator, discriminator,
-                    generator_optimizer, discriminator_optimizer,
-                    dataset, log_interval, noise_dim):
+def train_one_epoch(generator, discriminator, generator_optimizer,
+                    discriminator_optimizer, dataset, log_interval, noise_dim):
   """Trains `generator` and `discriminator` models on `dataset`.
 
   Args:
@@ -222,14 +217,18 @@ def train_one_epoch(generator, discriminator,
 
     with tf.contrib.summary.record_summaries_every_n_global_steps(log_interval):
       current_batch_size = images.shape[0]
-      noise = tf.random_uniform(shape=[current_batch_size, noise_dim],
-                                minval=-1., maxval=1., seed=batch_index)
+      noise = tf.random_uniform(
+          shape=[current_batch_size, noise_dim],
+          minval=-1.,
+          maxval=1.,
+          seed=batch_index)
 
       with tfe.GradientTape(persistent=True) as g:
         generated_images = generator(noise)
-        tf.contrib.summary.image('generated_images',
-                                 tf.reshape(generated_images, [-1, 28, 28, 1]),
-                                 max_images=10)
+        tf.contrib.summary.image(
+            'generated_images',
+            tf.reshape(generated_images, [-1, 28, 28, 1]),
+            max_images=10)
 
         discriminator_gen_outputs = discriminator(generated_images)
         discriminator_real_outputs = discriminator(images)
@@ -245,17 +244,17 @@ def train_one_epoch(generator, discriminator,
                                       discriminator.variables)
 
       with tf.variable_scope('generator'):
-        generator_optimizer.apply_gradients(zip(generator_grad,
-                                                generator.variables))
+        generator_optimizer.apply_gradients(
+            zip(generator_grad, generator.variables))
       with tf.variable_scope('discriminator'):
-        discriminator_optimizer.apply_gradients(zip(discriminator_grad,
-                                                    discriminator.variables))
+        discriminator_optimizer.apply_gradients(
+            zip(discriminator_grad, discriminator.variables))
 
       if log_interval and batch_index > 0 and batch_index % log_interval == 0:
         print('Batch #%d\tAverage Generator Loss: %.6f\t'
-              'Average Discriminator Loss: %.6f' % (
-                  batch_index, total_generator_loss/batch_index,
-                  total_discriminator_loss/batch_index))
+              'Average Discriminator Loss: %.6f' %
+              (batch_index, total_generator_loss / batch_index,
+               total_discriminator_loss / batch_index))
 
 
 def main(_):
@@ -266,10 +265,9 @@ def main(_):
 
   # Load the datasets
   data = input_data.read_data_sets(FLAGS.data_dir)
-  dataset = (tf.data.Dataset
-             .from_tensor_slices(data.train.images)
-             .shuffle(60000)
-             .batch(FLAGS.batch_size))
+  dataset = (
+      tf.data.Dataset.from_tensor_slices(data.train.images).shuffle(60000)
+      .batch(FLAGS.batch_size))
 
   # Create the models and optimizers
   generator = Generator(data_format)
@@ -294,20 +292,17 @@ def main(_):
         start = time.time()
         with summary_writer.as_default():
           train_one_epoch(generator, discriminator, generator_optimizer,
-                          discriminator_optimizer,
-                          dataset, FLAGS.log_interval, FLAGS.noise)
+                          discriminator_optimizer, dataset, FLAGS.log_interval,
+                          FLAGS.noise)
         end = time.time()
-        print('\nTrain time for epoch #%d (global step %d): %f' % (
-            epoch, global_step.numpy(), end - start))
+        print('\nTrain time for epoch #%d (global step %d): %f' %
+              (epoch, global_step.numpy(), end - start))
 
       all_variables = (
-          generator.variables
-          + discriminator.variables
-          + generator_optimizer.variables()
-          + discriminator_optimizer.variables()
-          + [global_step])
-      tfe.Saver(all_variables).save(
-          checkpoint_prefix, global_step=global_step)
+          generator.variables + discriminator.variables +
+          generator_optimizer.variables() +
+          discriminator_optimizer.variables() + [global_step])
+      tfe.Saver(all_variables).save(checkpoint_prefix, global_step=global_step)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py b/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py
index 6ce4de6ee0bf50400eff339ac04e132252a2b53e..157a6360ea555bba37df008a6458acac0342880b 100644
--- a/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py
+++ b/tensorflow/contrib/eager/python/examples/linear_regression/linear_regression.py
@@ -33,23 +33,13 @@ import tensorflow as tf
 import tensorflow.contrib.eager as tfe
 
 
-class LinearModel(tfe.Network):
-  """A TensorFlow linear regression model.
-
-  Uses TensorFlow's eager execution.
-
-  For those familiar with TensorFlow graphs, notice the absence of
-  `tf.Session`. The `forward()` method here immediately executes and
-  returns output values. The `loss()` method immediately compares the
-  output of `forward()` with the target and returns the MSE loss value.
-  The `fit()` performs gradient-descent training on the model's weights
-  and bias.
-  """
+class LinearModel(tf.keras.Model):
+  """A TensorFlow linear regression model."""
 
   def __init__(self):
     """Constructs a LinearModel object."""
     super(LinearModel, self).__init__()
-    self._hidden_layer = self.track_layer(tf.layers.Dense(1))
+    self._hidden_layer = tf.layers.Dense(1)
 
   def call(self, xs):
     """Invoke the linear model.
diff --git a/tensorflow/contrib/eager/python/examples/mnist/BUILD b/tensorflow/contrib/eager/python/examples/mnist/BUILD
deleted file mode 100644
index c61ec2dbae60a782c0e6589701554b045dcb92ae..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/eager/python/examples/mnist/BUILD
+++ /dev/null
@@ -1,36 +0,0 @@
-licenses(["notice"])  # Apache 2.0
-
-package(default_visibility = ["//tensorflow:internal"])
-
-load("//tensorflow:tensorflow.bzl", "cuda_py_test")
-
-py_binary(
-    name = "mnist",
-    srcs = ["mnist.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        "//tensorflow:tensorflow_py",
-        "//tensorflow/contrib/eager/python:tfe",
-        "//tensorflow/examples/tutorials/mnist:input_data",
-    ],
-)
-
-cuda_py_test(
-    name = "mnist_test",
-    srcs = ["mnist_test.py"],
-    additional_deps = [
-        ":mnist",
-        "//tensorflow/contrib/eager/python:tfe",
-        "//tensorflow:tensorflow_py",
-    ],
-)
-
-cuda_py_test(
-    name = "mnist_graph_test",
-    srcs = ["mnist_graph_test.py"],
-    additional_deps = [
-        ":mnist",
-        "//third_party/py/numpy",
-        "//tensorflow:tensorflow_py",
-    ],
-)
diff --git a/tensorflow/contrib/eager/python/examples/mnist/README.md b/tensorflow/contrib/eager/python/examples/mnist/README.md
index e987996b88ccf54a322749aadec4f9840760a90f..d1c079ff6b5cb187bbcfe2742293982b1bedd2d4 100644
--- a/tensorflow/contrib/eager/python/examples/mnist/README.md
+++ b/tensorflow/contrib/eager/python/examples/mnist/README.md
@@ -1,10 +1 @@
-Classification model for the MNIST dataset using eager execution.
-
-To run:
-
-```
-python mnist.py
-```
-
-`mnist_graph_test.py` demonstrates that the same code that is executed eagerly
-in `mnist.py` is used to construct a TensorFlow graph.
+See https://github.com/tensorflow/models/tree/master/official/mnist/mnist_eager.py
diff --git a/tensorflow/contrib/eager/python/examples/mnist/mnist.py b/tensorflow/contrib/eager/python/examples/mnist/mnist.py
deleted file mode 100644
index 58b1e89d15895cf38331e6f7bd5a311a2f5f6467..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/eager/python/examples/mnist/mnist.py
+++ /dev/null
@@ -1,264 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""A deep MNIST classifier using convolutional layers.
-
-Sample usage:
-  python mnist.py --help
-"""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import argparse
-import os
-import sys
-import time
-
-import tensorflow as tf
-
-import tensorflow.contrib.eager as tfe
-from tensorflow.examples.tutorials.mnist import input_data
-
-FLAGS = None
-
-
-class MNISTModel(tf.keras.Model):
-  """MNIST Network.
-
-  Network structure is equivalent to:
-  https://github.com/tensorflow/tensorflow/blob/r1.6/tensorflow/examples/tutorials/mnist/mnist_deep.py
-  and
-  https://github.com/tensorflow/models/blob/master/tutorials/image/mnist/convolutional.py
-
-  But written using the tf.layers API.
-  """
-
-  def __init__(self, data_format):
-    """Creates a model for classifying a hand-written digit.
-
-    Args:
-      data_format: Either 'channels_first' or 'channels_last'.
-        'channels_first' is typically faster on GPUs while 'channels_last' is
-        typically faster on CPUs. See
-        https://www.tensorflow.org/performance/performance_guide#data_formats
-    """
-    super(MNISTModel, self).__init__(name='')
-    if data_format == 'channels_first':
-      self._input_shape = [-1, 1, 28, 28]
-    else:
-      assert data_format == 'channels_last'
-      self._input_shape = [-1, 28, 28, 1]
-    self.conv1 = tf.layers.Conv2D(
-        32, 5, data_format=data_format, activation=tf.nn.relu)
-    self.conv2 = tf.layers.Conv2D(
-        64, 5, data_format=data_format, activation=tf.nn.relu)
-    self.fc1 = tf.layers.Dense(1024, activation=tf.nn.relu)
-    self.fc2 = tf.layers.Dense(10)
-    self.dropout = tf.layers.Dropout(0.5)
-    self.max_pool2d = tf.layers.MaxPooling2D(
-        (2, 2), (2, 2), padding='SAME', data_format=data_format)
-
-  def call(self, inputs, training=False):
-    """Computes labels from inputs.
-
-    Users should invoke __call__ to run the network, which delegates to this
-    method (and not call this method directly).
-
-    Args:
-      inputs: A batch of images as a Tensor with shape [batch_size, 784].
-      training: True if invoked in the context of training (causing dropout to
-        be applied).  False otherwise.
-
-    Returns:
-      A Tensor with shape [batch_size, 10] containing the predicted logits
-      for each image in the batch, for each of the 10 classes.
-    """
-
-    x = tf.reshape(inputs, self._input_shape)
-    x = self.conv1(x)
-    x = self.max_pool2d(x)
-    x = self.conv2(x)
-    x = self.max_pool2d(x)
-    x = tf.layers.flatten(x)
-    x = self.fc1(x)
-    x = self.dropout(x, training=training)
-    x = self.fc2(x)
-    return x
-
-
-def loss(predictions, labels):
-  return tf.reduce_mean(
-      tf.nn.softmax_cross_entropy_with_logits(
-          logits=predictions, labels=labels))
-
-
-def compute_accuracy(predictions, labels):
-  return tf.reduce_sum(
-      tf.cast(
-          tf.equal(
-              tf.argmax(predictions, axis=1,
-                        output_type=tf.int64),
-              tf.argmax(labels, axis=1,
-                        output_type=tf.int64)),
-          dtype=tf.float32)) / float(predictions.shape[0].value)
-
-
-def train_one_epoch(model, optimizer, dataset, log_interval=None):
-  """Trains model on `dataset` using `optimizer`."""
-
-  tf.train.get_or_create_global_step()
-
-  for (batch, (images, labels)) in enumerate(tfe.Iterator(dataset)):
-    with tf.contrib.summary.record_summaries_every_n_global_steps(10):
-      with tfe.GradientTape() as tape:
-        prediction = model(images, training=True)
-        loss_value = loss(prediction, labels)
-        tf.contrib.summary.scalar('loss', loss_value)
-        tf.contrib.summary.scalar('accuracy',
-                                  compute_accuracy(prediction, labels))
-      grads = tape.gradient(loss_value, model.variables)
-      optimizer.apply_gradients(zip(grads, model.variables))
-      if log_interval and batch % log_interval == 0:
-        print('Batch #%d\tLoss: %.6f' % (batch, loss_value))
-
-
-def test(model, dataset):
-  """Perform an evaluation of `model` on the examples from `dataset`."""
-  avg_loss = tfe.metrics.Mean('loss')
-  accuracy = tfe.metrics.Accuracy('accuracy')
-
-  for (images, labels) in tfe.Iterator(dataset):
-    predictions = model(images, training=False)
-    avg_loss(loss(predictions, labels))
-    accuracy(tf.argmax(predictions, axis=1, output_type=tf.int64),
-             tf.argmax(labels, axis=1, output_type=tf.int64))
-  print('Test set: Average loss: %.4f, Accuracy: %4f%%\n' %
-        (avg_loss.result(), 100 * accuracy.result()))
-  with tf.contrib.summary.always_record_summaries():
-    tf.contrib.summary.scalar('loss', avg_loss.result())
-    tf.contrib.summary.scalar('accuracy', accuracy.result())
-
-
-def load_data(data_dir):
-  """Returns training and test tf.data.Dataset objects."""
-  data = input_data.read_data_sets(data_dir, one_hot=True)
-  train_ds = tf.data.Dataset.from_tensor_slices((data.train.images,
-                                                 data.train.labels))
-  test_ds = tf.data.Dataset.from_tensors((data.test.images, data.test.labels))
-  return (train_ds, test_ds)
-
-
-def main(_):
-  tfe.enable_eager_execution()
-
-  (device, data_format) = ('/gpu:0', 'channels_first')
-  if FLAGS.no_gpu or tfe.num_gpus() <= 0:
-    (device, data_format) = ('/cpu:0', 'channels_last')
-  print('Using device %s, and data format %s.' % (device, data_format))
-
-  # Load the datasets
-  (train_ds, test_ds) = load_data(FLAGS.data_dir)
-  train_ds = train_ds.shuffle(60000).batch(FLAGS.batch_size)
-
-  # Create the model and optimizer
-  model = MNISTModel(data_format)
-  optimizer = tf.train.MomentumOptimizer(FLAGS.lr, FLAGS.momentum)
-
-  if FLAGS.output_dir:
-    train_dir = os.path.join(FLAGS.output_dir, 'train')
-    test_dir = os.path.join(FLAGS.output_dir, 'eval')
-    tf.gfile.MakeDirs(FLAGS.output_dir)
-  else:
-    train_dir = None
-    test_dir = None
-  summary_writer = tf.contrib.summary.create_file_writer(
-      train_dir, flush_millis=10000)
-  test_summary_writer = tf.contrib.summary.create_file_writer(
-      test_dir, flush_millis=10000, name='test')
-  checkpoint_prefix = os.path.join(FLAGS.checkpoint_dir, 'ckpt')
-
-  with tf.device(device):
-    for epoch in range(1, 11):
-      with tfe.restore_variables_on_create(
-          tf.train.latest_checkpoint(FLAGS.checkpoint_dir)):
-        global_step = tf.train.get_or_create_global_step()
-        start = time.time()
-        with summary_writer.as_default():
-          train_one_epoch(model, optimizer, train_ds, FLAGS.log_interval)
-        end = time.time()
-        print('\nTrain time for epoch #%d (global step %d): %f' % (
-            epoch, global_step.numpy(), end - start))
-      with test_summary_writer.as_default():
-        test(model, test_ds)
-      all_variables = (
-          model.variables
-          + optimizer.variables()
-          + [global_step])
-      tfe.Saver(all_variables).save(
-          checkpoint_prefix, global_step=global_step)
-
-
-if __name__ == '__main__':
-  parser = argparse.ArgumentParser()
-  parser.add_argument(
-      '--data-dir',
-      type=str,
-      default='/tmp/tensorflow/mnist/input_data',
-      help='Directory for storing input data')
-  parser.add_argument(
-      '--batch-size',
-      type=int,
-      default=64,
-      metavar='N',
-      help='input batch size for training (default: 64)')
-  parser.add_argument(
-      '--log-interval',
-      type=int,
-      default=10,
-      metavar='N',
-      help='how many batches to wait before logging training status')
-  parser.add_argument(
-      '--output_dir',
-      type=str,
-      default=None,
-      metavar='N',
-      help='Directory to write TensorBoard summaries')
-  parser.add_argument(
-      '--checkpoint_dir',
-      type=str,
-      default='/tmp/tensorflow/mnist/checkpoints/',
-      metavar='N',
-      help='Directory to save checkpoints in (once per epoch)')
-  parser.add_argument(
-      '--lr',
-      type=float,
-      default=0.01,
-      metavar='LR',
-      help='learning rate (default: 0.01)')
-  parser.add_argument(
-      '--momentum',
-      type=float,
-      default=0.5,
-      metavar='M',
-      help='SGD momentum (default: 0.5)')
-  parser.add_argument(
-      '--no-gpu',
-      action='store_true',
-      default=False,
-      help='disables GPU usage even if a GPU is available')
-
-  FLAGS, unparsed = parser.parse_known_args()
-  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
diff --git a/tensorflow/contrib/eager/python/examples/mnist/mnist_graph_test.py b/tensorflow/contrib/eager/python/examples/mnist/mnist_graph_test.py
deleted file mode 100644
index 1af26553120b34d4682b17b1c29c81dc65e421d4..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/eager/python/examples/mnist/mnist_graph_test.py
+++ /dev/null
@@ -1,65 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-import tensorflow as tf
-from tensorflow.contrib.eager.python.examples.mnist import mnist
-
-
-def data_format():
-  return "channels_first" if tf.test.is_gpu_available() else "channels_last"
-
-
-class MNISTGraphTest(tf.test.TestCase):
-
-  def testTrainGraph(self):
-    # The MNISTModel class can be executed eagerly (as in mnist.py and
-    # mnist_test.py) and also be used to construct a TensorFlow graph, which is
-    # then trained in a session.
-    with tf.Graph().as_default():
-      # Generate some random data.
-      batch_size = 64
-      images = np.random.randn(batch_size, 784).astype(np.float32)
-      digits = np.random.randint(low=0, high=10, size=batch_size)
-      labels = np.zeros((batch_size, 10))
-      labels[np.arange(batch_size), digits] = 1.
-
-      # Create a model, optimizer, and dataset as would be done
-      # for eager execution as well.
-      model = mnist.MNISTModel(data_format())
-      optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0)
-      dataset = tf.data.Dataset.from_tensors((images, labels))
-
-      # Define the loss tensor (as opposed to a loss function when
-      # using eager execution).
-      (images, labels) = dataset.make_one_shot_iterator().get_next()
-      predictions = model(images, training=True)
-      loss = mnist.loss(predictions, labels)
-
-      train_op = optimizer.minimize(loss)
-      init = tf.global_variables_initializer()
-      with tf.Session() as sess:
-        # Variables have to be initialized in the session.
-        sess.run(init)
-        # Train using the optimizer.
-        sess.run(train_op)
-
-
-if __name__ == "__main__":
-  tf.test.main()
diff --git a/tensorflow/contrib/eager/python/examples/mnist/mnist_test.py b/tensorflow/contrib/eager/python/examples/mnist/mnist_test.py
deleted file mode 100644
index 136085eba21284a42282395e54f32c33bf63b5c3..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/eager/python/examples/mnist/mnist_test.py
+++ /dev/null
@@ -1,80 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import tensorflow as tf
-
-import tensorflow.contrib.eager as tfe
-from tensorflow.contrib.eager.python.examples.mnist import mnist
-
-
-def device():
-  return "/device:GPU:0" if tfe.num_gpus() else "/device:CPU:0"
-
-
-def data_format():
-  return "channels_first" if tfe.num_gpus() else "channels_last"
-
-
-def random_dataset():
-  batch_size = 64
-  images = tf.random_normal([batch_size, 784])
-  digits = tf.random_uniform([batch_size], minval=0, maxval=10, dtype=tf.int32)
-  labels = tf.one_hot(digits, 10)
-  return tf.data.Dataset.from_tensors((images, labels))
-
-
-def train_one_epoch(defun=False):
-  model = mnist.MNISTModel(data_format())
-  if defun:
-    model.call = tfe.defun(model.call)
-  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
-  dataset = random_dataset()
-  with tf.device(device()):
-    tf.train.get_or_create_global_step()
-    mnist.train_one_epoch(model, optimizer, dataset)
-
-
-def evaluate(defun=False):
-  model = mnist.MNISTModel(data_format())
-  dataset = random_dataset()
-  if defun:
-    model.call = tfe.defun(model.call)
-  with tf.device(device()):
-    tf.train.get_or_create_global_step()
-    mnist.test(model, dataset)
-
-
-class MNISTTest(tf.test.TestCase):
-
-  def testTrainOneEpoch(self):
-    train_one_epoch(defun=False)
-
-  def testTest(self):
-    evaluate(defun=False)
-
-  def testTrainOneEpochWithDefunCall(self):
-    train_one_epoch(defun=True)
-
-  def testTestWithDefunCall(self):
-    evaluate(defun=True)
-
-
-if __name__ == "__main__":
-  tfe.enable_eager_execution()
-  tf.test.main()
diff --git a/tensorflow/contrib/eager/python/examples/resnet50/resnet50.py b/tensorflow/contrib/eager/python/examples/resnet50/resnet50.py
index 9982fdb07eefa665379e7be095f4f8017d92cf97..6b59413141f78fc85474850e109454ecdeb68cd3 100644
--- a/tensorflow/contrib/eager/python/examples/resnet50/resnet50.py
+++ b/tensorflow/contrib/eager/python/examples/resnet50/resnet50.py
@@ -27,10 +27,9 @@ from __future__ import print_function
 import functools
 
 import tensorflow as tf
-import tensorflow.contrib.eager as tfe
 
 
-class _IdentityBlock(tfe.Network):
+class _IdentityBlock(tf.keras.Model):
   """_IdentityBlock is the block that has no conv layer at shortcut.
 
   Args:
@@ -50,31 +49,24 @@ class _IdentityBlock(tfe.Network):
     bn_name_base = 'bn' + str(stage) + block + '_branch'
     bn_axis = 1 if data_format == 'channels_first' else 3
 
-    self.conv2a = self.track_layer(
-        tf.layers.Conv2D(
-            filters1, (1, 1),
-            name=conv_name_base + '2a',
-            data_format=data_format))
-    self.bn2a = self.track_layer(
-        tf.layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2a'))
-
-    self.conv2b = self.track_layer(
-        tf.layers.Conv2D(
-            filters2,
-            kernel_size,
-            padding='same',
-            data_format=data_format,
-            name=conv_name_base + '2b'))
-    self.bn2b = self.track_layer(
-        tf.layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2b'))
-
-    self.conv2c = self.track_layer(
-        tf.layers.Conv2D(
-            filters3, (1, 1),
-            name=conv_name_base + '2c',
-            data_format=data_format))
-    self.bn2c = self.track_layer(
-        tf.layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2c'))
+    self.conv2a = tf.layers.Conv2D(
+        filters1, (1, 1), name=conv_name_base + '2a', data_format=data_format)
+    self.bn2a = tf.layers.BatchNormalization(
+        axis=bn_axis, name=bn_name_base + '2a')
+
+    self.conv2b = tf.layers.Conv2D(
+        filters2,
+        kernel_size,
+        padding='same',
+        data_format=data_format,
+        name=conv_name_base + '2b')
+    self.bn2b = tf.layers.BatchNormalization(
+        axis=bn_axis, name=bn_name_base + '2b')
+
+    self.conv2c = tf.layers.Conv2D(
+        filters3, (1, 1), name=conv_name_base + '2c', data_format=data_format)
+    self.bn2c = tf.layers.BatchNormalization(
+        axis=bn_axis, name=bn_name_base + '2c')
 
   def call(self, input_tensor, training=False):
     x = self.conv2a(input_tensor)
@@ -92,7 +84,7 @@ class _IdentityBlock(tfe.Network):
     return tf.nn.relu(x)
 
 
-class _ConvBlock(tfe.Network):
+class _ConvBlock(tf.keras.Model):
   """_ConvBlock is the block that has a conv layer at shortcut.
 
   Args:
@@ -121,41 +113,35 @@ class _ConvBlock(tfe.Network):
     bn_name_base = 'bn' + str(stage) + block + '_branch'
     bn_axis = 1 if data_format == 'channels_first' else 3
 
-    self.conv2a = self.track_layer(
-        tf.layers.Conv2D(
-            filters1, (1, 1),
-            strides=strides,
-            name=conv_name_base + '2a',
-            data_format=data_format))
-    self.bn2a = self.track_layer(
-        tf.layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2a'))
-
-    self.conv2b = self.track_layer(
-        tf.layers.Conv2D(
-            filters2,
-            kernel_size,
-            padding='same',
-            name=conv_name_base + '2b',
-            data_format=data_format))
-    self.bn2b = self.track_layer(
-        tf.layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2b'))
-
-    self.conv2c = self.track_layer(
-        tf.layers.Conv2D(
-            filters3, (1, 1),
-            name=conv_name_base + '2c',
-            data_format=data_format))
-    self.bn2c = self.track_layer(
-        tf.layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '2c'))
-
-    self.conv_shortcut = self.track_layer(
-        tf.layers.Conv2D(
-            filters3, (1, 1),
-            strides=strides,
-            name=conv_name_base + '1',
-            data_format=data_format))
-    self.bn_shortcut = self.track_layer(
-        tf.layers.BatchNormalization(axis=bn_axis, name=bn_name_base + '1'))
+    self.conv2a = tf.layers.Conv2D(
+        filters1, (1, 1),
+        strides=strides,
+        name=conv_name_base + '2a',
+        data_format=data_format)
+    self.bn2a = tf.layers.BatchNormalization(
+        axis=bn_axis, name=bn_name_base + '2a')
+
+    self.conv2b = tf.layers.Conv2D(
+        filters2,
+        kernel_size,
+        padding='same',
+        name=conv_name_base + '2b',
+        data_format=data_format)
+    self.bn2b = tf.layers.BatchNormalization(
+        axis=bn_axis, name=bn_name_base + '2b')
+
+    self.conv2c = tf.layers.Conv2D(
+        filters3, (1, 1), name=conv_name_base + '2c', data_format=data_format)
+    self.bn2c = tf.layers.BatchNormalization(
+        axis=bn_axis, name=bn_name_base + '2c')
+
+    self.conv_shortcut = tf.layers.Conv2D(
+        filters3, (1, 1),
+        strides=strides,
+        name=conv_name_base + '1',
+        data_format=data_format)
+    self.bn_shortcut = tf.layers.BatchNormalization(
+        axis=bn_axis, name=bn_name_base + '1')
 
   def call(self, input_tensor, training=False):
     x = self.conv2a(input_tensor)
@@ -176,7 +162,8 @@ class _ConvBlock(tfe.Network):
     return tf.nn.relu(x)
 
 
-class ResNet50(tfe.Network):
+# pylint: disable=not-callable
+class ResNet50(tf.keras.Model):
   """Instantiates the ResNet50 architecture.
 
   Args:
@@ -220,32 +207,28 @@ class ResNet50(tfe.Network):
     self.include_top = include_top
 
     def conv_block(filters, stage, block, strides=(2, 2)):
-      l = _ConvBlock(
+      return _ConvBlock(
           3,
           filters,
           stage=stage,
           block=block,
           data_format=data_format,
           strides=strides)
-      return self.track_layer(l)
 
     def id_block(filters, stage, block):
-      l = _IdentityBlock(
+      return _IdentityBlock(
           3, filters, stage=stage, block=block, data_format=data_format)
-      return self.track_layer(l)
-
-    self.conv1 = self.track_layer(
-        tf.layers.Conv2D(
-            64, (7, 7),
-            strides=(2, 2),
-            data_format=data_format,
-            padding='same',
-            name='conv1'))
+
+    self.conv1 = tf.layers.Conv2D(
+        64, (7, 7),
+        strides=(2, 2),
+        data_format=data_format,
+        padding='same',
+        name='conv1')
     bn_axis = 1 if data_format == 'channels_first' else 3
-    self.bn_conv1 = self.track_layer(
-        tf.layers.BatchNormalization(axis=bn_axis, name='bn_conv1'))
-    self.max_pool = self.track_layer(
-        tf.layers.MaxPooling2D((3, 3), strides=(2, 2), data_format=data_format))
+    self.bn_conv1 = tf.layers.BatchNormalization(axis=bn_axis, name='bn_conv1')
+    self.max_pool = tf.layers.MaxPooling2D(
+        (3, 3), strides=(2, 2), data_format=data_format)
 
     self.l2a = conv_block([64, 64, 256], stage=2, block='a', strides=(1, 1))
     self.l2b = id_block([64, 64, 256], stage=2, block='b')
@@ -267,13 +250,11 @@ class ResNet50(tfe.Network):
     self.l5b = id_block([512, 512, 2048], stage=5, block='b')
     self.l5c = id_block([512, 512, 2048], stage=5, block='c')
 
-    self.avg_pool = self.track_layer(
-        tf.layers.AveragePooling2D(
-            (7, 7), strides=(7, 7), data_format=data_format))
+    self.avg_pool = tf.layers.AveragePooling2D(
+        (7, 7), strides=(7, 7), data_format=data_format)
 
     if self.include_top:
-      self.fc1000 = self.track_layer(
-          tf.layers.Dense(classes, name='fc1000'))
+      self.fc1000 = tf.layers.Dense(classes, name='fc1000')
     else:
       reduction_indices = [1, 2] if data_format == 'channels_last' else [2, 3]
       reduction_indices = tf.constant(reduction_indices)
@@ -288,7 +269,7 @@ class ResNet50(tfe.Network):
       else:
         self.global_pooling = None
 
-  def call(self, input_tensor, training=False):
+  def call(self, input_tensor, training):
     x = self.conv1(input_tensor)
     x = self.bn_conv1(x, training=training)
     x = tf.nn.relu(x)
diff --git a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_graph_test.py b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_graph_test.py
index 23317886e712323f4b520000e0fd372734fc53a1..551c76b0df71c88919df9cd6d81b4176b23b0ba3 100644
--- a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_graph_test.py
+++ b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_graph_test.py
@@ -55,7 +55,7 @@ class ResNet50GraphTest(tf.test.TestCase):
     with tf.Graph().as_default():
       images = tf.placeholder(tf.float32, image_shape(None))
       model = resnet50.ResNet50(data_format())
-      predictions = model(images)
+      predictions = model(images, training=False)
 
       init = tf.global_variables_initializer()
 
@@ -114,7 +114,7 @@ class ResNet50Benchmarks(tf.test.Benchmark):
     with tf.Graph().as_default():
       images = tf.placeholder(tf.float32, image_shape(None))
       model = resnet50.ResNet50(data_format())
-      predictions = model(images)
+      predictions = model(images, training=False)
 
       init = tf.global_variables_initializer()
 
diff --git a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
index 0ff8746884c288f824f5f22ab4c550370d0e0302..65dcc53aab39670cae10846b6996c17d7b4c5ba8 100644
--- a/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
+++ b/tensorflow/contrib/eager/python/examples/resnet50/resnet50_test.py
@@ -71,7 +71,7 @@ class ResNet50Test(tf.test.TestCase):
       model.call = tfe.defun(model.call)
     with tf.device(device):
       images, _ = random_batch(2)
-      output = model(images)
+      output = model(images, training=False)
     self.assertEqual((2, 1000), output.shape)
 
   def test_apply(self):
@@ -85,7 +85,7 @@ class ResNet50Test(tf.test.TestCase):
     model = resnet50.ResNet50(data_format, include_top=False)
     with tf.device(device):
       images, _ = random_batch(2)
-      output = model(images)
+      output = model(images, training=False)
     output_shape = ((2, 2048, 1, 1)
                     if data_format == 'channels_first' else (2, 1, 1, 2048))
     self.assertEqual(output_shape, output.shape)
@@ -95,7 +95,7 @@ class ResNet50Test(tf.test.TestCase):
     model = resnet50.ResNet50(data_format, include_top=False, pooling='avg')
     with tf.device(device):
       images, _ = random_batch(2)
-      output = model(images)
+      output = model(images, training=False)
     self.assertEqual((2, 2048), output.shape)
 
   def test_train(self):
@@ -194,11 +194,11 @@ class ResNet50Benchmarks(tf.test.Benchmark):
     with tf.device(device):
       images, _ = random_batch(batch_size)
       for _ in xrange(num_burn):
-        model(images).cpu()
+        model(images, training=False).cpu()
       gc.collect()
       start = time.time()
       for _ in xrange(num_iters):
-        model(images).cpu()
+        model(images, training=False).cpu()
       self._report(label, start, num_iters, device, batch_size, data_format)
 
   def benchmark_eager_apply(self):
diff --git a/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py b/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py
index aa87b94e7b0876e65405f6bcb2d6aabde36582bf..29f02324544ede172500f799cd84068984d7d87b 100644
--- a/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py
+++ b/tensorflow/contrib/eager/python/examples/rnn_colorbot/rnn_colorbot.py
@@ -109,7 +109,7 @@ def load_dataset(data_dir, url, batch_size):
 
 
 # pylint: disable=not-callable
-class RNNColorbot(tfe.Network):
+class RNNColorbot(tf.keras.Model):
   """Multi-layer (LSTM) RNN that regresses on real-valued vector labels.
   """
 
@@ -127,23 +127,20 @@ class RNNColorbot(tfe.Network):
     self.label_dimension = label_dimension
     self.keep_prob = keep_prob
 
-    # Note the calls to `track_layer` below; these calls register the layers as
-    # network components that house trainable variables.
-    self.cells = [
-        self.track_layer(tf.nn.rnn_cell.BasicLSTMCell(size))
-        for size in rnn_cell_sizes
-    ]
-    self.relu = self.track_layer(
-        tf.layers.Dense(label_dimension, activation=tf.nn.relu, name="relu"))
+    self.cells = self._add_cells(
+        [tf.nn.rnn_cell.BasicLSTMCell(size) for size in rnn_cell_sizes])
+    self.relu = tf.layers.Dense(
+        label_dimension, activation=tf.nn.relu, name="relu")
 
-  def call(self, chars, sequence_length, training=False):
+  def call(self, inputs, training=False):
     """Implements the RNN logic and prediction generation.
 
     Args:
-      chars: a Tensor of dimension [batch_size, time_steps, 256] holding a
-        batch of one-hot encoded color names
-      sequence_length: a Tensor of dimension [batch_size] holding the length
-        of each character sequence (i.e., color name)
+      inputs: A tuple (chars, sequence_length), where chars is a batch of
+        one-hot encoded color names represented as a Tensor with dimensions
+        [batch_size, time_steps, 256] and sequence_length holds the length
+        of each character sequence (color name) as a Tensor with dimension
+        [batch_size].
       training: whether the invocation is happening during training
 
     Returns:
@@ -151,6 +148,7 @@ class RNNColorbot(tfe.Network):
       passing chars through a multi-layer RNN and applying a ReLU to the final
       hidden state.
     """
+    (chars, sequence_length) = inputs
     # Transpose the first and second dimensions so that chars is of shape
     # [time_steps, batch_size, dimension].
     chars = tf.transpose(chars, [1, 0, 2])
@@ -181,6 +179,14 @@ class RNNColorbot(tfe.Network):
     hidden_states = tf.gather_nd(chars, indices)
     return self.relu(hidden_states)
 
+  def _add_cells(self, cells):
+    # "Magic" required for keras.Model classes to track all the variables in
+    # a list of tf.layers.Layer objects.
+    # TODO(ashankar): Figure out API so user code doesn't have to do this.
+    for i, c in enumerate(cells):
+      setattr(self, "cell-%d" % i, c)
+    return cells
+
 
 def loss(labels, predictions):
   """Computes mean squared loss."""
@@ -191,7 +197,7 @@ def test(model, eval_data):
   """Computes the average loss on eval_data, which should be a Dataset."""
   avg_loss = tfe.metrics.Mean("loss")
   for (labels, chars, sequence_length) in tfe.Iterator(eval_data):
-    predictions = model(chars, sequence_length, training=False)
+    predictions = model((chars, sequence_length), training=False)
     avg_loss(loss(labels, predictions))
   print("eval/loss: %.6f\n" % avg_loss.result())
   with tf.contrib.summary.always_record_summaries():
@@ -204,7 +210,7 @@ def train_one_epoch(model, optimizer, train_data, log_interval=10):
   tf.train.get_or_create_global_step()
 
   def model_loss(labels, chars, sequence_length):
-    predictions = model(chars, sequence_length, training=True)
+    predictions = model((chars, sequence_length), training=True)
     loss_value = loss(labels, predictions)
     tf.contrib.summary.scalar("loss", loss_value)
     return loss_value
@@ -277,7 +283,7 @@ def main(_):
       (chars, length) = (tf.identity(chars), tf.identity(length))
       chars = tf.expand_dims(chars, 0)
       length = tf.expand_dims(length, 0)
-      preds = tf.unstack(model(chars, length, training=False)[0])
+      preds = tf.unstack(model((chars, length), training=False)[0])
 
     # Predictions cannot be negative, as they are generated by a ReLU layer;
     # they may, however, be greater than 1.
diff --git a/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py b/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
index 5c5c59c87744f4ffa6db90e5d8d3aa3bc8132756..69cd16d12c32c8c7c4744d8f0b4b1feedf946aa1 100644
--- a/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
+++ b/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py
@@ -39,21 +39,23 @@ from tensorflow.contrib.cudnn_rnn.python.layers import cudnn_rnn
 from tensorflow.contrib.eager.python import tfe
 
 
-class RNN(tfe.Network):
+class RNN(tf.keras.Model):
   """A static RNN.
 
-  Similar to tf.nn.static_rnn, implemented as a tf.layer.Layer.
+  Similar to tf.nn.static_rnn, implemented as a class.
   """
 
   def __init__(self, hidden_dim, num_layers, keep_ratio):
     super(RNN, self).__init__()
     self.keep_ratio = keep_ratio
-    for _ in range(num_layers):
-      self.track_layer(tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_dim))
+    self.cells = self._add_cells([
+        tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_dim)
+        for _ in range(num_layers)
+    ])
 
   def call(self, input_seq, training):
     batch_size = int(input_seq.shape[1])
-    for c in self.layers:
+    for c in self.cells:
       state = c.zero_state(batch_size, tf.float32)
       outputs = []
       input_seq = tf.unstack(input_seq, num=int(input_seq.shape[0]), axis=0)
@@ -64,7 +66,19 @@ class RNN(tfe.Network):
       input_seq = tf.stack(outputs, axis=0)
       if training:
         input_seq = tf.nn.dropout(input_seq, self.keep_ratio)
-    return input_seq, None
+    # Returning a list instead of a single tensor so that the line:
+    # y = self.rnn(y, ...)[0]
+    # in PTBModel.call works for both this RNN and CudnnLSTM (which returns a
+    # tuple (output, output_states).
+    return [input_seq]
+
+  def _add_cells(self, cells):
+    # "Magic" required for keras.Model classes to track all the variables in
+    # a list of tf.layers.Layer objects.
+    # TODO(ashankar): Figure out API so user code doesn't have to do this.
+    for i, c in enumerate(cells):
+      setattr(self, "cell-%d" % i, c)
+    return cells
 
 
 class Embedding(tf.layers.Layer):
@@ -87,7 +101,8 @@ class Embedding(tf.layers.Layer):
     return tf.nn.embedding_lookup(self.embedding, x)
 
 
-class PTBModel(tfe.Network):
+# pylint: disable=not-callable
+class PTBModel(tf.keras.Model):
   """LSTM for word language modeling.
 
   Model described in:
@@ -109,19 +124,16 @@ class PTBModel(tfe.Network):
 
     self.keep_ratio = 1 - dropout_ratio
     self.use_cudnn_rnn = use_cudnn_rnn
-    self.embedding = self.track_layer(Embedding(vocab_size, embedding_dim))
+    self.embedding = Embedding(vocab_size, embedding_dim)
 
     if self.use_cudnn_rnn:
       self.rnn = cudnn_rnn.CudnnLSTM(
           num_layers, hidden_dim, dropout=dropout_ratio)
     else:
       self.rnn = RNN(hidden_dim, num_layers, self.keep_ratio)
-    self.track_layer(self.rnn)
 
-    self.linear = self.track_layer(
-        tf.layers.Dense(
-            vocab_size,
-            kernel_initializer=tf.random_uniform_initializer(-0.1, 0.1)))
+    self.linear = tf.layers.Dense(
+        vocab_size, kernel_initializer=tf.random_uniform_initializer(-0.1, 0.1))
     self._output_shape = [-1, embedding_dim]
 
   def call(self, input_seq, training):
@@ -136,7 +148,7 @@ class PTBModel(tfe.Network):
     y = self.embedding(input_seq)
     if training:
       y = tf.nn.dropout(y, self.keep_ratio)
-    y, _ = self.rnn(y, training=training)
+    y = self.rnn(y, training=training)[0]
     return self.linear(tf.reshape(y, self._output_shape))
 
 
@@ -148,7 +160,7 @@ def clip_gradients(grads_and_vars, clip_ratio):
 
 def loss_fn(model, inputs, targets, training):
   labels = tf.reshape(targets, [-1])
-  outputs = model(inputs, training)
+  outputs = model(inputs, training=training)
   return tf.reduce_mean(
       tf.nn.sparse_softmax_cross_entropy_with_logits(
           labels=labels, logits=outputs))
diff --git a/tensorflow/contrib/eager/python/examples/spinn/BUILD b/tensorflow/contrib/eager/python/examples/spinn/BUILD
index a1f8a759e2a556bc219f0aa13942f293c4f34cfa..98d01ad1d5a70788d2d4cb07031a8d76a6bf628f 100644
--- a/tensorflow/contrib/eager/python/examples/spinn/BUILD
+++ b/tensorflow/contrib/eager/python/examples/spinn/BUILD
@@ -38,5 +38,8 @@ cuda_py_test(
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:framework_test_lib",
     ],
-    tags = ["no_pip"],  # because spinn.py is under third_party/.
+    tags = [
+        "no_cuda_on_cpu_tap",
+        "no_pip",  # because spinn.py is under third_party/.
+    ],
 )
diff --git a/tensorflow/contrib/eager/python/g3doc/guide.md b/tensorflow/contrib/eager/python/g3doc/guide.md
index d97ff6b74cf033617154f7cbbd00cb6492a1d2f4..ebb05051f27841f1cd3d21b6218986e774ed4c9f 100644
--- a/tensorflow/contrib/eager/python/g3doc/guide.md
+++ b/tensorflow/contrib/eager/python/g3doc/guide.md
@@ -22,10 +22,9 @@ to models defined without using eager execution.
 Eager execution is included in TensorFlow versions 1.5 and above.
 Installation instructions at https://www.tensorflow.org/install/
 
-The contents of this guide are compatible with TensorFlow 1.5.
-However, if you run into bugs that are fixed in source but not the
-release, you may want to either [build from
-source](https://www.tensorflow.org/install/install_sources)
+The contents of this guide are compatible with TensorFlow 1.5. However, if you
+run into bugs that are fixed in source but not the release, you may want to
+either [build from source](https://www.tensorflow.org/install/install_sources)
 or try a nightly build. The nightly builds are available as:
 
 - [`pip` packages](https://github.com/tensorflow/tensorflow/blob/master/README.md#installation) and
@@ -570,8 +569,8 @@ for i in range(20001):
 print("Loss on test set: %f" % loss(model, data.test.images, data.test.labels).numpy())
 ```
 
-For a more complete example, see
-[`tensorflow/contrib/eager/python/examples/mnist.py`](https://www.tensorflow.org/code/tensorflow/contrib/eager/python/examples/mnist/mnist.py)
+For a more complete example, see [the example in the tensorflow/models
+repository](https://github.com/tensorflow/models/tree/master/official/mnist/mnist_eager.py).
 
 ### Checkpointing trained variables
 
@@ -860,11 +859,9 @@ eagerly or constructing graphs. This means that you can iteratively develop your
 model with eager execution enabled and later, if needed, use the same code to
 reap the benefits of representing models as computational graphs.
 
-For example,
-[`mnist.py`](https://www.tensorflow.org/code/tensorflow/contrib/eager/python/examples/mnist/mnist.py)
-defines a model that is eagerly executed. That same code is used to construct
-and execute a graph in
-[`mnist_graph_test.py`](https://www.tensorflow.org/code/tensorflow/contrib/eager/python/examples/mnist/mnist_graph_test.py).
+For example, the same model definition used to construct a graph in
+[mnist.py`](https://github.com/tensorflow/models/tree/master/official/mnist/mnist.py)
+can be trained with eager execution enabled as in [`mnist_eager.py`](https://github.com/tensorflow/models/tree/master/official/mnist/mnist_eager.py).
 
 Other models in the [examples
 directory](https://www.tensorflow.org/code/tensorflow/contrib/eager/python/examples/)
diff --git a/tensorflow/contrib/eager/python/metrics_impl.py b/tensorflow/contrib/eager/python/metrics_impl.py
index ea8dbf2b46ea4bd0e33645ae3c590c4dd13f7a52..a34c4f758ad5a34f9f2864325baf154199e67182 100644
--- a/tensorflow/contrib/eager/python/metrics_impl.py
+++ b/tensorflow/contrib/eager/python/metrics_impl.py
@@ -30,12 +30,12 @@ from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import variable_scope
-
+from tensorflow.python.training import checkpointable
 
 _to_replace = re.compile("[^A-Za-z0-9.]")
 
 
-class Metric(object):
+class Metric(checkpointable.CheckpointableBase):
   """A metric holds state for aggregating statistics over an evaluation run.
 
   Example use with eager execution:
@@ -93,11 +93,12 @@ class Metric(object):
   `aggregate()`, it is for use by TensorFlow infrastructure.
   """
 
-  def __init__(self, name=None):
+  def __init__(self, name=None, use_global_variables=False):
     self._built = False
     self._vars = []
     self._initial_values = {}
     self._updates = []
+    self._use_global_variables = use_global_variables
     name = name or self.__class__.__name__
     # Replace things like spaces in name to create a valid scope name.
     scope_name = _to_replace.sub("_", name)
@@ -245,17 +246,29 @@ class Metric(object):
     """***Only for use by descendants of Metric***."""
     if self._built:
       raise RuntimeError("Can't call add_variable() except in build().")
-    collections = None if context.in_eager_mode() else [
-        ops.GraphKeys.LOCAL_VARIABLES, ops.GraphKeys.METRIC_VARIABLES
-    ]
-    v = variable_scope.get_variable(
-        name,
-        shape,
-        dtype,
-        initializer,
+    if context.in_eager_mode():
+      collections = None
+    else:
+      if self._use_global_variables:
+        collections = [ops.GraphKeys.GLOBAL_VARIABLES]
+      else:
+        collections = [ops.GraphKeys.LOCAL_VARIABLES]
+      collections += [ops.GraphKeys.METRIC_VARIABLES]
+    # Variables are Checkpointable dependencies of Metrics regardless of the
+    # global/local distinction. Users can avoid saving variables by not adding a
+    # dependency on the Metric.
+    v = self._add_variable_with_custom_getter(
+        name=name,
+        shape=shape,
+        dtype=dtype,
+        initializer=initializer,
         trainable=False,
         collections=collections,
-        use_resource=True)
+        use_resource=True,
+        getter=variable_scope.get_variable,
+        # Raise duplicate variable exceptions from get_variable rather than
+        # Checkpointable.
+        overwrite=True)
     self._vars.append(v)
     if context.in_eager_mode():
       self._initial_values[v] = v.value()
@@ -267,8 +280,10 @@ class Mean(Metric):
   # TODO(josh11b): Maybe have a dtype argument that defaults to tf.float64?
   # Or defaults to type of the input if it is tf.float32, else tf.float64?
 
-  def __init__(self, name=None, dtype=dtypes.float64):
-    super(Mean, self).__init__(name=name)
+  def __init__(self, name=None, dtype=dtypes.float64,
+               use_global_variables=False):
+    super(Mean, self).__init__(name=name,
+                               use_global_variables=use_global_variables)
     self.dtype = dtype
 
   def build(self, *args, **kwargs):
diff --git a/tensorflow/contrib/eager/python/metrics_test.py b/tensorflow/contrib/eager/python/metrics_test.py
index a9ecaa3f8bced3043ea0eb0ac3aa8bfa65e9e1ff..6b5450ba89bdfa6e0195f488b75f596b58c463d5 100644
--- a/tensorflow/contrib/eager/python/metrics_test.py
+++ b/tensorflow/contrib/eager/python/metrics_test.py
@@ -18,8 +18,10 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import os
 import tempfile
 
+from tensorflow.contrib.eager.python import checkpointable_utils
 from tensorflow.contrib.eager.python import metrics
 from tensorflow.contrib.summary import summary_ops
 from tensorflow.contrib.summary import summary_test_util
@@ -50,6 +52,19 @@ class MetricsTest(test.TestCase):
       self.assertEqual(
           set(m.variables),
           set(ops.get_collection(ops.GraphKeys.LOCAL_VARIABLES)))
+      self.assertEqual(ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES), [])
+      self.assertEqual(
+          set(m.variables),
+          set(ops.get_collection(ops.GraphKeys.METRIC_VARIABLES)))
+
+  def testUseGlobalVariablesCollections(self):
+    with context.graph_mode(), ops.Graph().as_default():
+      m = metrics.Mean(use_global_variables=True)
+      m(1000)
+      self.assertEqual(
+          set(m.variables),
+          set(ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)))
+      self.assertEqual(ops.get_collection(ops.GraphKeys.LOCAL_VARIABLES), [])
       self.assertEqual(
           set(m.variables),
           set(ops.get_collection(ops.GraphKeys.METRIC_VARIABLES)))
@@ -193,6 +208,31 @@ class MetricsTest(test.TestCase):
       self.assertAllEqual(m2.result().eval(), 2.0)
       self.assertAllEqual(m1.result().eval(), 1.0)
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testSaveRestore(self):
+    checkpoint_directory = self.get_temp_dir()
+    checkpoint_prefix = os.path.join(checkpoint_directory, "ckpt")
+    mean = metrics.Mean()
+    checkpoint = checkpointable_utils.Checkpoint(mean=mean)
+    mean.build()
+    mean._built = True
+    self.evaluate(mean.init_variables())
+    self.evaluate(mean(100.))
+    self.evaluate(mean(200.))
+    save_path = checkpoint.save(checkpoint_prefix)
+    self.evaluate(mean(1000.))
+    checkpoint.restore(save_path).assert_consumed().run_restore_ops()
+    self.evaluate(mean(300.))
+    self.assertAllEqual(200., self.evaluate(mean.value()))
+
+    restore_mean = metrics.Mean()
+    restore_checkpoint = checkpointable_utils.Checkpoint(mean=restore_mean)
+    status = restore_checkpoint.restore(save_path)
+    restore_update = restore_mean(300.)
+    status.assert_consumed().run_restore_ops()
+    self.evaluate(restore_update)
+    self.assertAllEqual(200., self.evaluate(restore_mean.value()))
+    self.assertEqual(3, self.evaluate(restore_mean.denom))
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/contrib/eager/python/tfe.py b/tensorflow/contrib/eager/python/tfe.py
index d32bebf90c1e768d1efec26b3b78bf1a522a8f00..fce7a608531c0630e03a89b6625b6853389489ed 100644
--- a/tensorflow/contrib/eager/python/tfe.py
+++ b/tensorflow/contrib/eager/python/tfe.py
@@ -56,6 +56,10 @@ To use, at program startup, call `tfe.enable_eager_execution()`.
 @@save_network_checkpoint
 @@restore_network_checkpoint
 
+@@Checkpoint
+@@Checkpointable
+@@CheckpointableSaver
+
 @@in_eager_mode
 @@in_graph_mode
 
@@ -74,6 +78,8 @@ from __future__ import print_function
 # pylint:disable=g-bad-import-order,g-import-not-at-top,unused-import
 #
 from tensorflow.contrib.eager.python import metrics
+from tensorflow.contrib.eager.python.checkpointable_utils import CheckpointableSaver
+from tensorflow.contrib.eager.python.checkpointable_utils import Checkpoint
 from tensorflow.contrib.eager.python.datasets import Iterator
 from tensorflow.contrib.eager.python.network import Network
 from tensorflow.contrib.eager.python.network import Sequential
@@ -105,6 +111,7 @@ from tensorflow.python.ops.resource_variable_ops import ResourceVariable as Vari
 from tensorflow.python.ops.variable_scope import EagerVariableStore
 from tensorflow.python.ops import script_ops
 from tensorflow.python.ops import template
+from tensorflow.python.training.checkpointable import Checkpointable
 from tensorflow.python.util.all_util import remove_undocumented
 
 py_func = script_ops.eager_py_func
diff --git a/tensorflow/contrib/estimator/BUILD b/tensorflow/contrib/estimator/BUILD
index 6cdbed5b896577f5622b1bd0123c289c798bc0a5..773c6ab6c79217698c7c598a133082e2553f28f6 100644
--- a/tensorflow/contrib/estimator/BUILD
+++ b/tensorflow/contrib/estimator/BUILD
@@ -138,6 +138,7 @@ py_test(
     size = "medium",
     srcs = ["python/estimator/extenders_test.py"],
     srcs_version = "PY2AND3",
+    tags = ["notsan"],  # b/62863147
     deps = [
         ":extenders",
         "//tensorflow/contrib/data/python/ops:dataset_ops",
@@ -169,6 +170,7 @@ py_library(
         "//tensorflow/python:lookup_ops",
         "//tensorflow/python:math_ops",
         "//tensorflow/python:metrics",
+        "//tensorflow/python:nn",
         "//tensorflow/python:sparse_ops",
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:summary",
@@ -191,6 +193,7 @@ py_test(
         ":head",
         "//tensorflow/core:protos_all_py",
         "//tensorflow/python:array_ops",
+        "//tensorflow/python:check_ops",
         "//tensorflow/python:client_testlib",
         "//tensorflow/python:constant_op",
         "//tensorflow/python:control_flow_ops",
diff --git a/tensorflow/contrib/estimator/python/estimator/head.py b/tensorflow/contrib/estimator/python/estimator/head.py
index 238cf287b768eee28b20202084eb244c085c8b75..f95fcc8039cb54c26543781b31013a7676168b0b 100644
--- a/tensorflow/contrib/estimator/python/estimator/head.py
+++ b/tensorflow/contrib/estimator/python/estimator/head.py
@@ -31,6 +31,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import lookup_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import metrics as metrics_lib
+from tensorflow.python.ops import nn
 from tensorflow.python.ops import sparse_ops
 from tensorflow.python.ops.losses import losses
 from tensorflow.python.saved_model import signature_constants
@@ -177,6 +178,7 @@ def regression_head(weight_column=None,
                     label_dimension=1,
                     loss_reduction=losses.Reduction.SUM,
                     loss_fn=None,
+                    inverse_link_fn=None,
                     name=None):
   """Creates a `_Head` for regression using the `mean_squared_error` loss.
 
@@ -195,10 +197,16 @@ def regression_head(weight_column=None,
   `[D0, D1, ... DN]`, `[D0, D1, ... DN, 1]` or
   `[D0, D1, ... DN, label_dimension]`.
 
-  Also supports custom `loss_fn`. `loss_fn` takes `(labels, logits)` or
+  Supports custom `loss_fn`. `loss_fn` takes `(labels, logits)` or
   `(labels, logits, features)` as arguments and returns unreduced loss with
   shape `[D0, D1, ... DN, label_dimension]`.
 
+  Also supports custom `inverse_link_fn`, also known as 'mean function'.
+  `inverse_link_fn` takes `logits` as argument and returns predicted values.
+  This function is the inverse of the link function defined in
+  https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function
+  Namely, for poisson regression, set `inverse_link_fn=tf.exp`.
+
   Args:
     weight_column: A string or a `_NumericColumn` created by
       `tf.feature_column.numeric_column` defining feature column representing
@@ -209,7 +217,9 @@ def regression_head(weight_column=None,
       `[batch_size, label_dimension]`).
     loss_reduction: One of `tf.losses.Reduction` except `NONE`. Describes how to
       reduce training loss over batch. Defaults to `SUM`.
-    loss_fn: Optional loss function.
+    loss_fn: Optional loss function. Defaults to `mean_squared_error`.
+    inverse_link_fn: Optional inverse link function, also known as 'mean
+      function'. Defaults to identity.
     name: name of the head. If provided, summary and metrics keys will be
       suffixed by `"/" + name`. Also used as `name_scope` when creating ops.
 
@@ -224,6 +234,67 @@ def regression_head(weight_column=None,
       label_dimension=label_dimension,
       loss_reduction=loss_reduction,
       loss_fn=loss_fn,
+      inverse_link_fn=inverse_link_fn,
+      name=name)
+
+
+def poisson_regression_head(
+    weight_column=None,
+    label_dimension=1,
+    loss_reduction=losses.Reduction.SUM,
+    compute_full_loss=True,
+    name=None):
+  """Creates a `_Head` for poisson regression using `tf.nn.log_poisson_loss`.
+
+  The loss is the weighted sum over all input dimensions. Namely, if the input
+  labels have shape `[batch_size, label_dimension]`, the loss is the weighted
+  sum over both `batch_size` and `label_dimension`.
+
+  The head expects `logits` with shape `[D0, D1, ... DN, label_dimension]`.
+  In many applications, the shape is `[batch_size, label_dimension]`.
+
+  The `labels` shape must match `logits`, namely
+  `[D0, D1, ... DN, label_dimension]`. If `label_dimension=1`, shape
+  `[D0, D1, ... DN]` is also supported.
+
+  If `weight_column` is specified, weights must be of shape
+  `[D0, D1, ... DN]`, `[D0, D1, ... DN, 1]` or
+  `[D0, D1, ... DN, label_dimension]`.
+
+  This is implemented as a generalized linear model, see
+  https://en.wikipedia.org/wiki/Generalized_linear_model.
+
+  Args:
+    weight_column: A string or a `_NumericColumn` created by
+      `tf.feature_column.numeric_column` defining feature column representing
+      weights. It is used to down weight or boost examples during training. It
+      will be multiplied by the loss of the example.
+    label_dimension: Number of regression labels per example. This is the size
+      of the last dimension of the labels `Tensor` (typically, this has shape
+      `[batch_size, label_dimension]`).
+    loss_reduction: One of `tf.losses.Reduction` except `NONE`. Describes how to
+      reduce training loss over batch. Defaults to `SUM`.
+    compute_full_loss: Whether to include the constant `log(z!)` term in
+      computing the poisson loss. See `tf.nn.log_poisson_loss` for the full
+      documentation.
+    name: name of the head. If provided, summary and metrics keys will be
+      suffixed by `"/" + name`. Also used as `name_scope` when creating ops.
+
+  Returns:
+    An instance of `_Head` for poisson regression.
+
+  Raises:
+    ValueError: If `label_dimension` or `loss_reduction` is invalid.
+  """
+  def _poisson_loss(labels, logits):
+    return nn.log_poisson_loss(
+        targets=labels, log_input=logits, compute_full_loss=compute_full_loss)
+  return head_lib._regression_head_with_mean_squared_error_loss(  # pylint:disable=protected-access
+      weight_column=weight_column,
+      label_dimension=label_dimension,
+      loss_reduction=loss_reduction,
+      loss_fn=_poisson_loss,
+      inverse_link_fn=math_ops.exp,
       name=name)
 
 
diff --git a/tensorflow/contrib/estimator/python/estimator/head_test.py b/tensorflow/contrib/estimator/python/estimator/head_test.py
index 1411635228457218578c0297d4d901e9c86ca91a..76d050cb2833a9bdb1d713bfd3b9901b741efc23 100644
--- a/tensorflow/contrib/estimator/python/estimator/head_test.py
+++ b/tensorflow/contrib/estimator/python/estimator/head_test.py
@@ -32,6 +32,7 @@ from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import string_ops
@@ -1106,5 +1107,75 @@ class MultiLabelHead(test.TestCase):
         expected_metrics=expected_metrics)
 
 
+class PoissonRegressionHead(test.TestCase):
+
+  def setUp(self):
+    ops.reset_default_graph()
+
+  def test_train(self):
+    head = head_lib.poisson_regression_head()
+
+    # Create estimator spec.
+    logits = np.array([[0], [-1], [1]], dtype=np.float32)
+    labels = np.array([[1], [2], [3]], dtype=np.int32)
+    # With x = exp(logits), z = labels.
+    # loss = -ln(exp(-x) * (x^z) / z!)
+    #      = x - z * ln(x) + ln(z!)
+    #      = exp(logits) - labels * logits - ln(labels!)
+    # But for ln(z!) and z > 1, the Stirling approximation is used
+    # ln(z!) = z*ln(z) - z + 0.5*ln(2*pi*z)
+    # loss = [exp(0) - 1 * 0 + ln(1!),
+    #         exp(-1) - 2 * (-1) + 2*ln(2) - 2 + 0.5*ln(2*pi*2),
+    #         exp(1) - 3 * 1 + 3*ln(3) - 3 + 0.5*ln(2*pi*3)]
+    #      = [1.0, 3.020, 1.482]
+    # sum_loss = 5.502
+    expected_loss = 5.502
+    atol = 0.001
+    expected_train_result = b'my_train_op'
+    def _train_op_fn(loss):
+      with ops.control_dependencies((check_ops.assert_near(
+          math_ops.to_float(expected_loss), math_ops.to_float(loss),
+          atol=atol, name='assert_loss'),)):
+        return constant_op.constant(expected_train_result)
+
+    spec = head.create_estimator_spec(
+        features={'x': np.array(((42.,),), dtype=np.int32)},
+        mode=model_fn.ModeKeys.TRAIN,
+        logits=logits,
+        labels=labels,
+        train_op_fn=_train_op_fn)
+
+    with self.test_session() as sess:
+      _initialize_variables(self, spec.scaffold)
+      loss, train_result = sess.run([spec.loss, spec.train_op])
+      self.assertAlmostEqual(expected_loss, loss, delta=atol)
+      self.assertEqual(expected_train_result, train_result)
+
+  def test_predict(self):
+    head = head_lib.poisson_regression_head()
+
+    # Create estimator spec.
+    logits = np.array([[0], [-1], [1]], dtype=np.float32)
+    expected_predictions = np.exp(logits)
+    spec = head.create_estimator_spec(
+        features={'x': np.array(((42.,),), dtype=np.int32)},
+        mode=model_fn.ModeKeys.PREDICT,
+        logits=logits)
+
+    # Assert spec contains expected tensors.
+    keys = prediction_keys.PredictionKeys
+    self.assertItemsEqual(
+        (keys.PREDICTIONS, keys.LOGITS), spec.predictions.keys())
+    self.assertEqual(dtypes.float32, spec.predictions[keys.PREDICTIONS].dtype)
+    self.assertEqual(dtypes.float32, spec.predictions[keys.LOGITS].dtype)
+
+    # Assert predictions.
+    with self.test_session():
+      _initialize_variables(self, spec.scaffold)
+      self.assertAllClose(
+          expected_predictions, spec.predictions[keys.PREDICTIONS].eval())
+      self.assertAllClose(logits, spec.predictions[keys.LOGITS].eval())
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/estimator/python/estimator/replicate_model_fn.py b/tensorflow/contrib/estimator/python/estimator/replicate_model_fn.py
index 7134cd3f5a457a322f51066eb791133c3181d3fb..e0fae2c99292385c6dd32cc6002cee2076a2bb20 100644
--- a/tensorflow/contrib/estimator/python/estimator/replicate_model_fn.py
+++ b/tensorflow/contrib/estimator/python/estimator/replicate_model_fn.py
@@ -110,7 +110,8 @@ def replicate_model_fn(model_fn,
   Certain algorithms were chosen for aggregating results of computations on
   multiple towers:
     - Losses from all towers are reduced according to `loss_reduction`.
-    - Gradients are reduced using sum for each trainable variable.
+    - Gradients from all towers are reduced according to `loss_reduction`
+      for each trainable variable.
     - `eval_metrics_ops` are reduced per metric using `reduce_mean`.
     - `EstimatorSpec.predictions` and `EstimatorSpec.export_outputs` are
       reduced using concatenation.
diff --git a/tensorflow/contrib/factorization/python/ops/kmeans.py b/tensorflow/contrib/factorization/python/ops/kmeans.py
index c861cfff544a78617aa1ace730b50c094cf16330..7319eaa7de8db8e4677bdf64af3b0a72c1007a90 100644
--- a/tensorflow/contrib/factorization/python/ops/kmeans.py
+++ b/tensorflow/contrib/factorization/python/ops/kmeans.py
@@ -61,8 +61,8 @@ class _LossRelativeChangeHook(session_run_hook.SessionRunHook):
     loss = run_values.results
     assert loss is not None
     if self._prev_loss:
-      relative_change = (abs(loss - self._prev_loss) /
-                         (1 + abs(self._prev_loss)))
+      relative_change = (
+          abs(loss - self._prev_loss) / (1 + abs(self._prev_loss)))
       if relative_change < self._tolerance:
         run_context.request_stop()
     self._prev_loss = loss
@@ -233,7 +233,57 @@ class _ModelFn(object):
 
 # TODO(agarwal,ands): support sharded input.
 class KMeansClustering(estimator.Estimator):
-  """An Estimator for K-Means clustering."""
+  """An Estimator for K-Means clustering.
+
+  Example:
+  ```
+  import numpy as np
+  import tensorflow as tf
+
+  num_points = 100
+  dimensions = 2
+  points = np.random.uniform(0, 1000, [num_points, dimensions])
+
+  def input_fn():
+    return tf.train.limit_epochs(
+        tf.convert_to_tensor(points, dtype=tf.float32), num_epochs=1)
+
+  num_clusters = 5
+  kmeans = tf.contrib.factorization.KMeansClustering(
+      num_clusters=num_clusters, use_mini_batch=False)
+
+  # train
+  num_iterations = 10
+  previous_centers = None
+  for _ in xrange(num_iterations):
+    kmeans.train(input_fn)
+    cluster_centers = kmeans.cluster_centers()
+    if previous_centers is not None:
+      print 'delta:', cluster_centers - previous_centers
+    previous_centers = cluster_centers
+    print 'score:', kmeans.score(input_fn)
+  print 'cluster centers:', cluster_centers
+
+  # map the input points to their clusters
+  cluster_indices = list(kmeans.predict_cluster_index(input_fn))
+  for i, point in enumerate(points):
+    cluster_index = cluster_indices[i]
+    center = cluster_centers[cluster_index]
+    print 'point:', point, 'is in cluster', cluster_index, 'centered at', center
+  ```
+
+  The `SavedModel` saved by the `export_savedmodel` method does not include the
+  cluster centers. However, the cluster centers may be retrieved by the
+  latest checkpoint saved during training. Specifically,
+  ```
+  kmeans.cluster_centers()
+  ```
+  is equivalent to
+  ```
+  tf.train.load_variable(
+      kmeans.model_dir, KMeansClustering.CLUSTER_CENTERS_VAR_NAME)
+  ```
+  """
 
   # Valid values for the distance_metric constructor argument.
   SQUARED_EUCLIDEAN_DISTANCE = clustering_ops.SQUARED_EUCLIDEAN_DISTANCE
@@ -253,6 +303,9 @@ class KMeansClustering(estimator.Estimator):
   CLUSTER_INDEX = 'cluster_index'
   ALL_DISTANCES = 'all_distances'
 
+  # Variable name used by cluster_centers().
+  CLUSTER_CENTERS_VAR_NAME = clustering_ops.CLUSTERS_VAR_NAME
+
   def __init__(self,
                num_clusters,
                model_dir=None,
@@ -406,4 +459,4 @@ class KMeansClustering(estimator.Estimator):
 
   def cluster_centers(self):
     """Returns the cluster centers."""
-    return self.get_variable_value(clustering_ops.CLUSTERS_VAR_NAME)
+    return self.get_variable_value(KMeansClustering.CLUSTER_CENTERS_VAR_NAME)
diff --git a/tensorflow/contrib/feature_column/BUILD b/tensorflow/contrib/feature_column/BUILD
index 6fc053759c58d30c24657dd22e7d12be46fc7a7e..8ba0823a71a5aa05ea276bdd7e7117658bee4351 100644
--- a/tensorflow/contrib/feature_column/BUILD
+++ b/tensorflow/contrib/feature_column/BUILD
@@ -25,13 +25,42 @@ py_library(
     srcs = ["__init__.py"],
     srcs_version = "PY2AND3",
     deps = [
-        ":sequential_feature_column",
+        ":sequence_feature_column",
     ],
 )
 
 py_library(
-    name = "sequential_feature_column",
-    srcs = ["python/feature_column/sequential_feature_column.py"],
+    name = "sequence_feature_column",
+    srcs = ["python/feature_column/sequence_feature_column.py"],
     srcs_version = "PY2AND3",
-    deps = [],
+    deps = [
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:check_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:parsing_ops",
+        "//tensorflow/python:sparse_ops",
+        "//tensorflow/python:tensor_shape",
+        "//tensorflow/python:variable_scope",
+        "//tensorflow/python/feature_column",
+    ],
+)
+
+py_test(
+    name = "sequence_feature_column_test",
+    srcs = ["python/feature_column/sequence_feature_column_test.py"],
+    srcs_version = "PY2AND3",
+    tags = ["no_pip"],
+    deps = [
+        ":sequence_feature_column",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:errors",
+        "//tensorflow/python:framework_ops",
+        "//tensorflow/python:sparse_tensor",
+        "//tensorflow/python:training",
+        "//tensorflow/python/feature_column",
+        "//third_party/py/numpy",
+    ],
 )
diff --git a/tensorflow/contrib/feature_column/__init__.py b/tensorflow/contrib/feature_column/__init__.py
index 6da7b126931effae9cc97091a27070d7013450d4..650a80144f2e2445d189bfd28a619aad1cfb13a7 100644
--- a/tensorflow/contrib/feature_column/__init__.py
+++ b/tensorflow/contrib/feature_column/__init__.py
@@ -19,7 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 # pylint: disable=unused-import,line-too-long,wildcard-import
-from tensorflow.contrib.feature_column.python.feature_column.sequential_feature_column import *
+from tensorflow.contrib.feature_column.python.feature_column.sequence_feature_column import *
 
 from tensorflow.python.util.all_util import remove_undocumented
 # pylint: enable=unused-import,line-too-long,wildcard-import
diff --git a/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py
new file mode 100644
index 0000000000000000000000000000000000000000..e446043bdd2d50c5b952fce4738b4526edd4be57
--- /dev/null
+++ b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column.py
@@ -0,0 +1,434 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Experimental methods for tf.feature_column sequence input."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+
+import abc
+import collections
+
+
+from tensorflow.python.feature_column import feature_column as fc
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import parsing_ops
+from tensorflow.python.ops import sparse_ops
+from tensorflow.python.ops import variable_scope
+
+# pylint: disable=protected-access
+# TODO(b/73827486): Support SequenceExample.
+
+
+def sequence_input_layer(
+    features,
+    feature_columns,
+    weight_collections=None,
+    trainable=True):
+  """"Builds input layer for sequence input.
+
+  All `feature_columns` must be sequence dense columns with the same
+  `sequence_length`. The output of this method can be fed into sequence
+  networks, such as RNN.
+
+  The output of this method is a 3D `Tensor` of shape `[batch_size, T, D]`.
+  `T` is the maximum sequence length for this batch, which could differ from
+  batch to batch.
+
+  If multiple `feature_columns` are given with `Di` `num_elements` each, their
+  outputs are concatenated. So, the final `Tensor` has shape
+  `[batch_size, T, D0 + D1 + ... + Dn]`.
+
+  Example:
+
+  ```python
+  rating = sequence_numeric_column('rating')
+  watches = sequence_categorical_column_with_identity(
+      'watches', num_buckets=1000)
+  watches_embedding = embedding_column(watches, dimension=10)
+  columns = [rating, watches]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    features: A dict mapping keys to tensors.
+    feature_columns: An iterable of dense sequence columns. Valid columns are
+      - `embedding_column` that wraps a `sequence_categorical_column_with_*`
+      - `sequence_numeric_column`.
+    weight_collections: A list of collection names to which the Variable will be
+      added. Note that variables will also be added to collections
+      `tf.GraphKeys.GLOBAL_VARIABLES` and `ops.GraphKeys.MODEL_VARIABLES`.
+    trainable: If `True` also add the variable to the graph collection
+      `GraphKeys.TRAINABLE_VARIABLES`.
+
+  Returns:
+    An `(input_layer, sequence_length)` tuple where:
+    - input_layer: A float `Tensor` of shape `[batch_size, T, D]`.
+        `T` is the maximum sequence length for this batch, which could differ
+        from batch to batch. `D` is the sum of `num_elements` for all
+        `feature_columns`.
+    - sequence_length: An int `Tensor` of shape `[batch_size]`. The sequence
+        length for each example.
+
+  Raises:
+    ValueError: If any of the `feature_columns` is the wrong type.
+  """
+  feature_columns = fc._clean_feature_columns(feature_columns)
+  for c in feature_columns:
+    if not isinstance(c, _SequenceDenseColumn):
+      raise ValueError(
+          'All feature_columns must be of type _SequenceDenseColumn. '
+          'Given (type {}): {}'.format(type(c), c))
+
+  with variable_scope.variable_scope(
+      None, default_name='sequence_input_layer', values=features.values()):
+    builder = fc._LazyBuilder(features)
+    output_tensors = []
+    sequence_lengths = []
+    ordered_columns = []
+    for column in sorted(feature_columns, key=lambda x: x.name):
+      ordered_columns.append(column)
+      with variable_scope.variable_scope(
+          None, default_name=column._var_scope_name):
+        dense_tensor, sequence_length = column._get_sequence_dense_tensor(
+            builder,
+            weight_collections=weight_collections,
+            trainable=trainable)
+        # Flattens the final dimension to produce a 3D Tensor.
+        num_elements = column._variable_shape.num_elements()
+        shape = array_ops.shape(dense_tensor)
+        output_tensors.append(
+            array_ops.reshape(
+                dense_tensor,
+                shape=array_ops.concat([shape[:2], [num_elements]], axis=0)))
+        sequence_lengths.append(sequence_length)
+    fc._verify_static_batch_size_equality(output_tensors, ordered_columns)
+    # TODO(b/73160931): Verify sequence_length equality.
+    return array_ops.concat(output_tensors, -1), sequence_lengths[0]
+
+
+# TODO(b/73160931): Add remaining categorical columns.
+def sequence_categorical_column_with_identity(
+    key, num_buckets, default_value=None):
+  """Returns a feature column that represents sequences of integers.
+
+  Example:
+
+  ```python
+  watches = sequence_categorical_column_with_identity(
+      'watches', num_buckets=1000)
+  watches_embedding = embedding_column(watches, dimension=10)
+  columns = [watches]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    key: A unique string identifying the input feature.
+    num_buckets: Range of inputs. Namely, inputs are expected to be in the
+      range `[0, num_buckets)`.
+    default_value: If `None`, this column's graph operations will fail for
+      out-of-range inputs. Otherwise, this value must be in the range
+      `[0, num_buckets)`, and will replace out-of-range inputs.
+
+  Returns:
+    A `_SequenceCategoricalColumn`.
+  """
+  return _SequenceCategoricalColumn(
+      fc.categorical_column_with_identity(
+          key=key,
+          num_buckets=num_buckets,
+          default_value=default_value))
+
+
+# TODO(b/73160931): Merge with embedding_column
+def _sequence_embedding_column(
+    categorical_column, dimension, initializer=None, ckpt_to_load_from=None,
+    tensor_name_in_ckpt=None, max_norm=None, trainable=True):
+  """Returns a feature column that represents sequences of embeddings.
+
+  Use this to convert sequence categorical data into dense representation for
+  input to sequence NN, such as RNN.
+
+  Example:
+
+  ```python
+  watches = sequence_categorical_column_with_identity(
+      'watches', num_buckets=1000)
+  watches_embedding = embedding_column(watches, dimension=10)
+  columns = [watches]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    categorical_column: A `_SequenceCategoricalColumn` created with a
+      `sequence_cateogrical_column_with_*` function.
+    dimension: Integer dimension of the embedding.
+    initializer: Initializer function used to initialize the embeddings.
+    ckpt_to_load_from: String representing checkpoint name/pattern from which to
+      restore column weights. Required if `tensor_name_in_ckpt` is not `None`.
+    tensor_name_in_ckpt: Name of the `Tensor` in `ckpt_to_load_from` from
+      which to restore the column weights. Required if `ckpt_to_load_from` is
+      not `None`.
+    max_norm: If not `None`, embedding values are l2-normalized to this value.
+    trainable: Whether or not the embedding is trainable. Default is True.
+
+  Returns:
+    A `_SequenceEmbeddingColumn`.
+
+  Raises:
+    ValueError: If `categorical_column` is not the right type.
+  """
+  if not isinstance(categorical_column, _SequenceCategoricalColumn):
+    raise ValueError(
+        'categorical_column must be of type _SequenceCategoricalColumn. '
+        'Given (type {}): {}'.format(
+            type(categorical_column), categorical_column))
+  return _SequenceEmbeddingColumn(
+      fc.embedding_column(
+          categorical_column,
+          dimension=dimension,
+          initializer=initializer,
+          ckpt_to_load_from=ckpt_to_load_from,
+          tensor_name_in_ckpt=tensor_name_in_ckpt,
+          max_norm=max_norm,
+          trainable=trainable))
+
+
+def sequence_numeric_column(
+    key,
+    shape=(1,),
+    default_value=0.,
+    dtype=dtypes.float32):
+  """Returns a feature column that represents sequences of numeric data.
+
+  Example:
+
+  ```python
+  temperature = sequence_numeric_column('temperature')
+  columns = [temperature]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Args:
+    key: A unique string identifying the input features.
+    shape: The shape of the input data per sequence id. E.g. if `shape=(2,)`,
+      each example must contain `2 * sequence_length` values.
+    default_value: A single value compatible with `dtype` that is used for
+      padding the sparse data into a dense `Tensor`.
+    dtype: The type of values.
+
+  Returns:
+    A `_SequenceNumericColumn`.
+  """
+  # TODO(b/73160931): Add validations.
+  return _SequenceNumericColumn(
+      key,
+      shape=shape,
+      default_value=default_value,
+      dtype=dtype)
+
+
+class _SequenceDenseColumn(fc._FeatureColumn):
+  """Represents dense sequence data."""
+
+  __metaclass__ = abc.ABCMeta
+
+  TensorSequenceLengthPair = collections.namedtuple(  # pylint: disable=invalid-name
+      'TensorSequenceLengthPair', ['dense_tensor', 'sequence_length'])
+
+  @abc.abstractproperty
+  def _variable_shape(self):
+    """`TensorShape` without batch and sequence dimensions."""
+    pass
+
+  @abc.abstractmethod
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    """Returns a `TensorSequenceLengthPair`."""
+    pass
+
+
+def _sequence_length_from_sparse_tensor(sp_tensor, num_elements=1):
+  with ops.name_scope(None, 'sequence_length') as name_scope:
+    row_ids = sp_tensor.indices[:, 0]
+    column_ids = sp_tensor.indices[:, 1]
+    column_ids += array_ops.ones_like(column_ids)
+    seq_length = math_ops.to_int64(
+        math_ops.segment_max(column_ids, segment_ids=row_ids) / num_elements)
+    # If the last n rows do not have ids, seq_length will have shape
+    # [batch_size - n]. Pad the remaining values with zeros.
+    n_pad = array_ops.shape(sp_tensor)[:1] - array_ops.shape(seq_length)[:1]
+    padding = array_ops.zeros(n_pad, dtype=seq_length.dtype)
+    return array_ops.concat([seq_length, padding], axis=0, name=name_scope)
+
+
+class _SequenceCategoricalColumn(
+    fc._CategoricalColumn,
+    collections.namedtuple(
+        '_SequenceCategoricalColumn', ['categorical_column'])):
+  """Represents sequences of categorical data."""
+
+  @property
+  def name(self):
+    return self.categorical_column.name
+
+  @property
+  def _parse_example_spec(self):
+    return self.categorical_column._parse_example_spec
+
+  def _transform_feature(self, inputs):
+    return self.categorical_column._transform_feature(inputs)
+
+  @property
+  def _num_buckets(self):
+    return self.categorical_column._num_buckets
+
+  def _get_sparse_tensors(self, inputs, weight_collections=None,
+                          trainable=None):
+    sparse_tensors = self.categorical_column._get_sparse_tensors(inputs)
+    id_tensor = sparse_tensors.id_tensor
+    weight_tensor = sparse_tensors.weight_tensor
+    # Expands final dimension, so that embeddings are not combined during
+    # embedding lookup.
+    check_id_rank = check_ops.assert_equal(
+        array_ops.rank(id_tensor), 2,
+        data=[
+            'Column {} expected ID tensor of rank 2. '.format(self.name),
+            'id_tensor shape: ', array_ops.shape(id_tensor)])
+    with ops.control_dependencies([check_id_rank]):
+      id_tensor = sparse_ops.sparse_reshape(
+          id_tensor,
+          shape=array_ops.concat([id_tensor.dense_shape, [1]], axis=0))
+    if weight_tensor is not None:
+      check_weight_rank = check_ops.assert_equal(
+          array_ops.rank(weight_tensor), 2,
+          data=[
+              'Column {} expected weight tensor of rank 2.'.format(self.name),
+              'weight_tensor shape:', array_ops.shape(weight_tensor)])
+      with ops.control_dependencies([check_weight_rank]):
+        weight_tensor = sparse_ops.sparse_reshape(
+            weight_tensor,
+            shape=array_ops.concat([weight_tensor.dense_shape, [1]], axis=0))
+    return fc._CategoricalColumn.IdWeightPair(id_tensor, weight_tensor)
+
+  def _sequence_length(self, inputs):
+    sparse_tensors = self.categorical_column._get_sparse_tensors(inputs)
+    return _sequence_length_from_sparse_tensor(sparse_tensors.id_tensor)
+
+
+class _SequenceEmbeddingColumn(
+    _SequenceDenseColumn,
+    collections.namedtuple('_SequenceEmbeddingColumn', ['embedding_column'])):
+  """Represents sequences of embeddings."""
+
+  @property
+  def name(self):
+    return self.embedding_column.name
+
+  @property
+  def _parse_example_spec(self):
+    return self.embedding_column._parse_example_spec
+
+  def _transform_feature(self, inputs):
+    return self.embedding_column._transform_feature(inputs)
+
+  @property
+  def _variable_shape(self):
+    return self.embedding_column._variable_shape
+
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    dense_tensor = self.embedding_column._get_dense_tensor(
+        inputs=inputs,
+        weight_collections=weight_collections,
+        trainable=trainable)
+    sequence_length = self.embedding_column.categorical_column._sequence_length(
+        inputs)
+    return _SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
+
+class _SequenceNumericColumn(
+    _SequenceDenseColumn,
+    collections.namedtuple(
+        '_SequenceNumericColumn',
+        ['key', 'shape', 'default_value', 'dtype'])):
+  """Represents sequences of numeric data."""
+
+  @property
+  def name(self):
+    return self.key
+
+  @property
+  def _parse_example_spec(self):
+    return {self.key: parsing_ops.VarLenFeature(self.dtype)}
+
+  def _transform_feature(self, inputs):
+    return inputs.get(self.key)
+
+  @property
+  def _variable_shape(self):
+    return tensor_shape.TensorShape(self.shape)
+
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    # Do nothing with weight_collections and trainable since no variables are
+    # created in this function.
+    del weight_collections
+    del trainable
+    sp_tensor = inputs.get(self)
+    dense_tensor = sparse_ops.sparse_tensor_to_dense(
+        sp_tensor, default_value=self.default_value)
+    # Reshape into [batch_size, T, variable_shape].
+    dense_shape = array_ops.concat(
+        [array_ops.shape(dense_tensor)[:1], [-1], self._variable_shape],
+        axis=0)
+    dense_tensor = array_ops.reshape(dense_tensor, shape=dense_shape)
+    sequence_length = _sequence_length_from_sparse_tensor(
+        sp_tensor, num_elements=self._variable_shape.num_elements())
+    return _SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
+# pylint: enable=protected-access
diff --git a/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..105213680ebcf9ce263f7892f44a343538fc26bf
--- /dev/null
+++ b/tensorflow/contrib/feature_column/python/feature_column/sequence_feature_column_test.py
@@ -0,0 +1,474 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for sequential_feature_column."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.feature_column.python.feature_column import sequence_feature_column as sfc
+from tensorflow.python.feature_column.feature_column import _LazyBuilder
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.platform import test
+from tensorflow.python.training import monitored_session
+
+
+class SequenceInputLayerTest(test.TestCase):
+
+  def test_embedding_column(self):
+    vocabulary_size = 3
+    sparse_input_a = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+    sparse_input_b = sparse_tensor.SparseTensorValue(
+        # example 0, ids [1]
+        # example 1, ids [2, 0]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2))
+
+    embedding_dimension_a = 2
+    embedding_values_a = (
+        (1., 2.),  # id 0
+        (3., 4.),  # id 1
+        (5., 6.)  # id 2
+    )
+    embedding_dimension_b = 3
+    embedding_values_b = (
+        (11., 12., 13.),  # id 0
+        (14., 15., 16.),  # id 1
+        (17., 18., 19.)  # id 2
+    )
+    def _get_initializer(embedding_dimension, embedding_values):
+      def _initializer(shape, dtype, partition_info):
+        self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+        self.assertEqual(dtypes.float32, dtype)
+        self.assertIsNone(partition_info)
+        return embedding_values
+      return _initializer
+
+    expected_input_layer = [
+        # example 0, ids_a [2], ids_b [1]
+        [[5., 6., 14., 15., 16.], [0., 0., 0., 0., 0.]],
+        # example 1, ids_a [0, 1], ids_b [2, 0]
+        [[1., 2., 17., 18., 19.], [3., 4., 11., 12., 13.]],
+    ]
+    expected_sequence_length = [1, 2]
+
+    categorical_column_a = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column_a = sfc._sequence_embedding_column(
+        categorical_column_a, dimension=embedding_dimension_a,
+        initializer=_get_initializer(embedding_dimension_a, embedding_values_a))
+    categorical_column_b = sfc.sequence_categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_b = sfc._sequence_embedding_column(
+        categorical_column_b, dimension=embedding_dimension_b,
+        initializer=_get_initializer(embedding_dimension_b, embedding_values_b))
+
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={
+            'aaa': sparse_input_a,
+            'bbb': sparse_input_b,
+        },
+        # Test that columns are reordered alphabetically.
+        feature_columns=[embedding_column_b, embedding_column_a])
+
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('sequence_input_layer/aaa_embedding/embedding_weights:0',
+         'sequence_input_layer/bbb_embedding/embedding_weights:0'),
+        tuple([v.name for v in global_vars]))
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(embedding_values_a, global_vars[0].eval(session=sess))
+      self.assertAllEqual(embedding_values_b, global_vars[1].eval(session=sess))
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_numeric_column(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    expected_input_layer = [
+        [[0.], [1.]],
+        [[10.], [0.]],
+    ]
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={'aaa': sparse_input},
+        feature_columns=[numeric_column])
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_numeric_column_multi_dim(self):
+    """Tests sequence_input_layer for multi-dimensional numeric_column."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[[0., 1.],  [2., 3.]], [[4., 5.],  [6., 7.]]]
+        # example 1, [[[10., 11.],  [12., 13.]]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7),
+                 (1, 0), (1, 1), (1, 2), (1, 3)),
+        values=(0., 1., 2., 3., 4., 5., 6., 7., 10., 11., 12., 13.),
+        dense_shape=(2, 8))
+    # The output of numeric_column._get_dense_tensor should be flattened.
+    expected_input_layer = [
+        [[0., 1., 2., 3.], [4., 5., 6., 7.]],
+        [[10., 11., 12., 13.], [0., 0., 0., 0.]],
+    ]
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(2, 2))
+
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={'aaa': sparse_input},
+        feature_columns=[numeric_column])
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+def _assert_sparse_tensor_value(test_case, expected, actual):
+  test_case.assertEqual(np.int64, np.array(actual.indices).dtype)
+  test_case.assertAllEqual(expected.indices, actual.indices)
+
+  test_case.assertEqual(
+      np.array(expected.values).dtype, np.array(actual.values).dtype)
+  test_case.assertAllEqual(expected.values, actual.values)
+
+  test_case.assertEqual(np.int64, np.array(actual.dense_shape).dtype)
+  test_case.assertAllEqual(expected.dense_shape, actual.dense_shape)
+
+
+class SequenceCategoricalColumnWithIdentityTest(test.TestCase):
+
+  def test_get_sparse_tensors(self):
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2))
+    expected_sparse_ids = sparse_tensor.SparseTensorValue(
+        indices=((0, 0, 0), (1, 0, 0), (1, 1, 0)),
+        values=np.array((1, 2, 0), dtype=np.int64),
+        dense_shape=(2, 2, 1))
+
+    id_weight_pair = column._get_sparse_tensors(_LazyBuilder({'aaa': inputs}))
+
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with monitored_session.MonitoredSession() as sess:
+      _assert_sparse_tensor_value(
+          self,
+          expected_sparse_ids,
+          id_weight_pair.id_tensor.eval(session=sess))
+
+  def test_get_sparse_tensors_inputs3d(self):
+    """Tests _get_sparse_tensors when the input is already 3D Tensor."""
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0, 0), (1, 0, 0), (1, 1, 0)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2, 1))
+
+    with self.assertRaisesRegexp(
+        errors.InvalidArgumentError,
+        r'Column aaa expected ID tensor of rank 2\.\s*'
+        r'id_tensor shape:\s*\[2 2 1\]'):
+      id_weight_pair = column._get_sparse_tensors(
+          _LazyBuilder({'aaa': inputs}))
+      with monitored_session.MonitoredSession() as sess:
+        id_weight_pair.id_tensor.eval(session=sess)
+
+  def test_sequence_length(self):
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2))
+    expected_sequence_length = [1, 2]
+
+    sequence_length = column._sequence_length(_LazyBuilder({'aaa': inputs}))
+
+    with monitored_session.MonitoredSession() as sess:
+      sequence_length = sess.run(sequence_length)
+      self.assertAllEqual(expected_sequence_length, sequence_length)
+      self.assertEqual(np.int64, sequence_length.dtype)
+
+  def test_sequence_length_with_zeros(self):
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((1, 0), (3, 0), (3, 1)),
+        values=(1, 2, 0),
+        dense_shape=(5, 2))
+    expected_sequence_length = [0, 1, 0, 2, 0]
+
+    sequence_length = column._sequence_length(_LazyBuilder({'aaa': inputs}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+class SequenceEmbeddingColumnTest(test.TestCase):
+
+  def test_get_sequence_dense_tensor(self):
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 1), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 2))
+
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    expected_lookups = [
+        # example 0, ids [2]
+        [[7., 11.], [0., 0.]],
+        # example 1, ids [0, 1]
+        [[1., 2.], [3., 5.]],
+        # example 2, ids []
+        [[0., 0.], [0., 0.]],
+        # example 3, ids [1]
+        [[3., 5.], [0., 0.]],
+    ]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = sfc._sequence_embedding_column(
+        categorical_column, dimension=embedding_dimension,
+        initializer=_initializer)
+
+    embedding_lookup, _ = embedding_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('embedding_weights:0',), tuple([v.name for v in global_vars]))
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(embedding_values, global_vars[0].eval(session=sess))
+      self.assertAllEqual(expected_lookups, embedding_lookup.eval(session=sess))
+
+  def test_sequence_length(self):
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+    expected_sequence_length = [1, 2]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = sfc._sequence_embedding_column(
+        categorical_column, dimension=2)
+
+    _, sequence_length = embedding_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      sequence_length = sess.run(sequence_length)
+      self.assertAllEqual(expected_sequence_length, sequence_length)
+      self.assertEqual(np.int64, sequence_length.dtype)
+
+  def test_sequence_length_with_empty_rows(self):
+    """Tests _sequence_length when some examples do not have ids."""
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids []
+        # example 1, ids [2]
+        # example 2, ids [0, 1]
+        # example 3, ids []
+        # example 4, ids [1]
+        # example 5, ids []
+        indices=((1, 0), (2, 0), (2, 1), (4, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(6, 2))
+    expected_sequence_length = [0, 1, 2, 0, 1, 0]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = sfc._sequence_embedding_column(
+        categorical_column, dimension=2)
+
+    _, sequence_length = embedding_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+class SequenceNumericColumnTest(test.TestCase):
+
+  def test_get_sequence_dense_tensor(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    expected_dense_tensor = [
+        [[0.], [1.]],
+        [[10.], [0.]],
+    ]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
+  def test_get_sequence_dense_tensor_with_shape(self):
+    """Tests get_sequence_dense_tensor with shape !=(1,)."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0., 1., 2.], [3., 4., 5.]]
+        # example 1, [[10., 11., 12.]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5),
+                 (1, 0), (1, 1), (1, 2)),
+        values=(0., 1., 2., 3., 4., 5., 10., 11., 12.),
+        dense_shape=(2, 6))
+    expected_dense_tensor = [
+        [[0., 1., 2.], [3., 4., 5.]],
+        [[10., 11., 12.], [0., 0., 0.]],
+    ]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(3,))
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
+  def test_get_dense_tensor_multi_dim(self):
+    """Tests get_sequence_dense_tensor for multi-dim numeric_column."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[[0., 1.],  [2., 3.]], [[4., 5.],  [6., 7.]]]
+        # example 1, [[[10., 11.],  [12., 13.]]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7),
+                 (1, 0), (1, 1), (1, 2), (1, 3)),
+        values=(0., 1., 2., 3., 4., 5., 6., 7., 10., 11., 12., 13.),
+        dense_shape=(2, 8))
+    expected_dense_tensor = [
+        [[[0., 1.], [2., 3.]], [[4., 5.], [6., 7.]]],
+        [[[10., 11.], [12., 13.]], [[0., 0.], [0., 0.]]],
+    ]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(2, 2))
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
+  def test_sequence_length(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0., 1., 2.], [3., 4., 5.]]
+        # example 1, [[10., 11., 12.]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5),
+                 (1, 0), (1, 1), (1, 2)),
+        values=(0., 1., 2., 3., 4., 5., 10., 11., 12.),
+        dense_shape=(2, 6))
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(3,))
+
+    _, sequence_length = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      sequence_length = sess.run(sequence_length)
+      self.assertAllEqual(expected_sequence_length, sequence_length)
+      self.assertEqual(np.int64, sequence_length.dtype)
+
+  def test_sequence_length_with_shape(self):
+    """Tests _sequence_length with shape !=(1,)."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    _, sequence_length = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_sequence_length_with_empty_rows(self):
+    """Tests _sequence_length when some examples do not have ids."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values []
+        # example 1, values [[0.], [1.]]
+        # example 2, [[2.]]
+        # example 3, values []
+        # example 4, [[3.]]
+        # example 5, values []
+        indices=((1, 0), (1, 1), (2, 0), (4, 0)),
+        values=(0., 1., 2., 3.),
+        dense_shape=(6, 2))
+    expected_sequence_length = [0, 2, 1, 0, 1, 0]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    _, sequence_length = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column.py b/tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column.py
index 690a44ff4368663306733300a1ea70397fb93e1e..4ed7268e7a921284eed7767d870e56ecac39a3b1 100644
--- a/tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column.py
+++ b/tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column.py
@@ -12,8 +12,314 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Experimental methods for tf.feature_column sequential input."""
+"""Experimental methods for tf.feature_column sequence input."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+
+
+import abc
+import collections
+
+
+from tensorflow.python.feature_column import feature_column as fc
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import check_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import parsing_ops
+from tensorflow.python.ops import sparse_ops
+from tensorflow.python.ops import variable_scope
+
+# TODO(b/73160931): Fix pydoc.
+# pylint: disable=g-doc-args,missing-docstring,protected-access
+# TODO(b/73827486): Support SequenceExample.
+
+
+def sequence_input_layer(
+    features,
+    feature_columns,
+    weight_collections=None,
+    trainable=True,
+    scope=None):
+  """"Builds input layer for sequence input.
+
+  All `feature_columns` must be sequence dense columns with the same
+  `sequence_length`. The output of this method can be fed into sequence
+  networks, such as RNN.
+
+  The output of this method is a 3D `Tensor` of shape `[batch_size, T, D]`.
+  `T` is the maximum sequence length for this batch, which could differ from
+  batch to batch.
+
+  If multiple `feature_columns` are given with `Di` `num_elements` each, their
+  outputs are concatenated. So, the final `Tensor` has shape
+  `[batch_size, T, D0 + D1 + ... + Dn]`.
+
+  Example:
+
+  ```python
+  rating = sequence_numeric_column('rating')
+  watches = sequence_categorical_column_with_identity(
+      'watches', num_buckets=1000)
+  watches_embedding = embedding_column(watches, dimension=10)
+  columns = [rating, watches]
+
+  features = tf.parse_example(..., features=make_parse_example_spec(columns))
+  input_layer, sequence_length = sequence_input_layer(features, columns)
+
+  rnn_cell = tf.nn.rnn_cell.BasicRNNCell(hidden_size)
+  outputs, state = tf.nn.dynamic_rnn(
+      rnn_cell, inputs=input_layer, sequence_length=sequence_length)
+  ```
+
+  Returns:
+    An `(input_layer, sequence_length)` tuple where:
+    - input_layer: A float `Tensor` of shape `[batch_size, T, D]`.
+        `T` is the maximum sequence length for this batch, which could differ
+        from batch to batch. `D` is the sum of `num_elements` for all
+        `feature_columns`.
+    - sequence_length: An int `Tensor` of shape `[batch_size]`. The sequence
+        length for each example.
+  Raises:
+    ValueError: If any of the `feature_columns` is the wrong type.
+  """
+  feature_columns = fc._clean_feature_columns(feature_columns)
+  for c in feature_columns:
+    if not isinstance(c, _SequenceDenseColumn):
+      raise ValueError(
+          'All feature_columns must be of type _SequenceDenseColumn. '
+          'Given (type {}): {}'.format(type(c), c))
+
+  with variable_scope.variable_scope(
+      scope, default_name='sequence_input_layer', values=features.values()):
+    builder = fc._LazyBuilder(features)
+    output_tensors = []
+    sequence_lengths = []
+    ordered_columns = []
+    for column in sorted(feature_columns, key=lambda x: x.name):
+      ordered_columns.append(column)
+      with variable_scope.variable_scope(
+          None, default_name=column._var_scope_name):
+        dense_tensor, sequence_length = column._get_sequence_dense_tensor(
+            builder,
+            weight_collections=weight_collections,
+            trainable=trainable)
+        # Flattens the final dimension to produce a 3D Tensor.
+        num_elements = column._variable_shape.num_elements()
+        shape = array_ops.shape(dense_tensor)
+        output_tensors.append(
+            array_ops.reshape(
+                dense_tensor,
+                shape=array_ops.concat([shape[:2], [num_elements]], axis=0)))
+        sequence_lengths.append(sequence_length)
+    fc._verify_static_batch_size_equality(output_tensors, ordered_columns)
+    # TODO(b/73160931): Verify sequence_length equality.
+    return array_ops.concat(output_tensors, -1), sequence_lengths[0]
+
+
+# TODO(b/73160931): Add remaining categorical columns.
+def sequence_categorical_column_with_identity(
+    key, num_buckets, default_value=None):
+  return _SequenceCategoricalColumn(
+      fc.categorical_column_with_identity(
+          key=key,
+          num_buckets=num_buckets,
+          default_value=default_value))
+
+
+# TODO(b/73160931): Merge with embedding_column
+def _sequence_embedding_column(
+    categorical_column, dimension, initializer=None, ckpt_to_load_from=None,
+    tensor_name_in_ckpt=None, max_norm=None, trainable=True):
+  if not isinstance(categorical_column, _SequenceCategoricalColumn):
+    raise ValueError(
+        'categorical_column must be of type _SequenceCategoricalColumn. '
+        'Given (type {}): {}'.format(
+            type(categorical_column), categorical_column))
+  return _SequenceEmbeddingColumn(
+      fc.embedding_column(
+          categorical_column,
+          dimension=dimension,
+          initializer=initializer,
+          ckpt_to_load_from=ckpt_to_load_from,
+          tensor_name_in_ckpt=tensor_name_in_ckpt,
+          max_norm=max_norm,
+          trainable=trainable))
+
+
+def sequence_numeric_column(
+    key,
+    shape=(1,),
+    default_value=0.,
+    dtype=dtypes.float32):
+  # TODO(b/73160931): Add validations.
+  return _SequenceNumericColumn(
+      key,
+      shape=shape,
+      default_value=default_value,
+      dtype=dtype)
+
+
+class _SequenceDenseColumn(fc._FeatureColumn):
+  """Represents dense sequence data."""
+
+  __metaclass__ = abc.ABCMeta
+
+  TensorSequenceLengthPair = collections.namedtuple(  # pylint: disable=invalid-name
+      'TensorSequenceLengthPair', ['dense_tensor', 'sequence_length'])
+
+  @abc.abstractproperty
+  def _variable_shape(self):
+    """`TensorShape` without batch and sequence dimensions."""
+    pass
+
+  @abc.abstractmethod
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    """Returns a `TensorSequenceLengthPair`."""
+    pass
+
+
+def _sequence_length_from_sparse_tensor(sp_tensor, num_elements=1):
+  with ops.name_scope(None, 'sequence_length') as name_scope:
+    row_ids = sp_tensor.indices[:, 0]
+    column_ids = sp_tensor.indices[:, 1]
+    column_ids += array_ops.ones_like(column_ids)
+    seq_length = (
+        math_ops.segment_max(column_ids, segment_ids=row_ids) / num_elements)
+    # If the last n rows do not have ids, seq_length will have shape
+    # [batch_size - n]. Pad the remaining values with zeros.
+    n_pad = array_ops.shape(sp_tensor)[:1] - array_ops.shape(seq_length)[:1]
+    padding = array_ops.zeros(n_pad, dtype=seq_length.dtype)
+    return array_ops.concat([seq_length, padding], axis=0, name=name_scope)
+
+
+class _SequenceCategoricalColumn(
+    fc._CategoricalColumn,
+    collections.namedtuple(
+        '_SequenceCategoricalColumn', ['categorical_column'])):
+
+  @property
+  def name(self):
+    return self.categorical_column.name
+
+  @property
+  def _parse_example_spec(self):
+    return self.categorical_column._parse_example_spec
+
+  def _transform_feature(self, inputs):
+    return self.categorical_column._transform_feature(inputs)
+
+  @property
+  def _num_buckets(self):
+    return self.categorical_column._num_buckets
+
+  def _get_sparse_tensors(self, inputs, weight_collections=None,
+                          trainable=None):
+    sparse_tensors = self.categorical_column._get_sparse_tensors(inputs)
+    id_tensor = sparse_tensors.id_tensor
+    weight_tensor = sparse_tensors.weight_tensor
+    # Expands final dimension, so that embeddings are not combined during
+    # embedding lookup.
+    check_id_rank = check_ops.assert_equal(
+        array_ops.rank(id_tensor), 2,
+        data=[
+            'Column {} expected ID tensor of rank 2. '.format(self.name),
+            'id_tensor shape: ', array_ops.shape(id_tensor)])
+    with ops.control_dependencies([check_id_rank]):
+      id_tensor = sparse_ops.sparse_reshape(
+          id_tensor,
+          shape=array_ops.concat([id_tensor.dense_shape, [1]], axis=0))
+    if weight_tensor is not None:
+      check_weight_rank = check_ops.assert_equal(
+          array_ops.rank(weight_tensor), 2,
+          data=[
+              'Column {} expected weight tensor of rank 2.'.format(self.name),
+              'weight_tensor shape:', array_ops.shape(weight_tensor)])
+      with ops.control_dependencies([check_weight_rank]):
+        weight_tensor = sparse_ops.sparse_reshape(
+            weight_tensor,
+            shape=array_ops.concat([weight_tensor.dense_shape, [1]], axis=0))
+    return fc._CategoricalColumn.IdWeightPair(id_tensor, weight_tensor)
+
+  def _sequence_length(self, inputs):
+    sparse_tensors = self.categorical_column._get_sparse_tensors(inputs)
+    return _sequence_length_from_sparse_tensor(sparse_tensors.id_tensor)
+
+
+class _SequenceEmbeddingColumn(
+    _SequenceDenseColumn,
+    collections.namedtuple('_SequenceEmbeddingColumn', ['embedding_column'])):
+
+  @property
+  def name(self):
+    return self.embedding_column.name
+
+  @property
+  def _parse_example_spec(self):
+    return self.embedding_column._parse_example_spec
+
+  def _transform_feature(self, inputs):
+    return self.embedding_column._transform_feature(inputs)
+
+  @property
+  def _variable_shape(self):
+    return self.embedding_column._variable_shape
+
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    dense_tensor = self.embedding_column._get_dense_tensor(
+        inputs=inputs,
+        weight_collections=weight_collections,
+        trainable=trainable)
+    sequence_length = self.embedding_column.categorical_column._sequence_length(
+        inputs)
+    return _SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
+
+class _SequenceNumericColumn(
+    _SequenceDenseColumn,
+    collections.namedtuple(
+        '_SequenceNumericColumn',
+        ['key', 'shape', 'default_value', 'dtype'])):
+
+  @property
+  def name(self):
+    return self.key
+
+  @property
+  def _parse_example_spec(self):
+    return {self.key: parsing_ops.VarLenFeature(self.dtype)}
+
+  def _transform_feature(self, inputs):
+    return inputs.get(self.key)
+
+  @property
+  def _variable_shape(self):
+    return tensor_shape.TensorShape(self.shape)
+
+  def _get_sequence_dense_tensor(
+      self, inputs, weight_collections=None, trainable=None):
+    # Do nothing with weight_collections and trainable since no variables are
+    # created in this function.
+    del weight_collections
+    del trainable
+    sp_tensor = inputs.get(self)
+    dense_tensor = sparse_ops.sparse_tensor_to_dense(
+        sp_tensor, default_value=self.default_value)
+    # Reshape into [batch_size, T, variable_shape].
+    dense_shape = array_ops.concat(
+        [array_ops.shape(dense_tensor)[:1], [-1], self._variable_shape],
+        axis=0)
+    dense_tensor = array_ops.reshape(dense_tensor, shape=dense_shape)
+    sequence_length = _sequence_length_from_sparse_tensor(
+        sp_tensor, num_elements=self._variable_shape.num_elements())
+    return _SequenceDenseColumn.TensorSequenceLengthPair(
+        dense_tensor=dense_tensor, sequence_length=sequence_length)
+
+# pylint: enable=g-doc-args,missing-docstring,protected-access
diff --git a/tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column_test.py b/tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..59674869a27c3a40ab9cb3dcede384d1cda7ce27
--- /dev/null
+++ b/tensorflow/contrib/feature_column/python/feature_column/sequential_feature_column_test.py
@@ -0,0 +1,471 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for sequential_feature_column."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.contrib.feature_column.python.feature_column import sequential_feature_column as sfc
+from tensorflow.python.feature_column.feature_column import _LazyBuilder
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import sparse_tensor
+from tensorflow.python.platform import test
+from tensorflow.python.training import monitored_session
+
+
+class SequenceInputLayerTest(test.TestCase):
+
+  def test_embedding_column(self):
+    vocabulary_size = 3
+    sparse_input_a = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+    sparse_input_b = sparse_tensor.SparseTensorValue(
+        # example 0, ids [1]
+        # example 1, ids [2, 0]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2))
+
+    embedding_dimension_a = 2
+    embedding_values_a = (
+        (1., 2.),  # id 0
+        (3., 4.),  # id 1
+        (5., 6.)  # id 2
+    )
+    embedding_dimension_b = 3
+    embedding_values_b = (
+        (11., 12., 13.),  # id 0
+        (14., 15., 16.),  # id 1
+        (17., 18., 19.)  # id 2
+    )
+    def _get_initializer(embedding_dimension, embedding_values):
+      def _initializer(shape, dtype, partition_info):
+        self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+        self.assertEqual(dtypes.float32, dtype)
+        self.assertIsNone(partition_info)
+        return embedding_values
+      return _initializer
+
+    expected_input_layer = [
+        # example 0, ids_a [2], ids_b [1]
+        [[5., 6., 14., 15., 16.], [0., 0., 0., 0., 0.]],
+        # example 1, ids_a [0, 1], ids_b [2, 0]
+        [[1., 2., 17., 18., 19.], [3., 4., 11., 12., 13.]],
+    ]
+    expected_sequence_length = [1, 2]
+
+    categorical_column_a = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column_a = sfc._sequence_embedding_column(
+        categorical_column_a, dimension=embedding_dimension_a,
+        initializer=_get_initializer(embedding_dimension_a, embedding_values_a))
+    categorical_column_b = sfc.sequence_categorical_column_with_identity(
+        key='bbb', num_buckets=vocabulary_size)
+    embedding_column_b = sfc._sequence_embedding_column(
+        categorical_column_b, dimension=embedding_dimension_b,
+        initializer=_get_initializer(embedding_dimension_b, embedding_values_b))
+
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={
+            'aaa': sparse_input_a,
+            'bbb': sparse_input_b,
+        },
+        # Test that columns are reordered alphabetically.
+        feature_columns=[embedding_column_b, embedding_column_a])
+
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('sequence_input_layer/aaa_embedding/embedding_weights:0',
+         'sequence_input_layer/bbb_embedding/embedding_weights:0'),
+        tuple([v.name for v in global_vars]))
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(embedding_values_a, global_vars[0].eval(session=sess))
+      self.assertAllEqual(embedding_values_b, global_vars[1].eval(session=sess))
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_numeric_column(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    expected_input_layer = [
+        [[0.], [1.]],
+        [[10.], [0.]],
+    ]
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={'aaa': sparse_input},
+        feature_columns=[numeric_column])
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_numeric_column_multi_dim(self):
+    """Tests sequence_input_layer for multi-dimensional numeric_column."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[[0., 1.],  [2., 3.]], [[4., 5.],  [6., 7.]]]
+        # example 1, [[[10., 11.],  [12., 13.]]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7),
+                 (1, 0), (1, 1), (1, 2), (1, 3)),
+        values=(0., 1., 2., 3., 4., 5., 6., 7., 10., 11., 12., 13.),
+        dense_shape=(2, 8))
+    # The output of numeric_column._get_dense_tensor should be flattened.
+    expected_input_layer = [
+        [[0., 1., 2., 3.], [4., 5., 6., 7.]],
+        [[10., 11., 12., 13.], [0., 0., 0., 0.]],
+    ]
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(2, 2))
+
+    input_layer, sequence_length = sfc.sequence_input_layer(
+        features={'aaa': sparse_input},
+        feature_columns=[numeric_column])
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(expected_input_layer, input_layer.eval(session=sess))
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+def _assert_sparse_tensor_value(test_case, expected, actual):
+  test_case.assertEqual(np.int64, np.array(actual.indices).dtype)
+  test_case.assertAllEqual(expected.indices, actual.indices)
+
+  test_case.assertEqual(
+      np.array(expected.values).dtype, np.array(actual.values).dtype)
+  test_case.assertAllEqual(expected.values, actual.values)
+
+  test_case.assertEqual(np.int64, np.array(actual.dense_shape).dtype)
+  test_case.assertAllEqual(expected.dense_shape, actual.dense_shape)
+
+
+class SequenceCategoricalColumnWithIdentityTest(test.TestCase):
+
+  def test_get_sparse_tensors(self):
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2))
+    expected_sparse_ids = sparse_tensor.SparseTensorValue(
+        indices=((0, 0, 0), (1, 0, 0), (1, 1, 0)),
+        values=np.array((1, 2, 0), dtype=np.int64),
+        dense_shape=(2, 2, 1))
+
+    id_weight_pair = column._get_sparse_tensors(_LazyBuilder({'aaa': inputs}))
+
+    self.assertIsNone(id_weight_pair.weight_tensor)
+    with monitored_session.MonitoredSession() as sess:
+      _assert_sparse_tensor_value(
+          self,
+          expected_sparse_ids,
+          id_weight_pair.id_tensor.eval(session=sess))
+
+  def test_get_sparse_tensors_inputs3d(self):
+    """Tests _get_sparse_tensors when the input is already 3D Tensor."""
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0, 0), (1, 0, 0), (1, 1, 0)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2, 1))
+
+    with self.assertRaisesRegexp(
+        errors.InvalidArgumentError,
+        r'Column aaa expected ID tensor of rank 2\.\s*'
+        r'id_tensor shape:\s*\[2 2 1\]'):
+      id_weight_pair = column._get_sparse_tensors(
+          _LazyBuilder({'aaa': inputs}))
+      with monitored_session.MonitoredSession() as sess:
+        id_weight_pair.id_tensor.eval(session=sess)
+
+  def test_sequence_length(self):
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(1, 2, 0),
+        dense_shape=(2, 2))
+    expected_sequence_length = [1, 2]
+
+    sequence_length = column._sequence_length(_LazyBuilder({'aaa': inputs}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_sequence_length_with_zeros(self):
+    column = sfc.sequence_categorical_column_with_identity(
+        'aaa', num_buckets=3)
+    inputs = sparse_tensor.SparseTensorValue(
+        indices=((1, 0), (3, 0), (3, 1)),
+        values=(1, 2, 0),
+        dense_shape=(5, 2))
+    expected_sequence_length = [0, 1, 0, 2, 0]
+
+    sequence_length = column._sequence_length(_LazyBuilder({'aaa': inputs}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+class SequenceEmbeddingColumnTest(test.TestCase):
+
+  def test_get_sequence_dense_tensor(self):
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        # example 2, ids []
+        # example 3, ids [1]
+        indices=((0, 0), (1, 0), (1, 1), (3, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(4, 2))
+
+    embedding_dimension = 2
+    embedding_values = (
+        (1., 2.),  # id 0
+        (3., 5.),  # id 1
+        (7., 11.)  # id 2
+    )
+    def _initializer(shape, dtype, partition_info):
+      self.assertAllEqual((vocabulary_size, embedding_dimension), shape)
+      self.assertEqual(dtypes.float32, dtype)
+      self.assertIsNone(partition_info)
+      return embedding_values
+
+    expected_lookups = [
+        # example 0, ids [2]
+        [[7., 11.], [0., 0.]],
+        # example 1, ids [0, 1]
+        [[1., 2.], [3., 5.]],
+        # example 2, ids []
+        [[0., 0.], [0., 0.]],
+        # example 3, ids [1]
+        [[3., 5.], [0., 0.]],
+    ]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = sfc._sequence_embedding_column(
+        categorical_column, dimension=embedding_dimension,
+        initializer=_initializer)
+
+    embedding_lookup, _ = embedding_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    global_vars = ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+    self.assertItemsEqual(
+        ('embedding_weights:0',), tuple([v.name for v in global_vars]))
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(embedding_values, global_vars[0].eval(session=sess))
+      self.assertAllEqual(expected_lookups, embedding_lookup.eval(session=sess))
+
+  def test_sequence_length(self):
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids [2]
+        # example 1, ids [0, 1]
+        indices=((0, 0), (1, 0), (1, 1)),
+        values=(2, 0, 1),
+        dense_shape=(2, 2))
+    expected_sequence_length = [1, 2]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = sfc._sequence_embedding_column(
+        categorical_column, dimension=2)
+
+    _, sequence_length = embedding_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_sequence_length_with_empty_rows(self):
+    """Tests _sequence_length when some examples do not have ids."""
+    vocabulary_size = 3
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, ids []
+        # example 1, ids [2]
+        # example 2, ids [0, 1]
+        # example 3, ids []
+        # example 4, ids [1]
+        # example 5, ids []
+        indices=((1, 0), (2, 0), (2, 1), (4, 0)),
+        values=(2, 0, 1, 1),
+        dense_shape=(6, 2))
+    expected_sequence_length = [0, 1, 2, 0, 1, 0]
+
+    categorical_column = sfc.sequence_categorical_column_with_identity(
+        key='aaa', num_buckets=vocabulary_size)
+    embedding_column = sfc._sequence_embedding_column(
+        categorical_column, dimension=2)
+
+    _, sequence_length = embedding_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+class SequenceNumericColumnTest(test.TestCase):
+
+  def test_get_sequence_dense_tensor(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    expected_dense_tensor = [
+        [[0.], [1.]],
+        [[10.], [0.]],
+    ]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
+  def test_get_sequence_dense_tensor_with_shape(self):
+    """Tests get_sequence_dense_tensor with shape !=(1,)."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0., 1., 2.], [3., 4., 5.]]
+        # example 1, [[10., 11., 12.]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5),
+                 (1, 0), (1, 1), (1, 2)),
+        values=(0., 1., 2., 3., 4., 5., 10., 11., 12.),
+        dense_shape=(2, 6))
+    expected_dense_tensor = [
+        [[0., 1., 2.], [3., 4., 5.]],
+        [[10., 11., 12.], [0., 0., 0.]],
+    ]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(3,))
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
+  def test_get_dense_tensor_multi_dim(self):
+    """Tests get_sequence_dense_tensor for multi-dim numeric_column."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[[0., 1.],  [2., 3.]], [[4., 5.],  [6., 7.]]]
+        # example 1, [[[10., 11.],  [12., 13.]]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7),
+                 (1, 0), (1, 1), (1, 2), (1, 3)),
+        values=(0., 1., 2., 3., 4., 5., 6., 7., 10., 11., 12., 13.),
+        dense_shape=(2, 8))
+    expected_dense_tensor = [
+        [[[0., 1.], [2., 3.]], [[4., 5.], [6., 7.]]],
+        [[[10., 11.], [12., 13.]], [[0., 0.], [0., 0.]]],
+    ]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(2, 2))
+
+    dense_tensor, _ = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_dense_tensor, dense_tensor.eval(session=sess))
+
+  def test_sequence_length(self):
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0., 1., 2.], [3., 4., 5.]]
+        # example 1, [[10., 11., 12.]]
+        indices=((0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5),
+                 (1, 0), (1, 1), (1, 2)),
+        values=(0., 1., 2., 3., 4., 5., 10., 11., 12.),
+        dense_shape=(2, 6))
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa', shape=(3,))
+
+    _, sequence_length = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_sequence_length_with_shape(self):
+    """Tests _sequence_length with shape !=(1,)."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values [[0.], [1]]
+        # example 1, [[10.]]
+        indices=((0, 0), (0, 1), (1, 0)),
+        values=(0., 1., 10.),
+        dense_shape=(2, 2))
+    expected_sequence_length = [2, 1]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    _, sequence_length = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+  def test_sequence_length_with_empty_rows(self):
+    """Tests _sequence_length when some examples do not have ids."""
+    sparse_input = sparse_tensor.SparseTensorValue(
+        # example 0, values []
+        # example 1, values [[0.], [1.]]
+        # example 2, [[2.]]
+        # example 3, values []
+        # example 4, [[3.]]
+        # example 5, values []
+        indices=((1, 0), (1, 1), (2, 0), (4, 0)),
+        values=(0., 1., 2., 3.),
+        dense_shape=(6, 2))
+    expected_sequence_length = [0, 2, 1, 0, 1, 0]
+    numeric_column = sfc.sequence_numeric_column('aaa')
+
+    _, sequence_length = numeric_column._get_sequence_dense_tensor(
+        _LazyBuilder({'aaa': sparse_input}))
+
+    with monitored_session.MonitoredSession() as sess:
+      self.assertAllEqual(
+          expected_sequence_length, sequence_length.eval(session=sess))
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/framework/BUILD b/tensorflow/contrib/framework/BUILD
index 9e5f54f0973eae899ca65e4098358107053cb7d4..ac043fda0638e61f422e769ab3047a53a1b377bd 100644
--- a/tensorflow/contrib/framework/BUILD
+++ b/tensorflow/contrib/framework/BUILD
@@ -28,7 +28,6 @@ tf_custom_op_py_library(
         "python/framework/graph_util.py",
         "python/framework/tensor_util.py",
         "python/ops/__init__.py",
-        "python/ops/accumulate_n_v2.py",
         "python/ops/arg_scope.py",
         "python/ops/audio_ops.py",
         "python/ops/checkpoint_ops.py",
@@ -63,7 +62,9 @@ tf_custom_op_py_library(
         "//tensorflow/python:math_ops",
         "//tensorflow/python:platform",
         "//tensorflow/python:pywrap_tensorflow",
+        "//tensorflow/python:resource_variable_ops",
         "//tensorflow/python:script_ops",
+        "//tensorflow/python:smart_cond",
         "//tensorflow/python:sparse_tensor",
         "//tensorflow/python:state_ops",
         "//tensorflow/python:state_ops_gen",
@@ -161,23 +162,6 @@ py_test(
     ],
 )
 
-py_test(
-    name = "accumulate_n_v2_test",
-    size = "small",
-    srcs = ["python/ops/accumulate_n_v2_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":framework_py",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:platform_test",
-        "//tensorflow/python:variables",
-        "//third_party/py/numpy",
-    ],
-)
-
 cuda_py_test(
     name = "critical_section_test",
     size = "medium",
@@ -185,31 +169,14 @@ cuda_py_test(
     additional_deps = [
         "//tensorflow/python:client_testlib",
         ":framework_py",
+        "//tensorflow/python:array_ops",
+        "//tensorflow/python:control_flow_ops",
         "//tensorflow/python:framework_for_generated_wrappers",
         "//tensorflow/python:framework_test_lib",
         "//tensorflow/python:gradients",
         "//tensorflow/python:platform_test",
         "//tensorflow/python:resource_variable_ops",
-    ],
-)
-
-py_test(
-    name = "accumulate_n_v2_eager_test",
-    size = "small",
-    srcs = ["python/ops/accumulate_n_v2_eager_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":framework_py",
-        "//tensorflow/python:client_testlib",
-        "//tensorflow/python:framework_for_generated_wrappers",
-        "//tensorflow/python:framework_test_lib",
-        "//tensorflow/python:gradients",
-        "//tensorflow/python:math_ops",
-        "//tensorflow/python:resource_variable_ops",
-        "//tensorflow/python/eager:backprop",
-        "//tensorflow/python/eager:context",
-        "//tensorflow/python/eager:tape",
-        "//third_party/py/numpy",
+        "//tensorflow/python:tensor_array_ops",
     ],
 )
 
diff --git a/tensorflow/contrib/framework/__init__.py b/tensorflow/contrib/framework/__init__.py
index a49d42cd525434d4ffd4a6bb0d8854dc707b9280..80632500912e92b74b0de5d66277f79dfcba1938 100644
--- a/tensorflow/contrib/framework/__init__.py
+++ b/tensorflow/contrib/framework/__init__.py
@@ -87,6 +87,9 @@ See the @{$python/contrib.framework} guide.
 
 @@get_placeholders
 
+@@smart_cond
+@@smart_constant_value
+
 @@CriticalSection
 
 @@BoundedTensorSpec
@@ -104,10 +107,10 @@ from tensorflow.contrib.framework.python.ops import *
 
 from tensorflow.python.framework.ops import prepend_name_scope
 from tensorflow.python.framework.ops import strip_name_scope
-
+from tensorflow.python.framework.smart_cond import smart_cond
+from tensorflow.python.framework.smart_cond import smart_constant_value
 from tensorflow.python.framework.tensor_spec import BoundedTensorSpec
 from tensorflow.python.framework.tensor_spec import TensorSpec
-
 from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = ['nest']
diff --git a/tensorflow/contrib/framework/python/framework/experimental_test.py b/tensorflow/contrib/framework/python/framework/experimental_test.py
index 8e54e09e04ee3c0ddbd4fa84cc0912cb70c93e62..cfdc7df7d8fd4c1406bf447a79038ac33b11e047 100644
--- a/tensorflow/contrib/framework/python/framework/experimental_test.py
+++ b/tensorflow/contrib/framework/python/framework/experimental_test.py
@@ -49,7 +49,6 @@ class ExperimentalTest(test.TestCase):
                      "\nTHIS FUNCTION IS EXPERIMENTAL. It may change or "
                      "be removed at any time, and without warning."
                      "\n"
-                     "\n"
                      "\nArgs:"
                      "\n  arg0: Arg 0."
                      "\n  arg1: Arg 1."
diff --git a/tensorflow/contrib/framework/python/framework/graph_util.py b/tensorflow/contrib/framework/python/framework/graph_util.py
index 49eec3a3f1a0f357ea3adfade51e71cb0f89942d..2703224b1bf62831b6088558d4f93950fe938c10 100644
--- a/tensorflow/contrib/framework/python/framework/graph_util.py
+++ b/tensorflow/contrib/framework/python/framework/graph_util.py
@@ -85,14 +85,19 @@ def fuse_op(graph_def, input_nodes, output_nodes, output_dtypes,
       if n not in reachable_by_input and n not in output_nodes_set:
         # n is between input and output, i.e., part of the fused op
         next_to_visit = [n]
+        visited = set()
         while next_to_visit:
           cur_node = next_to_visit[0]
+          visited.add(cur_node)
           del next_to_visit[0]
           if cur_node in reachable_by_input and cur_node not in input_nodes_set:
             raise TypeError("Node %s uses input %s not in input_nodes." %
                             (n, cur_node))
           if cur_node not in input_nodes_set:
-            next_to_visit += name_to_input_name[cur_node]
+            next_to_visit += [
+                input_node for input_node in name_to_input_name[cur_node]
+                if input_node not in visited
+            ]
     elif n not in reachable_by_input:
       nodes_post_output.append(n)
 
diff --git a/tensorflow/contrib/framework/python/framework/graph_util_test.py b/tensorflow/contrib/framework/python/framework/graph_util_test.py
index b8a6d109e19211d271c2b15bac66ddacd38fe395..812c5fbd8cb759aef6eb1aad532c03794b2ceaf4 100644
--- a/tensorflow/contrib/framework/python/framework/graph_util_test.py
+++ b/tensorflow/contrib/framework/python/framework/graph_util_test.py
@@ -42,7 +42,8 @@ class GraphUtilTest(test.TestCase):
     graph_def = graph_pb2.GraphDef()
     node_a = GetNewNode('A', 'Placeholder', [])
     node_b = GetNewNode('B', 'Op1', ['A'])
-    node_c = GetNewNode('C', 'Op1', ['B'])
+    # A loop in the part that will be fused.
+    node_c = GetNewNode('C', 'Op1', ['B', 'C'])
     node_d = GetNewNode('D', 'Op1', ['C'])
     node_e = GetNewNode('E', 'Op1', ['D'])
     graph_def.node.extend([node_a, node_b, node_c, node_d, node_e])
diff --git a/tensorflow/contrib/framework/python/ops/accumulate_n_v2.py b/tensorflow/contrib/framework/python/ops/accumulate_n_v2.py
deleted file mode 100644
index 476528b0dd3df05239d5dc402b466e06dd789985..0000000000000000000000000000000000000000
--- a/tensorflow/contrib/framework/python/ops/accumulate_n_v2.py
+++ /dev/null
@@ -1,111 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Ops that will eventually be folded into tensorflow/python/ops/math_ops.py
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-
-from tensorflow.python.eager import context
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_shape
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import gen_math_ops
-from tensorflow.python.ops import math_ops
-
-
-
-def accumulate_n_v2(inputs, shape=None, tensor_dtype=None, name=None):
-  """Returns the element-wise sum of a list of tensors.
-
-  Optionally, pass `shape` and `tensor_dtype` for shape and type checking,
-  otherwise, these are inferred.
-
-  `tf.accumulate_n_v2` performs the same operation as `tf.add_n`, but does not
-  wait for all of its inputs to be ready before beginning to sum. This can
-  save memory if inputs are ready at different times, since minimum temporary
-  storage is proportional to the output size rather than the inputs size.
-
-  Unlike the original `accumulate_n`, `accumulate_n_v2` is differentiable.
-
-  For example:
-
-  ```python
-  a = tf.constant([[1, 2], [3, 4]])
-  b = tf.constant([[5, 0], [0, 6]])
-  tf.accumulate_n_v2([a, b, a])  # [[7, 4], [6, 14]]
-
-  # Explicitly pass shape and type
-  tf.accumulate_n_v2([a, b, a], shape=[2, 2], tensor_dtype=tf.int32)
-                                                                   # [[7,  4],
-                                                                   #  [6, 14]]
-  ```
-
-  Args:
-    inputs: A list of `Tensor` objects, each with same shape and type.
-    shape: Shape of elements of `inputs`.
-    tensor_dtype: The type of `inputs`.
-    name: A name for the operation (optional).
-
-  Returns:
-    A `Tensor` of same shape and type as the elements of `inputs`.
-
-  Raises:
-    ValueError: If `inputs` don't all have same shape and dtype or the shape
-    cannot be inferred.
-  """
-  _INPUTS_ERR_MSG = ValueError("inputs must be a list of at least one Tensor"
-                               "with the same dtype and shape")
-  if not inputs or not isinstance(inputs, (list, tuple)):
-    raise _INPUTS_ERR_MSG
-  inputs = ops.convert_n_to_tensor_or_indexed_slices(inputs)
-  if not all(isinstance(x, ops.Tensor) for x in inputs):
-    raise _INPUTS_ERR_MSG
-  if not all(x.dtype == inputs[0].dtype for x in inputs):
-    raise _INPUTS_ERR_MSG
-  if shape is not None:
-    shape = tensor_shape.as_shape(shape)
-  else:
-    shape = tensor_shape.unknown_shape()
-  for input_tensor in inputs:
-    if isinstance(input_tensor, ops.Tensor):
-      shape = shape.merge_with(input_tensor.get_shape())
-
-  # tensor_dtype is for safety only; operator's output type computed in C++
-  if tensor_dtype is not None and tensor_dtype != inputs[0].dtype:
-    raise TypeError("tensor_dtype is {}, but input is of type {}"
-                    .format(tensor_dtype, inputs[0].dtype))
-
-  if len(inputs) == 1 and name is None:
-    return inputs[0]
-  elif len(inputs) == 1 and name is not None:
-    return array_ops.identity(inputs[0], name=name)
-  elif context.in_eager_mode():
-    # TemporaryVariable not currently supported in eager mode; fall back
-    # onto AddN for now.
-    # TODO(frreiss) remove this once the lifetime of eager variables gets
-    # addressed
-    return math_ops.add_n(inputs, name=name)
-  else:
-    return gen_math_ops._accumulate_nv2(inputs, name=name, shape=shape)
-
-# The following code should eventually be merged into
-# tensorflow/python/ops/math_grad.py
-@ops.RegisterGradient("AccumulateNV2")
-def _AddNGrad(op, grad):
-  """Same as gradient for AddN. Copies the gradient to all inputs."""
-  # Not broadcasting.
-  return [grad] * len(op.inputs)
diff --git a/tensorflow/contrib/framework/python/ops/critical_section_ops.py b/tensorflow/contrib/framework/python/ops/critical_section_ops.py
index 182fec924febb74a23b82b1664d137f033f3b1b4..ab603cc18e12136baea35b10999771c0ada2dd2c 100644
--- a/tensorflow/contrib/framework/python/ops/critical_section_ops.py
+++ b/tensorflow/contrib/framework/python/ops/critical_section_ops.py
@@ -27,7 +27,11 @@ from tensorflow.python.eager import context
 from tensorflow.python.eager import function
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gen_resource_variable_ops
+from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.util import nest
 
 
@@ -38,7 +42,8 @@ CRITICAL_SECTION_EXECUTIONS = "critical_section_executions"
 
 class _ExecutionSignature(
     collections.namedtuple("_ExecutionSignature",
-                           ("op", "exclusive_resource_access"))):
+                           ("op", "handle",
+                            "resources", "exclusive_resource_access"))):
   """A class storing an `ExecuteInCriticalResource` op and associated attrs."""
   pass
 
@@ -112,16 +117,18 @@ class CriticalSection(object):
   ```
   """
 
-  def __init__(self, name=None, critical_section_def=None, import_scope=None):
+  def __init__(self, name=None, shared_name=None,
+               critical_section_def=None, import_scope=None):
     """Creates a critical section."""
     if critical_section_def and name is not None:
-      raise ValueError("critical_section_def and name are mutually exclusive.")
+      raise ValueError("critical_section_def and shared_name are "
+                       "mutually exclusive.")
     if critical_section_def:
       self._init_from_proto(critical_section_def, import_scope=import_scope)
     else:
-      self._init_from_args(name)
+      self._init_from_args(name, shared_name)
 
-  def _init_from_proto(self, critical_section_def, import_scope):
+  def _init_from_proto(self, critical_section_def, import_scope):  # pylint: disable=invalid-name
     raise NotImplementedError("Not yet implemented")
     # TODO(ebrevdo): Re-enable once CriticalSection is in core.
     # assert isinstance(
@@ -133,18 +140,20 @@ class CriticalSection(object):
     #         critical_section_def.critical_section_name,
     #         import_scope=import_scope))
 
-  def _init_from_args(self, name):
+  def _init_from_args(self, name, shared_name):  # pylint: disable=invalid-name
     """Initialize the CriticalSection from constructor arguments."""
     with ops.name_scope(name, "CriticalSection", []) as name:
-      with ops.control_dependencies(None):
+      with ops.init_scope():
         # pylint: disable=protected-access
-        handle_name = ops._name_from_scope_name(name)
         container = ops.get_default_graph()._container
         # pylint: enable=protected-access
+        if shared_name is None:
+          shared_name = name
         if container is None:
           container = ""
-        self._handle = gen_resource_variable_ops.critical_section_op(
-            shared_name=handle_name, name=name)
+        self._handle = gen_resource_variable_ops.mutex_v2(
+            shared_name=shared_name, container=container, name=name)
+
     if context.in_graph_mode():
       ops.add_to_collections(CRITICAL_SECTIONS, self)
 
@@ -183,68 +192,98 @@ class CriticalSection(object):
     name = kwargs.pop("name", None)
     exclusive_resource_access = kwargs.pop("exclusive_resource_access", True)
 
-    args = nest.map_structure(ops.convert_to_tensor, args)
     with ops.name_scope(name, "critical_section_execute", []):
-      fn_op = function.make_defun_op(fn, *args, **kwargs)
-      flat_dtypes = nest.flatten(fn_op.output_dtypes)
-      flat_shapes = nest.flatten(fn_op.output_shapes)
-      all_inputs = nest.flatten(args) + fn_op.captured_inputs
-      if self._handle in all_inputs:
+      lock = gen_resource_variable_ops.mutex_lock(self._handle)
+
+      with ops.control_dependencies([lock]):
+        c_known_ops = set()
+        c_captured_tensors = set()
+
+        def add_op_internal(op):
+          c_known_ops.add(op)
+          for i in op.inputs:
+            if i.op not in c_known_ops:
+              c_captured_tensors.add(i)
+
+        c = function.HelperContext(add_op_internal)
+        with c:
+          r = fn(*args, **kwargs)
+
+        resource_inputs = set([
+            x for x in
+            list(nest.flatten(args)) + nest.flatten(kwargs.values()) +
+            list(c_captured_tensors)
+            if tensor_util.is_tensor(x) and x.dtype == dtypes.resource])
+
+      if self._handle in resource_inputs:
         raise ValueError("The function fn attempts to access the "
-                         "CriticalSection in which it would be running.  This "
-                         "is illegal and would cause deadlocks.  "
+                         "CriticalSection in which it would be running.  "
+                         "This is illegal and would cause deadlocks.  "
                          "CriticalSection: %s." % self._handle)
 
       if context.in_graph_mode():
         # Collections and op introspection does not work in eager
         # mode.  This is generally ok; since eager mode (as of
         # writing) executes sequentially anyway.
-        all_input_resources = [
-            x for x in all_inputs if x.dtype == dtypes.resource]
         for sg in ops.get_collection(CRITICAL_SECTION_EXECUTIONS):
-          if sg.op.inputs[0].name == self._handle.name:
+          sg_handle_name = ops.convert_to_tensor(sg.handle).name
+          self_handle_name = ops.convert_to_tensor(self._handle).name
+          if sg_handle_name == self_handle_name:
             # Other executions in the same critical section are allowed.
             continue
           if not (exclusive_resource_access or sg.exclusive_resource_access):
             # Neither execution requested exclusive access.
             continue
-          sg_input_names = [y.name for y in sg.op.inputs[1:]]
-          for res in all_input_resources:
-            if res.name in sg_input_names:
-              raise ValueError(
-                  "This execution would access resource %s; but either this "
-                  "execution (CriticalSection: %s) or Execution '%s' "
-                  "(CriticalSection: %s) requested exclusive resource access "
-                  "of this resource for their critical section.  Did you mean "
-                  "to call execute with keyword argument "
-                  "exclusive_resource_access=False?"
-                  % (res.name,
-                     self.name,
-                     sg.op.name,
-                     sg.op.inputs[0].op.name))
-
-      flat_outputs = gen_resource_variable_ops.execute_in_critical_section(
-          critical_section=self._handle,
-          arguments=all_inputs,
-          f=fn_op,
-          output_types=flat_dtypes,
-          output_shapes=flat_shapes)
+          resource_intersection = resource_inputs.intersection(sg.resources)
+          if resource_intersection:
+            raise ValueError(
+                "This execution would access resources: %s.  Either this "
+                "lock (CriticalSection: %s) or lock '%s' "
+                "(CriticalSection: %s) requested exclusive resource access "
+                "of this resource.  Did you mean to call execute with keyword "
+                "argument exclusive_resource_access=False?" %
+                (list(resource_intersection), self._handle.name,
+                 sg.op.name, sg.handle.name))
+
+      def identity(x):  # pylint: disable=invalid-name
+        if isinstance(x, tensor_array_ops.TensorArray):
+          return x.identity()
+        elif isinstance(x, ops.Operation):
+          return control_flow_ops.group(x)
+        elif context.in_eager_mode() and x is None:
+          return None
+        else:
+          return array_ops.identity(x)
+
+      r_flat = [identity(x) for x in nest.flatten(r)]
+
+      with ops.control_dependencies(r_flat):
+        # The identity must run on the same machine as self._handle
+        with ops.colocate_with(self._handle):
+          # Do not use array_ops.identity as there are special
+          # optimizations within TensorFlow which seem to elide it
+          # even when optimizations are disabled(!).
+          ensure_lock_exists = gen_resource_variable_ops.consume_mutex_lock(
+              lock)
+
+        # Make sure that if any element of r is accessed, all of
+        # them are executed together.
+        r = nest.pack_sequence_as(
+            r, control_flow_ops.tuple(nest.flatten(r)))
+
+      with ops.control_dependencies([ensure_lock_exists]):
+        outputs = nest.map_structure(identity, r)
 
       if context.in_graph_mode():
-        if isinstance(flat_outputs, ops.Operation):
-          flat_outputs = [flat_outputs]
-        op = (flat_outputs[0].op if isinstance(flat_outputs[0], ops.Tensor)
-              else flat_outputs[0])
         signature = _ExecutionSignature(
-            op=op,
+            op=lock.op,
+            handle=self._handle,
+            resources=list(resource_inputs),
             exclusive_resource_access=exclusive_resource_access)
         ops.add_to_collections(
             CRITICAL_SECTION_EXECUTIONS, signature)
 
-      return (flat_outputs[0]
-              if (len(flat_outputs) == 1
-                  and isinstance(flat_outputs[0], ops.Operation))
-              else nest.pack_sequence_as(fn_op.output_dtypes, flat_outputs))
+      return outputs
 
   # TODO(ebrevdo): Re-enable once CriticalSection is in core.
 
@@ -276,6 +315,7 @@ class CriticalSection(object):
 
 # def _execution_to_proto_fn(execution_signature, export_scope=None):
 #   """Converts `_ExecutionSignature` to a `CriticalSectionExecutionDef`.
+#   # TODO(ebrevdo): Update for _ExecutionSignature storing resource list.
 
 #   Args:
 #     execution_signature: Instance of `_ExecutionSignature`.
@@ -298,6 +338,7 @@ class CriticalSection(object):
 
 # def _execution_from_proto_fn(op_def, import_scope=None):
 #   """Converts a `CriticalSectionExecutionDef` to a `_ExecutionSignature`."""
+#   # TODO(ebrevdo): Update for _ExecutionSignature storing resource list.
 #   assert isinstance(
 #       op_def, critical_section_pb2.CriticalSectionExecutionDef)
 
diff --git a/tensorflow/contrib/framework/python/ops/critical_section_test.py b/tensorflow/contrib/framework/python/ops/critical_section_test.py
index a416724d3ba1719471d70667e140f9cd2daf86c7..c916592ce1979fe3a79cf28ad4bdac44284cce97 100644
--- a/tensorflow/contrib/framework/python/ops/critical_section_test.py
+++ b/tensorflow/contrib/framework/python/ops/critical_section_test.py
@@ -19,12 +19,10 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.framework.python.ops import critical_section_ops
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.platform import test
 # TODO(ebrevdo): Re-enable once CriticalSection is in core.
@@ -35,7 +33,7 @@ class CriticalSectionTest(test.TestCase):
 
   @test_util.run_in_graph_and_eager_modes()
   def testCreateCriticalSection(self):
-    cs = critical_section_ops.CriticalSection(name="cs")
+    cs = critical_section_ops.CriticalSection(shared_name="cs")
     v = resource_variable_ops.ResourceVariable(0.0, name="v")
 
     def fn(a, b):
@@ -45,16 +43,72 @@ class CriticalSectionTest(test.TestCase):
         with ops.control_dependencies([nv]):
           return array_ops.identity(c)
 
-    num_concurrent = 1000
+    num_concurrent = 100
     r = [cs.execute(fn, 1.0, 2.0) for _ in range(num_concurrent)]
     self.evaluate(v.initializer)
     r_value = self.evaluate(r)
     self.assertAllClose([2.0 * i for i in range(num_concurrent)],
                         sorted(r_value))
 
+  @test_util.run_in_graph_and_eager_modes()
+  def testCriticalSectionWithControlFlow(self):
+    for outer_cond in [False, True]:
+      for inner_cond in [False, True]:
+        cs = critical_section_ops.CriticalSection(shared_name="cs")
+        v = resource_variable_ops.ResourceVariable(0.0, name="v")
+        num_concurrent = 100
+
+        # pylint: disable=cell-var-from-loop
+        def fn(a, b):
+          c = v.read_value()
+          def true_fn():
+            with ops.control_dependencies([c]):
+              nv = v.assign_add(a * b)
+              with ops.control_dependencies([nv]):
+                return array_ops.identity(c)
+          return control_flow_ops.cond(
+              array_ops.identity(inner_cond), true_fn, lambda: c)
+
+        def execute():
+          return cs.execute(fn, 1.0, 2.0)
+
+        r = [
+            control_flow_ops.cond(array_ops.identity(outer_cond),
+                                  execute,
+                                  v.read_value)
+            for _ in range(num_concurrent)
+        ]
+        # pylint: enable=cell-var-from-loop
+
+        self.evaluate(v.initializer)
+        r_value = self.evaluate(r)
+        if inner_cond and outer_cond:
+          self.assertAllClose([2.0 * i for i in range(num_concurrent)],
+                              sorted(r_value))
+        else:
+          self.assertAllClose([0] * num_concurrent, r_value)
+
+  def testCriticalSectionInParallelDoesntDeadlockOnError(self):
+    # No eager mode execution of this test because eager does not
+    # run fn() in parallel, which is where the deadlock could
+    # potentially occur (in graph mode).
+    cs = critical_section_ops.CriticalSection(shared_name="cs")
+    v = resource_variable_ops.ResourceVariable(0.0, name="v")
+
+    def fn(i):
+      error = control_flow_ops.Assert((i % 2) == 1, ["Error"])
+      with ops.control_dependencies([error]):
+        return v.read_value()
+    num_concurrent = 2
+    r = [cs.execute(fn, i) for i in range(num_concurrent)]
+    self.evaluate(v.initializer)
+    for _ in range(100):
+      with self.assertRaisesOpError("Error"):
+        self.evaluate(r)
+
   @test_util.run_in_graph_and_eager_modes()
   def testCreateCriticalSectionFnReturnsOp(self):
-    cs = critical_section_ops.CriticalSection(name="cs")
+    cs = critical_section_ops.CriticalSection(shared_name="cs")
     v = resource_variable_ops.ResourceVariable(0.0, name="v")
 
     def fn_return_op(a, b):
@@ -62,7 +116,7 @@ class CriticalSectionTest(test.TestCase):
       with ops.control_dependencies([c]):
         nv = v.assign_add(a * b)
         with ops.control_dependencies([nv]):
-          return ()
+          return control_flow_ops.no_op()
 
     num_concurrent = 100
     r = [cs.execute(fn_return_op, 1.0, 2.0) for _ in range(num_concurrent)]
@@ -71,47 +125,25 @@ class CriticalSectionTest(test.TestCase):
     final_v = self.evaluate(v)
     self.assertAllClose(2.0 * num_concurrent, final_v)
 
-  def testCreateCriticalSectionRaw(self):
-    cs = critical_section_ops.CriticalSection(name="cs")
-    v = resource_variable_ops.ResourceVariable(0.0, name="v")
-
-    @function.Defun(dtypes.float32, dtypes.float32)
-    def fn(a, b):
-      c = v.read_value()
-      with ops.control_dependencies([c]):
-        nv = v.assign_add(a * b)
-        with ops.control_dependencies([nv]):
-          return array_ops.identity(c)
-
-    def execute(fn, *args):
-      output_args = fn.definition.signature.output_arg
-      return resource_variable_ops.execute_in_critical_section(
-          critical_section=cs._handle,
-          arguments=list(args) + fn.captured_inputs,
-          f=fn,
-          output_types=[out.type for out in output_args],
-          output_shapes=[tensor_shape.TensorShape(None) for _ in output_args])
-
-    num_concurrent = 1000
-    r = [execute(fn, 1.0, 2.0)[0] for _ in range(num_concurrent)]
-    self.evaluate(v.initializer)
-    r_value = self.evaluate(r)
-    self.assertAllClose([2.0 * i for i in range(num_concurrent)],
-                        sorted(r_value))
-
   def testCollection(self):
-    cs = critical_section_ops.CriticalSection(name="cs")
+    cs = critical_section_ops.CriticalSection(shared_name="cs")
     self.assertIn(
         cs, ops.get_collection(critical_section_ops.CRITICAL_SECTIONS))
-    execute_op = cs.execute(lambda x: x + 1, 1.0).op
+    execute = cs.execute(lambda x: x + 1, 1.0, name="my_execute")
+    execute_op = [
+        x for x in execute.graph.get_operations()
+        if "my_execute" in x.name and "MutexLock" in x.type
+    ][0]
     self.assertIn(
         execute_op,
         [signature.op for signature in
          ops.get_collection(critical_section_ops.CRITICAL_SECTION_EXECUTIONS)])
 
-  @test_util.run_in_graph_and_eager_modes()
   def testRecursiveCriticalSectionAccessIsIllegal(self):
-    cs = critical_section_ops.CriticalSection(name="cs")
+    # This does not work properly in eager mode.  Eager users will
+    # just hit a deadlock if they do this.  But at least it'll be easier
+    # to debug.
+    cs = critical_section_ops.CriticalSection(shared_name="cs")
     def fn(x):
       return cs.execute(lambda x: x+1, x)
     with self.assertRaisesRegexp(
@@ -167,7 +199,7 @@ class CriticalSectionTest(test.TestCase):
   #     self.assertEqual(restored_exec[0].op.name, "imported/%s" % r.op.name)
 
   # def testToProto(self):
-  #   cs = critical_section_ops.CriticalSection(name="cs")
+  #   cs = critical_section_ops.CriticalSection(shared_name="cs")
   #   proto = cs.to_proto()
   #   self.assertEqual(proto.critical_section_name, cs._handle.name)
   #   cs_copy = critical_section_ops.CriticalSection.from_proto(proto)
diff --git a/tensorflow/contrib/gan/python/eval/python/summaries_test.py b/tensorflow/contrib/gan/python/eval/python/summaries_test.py
index 7956db43348c0cc0f3d372e92a2e343f5aa62013..45eb108586bed07434ac29595164745eac6054c1 100644
--- a/tensorflow/contrib/gan/python/eval/python/summaries_test.py
+++ b/tensorflow/contrib/gan/python/eval/python/summaries_test.py
@@ -90,8 +90,7 @@ class SummariesTest(test.TestCase):
     self._test_add_gan_model_image_summaries_impl(get_gan_model, 2, False)
 
   def test_add_gan_model_image_summaries_for_cyclegan(self):
-    self._test_add_gan_model_image_summaries_impl(get_cyclegan_model, 10,
-                                                  True)
+    self._test_add_gan_model_image_summaries_impl(get_cyclegan_model, 10, True)
 
   def _test_add_gan_model_summaries_impl(self, get_model_fn,
                                          expected_num_summary_ops):
diff --git a/tensorflow/contrib/graph_editor/reroute.py b/tensorflow/contrib/graph_editor/reroute.py
index 7ffdbb7139281734917fdb715601b317eb58b82f..95c02a64d47c26e731ef2628fb551529e9bc3f4d 100644
--- a/tensorflow/contrib/graph_editor/reroute.py
+++ b/tensorflow/contrib/graph_editor/reroute.py
@@ -471,9 +471,10 @@ def remove_control_inputs(op, cops):
     if cop not in op.control_inputs:
       raise ValueError("{} is not a control_input of {}".format(op.name,
                                                                 cop.name))
+  control_inputs = [cop for cop in op.control_inputs if cop not in cops]
   # pylint: disable=protected-access
-  op._control_inputs = [cop for cop in op._control_inputs if cop not in cops]
-  op._recompute_node_def()
+  op._remove_all_control_inputs()
+  op._add_control_inputs(control_inputs)
   # pylint: enable=protected-access
 
 
@@ -496,9 +497,6 @@ def add_control_inputs(op, cops):
     if cop in op.control_inputs:
       raise ValueError("{} is already a control_input of {}".format(cop.name,
                                                                     op.name))
-  # pylint: disable=protected-access
-  op._control_inputs += cops
-  op._recompute_node_def()
-  # pylint: enable=protected-access
+  op._add_control_inputs(cops)  # pylint: disable=protected-access
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/kfac/python/kernel_tests/estimator_test.py b/tensorflow/contrib/kfac/python/kernel_tests/estimator_test.py
index bfdb69ad02caaa57827e0ae6b3c9fc0d0ed03754..b12f7be76907dc206667eb8ee0c750f3b8db57fc 100644
--- a/tensorflow/contrib/kfac/python/kernel_tests/estimator_test.py
+++ b/tensorflow/contrib/kfac/python/kernel_tests/estimator_test.py
@@ -90,49 +90,51 @@ class EstimatorTest(test.TestCase):
   def testEstimatorInitManualRegistration(self):
     with self._graph.as_default():
       # We should be able to build an estimator for only the registered vars.
-      estimator.FisherEstimator([self.weights], 0.1, 0.2, self.layer_collection)
+      estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
+                                self.layer_collection)
 
       # Check that we throw an error if we try to build an estimator for vars
       # that were not manually registered.
       with self.assertRaises(ValueError):
-        estimator.FisherEstimator([self.weights, self.bias], 0.1, 0.2,
+        estimator.FisherEstimator(lambda: 0.2, [self.weights, self.bias], 0.1,
                                   self.layer_collection)
 
       # Check that we throw an error if we don't include registered variables,
       # i.e. self.weights
       with self.assertRaises(ValueError):
-        estimator.FisherEstimator([], 0.1, 0.2, self.layer_collection)
+        estimator.FisherEstimator(lambda: 0.2, [], 0.1, self.layer_collection)
 
   @test.mock.patch.object(utils.SubGraph, "variable_uses", return_value=42)
   def testVariableWrongNumberOfUses(self, mock_uses):
     with self.assertRaises(ValueError):
-      estimator.FisherEstimator([self.weights], 0.1, 0.2, self.layer_collection)
+      estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
+                                self.layer_collection)
 
   def testInvalidEstimationMode(self):
     with self.assertRaises(ValueError):
-      estimator.FisherEstimator([self.weights], 0.1, 0.2, self.layer_collection,
-                                "not_a_real_mode")
+      estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
+                                self.layer_collection, "not_a_real_mode")
 
   def testModeListCorrect(self):
     with self._graph.as_default():
-      est = estimator.FisherEstimator([self.weights], 0.1, 0.2,
+      est = estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
                                       self.layer_collection)
     self.assertItemsEqual(_ALL_ESTIMATION_MODES, est._gradient_fns.keys())
 
   def testAllModesBuild(self):
     for mode in _ALL_ESTIMATION_MODES:
       with self._graph.as_default():
-        estimator.FisherEstimator([self.weights], 0.1, 0.2,
+        estimator.FisherEstimator(lambda: 0.2, [self.weights], 0.1,
                                   self.layer_collection, mode)
 
   def test_cov_update_thunks(self):
     """Ensures covariance update ops run once per global_step."""
     with self._graph.as_default(), self.test_session() as sess:
       fisher_estimator = estimator.FisherEstimator(
+          damping_fn=lambda: 0.2,
           variables=[self.weights],
           layer_collection=self.layer_collection,
-          cov_ema_decay=0.0,
-          damping=0.0)
+          cov_ema_decay=0.0)
 
       # Construct an op that executes one covariance update per step.
       global_step = training_util.get_or_create_global_step()
@@ -176,10 +178,10 @@ class EstimatorTest(test.TestCase):
     """Ensures inverse update ops run once per global_step."""
     with self._graph.as_default(), self.test_session() as sess:
       fisher_estimator = estimator.FisherEstimator(
+          damping_fn=lambda: 0.2,
           variables=[self.weights],
           layer_collection=self.layer_collection,
-          cov_ema_decay=0.0,
-          damping=0.0)
+          cov_ema_decay=0.0)
 
       # Construct op that updates one inverse per global step.
       global_step = training_util.get_or_create_global_step()
diff --git a/tensorflow/contrib/kfac/python/ops/BUILD b/tensorflow/contrib/kfac/python/ops/BUILD
index ee6549b109399766579b6ea18a987ae2c8275983..c26230c2a82ae9529ab13b523b9ec287d17debaf 100644
--- a/tensorflow/contrib/kfac/python/ops/BUILD
+++ b/tensorflow/contrib/kfac/python/ops/BUILD
@@ -144,10 +144,13 @@ py_library(
         ":fisher_estimator",
         "//tensorflow/python:array_ops",
         "//tensorflow/python:control_flow_ops",
+        "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:linalg_ops",
         "//tensorflow/python:math_ops",
+        "//tensorflow/python:state_ops",
         "//tensorflow/python:training",
+        "//tensorflow/python:variable_scope",
         "//tensorflow/python:variables",
     ],
 )
diff --git a/tensorflow/contrib/kfac/python/ops/estimator.py b/tensorflow/contrib/kfac/python/ops/estimator.py
index a7b1f9d35c931fc44408be804479e758f28f7110..a7e268c48ae326a4d8fa5fe4a4ed15b8b83a0ed9 100644
--- a/tensorflow/contrib/kfac/python/ops/estimator.py
+++ b/tensorflow/contrib/kfac/python/ops/estimator.py
@@ -83,9 +83,9 @@ class FisherEstimator(object):
   """
 
   def __init__(self,
+               damping_fn,
                variables,
                cov_ema_decay,
-               damping,
                layer_collection,
                estimation_mode="gradients",
                colocate_gradients_with_ops=True,
@@ -94,16 +94,12 @@ class FisherEstimator(object):
     """Create a FisherEstimator object.
 
     Args:
+      damping_fn: Function, accepts no arguments and returns damping value.
       variables: A list of the variables for which to estimate the Fisher. This
           must match the variables registered in layer_collection (if it is not
           None).
       cov_ema_decay: The decay factor used when calculating the covariance
           estimate moving averages.
-      damping: The damping factor used to stabilize training due to errors in
-          the local approximation with the Fisher information matrix, and to
-          regularize the update direction by making it closer to the gradient.
-          (Higher damping means the update looks more like a standard gradient
-          update - see Tikhonov regularization.)
       layer_collection: The layer collection object, which holds the fisher
           blocks, kronecker factors, and losses associated with the
           graph.
@@ -135,10 +131,9 @@ class FisherEstimator(object):
     Raises:
       ValueError: If no losses have been registered with layer_collection.
     """
-
+    self._damping_fn = damping_fn
     self._cov_ema_decay = cov_ema_decay
     self._variables = variables
-    self._damping = damping
     self._estimation_mode = estimation_mode
     self._layers = layer_collection
     self._layers.create_subgraph()
@@ -182,7 +177,7 @@ class FisherEstimator(object):
 
   @property
   def damping(self):
-    return self._damping
+    return self._damping_fn()
 
   def _apply_transformation(self, vecs_and_vars, transform):
     """Applies an block-wise transformation to the corresponding vectors.
diff --git a/tensorflow/contrib/kfac/python/ops/optimizer.py b/tensorflow/contrib/kfac/python/ops/optimizer.py
index 1974b07acfc879dc4bc844db9af88fd1043d6698..5d456bcb79ff00cedc1aaa7244cc8722d21f6e98 100644
--- a/tensorflow/contrib/kfac/python/ops/optimizer.py
+++ b/tensorflow/contrib/kfac/python/ops/optimizer.py
@@ -23,11 +23,14 @@ from tensorflow.contrib.kfac.python.ops import curvature_matrix_vector_products
 from tensorflow.contrib.kfac.python.ops import estimator as est
 # pylint enable=long-line
 
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import linalg_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables as tf_variables
 from tensorflow.python.training import gradient_descent
 
@@ -61,6 +64,8 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
       damping: The damping factor used to stabilize training due to errors in
           the local approximation with the Fisher information matrix, and to
           regularize the update direction by making it closer to the gradient.
+          If damping is adapted during training then this value is used for
+          initializing damping varaible.
           (Higher damping means the update looks more like a standard gradient
           update - see Tikhonov regularization.)
       layer_collection: The layer collection object, which holds the fisher
@@ -105,10 +110,31 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
     if variables is None:
       variables = tf_variables.trainable_variables()
 
+    # The below paramaters are required only if damping needs to be adapated.
+    # These parameters can be set by calling
+    # set_damping_adaptation_params() explicitly.
+    self._damping_adaptation_decay = 0.95
+    self._damping_adaptation_interval = 5
+    # Check section 6.5 KFAC paper. omega(1) = pow(damping decay, interval)
+    self._omega = (
+        self._damping_adaptation_decay**self._damping_adaptation_interval)
+    self._adapt_damping = False
+    self._min_damping = 1e-5
+    self._prev_train_batch = None
+    self._is_chief = False
+    self._loss_fn = None
+    self._damping_constant = damping
+    self._damping = None
+    self._rho = None
+    self._prev_loss = None
+    self._q_model_change = None
+    self._update_damping_op = None
+
+    self._layers = layer_collection
     self._fisher_est = est.FisherEstimator(
+        lambda: self.damping,
         variables,
         cov_ema_decay,
-        damping,
         layer_collection,
         estimation_mode=estimation_mode,
         colocate_gradients_with_ops=colocate_gradients_with_ops,
@@ -139,6 +165,60 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
 
     super(KfacOptimizer, self).__init__(learning_rate, name=name)
 
+  def set_damping_adaptation_params(self,
+                                    is_chief,
+                                    prev_train_batch,
+                                    loss_fn,
+                                    min_damping=1e-5,
+                                    damping_adaptation_decay=0.99,
+                                    damping_adaptation_interval=5):
+    """Sets parameters required to adapt damping during training.
+
+    When called, enables damping adaptation according to the Levenberg-Marquardt
+    style rule described in Section 6.5 of "Optimizing Neural Networks with
+    Kronecker-factored Approximate Curvature".
+
+    Args:
+      is_chief: `Boolean`, `True` if the worker is chief.
+      prev_train_batch: Training data used to minimize loss in the previous
+        step. This will be used to evaluate loss by calling
+        `loss_fn(prev_train_batch)`.
+      loss_fn: `function` that takes as input training data tensor and returns
+        a scalar loss.
+      min_damping: `float`(Optional), Minimum value the damping parameter
+        can take. Default value 1e-5.
+      damping_adaptation_decay: `float`(Optional), The `damping` parameter is
+        multipled by the `damping_adaptation_decay` every
+        `damping_adaptation_interval` number of iterations. Default value 0.99.
+      damping_adaptation_interval: `int`(Optional), Number of steps in between
+        updating the `damping` parameter. Default value 5.
+
+    Raises:
+      ValueError: If `set_damping_adaptation_params` is already called and the
+        the `adapt_damping` is `True`.
+    """
+    if self._adapt_damping:
+      raise ValueError("Damping adaptation parameters already set.")
+    with variable_scope.variable_scope(self.get_name()):
+      self._adapt_damping = True
+      self._is_chief = is_chief
+      self._prev_train_batch = prev_train_batch
+      self._loss_fn = loss_fn
+      self._damping_adaptation_decay = damping_adaptation_decay
+      self._damping_adaptation_interval = damping_adaptation_interval
+      self._omega = (
+          self._damping_adaptation_decay**self._damping_adaptation_interval)
+      self._min_damping = min_damping
+
+      self._rho = variable_scope.get_variable(
+          "rho", shape=(), dtype=dtypes.float32, trainable=False)  # LM ratio.
+      self._prev_loss = variable_scope.get_variable(
+          "prev_loss", shape=(), dtype=dtypes.float32, trainable=False)
+      self._q_model_change = variable_scope.get_variable(
+          "q_model_change", shape=(), dtype=dtypes.float32, trainable=False)
+      self._damping = variable_scope.get_variable(
+          "damping", initializer=self._damping_constant, trainable=False)
+
   @property
   def cov_update_thunks(self):
     return self._fisher_est.cov_update_thunks
@@ -169,14 +249,34 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
 
   @property
   def damping(self):
-    return self._fisher_est.damping
+    if self._damping:
+      return self._damping
+    else:
+      return self._damping_constant
+
+  @property
+  def damping_adaptation_interval(self):
+    return self._damping_adaptation_interval
 
   def minimize(self, *args, **kwargs):
     kwargs["var_list"] = kwargs.get("var_list") or self.variables
     if set(kwargs["var_list"]) != set(self.variables):
       raise ValueError("var_list doesn't match with set of Fisher-estimating "
                        "variables.")
-    return super(KfacOptimizer, self).minimize(*args, **kwargs)
+    if self._adapt_damping and self._is_chief:
+      global_step = kwargs.get("global_step", None)
+      if not global_step:
+        raise KeyError("global_step needs to be passed to optimizer.minimize "
+                       "if damping parameter is adapted.")
+      update_damping_op = self._update_damping(self._prev_train_batch,
+                                               global_step)
+      with ops.control_dependencies([update_damping_op]):
+        loss = args[0]
+        loss_assign_op = state_ops.assign(self._prev_loss, loss)
+        train_op = super(KfacOptimizer, self).minimize(*args, **kwargs)
+        return control_flow_ops.group(loss_assign_op, train_op)
+    else:
+      return super(KfacOptimizer, self).minimize(*args, **kwargs)
 
   def compute_gradients(self, *args, **kwargs):
     # args[1] could be our var_list
@@ -296,6 +396,20 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
     coeff = self._update_clip_coeff(grads_and_vars, precon_grads_and_vars)
     return [(pgrad * coeff, var) for pgrad, var in precon_grads_and_vars]
 
+  def _compute_prev_updates(self, variables):
+    """Computes previous updates as negative velocities scaled by learning rate.
+
+    Args:
+      variables: List of variables in the graph that the update will be
+          applied to.
+
+    Returns:
+      List of previous updates applied to the `variables`.
+    """
+    return list(
+        -1 * self._learning_rate * self._zeros_slot(var, "velocity", self._name)
+        for var in variables)
+
   def _compute_qmodel_hyperparams(self, precon_grads, prev_updates, grads,
                                   variables):
     """Compute optimal update hyperparameters from the quadratic model.
@@ -374,9 +488,9 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
       c = ops.convert_to_tensor([[_inner_product_list(grads, precon_grads)],
                                  [_inner_product_list(grads, prev_updates)]])
 
-      sol = _two_by_two_solve(m, c)
-      alpha = -sol[0]
-      mu = -sol[1]
+      sol = -1. * _two_by_two_solve(m, c)
+      alpha = sol[0]
+      mu = sol[1]
       qmodel_change = 0.5 * math_ops.reduce_sum(sol * c)
 
       return alpha, mu, qmodel_change
@@ -404,6 +518,52 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
     return control_flow_ops.cond(
         math_ops.equal(m_22, 0.0), zero_prevupd_case, non_zero_prevupd_case)
 
+  def _assign_q_model_change(self, q_model_change):
+    """Assigns `q_model_change` to `self._q_model_change` if damping is adapted.
+
+    Note only the chief worker does the assignment.
+
+    Args:
+      q_model_change: Scalar tensor of type `float32`.
+
+    Returns:
+      If `adapt_damping` is `True` then returns an assign op, Otherwise returns
+      a no_op().
+    """
+    if self._adapt_damping and self._is_chief:
+      q_model_assign_op = state_ops.assign(self._q_model_change, q_model_change)
+    else:
+      q_model_assign_op = control_flow_ops.no_op()
+    return q_model_assign_op
+
+  def _compute_qmodel_hyperparams_wrapper(self, grads_and_vars,
+                                          precon_grads_and_vars):
+    """Wrapper function for `self._compute_qmodel_hyperparams`.
+
+    Constructs a list of preconditioned gradients and variables. Also creates a
+    op to asssign the computed q model change to `self._q_model_change`.
+
+    Args:
+      grads_and_vars: List of (gradient, variable) pairs.
+      precon_grads_and_vars: List of (preconditioned gradients, variable)
+        pairs.
+
+    Returns:
+      (alpha, mu, q_model_assign_op), where alpha and mu are chosen to optimize
+      the quadratic model, `q_model_assign_op` assigns the computed q model
+      change to `self._q_model_change`.
+    """
+    precon_grads = list(
+        precon_grad for (precon_grad, _) in precon_grads_and_vars)
+    grads = list(grad for (grad, _) in grads_and_vars)
+    variables = list(var for (_, var) in grads_and_vars)
+    prev_updates = self._compute_prev_updates(variables)
+    # Compute optimal velocity update parameters according to quadratic model
+    alpha, mu, q_model_change = self._compute_qmodel_hyperparams(
+        precon_grads, prev_updates, grads, variables)
+
+    return alpha, mu, self._assign_q_model_change(q_model_change)
+
   def _compute_update_steps(self, grads_and_vars):
     """Computes the update steps for the variables given the gradients.
 
@@ -411,8 +571,10 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
       grads_and_vars: List of (gradient, variable) pairs.
 
     Returns:
-      An 'Operation that computes the update steps for the given variables.
+      A list of tuple (assign_op ,var) where `assign_op` assigns the update
+      steps to `var`.
     """
+
     if self._momentum_type == "regular":
       # Compute "preconditioned" gradient.
       precon_grads_and_vars = self._fisher_est.multiply_inverse(grads_and_vars)
@@ -423,8 +585,13 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
                                                    precon_grads_and_vars)
 
       # Update the velocity with this and return it as the step.
-      return self._update_velocities(precon_grads_and_vars, self._momentum)
-
+      if self._adapt_damping and self._is_chief:
+        _, _, q_model_assign_op = self._compute_qmodel_hyperparams_wrapper(
+            grads_and_vars, precon_grads_and_vars)
+        with ops.control_dependencies([q_model_assign_op]):
+          return self._update_velocities(precon_grads_and_vars, self._momentum)
+      else:
+        return self._update_velocities(precon_grads_and_vars, self._momentum)
     elif self._momentum_type == "adam":
       # Update velocity.
       velocities_and_vars = self._update_velocities(grads_and_vars,
@@ -436,23 +603,13 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
       # Compute "preconditioned" gradient.
       precon_grads_and_vars = self._fisher_est.multiply_inverse(grads_and_vars)
 
-      # Extract out singleton lists from the tuple-lists
-      precon_grads = list(
-          precon_grad for (precon_grad, _) in precon_grads_and_vars)
-      grads = list(grad for (grad, _) in grads_and_vars)
-      variables = list(var for (_, var) in grads_and_vars)
-      # previous updates are the negative velocities (up to scaling by LR)
-      prev_updates = list(
-          -self._zeros_slot(var, "velocity", self._name) for var in variables)
-
       # Compute optimal velocity update parameters according to quadratic model
-      alpha, mu, _ = self._compute_qmodel_hyperparams(
-          precon_grads, prev_updates, grads, variables)
+      alpha, mu, q_model_assign_op = self._compute_qmodel_hyperparams_wrapper(
+          grads_and_vars, precon_grads_and_vars)
 
-      # Update the velocity with precon_grads according to these params
-      # and return it as the step.
-      return self._update_velocities(
-          precon_grads_and_vars, mu, vec_coeff=-alpha)
+      with ops.control_dependencies([q_model_assign_op]):
+        return self._update_velocities(
+            precon_grads_and_vars, mu, vec_coeff=-alpha)
 
   def _update_velocities(self, vecs_and_vars, decay, vec_coeff=1.0):
     """Updates the velocities of the variables with the given vectors.
@@ -482,6 +639,51 @@ class KfacOptimizer(gradient_descent.GradientDescentOptimizer):
     # Go through variable and update its associated part of the velocity vector.
     return [_update_velocity(vec, var) for vec, var in vecs_and_vars]
 
+  # TODO(b/73448937): Move all update damping code to a separate class/function.
+  def _update_damping(self, prev_batch, global_step):
+    """Adapts damping parameter. Check KFAC (Section 6.5) for the details.
+
+    The damping parameter is updated according to the Levenberg-Marquardt rule
+    every `self._damping_adaptation_interval` iterations.
+
+    Args:
+      prev_batch: Tensor or tuple of tensors which can be passed to
+        `self._loss_fn` to evaluate loss.
+      global_step: `Variable` which keeps track of number of times the training
+        variables have been updated.
+    Returns:
+      A `tf.cond` op which updates the damping parameter.
+    """
+    def compute_damping():
+      """"Adapts damping parameter based on "reduction ratio".
+
+      Reduction ratio captures how closely the quadratic approximation to the
+      loss function approximates the actual loss within a trust region. The
+      damping update tries to make the damping as small as possible while
+      maintaining the property that the quadratic model remains a good local
+      approximation to the loss function.
+
+      Returns:
+        An Op to assign newly computed damping value to `self._damping`.
+      """
+      prev_batch_loss = self._loss_fn(prev_batch)
+      with ops.control_dependencies([prev_batch_loss]):
+        rho_assign = self._rho.assign(
+            (prev_batch_loss - self._prev_loss) / self._q_model_change)
+        with ops.control_dependencies([rho_assign]):
+          new_damping = control_flow_ops.case(
+              [(self._rho < 0.25, lambda: self.damping / self._omega),
+               (self._rho > 0.75, lambda: self.damping * self._omega)],
+              lambda: self.damping)
+          with ops.control_dependencies([new_damping]):
+            new_damping_min = math_ops.maximum(new_damping, self._min_damping)
+            return control_flow_ops.group(self._damping.assign(new_damping_min))
+
+    return control_flow_ops.cond(
+        math_ops.equal(
+            math_ops.mod(global_step + 1, self._damping_adaptation_interval),
+            0), compute_damping, control_flow_ops.no_op)
+
 
 def _inner_product_list(list1, list2):
   return math_ops.add_n(
diff --git a/tensorflow/contrib/kfac/python/ops/utils.py b/tensorflow/contrib/kfac/python/ops/utils.py
index f5bd97cb4e7d547394050e944f75b43a40887f34..88e6fb20e8f97528aea2a92752d79344c27bbf24 100644
--- a/tensorflow/contrib/kfac/python/ops/utils.py
+++ b/tensorflow/contrib/kfac/python/ops/utils.py
@@ -241,19 +241,22 @@ class SubGraph(object):
     # Set of all ancestor Tensors, Ops to 'outputs'.
     self._members = set()
 
-    self._recurse_add(outputs)
-
-  def _recurse_add(self, nodes):
-    """Recursively adds all of nodes' ancestors."""
-    for node in nodes:
-      if node in self._members:
-        continue
-      self._members.add(node)
-
-      if isinstance(node, ops.Tensor):
-        self._recurse_add((node.op,))
-      elif isinstance(node, ops.Operation):
-        self._recurse_add(node.inputs)
+    self._iter_add(outputs)
+
+  def _iter_add(self, root):
+    """Iteratively adds all of nodes' ancestors using depth first search."""
+    stack = [root]
+    while stack:
+      nodes = stack.pop()
+      for node in nodes:
+        if node in self._members:
+          continue
+        self._members.add(node)
+
+        if isinstance(node, ops.Tensor):
+          stack.append((node.op,))
+        elif isinstance(node, ops.Operation):
+          stack.append(node.inputs)
 
   def is_member(self, node):
     """Check if 'node' is in this subgraph."""
diff --git a/tensorflow/contrib/labeled_tensor/python/ops/core.py b/tensorflow/contrib/labeled_tensor/python/ops/core.py
index abc18aa123bb4d40b54d22ec03257c5350118d13..0c6bba758b429a8c4112bc6abb2fae542b5dfc14 100644
--- a/tensorflow/contrib/labeled_tensor/python/ops/core.py
+++ b/tensorflow/contrib/labeled_tensor/python/ops/core.py
@@ -361,6 +361,10 @@ class LabeledTensor(object):
   def dtype(self):
     return self._tensor.dtype
 
+  @property
+  def shape(self):
+    return self._tensor.shape
+
   @property
   def name(self):
     return self._tensor.name
diff --git a/tensorflow/contrib/labeled_tensor/python/ops/core_test.py b/tensorflow/contrib/labeled_tensor/python/ops/core_test.py
index e70b4923749d89aba1bd0187857d762305daeb07..e378db56afb1d4f9463d2c9b0f1fa4c0feea8fb0 100644
--- a/tensorflow/contrib/labeled_tensor/python/ops/core_test.py
+++ b/tensorflow/contrib/labeled_tensor/python/ops/core_test.py
@@ -244,6 +244,9 @@ class LabeledTensorTest(test_util.Base):
   def test_dtype(self):
     self.assertEqual(self.lt.dtype, self.lt.tensor.dtype)
 
+  def test_shape(self):
+    self.assertEqual(self.lt.shape, self.lt.tensor.shape)
+
   def test_get_shape(self):
     self.assertEqual(self.lt.get_shape(), self.lt.tensor.get_shape())
 
diff --git a/tensorflow/contrib/labeled_tensor/python/ops/ops.py b/tensorflow/contrib/labeled_tensor/python/ops/ops.py
index c957b41a49b292225e547ce17b0c5a247810325a..3ba1026383ef146adb32197ae41b5c251155bf46 100644
--- a/tensorflow/contrib/labeled_tensor/python/ops/ops.py
+++ b/tensorflow/contrib/labeled_tensor/python/ops/ops.py
@@ -951,7 +951,7 @@ def define_reduce_op(op_name, reduce_fn):
           intermediate_axes.append(axis)
 
       reduce_op = reduce_fn(
-          labeled_tensor.tensor, reduction_dimensions, keep_dims=True)
+          labeled_tensor.tensor, reduction_dimensions, keepdims=True)
       reduce_lt = core.LabeledTensor(reduce_op, intermediate_axes)
 
       return squeeze(reduce_lt, axes_to_squeeze, name=scope)
diff --git a/tensorflow/contrib/layers/python/layers/feature_column.py b/tensorflow/contrib/layers/python/layers/feature_column.py
index b7d34d6435789e54403926a342481971e854b449..9ccb589d698ad83c9654f5523ccdcb35b031b3da 100644
--- a/tensorflow/contrib/layers/python/layers/feature_column.py
+++ b/tensorflow/contrib/layers/python/layers/feature_column.py
@@ -154,6 +154,7 @@ from tensorflow.python.ops import string_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import deprecation
+from tensorflow.python.util import nest
 
 
 # Imports the core `InputLayer` symbol in contrib during development.
@@ -554,28 +555,70 @@ def sparse_column_with_integerized_feature(column_name,
 class _SparseColumnHashed(_SparseColumn):
   """See `sparse_column_with_hash_bucket`."""
 
+  def __new__(cls,
+              column_name,
+              is_integerized=False,
+              bucket_size=None,
+              lookup_config=None,
+              combiner="sum",
+              dtype=dtypes.string,
+              hash_keys=None):
+    if hash_keys is not None:
+      if not isinstance(hash_keys, list) or not hash_keys:
+        raise ValueError("hash_keys must be a non-empty list.")
+      if (any([not isinstance(key_pair, list) for key_pair in hash_keys]) or
+          any([len(key_pair) != 2 for key_pair in hash_keys]) or
+          any([not isinstance(key, int) for key in nest.flatten(hash_keys)])):
+        raise ValueError(
+            "Each element of hash_keys must be a pair of integers.")
+    obj = super(_SparseColumnHashed, cls).__new__(
+        cls,
+        column_name,
+        is_integerized=is_integerized,
+        bucket_size=bucket_size,
+        lookup_config=lookup_config,
+        combiner=combiner,
+        dtype=dtype)
+    obj.hash_keys = hash_keys
+    return obj
+
   def _do_transform(self, input_tensor):
     if self.dtype.is_integer:
       sparse_values = string_ops.as_string(input_tensor.values)
     else:
       sparse_values = input_tensor.values
 
-    sparse_id_values = string_ops.string_to_hash_bucket_fast(
-        sparse_values, self.bucket_size, name="lookup")
-    return sparse_tensor_py.SparseTensor(input_tensor.indices, sparse_id_values,
-                                         input_tensor.dense_shape)
+    if self.hash_keys:
+      result = []
+      for key in self.hash_keys:
+        sparse_id_values = string_ops.string_to_hash_bucket_strong(
+            sparse_values, self.bucket_size, key)
+        result.append(
+            sparse_tensor_py.SparseTensor(input_tensor.indices,
+                                          sparse_id_values,
+                                          input_tensor.dense_shape))
+      return sparse_ops.sparse_concat(axis=1, sp_inputs=result, name="lookup")
+    else:
+      sparse_id_values = string_ops.string_to_hash_bucket_fast(
+          sparse_values, self.bucket_size, name="lookup")
+      return sparse_tensor_py.SparseTensor(
+          input_tensor.indices, sparse_id_values, input_tensor.dense_shape)
 
 
 def sparse_column_with_hash_bucket(column_name,
                                    hash_bucket_size,
                                    combiner="sum",
-                                   dtype=dtypes.string):
+                                   dtype=dtypes.string,
+                                   hash_keys=None):
   """Creates a _SparseColumn with hashed bucket configuration.
 
   Use this when your sparse features are in string or integer format, but you
   don't have a vocab file that maps each value to an integer ID.
   output_id = Hash(input_feature_string) % bucket_size
 
+  When hash_keys is set, multiple integer IDs would be created with each key
+  pair in the `hash_keys`. This is useful to reduce the collision of hashed ids.
+
   Args:
     column_name: A string defining sparse column name.
     hash_bucket_size: An int that is > 1. The number of buckets.
@@ -588,6 +631,9 @@ def sparse_column_with_hash_bucket(column_name,
         * "sqrtn": do l2 normalization on features in the column
       For more information: `tf.embedding_lookup_sparse`.
     dtype: The type of features. Only string and integer types are supported.
+    hash_keys: The hash keys to use. It is a list of lists of two uint64s. If
+      None, simple and fast hashing algorithm is used. Otherwise, multiple
+      strong hash ids would be produced with each two unit64s in this argument.
 
   Returns:
     A _SparseColumn with hashed bucket configuration
@@ -600,7 +646,8 @@ def sparse_column_with_hash_bucket(column_name,
       column_name,
       bucket_size=hash_bucket_size,
       combiner=combiner,
-      dtype=dtype)
+      dtype=dtype,
+      hash_keys=hash_keys)
 
 
 class _SparseColumnKeys(_SparseColumn):
diff --git a/tensorflow/contrib/layers/python/layers/feature_column_test.py b/tensorflow/contrib/layers/python/layers/feature_column_test.py
index fc8f153fe3abdc83aca5abfa9a4bb5f5d5531480..1de9ab705655db9863d9c7d2630f24283c83d44d 100644
--- a/tensorflow/contrib/layers/python/layers/feature_column_test.py
+++ b/tensorflow/contrib/layers/python/layers/feature_column_test.py
@@ -329,6 +329,55 @@ class FeatureColumnTest(test.TestCase):
     self.assertEqual(one_hot.sparse_id_column.name, "ids_weighted_by_weights")
     self.assertEqual(one_hot.length, 3)
 
+  def testOneHotColumnWithSparseColumnWithHashKeys(self):
+    input_values = ["marlo", "unknown", "omar"]
+    inputs = constant_op.constant(input_values)
+    hash_keys = [[10, 20], [20, 30]]
+    hash_column = fc.sparse_column_with_hash_bucket(
+        column_name="ids", hash_bucket_size=10, hash_keys=hash_keys)
+    columns_to_tensors = {}
+    columns_to_tensors["ids"] = inputs
+    hash_column.insert_transformed_feature(columns_to_tensors)
+    self.assertEqual(len(columns_to_tensors), 2)
+    self.assertTrue(hash_column in columns_to_tensors)
+
+    one_hot_column = fc.one_hot_column(hash_column)
+    one_hot_output = one_hot_column._to_dnn_input_layer(
+        columns_to_tensors[hash_column])
+
+    expected = np.array([[0., 1., 0., 0., 0., 0., 0., 1., 0.,
+                          0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 1.],
+                         [1., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
+    with self.test_session() as sess:
+      one_hot_value = sess.run(one_hot_output)
+    self.assertTrue(np.array_equal(one_hot_value, expected))
+
+  def testSparseColumnWithHashKeysWithUnexpectedHashKeys(self):
+    with self.assertRaisesRegexp(ValueError,
+                                 "hash_keys must be a non-empty list."):
+      fc.sparse_column_with_hash_bucket(
+          column_name="ids", hash_bucket_size=100, hash_keys=[])
+
+    with self.assertRaisesRegexp(ValueError,
+                                 "hash_keys must be a non-empty list."):
+      fc.sparse_column_with_hash_bucket(
+          column_name="ids", hash_bucket_size=100, hash_keys=1)
+
+    with self.assertRaisesRegexp(
+        ValueError, "Each element of hash_keys must be a pair of integers."):
+      fc.sparse_column_with_hash_bucket(
+          column_name="ids", hash_bucket_size=100, hash_keys=[1, 2])
+
+    with self.assertRaisesRegexp(
+        ValueError, "Each element of hash_keys must be a pair of integers."):
+      fc.sparse_column_with_hash_bucket(
+          column_name="ids", hash_bucket_size=100, hash_keys=["key"])
+
+    with self.assertRaisesRegexp(
+        ValueError, "Each element of hash_keys must be a pair of integers."):
+      fc.sparse_column_with_hash_bucket(
+          column_name="ids", hash_bucket_size=100, hash_keys=[[1, 2.0]])
+
   def testMissingValueInOneHotColumnForWeightedSparseColumn(self):
     # Github issue 12583
     ids = fc.sparse_column_with_keys("ids", ["marlo", "omar", "stringer"])
diff --git a/tensorflow/contrib/layers/python/layers/layers.py b/tensorflow/contrib/layers/python/layers/layers.py
index 5c1ff9ec267f1bccd9bee44a4b19e7ed3ec24cf0..80cbe68870808328b387e2044fe236af5a5e39f8 100644
--- a/tensorflow/contrib/layers/python/layers/layers.py
+++ b/tensorflow/contrib/layers/python/layers/layers.py
@@ -51,7 +51,6 @@ from tensorflow.python.ops import standard_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables as tf_variables
 from tensorflow.python.training import moving_averages
-from tensorflow.python.layers.maxout import maxout
 
 # TODO(b/28426988): Replace legacy_* fns migrated from slim.
 # TODO(b/28426988): Remove legacy_* when all uses have migrated to new API.
@@ -2187,8 +2186,10 @@ def layer_norm(inputs,
 
 
 @add_arg_scope
-def images_to_sequence(inputs, data_format=DATA_FORMAT_NHWC,
-                       outputs_collections=None, scope=None):
+def images_to_sequence(inputs,
+                       data_format=DATA_FORMAT_NHWC,
+                       outputs_collections=None,
+                       scope=None):
   """Convert a batch of images into a batch of sequences.
   Args:
     inputs: a (num_images, height, width, depth) tensor
@@ -2694,8 +2695,11 @@ def separable_convolution2d(
 
 
 @add_arg_scope
-def sequence_to_images(inputs, height, output_data_format='channels_last',
-                       outputs_collections=None, scope=None):
+def sequence_to_images(inputs,
+                       height,
+                       output_data_format='channels_last',
+                       outputs_collections=None,
+                       scope=None):
   """Convert a batch of sequences into a batch of images.
   Args:
     inputs: (num_steps, num_batches, depth) sequence tensor
@@ -2936,6 +2940,53 @@ def unit_norm(inputs, dim, epsilon=1e-7, scope=None):
     return math_ops.div(inputs, array_ops.tile(lengths, multiples))
 
 
+@add_arg_scope
+def maxout(inputs, num_units, axis=-1, scope=None):
+  """Adds a maxout op from https://arxiv.org/abs/1302.4389
+
+  "Maxout Networks" Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron
+  Courville,
+   Yoshua Bengio
+
+  Usually the operation is performed in the filter/channel dimension. This can
+  also be
+  used after fully-connected layers to reduce number of features.
+
+  Arguments:
+    inputs: Tensor input
+    num_units: Specifies how many features will remain after maxout
+      in the `axis` dimension (usually channel).
+      This must be multiple of number of `axis`.
+    axis: The dimension where max pooling will be performed. Default is the
+    last dimension.
+    scope: Optional scope for variable_scope.
+
+  Returns:
+    A `Tensor` representing the results of the pooling operation.
+
+  Raises:
+    ValueError: if num_units is not multiple of number of features.
+  """
+  with variable_scope.variable_scope(scope, 'MaxOut', [inputs]):
+    inputs = ops.convert_to_tensor(inputs)
+    shape = inputs.get_shape().as_list()
+    num_channels = shape[axis]
+    if num_channels % num_units:
+      raise ValueError('number of features({}) is not '
+                       'a multiple of num_units({})'.format(
+                           num_channels, num_units))
+    shape[axis] = -1
+    shape += [num_channels // num_units]
+
+    # Dealing with batches with arbitrary sizes
+    for i in range(len(shape)):
+      if shape[i] is None:
+        shape[i] = array_ops.shape(inputs)[i]
+    outputs = math_ops.reduce_max(
+        array_ops.reshape(inputs, shape), -1, keepdims=False)
+    return outputs
+
+
 def poincare_normalize(x, axis=1, epsilon=1e-5, name=None):
   """Project into the Poincare ball with norm <= 1.0 - epsilon.
 
diff --git a/tensorflow/contrib/layers/python/layers/layers_test.py b/tensorflow/contrib/layers/python/layers/layers_test.py
index 0f062adbab3ca9acfb89543b69c7c957bbdf5dd8..997f910a2a97567adbd7ffa3e81a31d2ae0bad7e 100644
--- a/tensorflow/contrib/layers/python/layers/layers_test.py
+++ b/tensorflow/contrib/layers/python/layers/layers_test.py
@@ -4135,5 +4135,31 @@ class LegacyFullyConnectedTest(test.TestCase):
         _layers.legacy_fully_connected(x, 2, activation_fn=nn_ops.softmax)
 
 
+class MaxOutTest(test.TestCase):
+
+  def test_simple(self):
+    inputs = random_ops.random_uniform((64, 10, 36), seed=1)
+    graph = _layers.maxout(inputs, num_units=3)
+    self.assertEqual(graph.get_shape().as_list(), [64, 10, 3])
+
+  def test_fully_connected(self):
+    inputs = random_ops.random_uniform((64, 50), seed=1)
+    graph = _layers.fully_connected(inputs, 50)
+    graph = _layers.maxout(graph, num_units=10)
+    self.assertEqual(graph.get_shape().as_list(), [64, 10])
+
+  def test_nchw(self):
+    inputs = random_ops.random_uniform((10, 100, 100, 3), seed=1)
+    graph = _layers.conv2d(inputs, 10, 3, padding='SAME')
+    graph = _layers.maxout(graph, num_units=1)
+    self.assertEqual(graph.get_shape().as_list(), [10, 100, 100, 1])
+
+  def test_invalid_shape(self):
+    inputs = random_ops.random_uniform((10, 100, 100, 3), seed=1)
+    graph = _layers.conv2d(inputs, 3, 10)
+    with self.assertRaisesRegexp(ValueError, 'number of features'):
+      graph = _layers.maxout(graph, num_units=2)
+
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/layers/python/layers/optimizers.py b/tensorflow/contrib/layers/python/layers/optimizers.py
index cdceea6fee5bdb5aeb6537ea55d25ccf107def4c..69d927e1b3001d14dd1af2f890b07c1a57ab2cfc 100644
--- a/tensorflow/contrib/layers/python/layers/optimizers.py
+++ b/tensorflow/contrib/layers/python/layers/optimizers.py
@@ -41,7 +41,7 @@ OPTIMIZER_CLS_NAMES = {
     "Adagrad": train.AdagradOptimizer,
     "Adam": train.AdamOptimizer,
     "Ftrl": train.FtrlOptimizer,
-    "Momentum": lambda lr: train.MomentumOptimizer(lr, momentum=0.9),
+    "Momentum": lambda learning_rate: train.MomentumOptimizer(learning_rate, momentum=0.9),  # pylint: disable=line-too-long
     "RMSProp": train.RMSPropOptimizer,
     "SGD": train.GradientDescentOptimizer,
 }
diff --git a/tensorflow/contrib/layers/python/layers/optimizers_test.py b/tensorflow/contrib/layers/python/layers/optimizers_test.py
index 1ea25bd1a5685eb6f840e621b5739029a660aa0f..a4461a20e54c289886f1a1beb255de12fc054afe 100644
--- a/tensorflow/contrib/layers/python/layers/optimizers_test.py
+++ b/tensorflow/contrib/layers/python/layers/optimizers_test.py
@@ -61,7 +61,8 @@ class OptimizersTest(test.TestCase):
     optimizers = [
         "SGD", gradient_descent.GradientDescentOptimizer,
         gradient_descent.GradientDescentOptimizer(learning_rate=0.1),
-        lambda lr: gradient_descent.GradientDescentOptimizer(learning_rate=lr)
+        lambda lr: gradient_descent.GradientDescentOptimizer(learning_rate=lr),
+        "Momentum"
     ]
     for optimizer in optimizers:
       with ops.Graph().as_default() as g:
diff --git a/tensorflow/contrib/learn/README.md b/tensorflow/contrib/learn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d516bffc5e0327a3400068b35de5503e5a925a54
--- /dev/null
+++ b/tensorflow/contrib/learn/README.md
@@ -0,0 +1,143 @@
+EVERYTHING IN THIS DIRECTORY IS DEPRECATED.
+
+Using functions or classes will result in warnings.
+
+Instructions for converting to current alternatives are included in the
+warnings. A high-level overview is below.
+
+## Canned Estimators
+
+Many canned estimators (subclasses of `Estimator`) have equivalents in core:
+`DNNClassifier`, `DNNRegressor`, `DNNEstimator`, `LinearClassifier`,
+`LinearRegressor`, `DNNLinearCombinedClassifier` and
+`DNNLinearCombinedRegressor`. They are exposed under `tf.estimator`.
+`DNNEstimator`, `LinearEstimator` and `DNNLinearCombinedEstimator`
+are exposed under `tf.contrib.estimator`.
+
+To migrate to the new api, users need to take the following steps:
+
+* Replace `tf.contrib.learn` with `tf.estimator`.
+* If you subclass any of the estimators, stop doing that. You should be able to
+  write a factory method that returns a canned estimator instead. If this is not
+  possible (if you override methods from the canned estimator), consider writing
+  a custom estimator instead. See `tf.estimator.Estimator`.
+* Set `loss_reduction=tf.losses.Reduction.SUM_OVER_BATCH_SIZE` to preserve loss
+  reduction as the average over batch.
+* Some optimizer-related arguments are no longer passed in the estimator
+  constructor. Instead, we provide methods that perform the same job by wrapping
+  an optimizer. Specifically:
+  *  `gradient_clip_norm`: Use `tf.contrib.estimator.clip_gradients_by_norm`
+  *  `embedding_lr_multipliers`: Not supported.
+  Other arguments:
+  * `input_layer_min_slice_size`: Replaced by `input_layer_partitioner`
+  * `enable_centered_bias`: Not supported. Dropping this argument is unlikely to
+    harm your model.
+  * `feature_engineering_fn`: Not supported. You can call your
+    `feature_engineering_fn` inside your input_fn:
+    ```python
+    def new_input_fn():
+      features, labels = old_input_fn()
+      return feature_engineering_fn(features, labels)
+    ```
+* Use `tf.reshape` to reshape labels in your `input_fn`. `tf.estimator`
+  classifiers and regressors expect labels as a 2D Tensor of shape
+  `[batch_size, 1]`, or `[batch_size, n_labels]`. In contrast,
+  `tf.contrib.learn` classifiers and regressors supported labels with shape
+  `[batch_size]`.
+* If you pass custom metrics from the `evaluate()` method call, use
+  `tf.contrib.estimator.add_metrics`.
+* Replace your `serving_input_fn` with a `serving_input_receiver_fn`.
+  Note this should be entirely distinct from your training `input_fn`, so if you
+  previously had one `input_fn` with different "modes", you should now factor
+  that apart.  Where the former returned either a simple `(features, labels)`
+  tuple or `InputFnOps`, you should now return a `ServingInputReceiver`.
+  If you were generating your `serving_input_fn` using the
+  `build_parsing_serving_input_fn` helper, you can simply drop in the
+  replacement `build_parsing_serving_input_receiver_fn`.
+
+Some remaining estimators/classes:
+
+* `DynamicRnnEstimator`:  Consider a custom `model_fn`.
+* `KMeansClustering`: Use `tf.contrib.factorization.KMeansClustering`.
+* `LogisticRegressor`: Not supported. Instead, use `binary_classification_head`
+  with a custom `model_fn`, or with `DNNEstimator`.
+* `StateSavingRnnEstimator`: Consider a custom `model_fn`.
+* SVM: Consider a custom `model_fn`.
+* `LinearComposableModel` and `DNNComposableModel`: Not supported. 
+  Consider `tf.contrib.estimator.DNNEstimator`, or write a custom model_fn.
+* `MetricSpec`: Deprecated. For adding custom metrics to canned Estimators, use
+  `tf.contrib.estimator.add_metrics`.
+
+## Estimator
+`tf.contrib.learn.Estimator` is migrated to `tf.estimator.Estimator`.
+
+To migrate, users need to take the following steps:
+
+* Replace `tf.contrib.learn.Estimator` with `tf.estimator.Estimator`.
+* If you pass a `config` argument to `Estimator`, this must be
+  `tf.estimator.RunConfig`. You may need to edit your code accordingly.
+* Edit your `model_fn` to return `tf.estimator.EstimatorSpec`. Refer to
+  `EstimatorSpec` for documentation of specific fields.
+* If your `model_fn` uses the `mode` argument, use `tf.estimator.ModeKeys`.
+
+Some related classes:
+* `Evaluable`, `Trainable`: Not supported, merged into `tf.estimator.Estimator`.
+* ExportStrategy: Replaced by `tf.estimator.Exporter`.
+
+## Head/MultiHead
+These classes are now supported under `tf.contrib.estimator`, e.g.
+`tf.contrib.estimator.multi_class_head` and `tf.contrib.estimator.multi_head`.
+
+Some differences:
+
+* `multi_class_head`: If you use `tf.contrib.learn.multi_class_head` with
+  `n_classes=2`, switch to `tf.contrib.estimator.binary_classification_head`.
+* `loss_only_head`: Not supported.
+* `poisson_regression_head`: Not supported (yet).
+* `binary_svm_head`: Not supported (yet).
+* `no_op_train_fn`: Replace it with `tf.no_op`.
+
+Some arguments are renamed, please refer to documentation. In addition:
+
+* `loss_fn`: Supported for `multi_label_head`. If you need it for other heads,
+  please open an issue.
+* `metric_class_ids`: Not supported (yet).
+* `enable_centered_bias`: Not supported. Dropping this argument is unlikely to
+  harm your model.
+* `label_name`: Not needed in `tf.estimator`. If you don’t use `multi_head`,
+  drop this argument. If you use `multi_head`, refer to
+  `tf.contrib.estimator.multi_head` documentation.
+
+## Experiment Class - Distributed Training Tooling
+
+Switch to `tf.estimator.train_and_evaluate`. Some differences:
+
+* Most of the constructor arguments, like `train_input_fn`, `eval_input_fn`,
+  should be wrapped into `tf.estimator.TrainSpec` and `tf.estimator.EvalSpec`.
+* Remove the `experiment_fn`. Instead, create the `Estimator`,
+  `train_spec` and `eval_spec`, then call `tf.estimator.train_and_evaluate`
+  directly.
+* Inside `tf.estimator.EvalSpec`, the `exporter` field is the replacement
+  for `export_strategy`. To be precise, `tf.estimator.LatestExporter` is the
+  replacement for `tf.contrib.learn.make_export_strategy`. If you want to export
+  only at the end of training  use `tf.estimator.FinalExporter`.
+* If the `TF_CONFIG` environment variable is constructed manually, please read
+  the `train_and_evaluate` documentation for the new requirementds (in
+  particular, the chief node and evaluator node).
+
+## Others Classes and Functions
+
+* `tf.contrib.learn.datasets` is deprecated. We are adding ready to use datasets
+  to tensorflow/models. Many smaller datasets are available from other sources,
+  such as scikits.learn. Some Python processing may have to be written, but this
+  is straightforward to implement using the standard modules.
+* `tf.contrib.learn.preprocessing`: Deprecated. The python-only preprocessing
+  functions are not a good fit for TensorFlow. Please use `tf.data`, and
+  consider tensorflow/transform for more complex use cases.
+* `tf.contrib.learn.models`: Not supported, use canned estimators instead.
+* `tf.contrib.learn.monitors`: Implement `SessionRunHook` instead. Hook
+  implementations are in `tf.train`.
+* `tf.contrib.learn.learn_io`: Use the methods in `tf.estimator.inputs`, such as
+  `tf.estimator.inputs.numpy_input_fn`. Some utility functions have no
+  equivalent, we encourage the use of `tf.data`.
+
diff --git a/tensorflow/contrib/learn/__init__.py b/tensorflow/contrib/learn/__init__.py
index 3698af027e38f1063ad829c26eb179734968f813..79bd73faaf1301a2fc4999b64f88d30542577980 100644
--- a/tensorflow/contrib/learn/__init__.py
+++ b/tensorflow/contrib/learn/__init__.py
@@ -13,8 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-# TODO(ptucker,ipolosukhin): Improve descriptions.
-"""High level API for learning.
+"""High level API for learning (DEPRECATED).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 See the @{$python/contrib.learn} guide.
 
diff --git a/tensorflow/contrib/learn/python/__init__.py b/tensorflow/contrib/learn/python/__init__.py
index bbebd5ab9792cb937219cf937f08c4d4e6e44a92..df23aeb2c433c2b4392f706730f715246ce01cea 100644
--- a/tensorflow/contrib/learn/python/__init__.py
+++ b/tensorflow/contrib/learn/python/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""High level API for learning with TensorFlow."""
+"""High level API for learning with TensorFlow (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/__init__.py b/tensorflow/contrib/learn/python/learn/__init__.py
index cdc67c77d5fd1df61016835dc75ba44feb458cf9..76e0e8ac8f19026086959f3b197cfd1a81e65a3e 100644
--- a/tensorflow/contrib/learn/python/learn/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""High level API for learning with TensorFlow."""
+"""High level API for learning with TensorFlow (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/basic_session_run_hooks.py b/tensorflow/contrib/learn/python/learn/basic_session_run_hooks.py
index 2284ec46e971731af74f17678fc0d1d3888419e2..fed1c44d1970bf07c808ace817aa9972d7776d88 100644
--- a/tensorflow/contrib/learn/python/learn/basic_session_run_hooks.py
+++ b/tensorflow/contrib/learn/python/learn/basic_session_run_hooks.py
@@ -12,20 +12,47 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Some common SessionRunHook classes."""
+"""Some common SessionRunHook classes (deprected).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.training import basic_session_run_hooks
+from tensorflow.python.util.deprecation import deprecated_alias
 
 # pylint: disable=invalid-name
-LoggingTensorHook = basic_session_run_hooks.LoggingTensorHook
-StopAtStepHook = basic_session_run_hooks.StopAtStepHook
-CheckpointSaverHook = basic_session_run_hooks.CheckpointSaverHook
-StepCounterHook = basic_session_run_hooks.StepCounterHook
-NanLossDuringTrainingError = basic_session_run_hooks.NanLossDuringTrainingError
-NanTensorHook = basic_session_run_hooks.NanTensorHook
-SummarySaverHook = basic_session_run_hooks.SummarySaverHook
+LoggingTensorHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.LoggingTensorHook',
+    'tf.train.LoggingTensorHook',
+    basic_session_run_hooks.LoggingTensorHook)
+StopAtStepHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.StopAtStepHook',
+    'tf.train.StopAtStepHook',
+    basic_session_run_hooks.StopAtStepHook)
+CheckpointSaverHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.CheckpointSaverHook',
+    'tf.train.CheckpointSaverHook',
+    basic_session_run_hooks.CheckpointSaverHook)
+StepCounterHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.StepCounterHook',
+    'tf.train.StepCounterHook',
+    basic_session_run_hooks.StepCounterHook)
+NanLossDuringTrainingError = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.NanLossDuringTrainingError',
+    'tf.train.NanLossDuringTrainingError',
+    basic_session_run_hooks.NanLossDuringTrainingError)
+NanTensorHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.NanTensorHook',
+    'tf.train.NanTensorHook',
+    basic_session_run_hooks.NanTensorHook)
+SummarySaverHook = deprecated_alias(
+    'tf.contrib.learn.basic_session_run_hooks.SummarySaverHook',
+    'tf.train.SummarySaverHook',
+    basic_session_run_hooks.SummarySaverHook)
 # pylint: enable=invalid-name
diff --git a/tensorflow/contrib/learn/python/learn/datasets/__init__.py b/tensorflow/contrib/learn/python/learn/datasets/__init__.py
index 7240b0de149051afa045a8113f9e9b212840c311..3c34712ac859d32f549468345950a93d2ed2aa56 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/__init__.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Dataset utilities and synthetic/reference datasets."""
+"""Dataset utilities and synthetic/reference datasets (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -27,6 +32,7 @@ from tensorflow.contrib.learn.python.learn.datasets import base
 from tensorflow.contrib.learn.python.learn.datasets import mnist
 from tensorflow.contrib.learn.python.learn.datasets import synthetic
 from tensorflow.contrib.learn.python.learn.datasets import text_datasets
+from tensorflow.python.util.deprecation import deprecated
 
 # Export load_iris and load_boston.
 load_iris = base.load_iris
@@ -51,6 +57,7 @@ SYNTHETIC = {
 }
 
 
+@deprecated(None, 'Please use tf.data.')
 def load_dataset(name, size='small', test_with_fake_data=False):
   """Loads dataset by name.
 
@@ -73,8 +80,9 @@ def load_dataset(name, size='small', test_with_fake_data=False):
     return DATASETS[name]()
 
 
+@deprecated(None, 'Please use tf.data.')
 def make_dataset(name, n_samples=100, noise=None, seed=42, *args, **kwargs):
-  """Creates binary synthetic datasets
+  """Creates binary synthetic datasets.
 
   Args:
     name: str, name of the dataset to generate
diff --git a/tensorflow/contrib/learn/python/learn/datasets/base.py b/tensorflow/contrib/learn/python/learn/datasets/base.py
index ca720ae5ed26e74da12bd6c5a37231b41442f76f..3b5c9b97c08a388e1f35249967b6cab26861f100 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/base.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/base.py
@@ -12,7 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Base utilities for loading datasets."""
+
+"""Base utilities for loading datasets (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -29,11 +35,14 @@ import numpy as np
 from six.moves import urllib
 
 from tensorflow.python.platform import gfile
+from tensorflow.python.util.deprecation import deprecated
+
 
 Dataset = collections.namedtuple('Dataset', ['data', 'target'])
 Datasets = collections.namedtuple('Datasets', ['train', 'validation', 'test'])
 
 
+@deprecated(None, 'Use tf.data instead.')
 def load_csv_with_header(filename,
                          target_dtype,
                          features_dtype,
@@ -53,6 +62,7 @@ def load_csv_with_header(filename,
   return Dataset(data=data, target=target)
 
 
+@deprecated(None, 'Use tf.data instead.')
 def load_csv_without_header(filename,
                             target_dtype,
                             features_dtype,
@@ -70,6 +80,7 @@ def load_csv_without_header(filename,
   return Dataset(data=data, target=target)
 
 
+@deprecated(None, 'Use tf.data instead.')
 def shrink_csv(filename, ratio):
   """Create a smaller dataset of only 1/ratio of original data."""
   filename_small = filename.replace('.', '_small.')
@@ -84,6 +95,7 @@ def shrink_csv(filename, ratio):
         i += 1
 
 
+@deprecated(None, 'Use scikits.learn.datasets.')
 def load_iris(data_path=None):
   """Load Iris dataset.
 
@@ -100,6 +112,7 @@ def load_iris(data_path=None):
       data_path, target_dtype=np.int, features_dtype=np.float)
 
 
+@deprecated(None, 'Use scikits.learn.datasets.')
 def load_boston(data_path=None):
   """Load Boston housing dataset.
 
@@ -116,7 +129,12 @@ def load_boston(data_path=None):
       data_path, target_dtype=np.float, features_dtype=np.float)
 
 
-def retry(initial_delay, max_delay, factor=2.0, jitter=0.25, is_retriable=None):
+@deprecated(None, 'Use the retry module or similar alternatives.')
+def retry(initial_delay,
+          max_delay,
+          factor=2.0,
+          jitter=0.25,
+          is_retriable=None):
   """Simple decorator for wrapping retriable functions.
 
   Args:
@@ -152,7 +170,7 @@ def retry(initial_delay, max_delay, factor=2.0, jitter=0.25, is_retriable=None):
       for delay in delays():
         try:
           return fn(*args, **kwargs)
-        except Exception as e:  # pylint: disable=broad-except)
+        except Exception as e:  # pylint: disable=broad-except
           if is_retriable is None:
             continue
 
@@ -176,11 +194,13 @@ def _is_retriable(e):
   return isinstance(e, IOError) and e.errno in _RETRIABLE_ERRNOS
 
 
+@deprecated(None, 'Please use urllib or similar directly.')
 @retry(initial_delay=1.0, max_delay=16.0, is_retriable=_is_retriable)
 def urlretrieve_with_retry(url, filename=None):
   return urllib.request.urlretrieve(url, filename)
 
 
+@deprecated(None, 'Please write your own downloading logic.')
 def maybe_download(filename, work_directory, source_url):
   """Download the data from source url, unless it's already here.
 
diff --git a/tensorflow/contrib/learn/python/learn/datasets/mnist.py b/tensorflow/contrib/learn/python/learn/datasets/mnist.py
index 37f9175015a239f763c7721cf36ab8063c0a3e32..abbb44c2f5b701829ce16f64eadd8ebc04c84e2c 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/mnist.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/mnist.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Functions for downloading and reading MNIST data."""
+"""Functions for downloading and reading MNIST data (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -27,6 +32,7 @@ from tensorflow.contrib.learn.python.learn.datasets import base
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import random_seed
 from tensorflow.python.platform import gfile
+from tensorflow.python.util.deprecation import deprecated
 
 # CVDF mirror of http://yann.lecun.com/exdb/mnist/
 DEFAULT_SOURCE_URL = 'https://storage.googleapis.com/cvdf-datasets/mnist/'
@@ -37,6 +43,7 @@ def _read32(bytestream):
   return numpy.frombuffer(bytestream.read(4), dtype=dt)[0]
 
 
+@deprecated(None, 'Please use tf.data to implement this functionality.')
 def extract_images(f):
   """Extract the images into a 4D uint8 numpy array [index, y, x, depth].
 
@@ -65,6 +72,7 @@ def extract_images(f):
     return data
 
 
+@deprecated(None, 'Please use tf.one_hot on tensors.')
 def dense_to_one_hot(labels_dense, num_classes):
   """Convert class labels from scalars to one-hot vectors."""
   num_labels = labels_dense.shape[0]
@@ -74,6 +82,7 @@ def dense_to_one_hot(labels_dense, num_classes):
   return labels_one_hot
 
 
+@deprecated(None, 'Please use tf.data to implement this functionality.')
 def extract_labels(f, one_hot=False, num_classes=10):
   """Extract the labels into a 1D uint8 numpy array [index].
 
@@ -103,7 +112,15 @@ def extract_labels(f, one_hot=False, num_classes=10):
 
 
 class DataSet(object):
+  """Container class for a dataset (deprecated).
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please use alternatives such as official/mnist/dataset.py'
+              ' from tensorflow/models.')
   def __init__(self,
                images,
                labels,
@@ -210,6 +227,8 @@ class DataSet(object):
       return self._images[start:end], self._labels[start:end]
 
 
+@deprecated(None, 'Please use alternatives such as official/mnist/dataset.py'
+            ' from tensorflow/models.')
 def read_data_sets(train_dir,
                    fake_data=False,
                    one_hot=False,
@@ -275,5 +294,7 @@ def read_data_sets(train_dir,
   return base.Datasets(train=train, validation=validation, test=test)
 
 
+@deprecated(None, 'Please use alternatives such as official/mnist/dataset.py'
+            ' from tensorflow/models.')
 def load_mnist(train_dir='MNIST-data'):
   return read_data_sets(train_dir)
diff --git a/tensorflow/contrib/learn/python/learn/datasets/produce_small_datasets.py b/tensorflow/contrib/learn/python/learn/datasets/produce_small_datasets.py
index 6e0ba38941ce4650ede9f7210e284bde2ed8e6a9..a4848fa64a72f031ef35c0c3256e97a7326acd60 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/produce_small_datasets.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/produce_small_datasets.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Produce DBpedia datasets of a smaller size."""
+"""Produce DBpedia datasets of a smaller size (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/datasets/synthetic.py b/tensorflow/contrib/learn/python/learn/datasets/synthetic.py
index 9a843168c27d9cae3f55efe4fe4c688d86c745f3..6a0e3350b3d1052249160a2a997a76de7a5040c3 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/synthetic.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/synthetic.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Synthetic dataset generators."""
+"""Synthetic dataset generators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -21,8 +26,10 @@ from __future__ import print_function
 import numpy as np
 
 from tensorflow.contrib.learn.python.learn.datasets.base import Dataset
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Consider using synthetic datasets from scikits.learn.')
 def circles(n_samples=100,
             noise=None,
             seed=None,
@@ -93,6 +100,7 @@ def circles(n_samples=100,
   return Dataset(data=X[indices], target=y[indices])
 
 
+@deprecated(None, 'Consider using synthetic datasets from scikits.learn.')
 def spirals(n_samples=100,
             noise=None,
             seed=None,
diff --git a/tensorflow/contrib/learn/python/learn/datasets/text_datasets.py b/tensorflow/contrib/learn/python/learn/datasets/text_datasets.py
index 2596a2ecaf1572506504831e8b08fab9b5dbc119..ce9466301728082f8e9d99c90989ba8fe623bcf0 100644
--- a/tensorflow/contrib/learn/python/learn/datasets/text_datasets.py
+++ b/tensorflow/contrib/learn/python/learn/datasets/text_datasets.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Text datasets."""
+"""Text datasets (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -26,10 +31,12 @@ import numpy as np
 
 from tensorflow.contrib.learn.python.learn.datasets import base
 from tensorflow.python.platform import gfile
+from tensorflow.python.util.deprecation import deprecated
 
 DBPEDIA_URL = 'https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz'
 
 
+@deprecated(None, 'See contrib/learn/README.md')
 def maybe_download_dbpedia(data_dir):
   """Download if DBpedia data is not present."""
   train_path = os.path.join(data_dir, 'dbpedia_csv/train.csv')
@@ -41,6 +48,7 @@ def maybe_download_dbpedia(data_dir):
     tfile.extractall(data_dir)
 
 
+@deprecated(None, 'See contrib/learn/README.md')
 def load_dbpedia(size='small', test_with_fake_data=False):
   """Get DBpedia datasets from CSV files."""
   if not test_with_fake_data:
diff --git a/tensorflow/contrib/learn/python/learn/estimators/__init__.py b/tensorflow/contrib/learn/python/learn/estimators/__init__.py
index 4981750c94c7ac31e23b7a3f71ca30e3c9573a20..3e64595f312bcc2a2e8dcba589fb993a249b684b 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""An estimator is a rule for calculating an estimate of a given quantity.
+"""An estimator is a rule for calculating an estimate of a given quantity (deprecated).
+
+These classes are deprecated and replaced with `tf.estimator`.
+
+See [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 # Estimators
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/_sklearn.py b/tensorflow/contrib/learn/python/learn/estimators/_sklearn.py
index 15277415a1ce83dc1d4a334e60fe1933ba244df0..1f0e4663d060a3850e2002b27f809fde1db47e48 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/_sklearn.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/_sklearn.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 # ==============================================================================
 
-"""sklearn cross-support."""
+"""sklearn cross-support (deprecated)."""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -132,6 +132,8 @@ class _TransformerMixin():
 class NotFittedError(ValueError, AttributeError):
   """Exception class to raise if estimator is used before fitting.
 
+  USE OF THIS EXCEPTION IS DEPRECATED.
+
   This class inherits from both ValueError and AttributeError to help with
   exception handling and backward compatibility.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/composable_model.py b/tensorflow/contrib/learn/python/learn/estimators/composable_model.py
index a02c726c74946d93b8e1726473db746220b00795..1fa58271e2b886cd143683a759145fd750791473 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/composable_model.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/composable_model.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""TensorFlow composable models used as building blocks for estimators."""
+"""TensorFlow composable models used as building blocks for estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -34,6 +39,7 @@ from tensorflow.python.ops import nn
 from tensorflow.python.ops import partitioned_variables
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.summary import summary
+from tensorflow.python.util.deprecation import deprecated
 
 
 class _ComposableModel(object):
@@ -46,6 +52,7 @@ class _ComposableModel(object):
   _ComposableModel and its subclasses are not part of the public tf.learn API.
   """
 
+  @deprecated(None, "Please use model_fns in tf.estimator.")
   def __init__(self,
                num_label_columns,
                optimizer,
@@ -141,6 +148,10 @@ class _ComposableModel(object):
 class LinearComposableModel(_ComposableModel):
   """A _ComposableModel that implements linear regression.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Instances of this class can be used to build estimators through the use
   of composition.
   """
@@ -252,6 +263,10 @@ class LinearComposableModel(_ComposableModel):
 class DNNComposableModel(_ComposableModel):
   """A _ComposableModel that implements a DNN.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Instances of this class can be used to build estimators through the use
   of composition.
   """
diff --git a/tensorflow/contrib/learn/python/learn/estimators/constants.py b/tensorflow/contrib/learn/python/learn/estimators/constants.py
index fc69e810244a182b864be856e6720f8584f7aa65..d2548946bc77dea7c452d61c7e2b6e12c3d6239a 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/constants.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/constants.py
@@ -13,9 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Constants regarding Estimators.
+"""Constants regarding Estimators (deprecated).
 
-This file is obsoleted in the move of Estimator to core.
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 """
 from __future__ import absolute_import
 from __future__ import division
@@ -25,6 +27,8 @@ from __future__ import print_function
 class ProblemType(object):
   """Enum-like values for the type of problem that the model solves.
 
+  THIS CLASS IS DEPRECATED.
+
   These values are used when exporting the model to produce the appropriate
   signature function for serving.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/debug.py b/tensorflow/contrib/learn/python/learn/estimators/debug.py
index 9d5f6c2bf969d7c85d251bf1b06a0307a41b2297..24b067b7e38b12df3d1d0c49f626344217218571 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/debug.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/debug.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Debug estimators.
+"""Debug estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Debug estimators are bias-only estimators that can be used for debugging
 and as simple baselines.
@@ -118,6 +122,10 @@ def debug_model_fn(features, labels, mode, params, config=None):
 class DebugClassifier(estimator.Estimator):
   """A classifier for TensorFlow Debug models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
@@ -237,6 +245,10 @@ class DebugClassifier(estimator.Estimator):
 class DebugRegressor(estimator.Estimator):
   """A regressor for TensorFlow Debug models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
diff --git a/tensorflow/contrib/learn/python/learn/estimators/dnn.py b/tensorflow/contrib/learn/python/learn/estimators/dnn.py
index c17b41c0f767e19d9c3635a8f60347a49b297cfb..eabebb7e881558471c343c0573cc9a8f4a425312 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/dnn.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/dnn.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Deep Neural Network estimators."""
+"""Deep Neural Network estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -212,6 +217,10 @@ def _dnn_model_fn(features, labels, mode, params, config=None):
 class DNNClassifier(estimator.Estimator):
   """A classifier for TensorFlow DNN models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
@@ -521,6 +530,10 @@ class DNNClassifier(estimator.Estimator):
 class DNNRegressor(estimator.Estimator):
   """A regressor for TensorFlow DNN models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
@@ -796,6 +809,10 @@ class DNNRegressor(estimator.Estimator):
 class DNNEstimator(estimator.Estimator):
   """A Estimator for TensorFlow DNN models with user specified _Head.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Example:
 
   ```python
diff --git a/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py b/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py
index 726612235050def6e7addb503cc6646a25de0e42..3d85533d92d17095bae9a69f229171e1bf61ba10 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow estimators for Linear and DNN joined training models."""
+"""TensorFlow estimators for Linear and DNN joined training models (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -372,6 +377,10 @@ def _dnn_linear_combined_model_fn(features, labels, mode, params, config=None):
 class DNNLinearCombinedEstimator(estimator.Estimator):
   """An estimator for TensorFlow Linear and DNN joined training models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note: New users must set `fix_global_step_increment_bug=True` when creating an
   estimator.
 
@@ -490,6 +499,10 @@ class DNNLinearCombinedEstimator(estimator.Estimator):
 class DNNLinearCombinedClassifier(estimator.Estimator):
   """A classifier for TensorFlow Linear and DNN joined training models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note: New users must set `fix_global_step_increment_bug=True` when creating an
   estimator.
 
@@ -832,6 +845,10 @@ class DNNLinearCombinedClassifier(estimator.Estimator):
 class DNNLinearCombinedRegressor(estimator.Estimator):
   """A regressor for TensorFlow Linear and DNN joined training models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note: New users must set `fix_global_step_increment_bug=True` when creating an
   estimator.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/dynamic_rnn_estimator.py b/tensorflow/contrib/learn/python/learn/estimators/dynamic_rnn_estimator.py
index 69440e823ef1ed2d739f28bc14587891f2de80bb..a703dc66e922d48ceb64edc2a979061b8e45db49 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/dynamic_rnn_estimator.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/dynamic_rnn_estimator.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Estimator for Dynamic RNNs."""
+"""Estimator for Dynamic RNNs (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -540,6 +545,12 @@ def _get_dynamic_rnn_model_fn(
 
 
 class DynamicRnnEstimator(estimator.Estimator):
+  """Dynamically unrolled RNN (deprecated).
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self,
                problem_type,
diff --git a/tensorflow/contrib/learn/python/learn/estimators/estimator.py b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
index 4b63e08ab3372849309ee5d28d754de82e9632f4..5262e04e16ee85d1672dd495f05084ff07c8dd18 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/estimator.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/estimator.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Base Estimator class."""
+"""Base Estimator class (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -138,6 +143,7 @@ def _get_input_fn(x, y, input_fn, feed_fn, batch_size, shuffle=False, epochs=1):
   return df.input_builder, df.get_feed_dict_fn()
 
 
+@deprecated(None, 'Please specify feature columns explicitly.')
 def infer_real_valued_columns_from_input_fn(input_fn):
   """Creates `FeatureColumn` objects for inputs defined by `input_fn`.
 
@@ -158,6 +164,7 @@ def infer_real_valued_columns_from_input_fn(input_fn):
     return layers.infer_real_valued_columns(features)
 
 
+@deprecated(None, 'Please specify feature columns explicitly.')
 def infer_real_valued_columns_from_input(x):
   """Creates `FeatureColumn` objects for inputs defined by input `x`.
 
@@ -389,6 +396,10 @@ class BaseEstimator(sklearn.BaseEstimator, evaluable.Evaluable,
                     trainable.Trainable):
   """Abstract BaseEstimator class to train and evaluate TensorFlow models.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Users should not instantiate or subclass this class. Instead, use an
   `Estimator`.
   """
@@ -399,6 +410,8 @@ class BaseEstimator(sklearn.BaseEstimator, evaluable.Evaluable,
   # TODO(wicke): Remove this once launcher takes over config functionality
   _Config = run_config.RunConfig  # pylint: disable=invalid-name
 
+  @deprecated(None, 'Please replace uses of any Estimator from tf.contrib.learn'
+              ' with an Estimator from tf.estimator.*')
   def __init__(self, model_dir=None, config=None):
     """Initializes a BaseEstimator instance.
 
@@ -1074,6 +1087,10 @@ def _identity_feature_engineering_fn(features, labels):
 
 class Estimator(BaseEstimator):
   """Estimator class is the basic TensorFlow model trainer/evaluator.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
   """
 
   def __init__(self,
@@ -1458,8 +1475,14 @@ class Estimator(BaseEstimator):
 # For time of deprecation x,y from Estimator allow direct access.
 # pylint: disable=protected-access
 class SKCompat(sklearn.BaseEstimator):
-  """Scikit learn wrapper for TensorFlow Learn Estimator."""
+  """Scikit learn wrapper for TensorFlow Learn Estimator.
+  
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please switch to the Estimator interface.')
   def __init__(self, estimator):
     self._estimator = estimator
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/estimator_test_utils.py b/tensorflow/contrib/learn/python/learn/estimators/estimator_test_utils.py
index fd47710e3015de9ae6a453f98978b0ef8f88968c..e4c31396baf8271c49395a2b87b454dbc77177e2 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/estimator_test_utils.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/estimator_test_utils.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Utils for Estimator."""
+"""Utils for Estimator (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/estimators/head.py b/tensorflow/contrib/learn/python/learn/estimators/head.py
index 9b124b2c19f16bbc9b2afeadb82a32006e1a0ae9..2b4b6eff39f4fc8a20a149edfc07d2f4f27a9bae 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/head.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/head.py
@@ -12,8 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Abstractions for the head(s) of a model.
+"""Abstractions for the head(s) of a model (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 """
+
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
@@ -47,11 +52,16 @@ from tensorflow.python.summary import summary
 from tensorflow.python.training import training
 from tensorflow.python.util import tf_decorator
 from tensorflow.python.util import tf_inspect
+from tensorflow.python.util.deprecation import deprecated
 
 
 class Head(object):
   """Interface for the head/top of a model.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Given logits (or output of a hidden layer), a Head knows how to compute
   predictions, loss, default metric and export signature. It is meant to,
 
@@ -177,6 +187,7 @@ class Head(object):
     raise NotImplementedError("Calling an abstract method.")
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def regression_head(label_name=None,
                     weight_column_name=None,
                     label_dimension=1,
@@ -216,6 +227,7 @@ def regression_head(label_name=None,
       link_fn=(link_fn if link_fn is not None else array_ops.identity))
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def poisson_regression_head(label_name=None,
                             weight_column_name=None,
                             label_dimension=1,
@@ -254,6 +266,7 @@ def poisson_regression_head(label_name=None,
 # TODO(zakaria): Consider adding a _RegressionHead for logistic_regression
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def multi_class_head(n_classes,
                      label_name=None,
                      weight_column_name=None,
@@ -335,6 +348,7 @@ def multi_class_head(n_classes,
       label_keys=label_keys)
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def binary_svm_head(
     label_name=None,
     weight_column_name=None,
@@ -370,6 +384,7 @@ def binary_svm_head(
       thresholds=thresholds)
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def multi_label_head(n_classes,
                      label_name=None,
                      weight_column_name=None,
@@ -430,6 +445,7 @@ def multi_label_head(n_classes,
       loss_fn=_wrap_custom_loss_fn(loss_fn) if loss_fn else None)
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def loss_only_head(loss_fn, head_name=None):
   """Creates a Head that contains only loss terms.
 
@@ -447,6 +463,7 @@ def loss_only_head(loss_fn, head_name=None):
   return _LossOnlyHead(loss_fn, head_name=head_name)
 
 
+@deprecated(None, "Please switch to tf.contrib.estimator.*_head.")
 def multi_head(heads, loss_weights=None):
   """Creates a MultiHead stemming from same logits/hidden layer.
 
@@ -479,6 +496,7 @@ def multi_head(heads, loss_weights=None):
   return _MultiHead(heads, loss_merger=_weighted_loss_merger)
 
 
+@deprecated(None, "Use 'lambda _: tf.no_op()'.")
 def no_op_train_fn(loss):
   del loss
   return control_flow_ops.no_op()
diff --git a/tensorflow/contrib/learn/python/learn/estimators/kmeans.py b/tensorflow/contrib/learn/python/learn/estimators/kmeans.py
index 8f9d6fc318a357853bdb8e3264f6691b410006b1..66ebcfd1d81904b9afe5be6bd1a648fe325e1e0b 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/kmeans.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/kmeans.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Implementation of k-means clustering on top of `Estimator` API.
+"""Implementation of k-means clustering on top of `Estimator` API (deprecated).
 
 This module is deprecated. Please use
 @{tf.contrib.factorization.KMeansClustering} instead of
@@ -153,7 +153,12 @@ def _kmeans_clustering_model_fn(features, labels, mode, params, config):
 
 # TODO(agarwal,ands): support sharded input.
 class KMeansClustering(estimator.Estimator):
-  """An Estimator for K-Means clustering."""
+  """An Estimator for K-Means clustering.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
   SQUARED_EUCLIDEAN_DISTANCE = clustering_ops.SQUARED_EUCLIDEAN_DISTANCE
   COSINE_DISTANCE = clustering_ops.COSINE_DISTANCE
   RANDOM_INIT = clustering_ops.RANDOM_INIT
diff --git a/tensorflow/contrib/learn/python/learn/estimators/linear.py b/tensorflow/contrib/learn/python/learn/estimators/linear.py
index 37aa8b339622415d082933cdf66d2472a4119b48..64d7ecc68e7abb1d36a3eb098fedd8184d6e9d77 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/linear.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/linear.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Linear Estimators."""
+"""Linear Estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -305,6 +310,10 @@ class _SdcaUpdateWeightsHook(session_run_hook.SessionRunHook):
 class LinearClassifier(estimator.Estimator):
   """Linear classifier model.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Train a linear model to classify instances into one of multiple possible
   classes. When number of possible classes is 2, this is binary classification.
 
@@ -625,6 +634,10 @@ class LinearClassifier(estimator.Estimator):
 class LinearRegressor(estimator.Estimator):
   """Linear regressor model.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Train a linear regression model to predict label value given observation of
   feature values.
 
@@ -860,6 +873,10 @@ class LinearRegressor(estimator.Estimator):
 class LinearEstimator(estimator.Estimator):
   """Linear model with user specified head.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Train a generalized linear model to predict label value given observation of
   feature values.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/logistic_regressor.py b/tensorflow/contrib/learn/python/learn/estimators/logistic_regressor.py
index fb339160d58e09d4ffd50090f2dbbcec08bebe47..3cbcc6e98de1c915c302617e4591c9baa33adeaf 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/logistic_regressor.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/logistic_regressor.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Logistic regression (aka binary classifier) class.
+"""Logistic regression (aka binary classifier) class (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 This defines some useful basic metrics for using logistic regression to classify
 a binary event (0 vs 1).
@@ -75,6 +79,10 @@ def LogisticRegressor(  # pylint: disable=invalid-name
     feature_engineering_fn=None):
   """Builds a logistic regression Estimator for binary classification.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   This method provides a basic Estimator with some additional metrics for custom
   binary classification models, including AUC, precision/recall and accuracy.
 
diff --git a/tensorflow/contrib/learn/python/learn/estimators/metric_key.py b/tensorflow/contrib/learn/python/learn/estimators/metric_key.py
index 99388f116b345bd038f2985606c6203011597ea2..f264248e44d9aa48f26ee32e36746bd4c3145a8d 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/metric_key.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/metric_key.py
@@ -12,14 +12,20 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Enum for metric keys."""
+"""Enum for metric keys (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 
 class MetricKey(object):
-  """Metric key strings."""
+  """Metric key strings (deprecated)."""
+  
   LOSS = "loss"
   AUC = "auc"
   AUC_PR = "auc_precision_recall"
diff --git a/tensorflow/contrib/learn/python/learn/estimators/model_fn.py b/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
index 44e6c7c52dac524a22e9099e33e2aef82f8fe7ba..dcb161180c99ce71195c820217e8bdaf79d70901 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/model_fn.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Classes and methods related to model_fn."""
+"""Classes and methods related to model_fn (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -37,10 +42,13 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.training import session_run_hook
+from tensorflow.python.util.deprecation import deprecated
 
 
 class ModeKeys(object):
-  """Standard names for model modes.
+  """Standard names for model modes (deprecated).
+
+  THIS CLASS IS DEPRECATED.
 
   The following standard keys are defined:
 
@@ -65,8 +73,16 @@ class ModelFnOps(
         'output_alternatives', 'training_chief_hooks', 'training_hooks',
         'scaffold', 'mode'
     ])):
-  """Ops returned from a model_fn."""
+  """Ops returned from a model_fn.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'When switching to tf.estimator.Estimator, use '
+              'tf.estimator.EstimatorSpec. You can use the `estimator_spec`'
+              ' method to create an equivalent one.')
   def __new__(cls,
               mode,
               predictions=None,
diff --git a/tensorflow/contrib/learn/python/learn/estimators/prediction_key.py b/tensorflow/contrib/learn/python/learn/estimators/prediction_key.py
index f8d87b8914307a86eb2f46123a28ff11eb925eda..6fd2fc9d592cef4e44a640e2f27cb28b367d44d5 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/prediction_key.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/prediction_key.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Enum for model prediction keys.
+"""Enum for model prediction keys (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 This file is obsoleted in the move of Estimator to core.
 """
@@ -22,6 +26,8 @@ from __future__ import print_function
 
 
 class PredictionKey(object):
+  """THIS CLASS IS DEPRECATED."""
+
   CLASSES = "classes"
   PROBABILITIES = "probabilities"
   LOGITS = "logits"
diff --git a/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py b/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py
index 2752bc2d90ee0f51b2c40ccc4d24a4eb21cff38f..215022e5d9e5d3cd5d6a96583b325b19a1719568 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/rnn_common.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Common operations for RNN Estimators."""
+"""Common operations for RNN Estimators (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/estimators/run_config.py b/tensorflow/contrib/learn/python/learn/estimators/run_config.py
index fd90fd1cc6277e7d80287aefdbab6134dac7c0d5..1d161093de01ef838d0c75ec9a39574c7529bd57 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/run_config.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/run_config.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Run Config."""
+"""Run Config (deprecated, use tf.estimator.RunConfig instead).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -29,11 +34,12 @@ from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.estimator import run_config as core_run_config
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import server_lib
+from tensorflow.python.util.deprecation import deprecated
 
 
 # A list of the property names in RunConfig user allows to change. They will
 # not affect the execution framework, so when execution framework checks the
-# `uid` of the RunConfig, it should be ingored.
+# `uid` of the RunConfig, it should be ignored.
 _DEFAULT_UID_WHITE_LIST = [
     'tf_random_seed',
     'save_summary_steps',
@@ -47,6 +53,7 @@ _DEFAULT_UID_WHITE_LIST = [
 
 
 class Environment(object):
+  """DEPRECATED CLASS."""
   # For running general distributed training.
   CLOUD = 'cloud'
   # For running Google-internal distributed training.
@@ -56,6 +63,7 @@ class Environment(object):
 
 
 class TaskType(object):
+  """DEPRECATED CLASS."""
   MASTER = 'master'
   PS = 'ps'
   WORKER = 'worker'
@@ -64,6 +72,8 @@ class TaskType(object):
 class ClusterConfig(object):
   """This class specifies the configurations for a distributed run.
 
+  THIS CLASS IS DEPRECATED. Use tf.estimator.RunConfig instead.
+
   If you're using an `Estimator`, you should probably use the subclass
   RunConfig instead.
   """
@@ -211,10 +221,13 @@ class ClusterConfig(object):
 class RunConfig(ClusterConfig, core_run_config.RunConfig):
   """This class specifies the configurations for an `Estimator` run.
 
-  This class is the implementation of @{tf.estimator.RunConfig} interface.
+  This class is a deprecated implementation of @{tf.estimator.RunConfig}
+  interface.
   """
   _USE_DEFAULT = 0
 
+  @deprecated(None, 'When switching to tf.estimator.Estimator, use'
+              ' tf.estimator.RunConfig instead.')
   def __init__(self,
                master=None,
                num_cores=0,
diff --git a/tensorflow/contrib/learn/python/learn/estimators/state_saving_rnn_estimator.py b/tensorflow/contrib/learn/python/learn/estimators/state_saving_rnn_estimator.py
index 0cea35e219a4457417a161a3ac4ac4292fd24f53..de78c72c3ae3ef14f5f7c46b1d47f82e8266c7c6 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/state_saving_rnn_estimator.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/state_saving_rnn_estimator.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Estimator for State Saving RNNs."""
+"""Estimator for State Saving RNNs (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -528,6 +533,12 @@ def _get_rnn_model_fn(cell_type,
 
 
 class StateSavingRnnEstimator(estimator.Estimator):
+  """RNN with static unrolling and state saving (deprecated).
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self,
                problem_type,
diff --git a/tensorflow/contrib/learn/python/learn/estimators/svm.py b/tensorflow/contrib/learn/python/learn/estimators/svm.py
index 72920d73c0c92886e54f533ad7fe170fe27d9870..3459997baba16fc0d4045e50819ecdd0e7121657 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/svm.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/svm.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Support Vector Machine (SVM) Estimator."""
+"""Support Vector Machine (SVM) Estimator (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -36,6 +41,10 @@ def _as_iterable(preds, output):
 class SVM(estimator.Estimator):
   """Support Vector Machine (SVM) model for binary classification.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Currently, only linear SVMs are supported. For the underlying optimization
   problem, the `SDCAOptimizer` is used. For performance and convergence tuning,
   the num_loss_partitions parameter passed to `SDCAOptimizer` (see `__init__()`
diff --git a/tensorflow/contrib/learn/python/learn/estimators/tensor_signature.py b/tensorflow/contrib/learn/python/learn/estimators/tensor_signature.py
index a120bc6cc3975a3d4559d018c8aa74ff42a16d2d..71b5658dd174d2b47e33860844359f68e6768026 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/tensor_signature.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/tensor_signature.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorSignature class and utilities."""
+"""TensorSignature class and utilities (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -33,6 +38,10 @@ class TensorSignature(collections.namedtuple(
     "TensorSignature", ["dtype", "shape", "is_sparse"])):
   """Signature of the `Tensor` object.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Useful to check compatibility of tensors.
 
   Example:
diff --git a/tensorflow/contrib/learn/python/learn/estimators/test_data.py b/tensorflow/contrib/learn/python/learn/estimators/test_data.py
index ed201bfc58f273e6587850032386c2686aea4148..e4b057b4f5a9e081c2d891bd9828ffc315e51e91 100644
--- a/tensorflow/contrib/learn/python/learn/estimators/test_data.py
+++ b/tensorflow/contrib/learn/python/learn/estimators/test_data.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Test data utilities."""
+"""Test data utilities (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/evaluable.py b/tensorflow/contrib/learn/python/learn/evaluable.py
index 8f6cd39864b437f163dd7c1140dc88755ce98529..10881ca885599bc81386e15f814a2687d907f63b 100644
--- a/tensorflow/contrib/learn/python/learn/evaluable.py
+++ b/tensorflow/contrib/learn/python/learn/evaluable.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""`Evaluable` interface."""
+"""`Evaluable` interface (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -23,6 +28,10 @@ import abc
 
 class Evaluable(object):
   """Interface for objects that are evaluatable by, e.g., `Experiment`.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
   """
   __metaclass__ = abc.ABCMeta
 
diff --git a/tensorflow/contrib/learn/python/learn/experiment.py b/tensorflow/contrib/learn/python/learn/experiment.py
index bec976afd2719138117976381669ca3292360480..9a7c4cd685b90cf3ac8922bdb031aa935c1aa64f 100644
--- a/tensorflow/contrib/learn/python/learn/experiment.py
+++ b/tensorflow/contrib/learn/python/learn/experiment.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Experiment class collecting information needed for a single training run."""
+"""Experiment class collecting information for a single training run (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -25,7 +30,6 @@ import os
 import time
 
 from tensorflow.contrib.framework import deprecated
-from tensorflow.contrib.framework import deprecated_args
 from tensorflow.contrib.framework.python.framework import experimental
 from tensorflow.contrib.learn.python.learn import evaluable
 from tensorflow.contrib.learn.python.learn import export_strategy
@@ -118,6 +122,10 @@ class _EvalAndExportListener(basic_session_run_hooks.CheckpointSaverListener):
 class Experiment(object):
   """Experiment is a class containing all information needed to train a model.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   After an experiment is created (by passing an Estimator and inputs for
   training and evaluation), an Experiment instance knows how to invoke training
   and eval loops in a sensible fashion for distributed training.
@@ -125,16 +133,8 @@ class Experiment(object):
 
   # TODO(ispir): remove delay_workers_by_global_step and make global step based
   # waiting as only behavior.
-  @deprecated_args(
-      "2016-10-23",
-      "local_eval_frequency is deprecated as local_run will be renamed to "
-      "train_and_evaluate. Use min_eval_frequency and call train_and_evaluate "
-      "instead. Note, however, that the default for min_eval_frequency is 1, "
-      "meaning models will be evaluated every time a new checkpoint is "
-      "available. In contrast, the default for local_eval_frequency is None, "
-      "resulting in evaluation occurring only after training has completed. "
-      "min_eval_frequency is ignored when calling the deprecated local_run.",
-      "local_eval_frequency")
+  @deprecated(None, "Please switch to tf.estimator.train_and_evaluate. You will"
+              " also have to convert to a tf.estimator.Estimator.")
   def __init__(self,
                estimator,
                train_input_fn,
@@ -152,7 +152,8 @@ class Experiment(object):
                export_strategies=None,
                train_steps_per_iteration=None,
                checkpoint_and_export=False,
-               saving_listeners=None):
+               saving_listeners=None,
+               check_interval_secs=5):
     """Constructor for `Experiment`.
 
     Creates an Experiment instance. None of the functions passed to this
@@ -190,8 +191,9 @@ class Experiment(object):
         number of steps between evaluations. Of course, evaluation does not
         occur if no new snapshot is available, hence, this is the minimum.
         If 0, the evaluation will only happen after training.
-        If None, defaults to 1, unless model_dir is on GCS, in which case the
-        default is 1000.
+        If None, defaults to 1. To avoid checking for new checkpoints too
+        frequent, the interval is further limited to be at least
+        check_interval_secs between checks.
       delay_workers_by_global_step: if `True` delays training workers
         based on global step instead of time.
       export_strategies: Iterable of `ExportStrategy`s, or a single one, or
@@ -215,7 +217,10 @@ class Experiment(object):
       saving_listeners: list of `CheckpointSaverListener` objects. Used by
         tf.estimator.Estimator for callbacks that run immediately before or
         after checkpoint savings.
-
+      check_interval_secs:
+        Minimum time between subsequent checks for a new checkpoint. This
+        mostly applies if both min_eval_frequency and the time spent per
+        training step is low.
     Raises:
       ValueError: if `estimator` does not implement Estimator interface,
         or if export_strategies has the wrong type.
@@ -261,13 +266,9 @@ class Experiment(object):
     self._continuous_eval_throttle_secs = continuous_eval_throttle_secs
     self._checkpoint_and_export = checkpoint_and_export
     self._saving_listeners = saving_listeners
-    # Using 1 on a non-cached file system requires a lot of overhead to
-    # read the checkpoint state file. This is particular bad on GCS, so
-    # we use a different default. This is a temporary band-aid, to be
-    # fixed holistically later (b/36498507).
-    default_min_eval_frequency = 1000 if _is_gcs(estimator.model_dir) else 1
     self._min_eval_frequency = min_eval_frequency if (
-        min_eval_frequency is not None) else default_min_eval_frequency
+        min_eval_frequency is not None) else 1
+    self._check_interval_secs = check_interval_secs
     self._delay_workers_by_global_step = delay_workers_by_global_step
     self._train_monitors = train_monitors[:] if train_monitors else []
     self._eval_hooks = eval_hooks[:] if eval_hooks else []
@@ -646,12 +647,19 @@ class Experiment(object):
         self._train_monitors += [saver_hook]
       else:
         if self._min_eval_frequency:
+          # Using low min_eval_frequency (default is 1) on a non-cached file
+          # system requires a lot of overhead to read the checkpoint state file.
+          # This is particular bad on GCS and CNS. See also b/36498507 for
+          # context. `check_interval_secs = 5` avoids polling a remote
+          # fileystem too often.
+
           self._train_monitors += [
               monitors.ValidationMonitor(
                   input_fn=self._eval_input_fn,
                   eval_steps=self._eval_steps,
                   metrics=self._eval_metrics,
                   every_n_steps=self._min_eval_frequency,
+                  check_interval_secs=self._check_interval_secs,
                   name=eval_dir_suffix,
                   hooks=self._eval_hooks)
           ]
@@ -928,7 +936,3 @@ def _new_attr_context(obj, attr):
     yield
   finally:
     setattr(obj, attr, saved)
-
-
-def _is_gcs(model_dir):
-  return model_dir and model_dir.startswith("gs://")
diff --git a/tensorflow/contrib/learn/python/learn/experiment_test.py b/tensorflow/contrib/learn/python/learn/experiment_test.py
index 545d7d8924c0c10544e6113e2968b7ae3d2090fc..d10927a0cdd5c67c8d2a8e569153235ee175ec4d 100644
--- a/tensorflow/contrib/learn/python/learn/experiment_test.py
+++ b/tensorflow/contrib/learn/python/learn/experiment_test.py
@@ -674,37 +674,11 @@ class ExperimentTest(test.TestCase):
   def test_min_eval_frequency_defaults(self):
     def dummy_model_fn(features, labels):  # pylint: disable=unused-argument
       pass
-
-    # The default value when model_dir is on GCS is 1000
-    estimator = core_estimator.Estimator(dummy_model_fn, 'gs://dummy_bucket')
-    ex = experiment.Experiment(
-        estimator, train_input_fn=None, eval_input_fn=None)
-    self.assertEquals(ex._min_eval_frequency, 1000)
-
-    # The default value when model_dir is not on GCS is 1
     estimator = core_estimator.Estimator(dummy_model_fn, '/tmp/dummy')
     ex = experiment.Experiment(
         estimator, train_input_fn=None, eval_input_fn=None)
     self.assertEquals(ex._min_eval_frequency, 1)
 
-    # Make sure default not used when explicitly set
-    estimator = core_estimator.Estimator(dummy_model_fn, 'gs://dummy_bucket')
-    ex = experiment.Experiment(
-        estimator,
-        min_eval_frequency=123,
-        train_input_fn=None,
-        eval_input_fn=None)
-    self.assertEquals(ex._min_eval_frequency, 123)
-
-    # Make sure default not used when explicitly set as 0
-    estimator = core_estimator.Estimator(dummy_model_fn, 'gs://dummy_bucket')
-    ex = experiment.Experiment(
-        estimator,
-        min_eval_frequency=0,
-        train_input_fn=None,
-        eval_input_fn=None)
-    self.assertEquals(ex._min_eval_frequency, 0)
-
   def test_continuous_train_and_eval(self):
     for est in self._estimators_for_tests(eval_dict={'global_step': 100}):
       if isinstance(est, core_estimator.Estimator):
diff --git a/tensorflow/contrib/learn/python/learn/export_strategy.py b/tensorflow/contrib/learn/python/learn/export_strategy.py
index 55a8b824312b89e0ac66513242191f4201ac212a..075cab536ecb5279e7e6f23abb0b70c75043a7ec 100644
--- a/tensorflow/contrib/learn/python/learn/export_strategy.py
+++ b/tensorflow/contrib/learn/python/learn/export_strategy.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""ExportStrategy class represents different flavors of model export."""
+"""ExportStrategy class represents different flavors of model export (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -21,6 +26,7 @@ from __future__ import print_function
 import collections
 
 from tensorflow.python.util import tf_inspect
+from tensorflow.python.util.deprecation import deprecated
 
 __all__ = ['ExportStrategy']
 
@@ -30,6 +36,10 @@ class ExportStrategy(
                            ['name', 'export_fn', 'strip_default_attrs'])):
   """A class representing a type of model export.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Typically constructed by a utility function specific to the exporter, such as
   `saved_model_export_utils.make_export_strategy()`.
 
@@ -56,6 +66,8 @@ class ExportStrategy(
         forward compatibility of the resulting `SavedModel`.
   """
 
+  @deprecated(None, 'Please switch to tf.estimator.train_and_evaluate, and use '
+              'tf.estimator.Exporter.')
   def __new__(cls, name, export_fn, strip_default_attrs=None):
     return super(ExportStrategy, cls).__new__(
         cls, name, export_fn, strip_default_attrs)
diff --git a/tensorflow/contrib/learn/python/learn/graph_actions.py b/tensorflow/contrib/learn/python/learn/graph_actions.py
index 98365c05f663e5d2a06703457fc5663d7135f7d9..a997fab723a16dddf150aa9397863605e4e77933 100644
--- a/tensorflow/contrib/learn/python/learn/graph_actions.py
+++ b/tensorflow/contrib/learn/python/learn/graph_actions.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""High level operations on graphs."""
+"""High level operations on graphs (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -68,6 +73,7 @@ def clear_summary_writers():
   return summary_io.SummaryWriterCache.clear()
 
 
+@deprecated(None, 'Use `SummaryWriterCache.get` directly.')
 def get_summary_writer(logdir):
   """Returns single SummaryWriter per logdir in current run.
 
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/__init__.py b/tensorflow/contrib/learn/python/learn/learn_io/__init__.py
index 06c3782a471537cf3879450e6bd20899a35d96ac..8b133a4440d8cbc19abca64f972791fc16ade6f8 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/__init__.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tools to allow different io formats."""
+"""Tools to allow different io formats (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py b/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py
index 7d666391cea3c0a52a2cb7e324c00d5f480710d5..e0a1948d95a727675dac8ff3ce9f55c35d5f8d8d 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/dask_io.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Methods to allow dask.DataFrame."""
+"""Methods to allow dask.DataFrame (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -21,6 +26,8 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.util.deprecation import deprecated
+
 try:
   # pylint: disable=g-import-not-at-top
   import dask.dataframe as dd
@@ -60,6 +67,7 @@ def _construct_dask_df_with_divisions(df):
     return dd.Series(merge(dsk, df.dask), name, df.name, divisions)
 
 
+@deprecated(None, 'Please feed input to tf.data to support dask.')
 def extract_dask_data(data):
   """Extract data from dask.Series or dask.DataFrame for predictors.
 
@@ -81,6 +89,7 @@ def extract_dask_data(data):
     return data
 
 
+@deprecated(None, 'Please feed input to tf.data to support dask.')
 def extract_dask_labels(labels):
   """Extract data from dask.Series or dask.DataFrame for labels.
 
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py b/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py
index 96be8b1bc402479d5611965f27abb197363cb939..c45b1d186471125776d6536112aebb66bb5ad558 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/data_feeder.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Implementations of different data feeders to provide data for TF trainer."""
+"""Implementations of different data feeders to provide data for TF trainer (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 # TODO(ipolosukhin): Replace this module with feed-dict queue runners & queues.
 
@@ -31,6 +36,7 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util.deprecation import deprecated
 
 # pylint: disable=g-multiple-import,g-bad-import-order
 from .pandas_io import HAS_PANDAS, extract_pandas_data, extract_pandas_matrix, extract_pandas_labels
@@ -101,6 +107,7 @@ def _is_iterable(x):
   return hasattr(x, 'next') or hasattr(x, '__next__')
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 def setup_train_data_feeder(x,
                             y,
                             n_classes,
@@ -188,6 +195,7 @@ def _batch_data(x, batch_size=None):
     yield np.matrix(chunk)
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 def setup_predict_data_feeder(x, batch_size=None):
   """Returns an iterable for feeding into predict step.
 
@@ -219,6 +227,7 @@ def setup_predict_data_feeder(x, batch_size=None):
   return [x]
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 def setup_processor_data_feeder(x):
   """Sets up processor iterable.
 
@@ -233,6 +242,7 @@ def setup_processor_data_feeder(x):
   return x
 
 
+@deprecated(None, 'Please convert numpy dtypes explicitly.')
 def check_array(array, dtype):
   """Checks array on dtype and converts it if different.
 
@@ -275,8 +285,14 @@ def _check_dtype(dtype):
 
 
 class DataFeeder(object):
-  """Data feeder is an example class to sample data for TF trainer."""
+  """Data feeder is an example class to sample data for TF trainer.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data.')
   def __init__(self,
                x,
                y,
@@ -563,6 +579,10 @@ class DataFeeder(object):
 class StreamingDataFeeder(DataFeeder):
   """Data feeder for TF trainer that reads data from iterator.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Streaming data feeder allows to read data as it comes it from disk or
   somewhere else. It's custom to have this iterators rotate infinetly over
   the dataset, to allow control of how much to learn on the trainer side.
@@ -771,11 +791,16 @@ class StreamingDataFeeder(DataFeeder):
 class DaskDataFeeder(object):
   """Data feeder for that reads data from dask.Series and dask.DataFrame.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Numpy arrays can be serialized to disk and it's possible to do random seeks
   into them. DaskDataFeeder will remove requirement to have full dataset in the
   memory and still do random seeks for sampling of batches.
   """
 
+  @deprecated(None, 'Please feed input to tf.data to support dask.')
   def __init__(self,
                x,
                y,
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/generator_io.py b/tensorflow/contrib/learn/python/learn/learn_io/generator_io.py
index 884faf8335e2a3ca1d27d2d93b4c817131648774..f8aaa0c9e3e5b589a6ad47678dba3dc38de7c471 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/generator_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/generator_io.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Methods to allow generator of dict with numpy arrays."""
+"""Methods to allow generator of dict with numpy arrays (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -23,8 +28,10 @@ from types import FunctionType
 from types import GeneratorType
 
 from tensorflow.python.estimator.inputs.queues.feeding_functions import _enqueue_data as enqueue_data
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Please use tf.data.')
 def generator_input_fn(x,
                        target_key=None,
                        batch_size=128,
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/graph_io.py b/tensorflow/contrib/learn/python/learn/learn_io/graph_io.py
index 3a46c239688017f9204d2c6182a6f81cd325a417..9e816f54b6cf8dee84c6d62406ab3db700054d06 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/graph_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/graph_io.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Methods to read data in the graph."""
+"""Methods to read data in the graph (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -34,11 +39,13 @@ from tensorflow.python.platform import gfile
 from tensorflow.python.summary import summary
 from tensorflow.python.training import input as input_ops
 from tensorflow.python.training import queue_runner
+from tensorflow.python.util.deprecation import deprecated
 
 # Default name for key in the feature dict.
 KEY_FEATURE_NAME = '__key__'
 
 
+@deprecated(None, 'Use tf.data.')
 def read_batch_examples(file_pattern,
                         batch_size,
                         reader,
@@ -106,6 +113,7 @@ def read_batch_examples(file_pattern,
   return examples
 
 
+@deprecated(None, 'Use tf.data.')
 def read_keyed_batch_examples(file_pattern,
                               batch_size,
                               reader,
@@ -175,6 +183,7 @@ def read_keyed_batch_examples(file_pattern,
       seed=seed)
 
 
+@deprecated(None, 'Use tf.data.')
 def read_keyed_batch_examples_shared_queue(file_pattern,
                                            batch_size,
                                            reader,
@@ -452,6 +461,7 @@ def _read_keyed_batch_examples_helper(file_pattern,
     return queued_examples_with_keys
 
 
+@deprecated(None, 'Use tf.data.')
 def read_keyed_batch_features(file_pattern,
                               batch_size,
                               features,
@@ -540,6 +550,7 @@ def read_keyed_batch_features(file_pattern,
         name=scope)
 
 
+@deprecated(None, 'Use tf.data.')
 def read_keyed_batch_features_shared_queue(file_pattern,
                                            batch_size,
                                            features,
@@ -620,6 +631,7 @@ def read_keyed_batch_features_shared_queue(file_pattern,
         name=scope)
 
 
+@deprecated(None, 'Use tf.data.')
 def queue_parsed_features(parsed_features,
                           keys=None,
                           feature_queue_capacity=100,
@@ -742,6 +754,7 @@ def queue_parsed_features(parsed_features,
     return dequeued_keys, dequeued_parsed_features
 
 
+@deprecated(None, 'Use tf.data.')
 def read_batch_features(file_pattern,
                         batch_size,
                         features,
@@ -821,6 +834,7 @@ def read_batch_features(file_pattern,
   return features
 
 
+@deprecated(None, 'Use tf.data.')
 def read_batch_record_features(file_pattern,
                                batch_size,
                                features,
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/numpy_io.py b/tensorflow/contrib/learn/python/learn/learn_io/numpy_io.py
index 692438807fbd7febb156d4db73b5d3deba6c987d..29552d24f1eaa0d85a99c8b09f69d007e7e4fe9f 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/numpy_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/numpy_io.py
@@ -12,15 +12,22 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Methods to allow dict of numpy arrays."""
+"""Methods to allow dict of numpy arrays (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.estimator.inputs.numpy_io import numpy_input_fn as core_numpy_input_fn
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Use tf.estimator.inputs.numpy_input_fn.')
 def numpy_input_fn(x,
                    y=None,
                    batch_size=128,
diff --git a/tensorflow/contrib/learn/python/learn/learn_io/pandas_io.py b/tensorflow/contrib/learn/python/learn/learn_io/pandas_io.py
index ede7558eafa9237dc63aa95a62e599c5e9755822..b4ef055f5ae484ec704ad42efcf2c00c4a7a4f56 100644
--- a/tensorflow/contrib/learn/python/learn/learn_io/pandas_io.py
+++ b/tensorflow/contrib/learn/python/learn/learn_io/pandas_io.py
@@ -13,13 +13,19 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Methods to allow pandas.DataFrame."""
+"""Methods to allow pandas.DataFrame (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.estimator.inputs.pandas_io import pandas_input_fn as core_pandas_input_fn
+from tensorflow.python.util.deprecation import deprecated
 
 try:
   # pylint: disable=g-import-not-at-top
@@ -47,6 +53,7 @@ PANDAS_DTYPES = {
 }
 
 
+@deprecated(None, 'Please use tf.estimator.inputs.pandas_input_fn')
 def pandas_input_fn(x,
                     y=None,
                     batch_size=128,
@@ -66,6 +73,7 @@ def pandas_input_fn(x,
                               target_column=target_column)
 
 
+@deprecated(None, 'Please access pandas data directly.')
 def extract_pandas_data(data):
   """Extract data from pandas.DataFrame for predictors.
 
@@ -96,6 +104,7 @@ def extract_pandas_data(data):
                      'float, or bool. Found: ' + ', '.join(error_report))
 
 
+@deprecated(None, 'Please access pandas data directly.')
 def extract_pandas_matrix(data):
   """Extracts numpy matrix from pandas DataFrame.
 
@@ -111,6 +120,7 @@ def extract_pandas_matrix(data):
   return data.as_matrix()
 
 
+@deprecated(None, 'Please access pandas data directly.')
 def extract_pandas_labels(labels):
   """Extract data from pandas.DataFrame for labels.
 
diff --git a/tensorflow/contrib/learn/python/learn/learn_runner.py b/tensorflow/contrib/learn/python/learn/learn_runner.py
index 2af723a0d64822e81fa0fbeb106ab812de6ab4e8..d719a3e488b9905ef7903e21d90dbaae0449735c 100644
--- a/tensorflow/contrib/learn/python/learn/learn_runner.py
+++ b/tensorflow/contrib/learn/python/learn/learn_runner.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Runs an Experiment."""
+"""Runs an Experiment (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -22,6 +27,7 @@ from tensorflow.contrib.learn.python.learn.estimators import run_config as run_c
 from tensorflow.contrib.learn.python.learn.experiment import Experiment
 from tensorflow.contrib.training.python.training import hparam as hparam_lib
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util.deprecation import deprecated
 
 
 # TODO(xiejw): Refactor the learn_runner to make code reusable.
@@ -99,6 +105,7 @@ def _wrapped_experiment_fn_with_uid_check(experiment_fn, require_hparams=False):
   return wrapped_experiment_fn
 
 
+@deprecated(None, 'Use tf.estimator.train_and_evaluate.')
 def run(experiment_fn, output_dir=None, schedule=None, run_config=None,
         hparams=None):
   """Make and run an experiment.
@@ -218,6 +225,7 @@ def run(experiment_fn, output_dir=None, schedule=None, run_config=None,
   return _execute_schedule(experiment, schedule)
 
 
+@deprecated(None, 'Use tf.estimator.train_and_evaluate.')
 def tune(experiment_fn, tuner):
   """Tune an experiment with hyper-parameters.
 
diff --git a/tensorflow/contrib/learn/python/learn/learn_runner_lib.py b/tensorflow/contrib/learn/python/learn/learn_runner_lib.py
index 7d9b1c7716f0ab1f2274ca53406175240b613027..ba2d067787c1dfd4e4820ecc916f1053e9f3cf60 100644
--- a/tensorflow/contrib/learn/python/learn/learn_runner_lib.py
+++ b/tensorflow/contrib/learn/python/learn/learn_runner_lib.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Utilities to run and tune an Experiment.
+"""Utilities to run and tune an Experiment (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 @@run
 @@tune
diff --git a/tensorflow/contrib/learn/python/learn/metric_spec.py b/tensorflow/contrib/learn/python/learn/metric_spec.py
index 6440bc204b8e339ff51311dcc87b36f556b94092..97220365d5dddb82b602369f06bea021a86d584f 100644
--- a/tensorflow/contrib/learn/python/learn/metric_spec.py
+++ b/tensorflow/contrib/learn/python/learn/metric_spec.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""The metric spec class to flexibly connect models and metrics."""
+"""The metric spec class to flexibly connect models and metrics (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -22,6 +27,7 @@ import six
 
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import tf_inspect
+from tensorflow.python.util.deprecation import deprecated
 
 
 def _assert_named_args(sentinel):
@@ -223,6 +229,10 @@ def _adapt_metric_fn(
 class MetricSpec(object):
   """MetricSpec connects a model to metric functions.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   The MetricSpec class contains all information necessary to connect the
   output of a `model_fn` to the metrics (usually, streaming metrics) that are
   used in evaluation.
@@ -284,6 +294,7 @@ class MetricSpec(object):
 
   """
 
+  @deprecated(None, 'Use tf.estimator.EstimatorSpec.eval_metric_ops.')
   def __init__(self,
                metric_fn,
                prediction_key=None,
diff --git a/tensorflow/contrib/learn/python/learn/models.py b/tensorflow/contrib/learn/python/learn/models.py
index 4283240d018c949bb35aeb12032d2ee8b75884a5..bd4bbf9f8c9ad7e8a0fc06d8c0dc24672536c158 100644
--- a/tensorflow/contrib/learn/python/learn/models.py
+++ b/tensorflow/contrib/learn/python/learn/models.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Various high level TF models."""
+"""Various high level TF models (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -28,8 +33,10 @@ from tensorflow.python.ops import array_ops as array_ops_
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.summary import summary
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Consider using a tf.estimator.LinearRegressor')
 def linear_regression_zero_init(x, y):
   """Linear regression subgraph with zero-value initial weights and bias.
 
@@ -43,6 +50,7 @@ def linear_regression_zero_init(x, y):
   return linear_regression(x, y, init_mean=0.0, init_stddev=0.0)
 
 
+@deprecated(None, 'Consider using a class from tf.estimator.LinearClassifier')
 def logistic_regression_zero_init(x, y):
   """Logistic regression subgraph with zero-value initial weights and bias.
 
@@ -56,6 +64,7 @@ def logistic_regression_zero_init(x, y):
   return logistic_regression(x, y, init_mean=0.0, init_stddev=0.0)
 
 
+@deprecated(None, 'Consider using a class from tf.estimator.')
 def linear_regression(x, y, init_mean=None, init_stddev=1.0):
   """Creates linear regression TensorFlow subgraph.
 
@@ -107,6 +116,7 @@ def linear_regression(x, y, init_mean=None, init_stddev=1.0):
     return losses_ops.mean_squared_error_regressor(x, y, weights, bias)
 
 
+@deprecated(None, 'Consider using a class from tf.estimator.')
 def logistic_regression(x,
                         y,
                         class_weight=None,
@@ -203,6 +213,7 @@ def _reverse_seq(input_seq, lengths):
   return result
 
 
+@deprecated(None, 'Please consider `tf.nn.bidirectional_dynamic_rnn`.')
 def bidirectional_rnn(cell_fw,
                       cell_bw,
                       inputs,
@@ -283,6 +294,7 @@ def bidirectional_rnn(cell_fw,
 # End of TensorFlow 0.7
 
 
+@deprecated(None, 'Please consider tensorflow/tensor2tensor.')
 def get_rnn_model(rnn_size, cell_type, num_layers, input_op_fn, bidirectional,
                   target_predictor_fn, sequence_length, initial_state,
                   attn_length, attn_size, attn_vec_size):
diff --git a/tensorflow/contrib/learn/python/learn/monitored_session.py b/tensorflow/contrib/learn/python/learn/monitored_session.py
index 22602e9f69d972505d83a66a6f9183b5e4d15c44..ac0433f1775feeed2ec3cf49291da01500bef01b 100644
--- a/tensorflow/contrib/learn/python/learn/monitored_session.py
+++ b/tensorflow/contrib/learn/python/learn/monitored_session.py
@@ -13,7 +13,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""A wrapper of Session API which runs hooks."""
+"""A wrapper of Session API which runs hooks (deprecated).
+
+These are deprecated aliases for classes and functions in `tf.train`. Please use
+those directly.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/monitors.py b/tensorflow/contrib/learn/python/learn/monitors.py
index 51381a7427c919592b8e818c4b46dba974992610..77f7c73d5412d40b338eaff4cf04d99fd0892723 100644
--- a/tensorflow/contrib/learn/python/learn/monitors.py
+++ b/tensorflow/contrib/learn/python/learn/monitors.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Monitors instrument the training process.
+"""Monitors instrument the training process (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 @@get_default_monitors
 @@BaseMonitor
@@ -59,6 +63,10 @@ from tensorflow.python.util import tf_inspect
 class BaseMonitor(object):
   """Base class for Monitors.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Defines basic interfaces of Monitors.
   Monitors can either be run on all workers or, more commonly, restricted
   to run exclusively on the elected chief worker.
@@ -229,6 +237,10 @@ def _extract_output(outputs, request):
 class EveryN(BaseMonitor):
   """Base class for monitors that execute callbacks every N steps.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   This class adds three new callbacks:
     - every_n_step_begin
     - every_n_step_end
@@ -418,6 +430,10 @@ class StopAtStep(BaseMonitor):
 class PrintTensor(EveryN):
   """Prints given tensors every N steps.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   This is an `EveryN` monitor and has consistent semantic for `every_n`
   and `first_n`.
 
@@ -455,9 +471,12 @@ class PrintTensor(EveryN):
 class LoggingTrainable(EveryN):
   """Writes trainable variable values into log every N steps.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Write the tensors in trainable variables `every_n` steps,
   starting with the `first_n`th step.
-
   """
 
   def __init__(self, scope=None, every_n=100, first_n=1):
@@ -493,7 +512,12 @@ class LoggingTrainable(EveryN):
 
 
 class SummarySaver(EveryN):
-  """Saves summaries every N steps."""
+  """Saves summaries every N steps.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self,
                summary_op,
@@ -554,6 +578,10 @@ class SummarySaver(EveryN):
 class ValidationMonitor(EveryN):
   """Runs evaluation of a given estimator, at most every N steps.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note that the evaluation is done based on the saved checkpoint, which will
   usually be older than the current step.
 
@@ -573,7 +601,8 @@ class ValidationMonitor(EveryN):
                early_stopping_rounds=None,
                early_stopping_metric="loss",
                early_stopping_metric_minimize=True,
-               name=None):
+               name=None,
+               check_interval_secs=5):
     """Initializes a ValidationMonitor.
 
     Args:
@@ -600,6 +629,9 @@ class ValidationMonitor(EveryN):
           loss metrics like mean squared error, and False for performance
           metrics like accuracy.
       name: See `BaseEstimator.evaluate`.
+      check_interval_secs: Only check for new checkpoint if at least
+          `check_interval_secs` have passed. Ignore if None. Default is 5 secs.
+
 
     Raises:
       ValueError: If both x and input_fn are provided.
@@ -626,6 +658,8 @@ class ValidationMonitor(EveryN):
     self._early_stopped = False
     self._latest_path = None
     self._latest_path_step = None
+    self._last_checkpoint_check_time = None
+    self._check_interval_secs = check_interval_secs
 
   @property
   def early_stopped(self):
@@ -690,6 +724,16 @@ class ValidationMonitor(EveryN):
     # that's what is being evaluated.
     if self._estimator is None:
       raise ValueError("Missing call to set_estimator.")
+    current_time = time.time()
+    if (self._check_interval_secs is not None and
+        self._last_checkpoint_check_time is not None and
+        current_time - self._last_checkpoint_check_time <=
+        self._check_interval_secs):
+      logging.debug(
+          "Skipping evaluation since less than %d seconds have passed since "
+          "last check for a new checkpoint.", self._check_interval_secs)
+      return False
+    self._last_checkpoint_check_time = current_time
     # Check that we are not running evaluation on the same checkpoint.
     latest_path = saver_lib.latest_checkpoint(self._estimator.model_dir)
     if latest_path is None:
@@ -740,6 +784,10 @@ class ValidationMonitor(EveryN):
 class CaptureVariable(EveryN):
   """Captures a variable's values into a collection.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   This monitor is useful for unit testing. You should exercise caution when
   using this monitor in production, since it never discards values.
 
@@ -778,6 +826,7 @@ class CaptureVariable(EveryN):
     self._var_values[step] = _extract_output(outputs, self._var_name)
 
 
+@deprecation.deprecated(None, "Use tf.train.MonitoredTrainingSession.")
 def get_default_monitors(loss_op=None,
                          summary_op=None,
                          save_summary_steps=100,
@@ -812,6 +861,10 @@ def get_default_monitors(loss_op=None,
 class GraphDump(BaseMonitor):
   """Dumps almost all tensors in the graph at every step.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Note, this is very expensive, prefer `PrintTensor` in production.
   """
 
@@ -901,7 +954,12 @@ class GraphDump(BaseMonitor):
 
 
 class ExportMonitor(EveryN):
-  """Monitor that exports Estimator every N steps."""
+  """Monitor that exports Estimator every N steps.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   @deprecation.deprecated("2017-03-25",
                           "ExportMonitor is deprecated. Please pass an "
@@ -1024,7 +1082,12 @@ class ExportMonitor(EveryN):
 
 
 class CheckpointSaver(BaseMonitor):
-  """Saves checkpoints every N steps or N seconds."""
+  """Saves checkpoints every N steps or N seconds.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self,
                checkpoint_dir,
@@ -1109,7 +1172,12 @@ class CheckpointSaver(BaseMonitor):
 
 
 class StepCounter(EveryN):
-  """Steps per second monitor."""
+  """Steps per second monitor.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
   def __init__(self, every_n_steps=100, output_dir=None, summary_writer=None):
     super(StepCounter, self).__init__(every_n_steps=every_n_steps)
@@ -1149,6 +1217,10 @@ class NanLossDuringTrainingError(RuntimeError):
 class NanLoss(EveryN):
   """NaN Loss monitor.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Monitors loss and stops training if loss is NaN.
   Can either fail with exception or just stop training.
   """
diff --git a/tensorflow/contrib/learn/python/learn/monitors_test.py b/tensorflow/contrib/learn/python/learn/monitors_test.py
index b2b24776c60183113a5f936dd276ff312d6d0079..5c34d0ddb01f3bcdc407e6926e7c5b73be1863b4 100644
--- a/tensorflow/contrib/learn/python/learn/monitors_test.py
+++ b/tensorflow/contrib/learn/python/learn/monitors_test.py
@@ -385,7 +385,11 @@ class MonitorsTest(test.TestCase):
     estimator.evaluate.return_value = validation_outputs
 
     monitor = learn.monitors.ValidationMonitor(
-        x=constant_op.constant(2.0), every_n_steps=0, early_stopping_rounds=2)
+        x=constant_op.constant(2.0),
+        every_n_steps=0,
+        early_stopping_rounds=2,
+        check_interval_secs=None)
+
     self._assert_validation_monitor(monitor)
     monitor.set_estimator(estimator)
     with ops.Graph().as_default() as g, self.test_session(g):
diff --git a/tensorflow/contrib/learn/python/learn/ops/__init__.py b/tensorflow/contrib/learn/python/learn/ops/__init__.py
index 33962e34cc685ce2c830a7bbfd1b5c626bcd8b31..efb1f47cf5bb2dcd0fb37b7b85cd8f170d56e4d1 100644
--- a/tensorflow/contrib/learn/python/learn/ops/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/ops/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Various TensorFlow Ops."""
+"""Various TensorFlow Ops (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py b/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py
index fa3b7323e343371e986b763d30a8a44620894549..b3b067b8e1a4eb9f644e8e55587b3405d91a0189 100644
--- a/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py
+++ b/tensorflow/contrib/learn/python/learn/ops/embeddings_ops.py
@@ -13,7 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow Ops to work with embeddings.
+"""TensorFlow Ops to work with embeddings (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Note: categorical variables are handled via embeddings in many cases.
 For example, in case of words.
diff --git a/tensorflow/contrib/learn/python/learn/ops/losses_ops.py b/tensorflow/contrib/learn/python/learn/ops/losses_ops.py
index b040ab3bb6c516158589a8e30d56fff1f7728951..92976d1539c7ddc226b81f903beee82b798ec8db 100644
--- a/tensorflow/contrib/learn/python/learn/ops/losses_ops.py
+++ b/tensorflow/contrib/learn/python/learn/ops/losses_ops.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow Ops for loss computation."""
+"""TensorFlow Ops for loss computation (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py b/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py
index 45727faab4362abeab18f77861353eb53976023a..aa37cb4a76e2a6157bf077d327248353bd516472 100644
--- a/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py
+++ b/tensorflow/contrib/learn/python/learn/ops/seq2seq_ops.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow Ops for Sequence to Sequence models."""
+"""TensorFlow Ops for Sequence to Sequence models (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -26,8 +31,10 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn
 from tensorflow.python.ops import variable_scope as vs
+from tensorflow.python.util.deprecation import deprecated
 
 
+@deprecated(None, 'Please use tf.nn/tf.layers directly.')
 def sequence_classifier(decoding, labels, sampling_decoding=None, name=None):
   """Returns predictions and loss for sequence of predictions.
 
@@ -57,6 +64,7 @@ def sequence_classifier(decoding, labels, sampling_decoding=None, name=None):
     return array_ops.stack(predictions, axis=1), loss
 
 
+@deprecated(None, 'Please use tf.nn/tf.layers directly.')
 def seq2seq_inputs(x, y, input_length, output_length, sentinel=None, name=None):
   """Processes inputs for Sequence to Sequence models.
 
@@ -87,6 +95,7 @@ def seq2seq_inputs(x, y, input_length, output_length, sentinel=None, name=None):
     return in_x, in_y, out_y
 
 
+@deprecated(None, 'Please use tf.nn/tf.layers directly.')
 def rnn_decoder(decoder_inputs, initial_state, cell, scope=None):
   """RNN Decoder that creates training and sampling sub-graphs.
 
@@ -123,6 +132,7 @@ def rnn_decoder(decoder_inputs, initial_state, cell, scope=None):
   return outputs, states, sampling_outputs, sampling_states
 
 
+@deprecated(None, 'Please use tf.nn/tf.layers directly.')
 def rnn_seq2seq(encoder_inputs,
                 decoder_inputs,
                 encoder_cell,
diff --git a/tensorflow/contrib/learn/python/learn/preprocessing/__init__.py b/tensorflow/contrib/learn/python/learn/preprocessing/__init__.py
index 7bcc177d4ea0ab57f092d68888a72de2b2fd5edc..e8c6e1acf80f0791421bee59aff30e67bccb44b2 100644
--- a/tensorflow/contrib/learn/python/learn/preprocessing/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/preprocessing/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Preprocessing tools useful for building models."""
+"""Preprocessing tools useful for building models (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/preprocessing/categorical.py b/tensorflow/contrib/learn/python/learn/preprocessing/categorical.py
index 154739d497ec1029026eaca1e93b37cd225f1050..faba3b2025e8abb51d1989c3fafbd5e711d6559b 100644
--- a/tensorflow/contrib/learn/python/learn/preprocessing/categorical.py
+++ b/tensorflow/contrib/learn/python/learn/preprocessing/categorical.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Implements preprocessing transformers for categorical variables."""
+"""Implements preprocessing transformers for categorical variables (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -22,6 +27,8 @@ from __future__ import print_function
 import math
 import numpy as np
 
+from tensorflow.python.util.deprecation import deprecated
+
 # pylint: disable=g-bad-import-order
 from . import categorical_vocabulary
 from ..learn_io.data_feeder import setup_processor_data_feeder
@@ -31,10 +38,16 @@ from ..learn_io.data_feeder import setup_processor_data_feeder
 class CategoricalProcessor(object):
   """Maps documents to sequences of word ids.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   As a common convention, Nan values are handled as unknown tokens.
   Both float('nan') and np.nan are accepted.
   """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data for sequence '
+              'processing.')
   def __init__(self, min_frequency=0, share=False, vocabularies=None):
     """Initializes a CategoricalProcessor instance.
 
diff --git a/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py b/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py
index 5709955c49fba50ca4a299a443a2902bbd9c6b23..3ac370a6ab4423846e810900514445ad5269b680 100644
--- a/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py
+++ b/tensorflow/contrib/learn/python/learn/preprocessing/categorical_vocabulary.py
@@ -13,7 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Categorical vocabulary classes to map categories to indexes.
+"""Categorical vocabulary classes to map categories to indexes (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Can be used for categorical variables, sparse variables and words.
 """
@@ -25,14 +29,21 @@ from __future__ import print_function
 import collections
 import six
 
+from tensorflow.python.util.deprecation import deprecated
+
 
 class CategoricalVocabulary(object):
   """Categorical variables vocabulary class.
 
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+
   Accumulates and provides mapping from classes to indexes.
   Can be easily used for words.
   """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data.')
   def __init__(self, unknown_token="<UNK>", support_reverse=True):
     self._unknown_token = unknown_token
     self._mapping = {unknown_token: 0}
diff --git a/tensorflow/contrib/learn/python/learn/preprocessing/text.py b/tensorflow/contrib/learn/python/learn/preprocessing/text.py
index 3af2074c2a46f0258c04111fff0235ba8309625e..f2b6776be7789a9433bfe41eb9354b74347059ec 100644
--- a/tensorflow/contrib/learn/python/learn/preprocessing/text.py
+++ b/tensorflow/contrib/learn/python/learn/preprocessing/text.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Implements a number of text preprocessing utilities."""
+"""Implements a number of text preprocessing utilities (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -24,6 +29,7 @@ import numpy as np
 import six
 
 from tensorflow.python.platform import gfile
+from tensorflow.python.util.deprecation import deprecated
 
 from .categorical_vocabulary import CategoricalVocabulary  # pylint: disable=g-bad-import-order
 
@@ -38,6 +44,7 @@ TOKENIZER_RE = re.compile(r"[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",
                           re.UNICODE)
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 def tokenizer(iterator):
   """Tokenizer generator.
 
@@ -51,9 +58,16 @@ def tokenizer(iterator):
     yield TOKENIZER_RE.findall(value)
 
 
+@deprecated(None, 'Please use tensorflow/transform or tf.data.')
 class ByteProcessor(object):
-  """Maps documents into sequence of ids for bytes."""
+  """Maps documents into sequence of ids for bytes.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data.')
   def __init__(self, max_document_length):
     self.max_document_length = max_document_length
 
@@ -108,8 +122,14 @@ class ByteProcessor(object):
 
 
 class VocabularyProcessor(object):
-  """Maps documents to sequences of word ids."""
+  """Maps documents to sequences of word ids.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Please use tensorflow/transform or tf.data.')
   def __init__(self,
                max_document_length,
                min_frequency=0,
diff --git a/tensorflow/contrib/learn/python/learn/session_run_hook.py b/tensorflow/contrib/learn/python/learn/session_run_hook.py
index a8ba2be97206f2b974d256ad2c62c21a4e3e55d8..87edc9b720bdb3edcd5f2dcd1662d14da53c51cf 100644
--- a/tensorflow/contrib/learn/python/learn/session_run_hook.py
+++ b/tensorflow/contrib/learn/python/learn/session_run_hook.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""This file is deprecated. Use tensorflow.python.training.session_run_hook."""
+"""This file is deprecated. Use `tensorflow.python.training.session_run_hook`.
+
+See [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/summary_writer_cache.py b/tensorflow/contrib/learn/python/learn/summary_writer_cache.py
index 919d415c302b8ec17202aad34ff0bee69bfee2c7..d663cf5fb79c428b0e70d66b0f1305f0559a05c9 100644
--- a/tensorflow/contrib/learn/python/learn/summary_writer_cache.py
+++ b/tensorflow/contrib/learn/python/learn/summary_writer_cache.py
@@ -12,7 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Wrapper for a Session-like object that handles threads and recovery.
+"""Wrapper for a Session-like object that handles threads and recovery (deprecated).
+
+These are deprecated aliases for classes and functions in `tf.train`. Please use
+those directly.
 
 Based on an original design of Illia Polosukhin.
 """
diff --git a/tensorflow/contrib/learn/python/learn/trainable.py b/tensorflow/contrib/learn/python/learn/trainable.py
index 429b6040be21d8cbe1f2bba58090366552fdfbe7..a1a3f20dcd8cb5ff7baa559ac41d5e5c40780511 100644
--- a/tensorflow/contrib/learn/python/learn/trainable.py
+++ b/tensorflow/contrib/learn/python/learn/trainable.py
@@ -12,7 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""`Trainable` interface."""
+"""`Trainable` interface (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -23,6 +28,8 @@ import abc
 
 class Trainable(object):
   """Interface for objects that are trainable by, e.g., `Experiment`.
+
+  THIS CLASS IS DEPRECATED.
   """
   __metaclass__ = abc.ABCMeta
 
diff --git a/tensorflow/contrib/learn/python/learn/utils/__init__.py b/tensorflow/contrib/learn/python/learn/utils/__init__.py
index 48978d0ac34cec2b18e6794dcf3b260bc3b683c4..66d8dc6fd43b383919a16515bc96be492a253bf6 100644
--- a/tensorflow/contrib/learn/python/learn/utils/__init__.py
+++ b/tensorflow/contrib/learn/python/learn/utils/__init__.py
@@ -13,7 +13,12 @@
 # limitations under the License.
 # ==============================================================================
 
-"""TensorFlow Learn Utils."""
+"""TensorFlow Learn Utils (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/utils/export.py b/tensorflow/contrib/learn/python/learn/utils/export.py
index cb34cb1d26b6812c7f3f39e9f965615de5a8ef07..3eacac7a3d3dcff4d39025fdee88e16e385b1b84 100644
--- a/tensorflow/contrib/learn/python/learn/utils/export.py
+++ b/tensorflow/contrib/learn/python/learn/utils/export.py
@@ -13,14 +13,18 @@
 # limitations under the License.
 # ==============================================================================
 
-"""Export utilities."""
+"""Export utilities (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
+"""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.framework import deprecated
-from tensorflow.python.training import training_util
 from tensorflow.contrib.session_bundle import exporter
 from tensorflow.contrib.session_bundle import gc
 from tensorflow.python.client import session as tf_session
@@ -32,6 +36,7 @@ from tensorflow.python.ops import lookup_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import saver as tf_saver
+from tensorflow.python.training import training_util
 
 
 @deprecated('2017-03-25', 'Please use Estimator.export_savedmodel() instead.')
diff --git a/tensorflow/contrib/learn/python/learn/utils/gc.py b/tensorflow/contrib/learn/python/learn/utils/gc.py
index 226915987a4934626066b12810f579ae675107b2..916aecbea88b10bbef316ffb89d4c4d89667cb29 100644
--- a/tensorflow/contrib/learn/python/learn/utils/gc.py
+++ b/tensorflow/contrib/learn/python/learn/utils/gc.py
@@ -13,7 +13,11 @@
 # limitations under the License.
 # ==============================================================================
 
-r"""System for specifying garbage collection (GC) of path based data.
+r"""System for specifying garbage collection (GC) of path based data (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 This framework allows for GC of data specified by path names, for example files
 on disk.  gc.Path objects each represent a single item stored at a path and may
@@ -73,10 +77,12 @@ import os
 
 from tensorflow.python.platform import gfile
 from tensorflow.python.util import compat
+from tensorflow.python.util.deprecation import deprecated
 
 Path = collections.namedtuple('Path', 'path export_version')
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def largest_export_versions(n):
   """Creates a filter that keeps the largest n export versions.
 
@@ -97,6 +103,7 @@ def largest_export_versions(n):
   return keep
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def one_of_every_n_export_versions(n):
   """Creates a filter that keeps one of every n export versions.
 
@@ -128,6 +135,7 @@ def one_of_every_n_export_versions(n):
   return keep
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def mod_export_version(n):
   """Creates a filter that keeps every export that is a multiple of n.
 
@@ -146,6 +154,7 @@ def mod_export_version(n):
   return keep
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def union(lf, rf):
   """Creates a filter that keeps the union of two filters.
 
@@ -163,6 +172,7 @@ def union(lf, rf):
   return keep
 
 
+@deprecated(None, 'Please implement your own file management or use Saver.')
 def negation(f):
   """Negate a filter.
 
@@ -179,6 +189,7 @@ def negation(f):
   return keep
 
 
+@deprecated(None, 'Please implement your own file name management.')
 def get_paths(base_dir, parser):
   """Gets a list of Paths in a given directory.
 
diff --git a/tensorflow/contrib/learn/python/learn/utils/input_fn_utils.py b/tensorflow/contrib/learn/python/learn/utils/input_fn_utils.py
index b2521933e524e7ec24d73d4b5171f33e507dd88c..b92eb9fea8b7ccea56c781df74dcfa1cc5508e48 100644
--- a/tensorflow/contrib/learn/python/learn/utils/input_fn_utils.py
+++ b/tensorflow/contrib/learn/python/learn/utils/input_fn_utils.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Utilities for creating input_fns.
+"""Utilities for creating input_fns (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Contents of this file are moved to tensorflow/python/estimator/export.py.
 InputFnOps is renamed to ServingInputReceiver.
@@ -32,13 +36,17 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import parsing_ops
+from tensorflow.python.util.deprecation import deprecated
 
 
 class InputFnOps(collections.namedtuple('InputFnOps',
                                         ['features',
                                          'labels',
                                          'default_inputs'])):
-  """A return type for an input_fn.
+  """A return type for an input_fn (deprecated).
+
+  THIS CLASS IS DEPRECATED. Please use tf.estimator.export.ServingInputReceiver
+  instead.
 
   This return type is currently only supported for serving input_fn.
   Training and eval input_fn should return a `(features, labels)` tuple.
@@ -56,6 +64,8 @@ class InputFnOps(collections.namedtuple('InputFnOps',
   """
 
 
+@deprecated(None, 'Please use '
+            'tf.estimator.export.build_parsing_serving_input_receiver_fn.')
 def build_parsing_serving_input_fn(feature_spec, default_batch_size=None):
   """Build an input_fn appropriate for serving, expecting fed tf.Examples.
 
@@ -84,6 +94,8 @@ def build_parsing_serving_input_fn(feature_spec, default_batch_size=None):
   return input_fn
 
 
+@deprecated(None, 'Please use '
+            'tf.estimator.export.build_raw_serving_input_receiver_fn.')
 def build_default_serving_input_fn(features, default_batch_size=None):
   """Build an input_fn appropriate for serving, expecting feature Tensors.
 
diff --git a/tensorflow/contrib/learn/python/learn/utils/inspect_checkpoint.py b/tensorflow/contrib/learn/python/learn/utils/inspect_checkpoint.py
index 6a63fb545a56e6040b0b0c3bbb6a17cd96925895..6dbaa15f8391b0044be8e30ca191753beb88db93 100644
--- a/tensorflow/contrib/learn/python/learn/utils/inspect_checkpoint.py
+++ b/tensorflow/contrib/learn/python/learn/utils/inspect_checkpoint.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""A simple script for inspect checkpoint files."""
+"""A simple script for inspect checkpoint files (deprecated)."""
 
 from __future__ import absolute_import
 from __future__ import division
diff --git a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
index 1593380007b2799fb1d17e92408ab19a7b47fe1e..213619a1877d898dc7c55f6b8c340df5c1afbf27 100644
--- a/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
+++ b/tensorflow/contrib/learn/python/learn/utils/saved_model_export_utils.py
@@ -12,7 +12,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Utilities supporting export to SavedModel.
+"""Utilities supporting export to SavedModel (deprecated).
+
+This module and all its submodules are deprecated. See
+[contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+for migration instructions.
 
 Some contents of this file are moved to tensorflow/python/estimator/export.py:
 
@@ -52,8 +56,9 @@ from tensorflow.python.saved_model import signature_constants
 from tensorflow.python.saved_model import signature_def_utils
 from tensorflow.python.summary import summary_iterator
 from tensorflow.python.training import saver
-
 from tensorflow.python.util import compat
+from tensorflow.python.util.deprecation import deprecated
+
 
 # A key for use in the input_alternatives dict indicating the default input.
 # This is the input that will be expected when a serving request does not
@@ -77,6 +82,7 @@ FEATURES_INPUT_ALTERNATIVE_KEY = 'features_input_alternative'
 _FALLBACK_DEFAULT_OUTPUT_ALTERNATIVE_KEY = 'default_output_alternative'
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def build_standardized_signature_def(input_tensors, output_tensors,
                                      problem_type):
   """Build a SignatureDef using problem type and input and output Tensors.
@@ -156,6 +162,7 @@ def _is_regression_problem(problem_type, input_tensors, output_tensors):
           len(input_tensors) == 1 and len(output_tensors) == 1)
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_input_alternatives(input_ops):
   """Obtain all input alternatives using the input_fn output and heuristics."""
   input_alternatives = {}
@@ -181,6 +188,7 @@ def get_input_alternatives(input_ops):
   return input_alternatives, features
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_output_alternatives(model_fn_ops, default_output_alternative_key=None):
   """Obtain all output alternatives using the model_fn output and heuristics.
 
@@ -246,6 +254,7 @@ def get_output_alternatives(model_fn_ops, default_output_alternative_key=None):
                        sorted(output_alternatives.keys())))
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def build_all_signature_defs(input_alternatives, output_alternatives,
                              actual_default_output_alternative_key):
   """Build `SignatureDef`s from all pairs of input and output alternatives."""
@@ -279,6 +288,7 @@ def build_all_signature_defs(input_alternatives, output_alternatives,
 MAX_DIRECTORY_CREATION_ATTEMPTS = 10
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_timestamped_export_dir(export_dir_base):
   """Builds a path to a new subdirectory within the base directory.
 
@@ -317,6 +327,7 @@ def get_timestamped_export_dir(export_dir_base):
                      '{} attempts.'.format(MAX_DIRECTORY_CREATION_ATTEMPTS))
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_temp_export_dir(timestamped_export_dir):
   """Builds a directory name based on the argument but starting with 'temp-'.
 
@@ -344,6 +355,7 @@ def _export_version_parser(path):
   return path._replace(export_version=int(filename))
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def get_most_recent_export(export_dir_base):
   """Locate the most recent SavedModel export in a directory of many exports.
 
@@ -363,6 +375,7 @@ def get_most_recent_export(export_dir_base):
   return next(iter(results or []), None)
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def garbage_collect_exports(export_dir_base, exports_to_keep):
   """Deletes older exports, retaining only a given number of the most recent.
 
@@ -387,6 +400,7 @@ def garbage_collect_exports(export_dir_base, exports_to_keep):
       logging.warn('Can not delete %s recursively: %s', p.path, e)
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def make_export_strategy(serving_input_fn,
                          default_output_alternative_key=None,
                          assets_extra=None,
@@ -469,6 +483,8 @@ def make_export_strategy(serving_input_fn,
   return export_strategy.ExportStrategy('Servo', export_fn, strip_default_attrs)
 
 
+@deprecated(None,
+            'Use tf.estimator.export.build_parsing_serving_input_receiver_fn')
 def make_parsing_export_strategy(feature_columns,
                                  default_output_alternative_key=None,
                                  assets_extra=None,
@@ -555,8 +571,14 @@ def _default_compare_fn(curr_best_eval_result, cand_eval_result):
 
 
 class BestModelSelector(object):
-  """A helper that keeps track of export selection candidates."""
+  """A helper that keeps track of export selection candidates.
+
+  THIS CLASS IS DEPRECATED. See
+  [contrib/learn/README.md](https://www.tensorflow.org/code/tensorflow/contrib/learn/README.md)
+  for general migration instructions.
+  """
 
+  @deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
   def __init__(self, event_file_pattern=None, compare_fn=None):
     """Constructor of this class.
 
@@ -622,6 +644,7 @@ class BestModelSelector(object):
     return best_eval_result
 
 
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def make_best_model_export_strategy(
     serving_input_fn,
     exports_to_keep=1,
@@ -707,6 +730,7 @@ def make_best_model_export_strategy(
 
 # TODO(b/67013778): Revisit this approach when corresponding changes to
 # TF Core are finalized.
+@deprecated(None, 'Switch to tf.estimator.Exporter and associated utilities.')
 def extend_export_strategy(base_export_strategy,
                            post_export_fn,
                            post_export_name=None):
diff --git a/tensorflow/contrib/lite/arena_planner.cc b/tensorflow/contrib/lite/arena_planner.cc
index 87b17c338e7afc33d32dd9688cc0825ac319dd19..8e47e2375e2e306c345a2b6caa2411abd9b3ceb0 100644
--- a/tensorflow/contrib/lite/arena_planner.cc
+++ b/tensorflow/contrib/lite/arena_planner.cc
@@ -128,6 +128,11 @@ TfLiteStatus ArenaPlanner::PlanAllocations() {
 }
 
 TfLiteStatus ArenaPlanner::ExecuteAllocations(int first_node, int last_node) {
+  // Grow the size of `allocs_` if necessary. This allows allocating temporary
+  // tensors in op's `prepare` function.
+  TF_LITE_ENSURE(context_, graph_info_->num_tensors() >= allocs_.size());
+  allocs_.resize(graph_info_->num_tensors());
+
   TF_LITE_ENSURE_STATUS(CalculateAllocations(first_node, last_node));
   TF_LITE_ENSURE_STATUS(Commit());
 
diff --git a/tensorflow/contrib/lite/builtin_ops.h b/tensorflow/contrib/lite/builtin_ops.h
index 4ebd1586de791eecf0304637bde76232d9f0a11d..88cdf1d46312f1e610825f23f3d8d357b0762bac 100644
--- a/tensorflow/contrib/lite/builtin_ops.h
+++ b/tensorflow/contrib/lite/builtin_ops.h
@@ -23,6 +23,9 @@ limitations under the License.
 extern "C" {
 #endif  // __cplusplus
 
+// The enum for builtin operators.
+// Note: CUSTOM and DELEGATE are 2 special ops which are not real biultin
+// ops.
 typedef enum {
   kTfLiteBuiltinAdd = 0,
   kTfLiteBuiltinAveragePool2d = 1,
@@ -71,6 +74,9 @@ typedef enum {
   kTfLiteBuiltinExp = 47,
   kTfLiteBuiltinTopkV2 = 48,
   kTfLiteBuiltinSplit = 49,
+  kTfLiteBuiltinLogSoftmax = 50,
+  kTfLiteBuiltinDelegate = 51,
+  kTfLiteBuiltinBidirectionalSequenceLstm = 52,
 } TfLiteBuiltinOperator;
 
 #ifdef __cplusplus
diff --git a/tensorflow/contrib/lite/context.h b/tensorflow/contrib/lite/context.h
index b0c4d3431f9a67bc87d51ada91ed73f1661023a2..ed7f4515fa4437d61a37be93616c28a046295c5a 100644
--- a/tensorflow/contrib/lite/context.h
+++ b/tensorflow/contrib/lite/context.h
@@ -258,7 +258,7 @@ typedef struct TfLiteContext {
   TfLiteStatus (*GetExecutionPlan)(struct TfLiteContext* context,
                                    TfLiteIntArray** execution_plan);
 
-  // An tensor of tensors in the interpreter context (of length `tensors_size`)
+  // An array of tensors in the interpreter context (of length `tensors_size`)
   TfLiteTensor* tensors;
 
   // opaque full context ptr (an opaque c++ data structure)
@@ -283,7 +283,8 @@ typedef struct TfLiteContext {
                                          TfLiteNode** node,
                                          TfLiteRegistration** registration);
 
-  // Replace ops with delegate.
+  // Replace ops with one or more stub delegate operations. This function
+  // does not take ownership of `nodes_to_replace`.
   TfLiteStatus (*ReplaceSubgraphsWithDelegateKernels)(
       struct TfLiteContext*, TfLiteRegistration registration,
       const TfLiteIntArray* nodes_to_replace);
diff --git a/tensorflow/contrib/lite/interpreter.cc b/tensorflow/contrib/lite/interpreter.cc
index 028449211b8108d004df4d1cd8a58b4a08df6604..0f5e17f0de0d828771e1fdbeac0e172f2ed9159c 100644
--- a/tensorflow/contrib/lite/interpreter.cc
+++ b/tensorflow/contrib/lite/interpreter.cc
@@ -25,13 +25,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/kernels/gemm_support.h"
 #include "tensorflow/contrib/lite/memory_planner.h"
 #include "tensorflow/contrib/lite/nnapi_delegate.h"
-
-namespace {
-
-// std::vector preallocation tuning.
-constexpr const int kSlotsToReserve = 128;
-
-}  // namespace
+#include "tensorflow/contrib/lite/schema/schema_generated.h"
 
 namespace tflite {
 
@@ -84,8 +78,8 @@ Interpreter::Interpreter(ErrorReporter* error_reporter)
   context_.GetExecutionPlan = nullptr;
 
   // Reserve some space for the tensors to avoid excessive resizing.
-  tensors_.reserve(kSlotsToReserve);
-  nodes_and_registration_.reserve(kSlotsToReserve);
+  tensors_.reserve(kTensorsReservedCapacity);
+  nodes_and_registration_.reserve(kTensorsReservedCapacity);
   next_execution_plan_index_to_prepare_ = 0;
   UseNNAPI(false);
 }
@@ -115,6 +109,9 @@ TfLiteStatus Interpreter::ReplaceSubgraphsWithDelegateKernels(
 
 TfLiteStatus Interpreter::ReplaceSubgraphsWithDelegateKernels(
     TfLiteRegistration registration, const TfLiteIntArray* nodes_to_replace) {
+  // Annotate the registration as DELEGATE op.
+  registration.builtin_code = BuiltinOperator_DELEGATE;
+
   // Analyze the graph to find all independent subgraphs that are either
   // fully not-this-delegate or this-delegate computation.
   InterpreterInfo info(this);
@@ -298,7 +295,20 @@ TfLiteStatus Interpreter::AddNodeWithParameters(
         OpInit(*registration,
                reinterpret_cast<const char*>(builtin_data_deleter.get()), 0);
   }
+
   node.builtin_data = builtin_data_deleter.release();
+  // TODO(ycling): Filling `custom_initial_data` and `custom_initial_data_size`
+  // properly for nodes generated by ReplaceSubgraphsWithDelegateKernels.
+  if (registration->builtin_code == BuiltinOperator_CUSTOM) {
+    // When it's a CUSTOM op, the `custom_options` field in the Flatbuffer
+    // `Operator` table is passed in.
+    node.custom_initial_data = init_data;
+    node.custom_initial_data_size = init_data_size;
+  } else {
+    node.custom_initial_data = nullptr;
+    node.custom_initial_data_size = 0;
+  }
+
   node_and_reg.second = *registration;
   execution_plan_.push_back(new_node_index);
   return kTfLiteOk;
@@ -336,6 +346,7 @@ TfLiteStatus Interpreter::PrepareOpsStartingAt(
     TfLiteNode& node = nodes_and_registration_[node_index].first;
     const TfLiteRegistration& registration =
         nodes_and_registration_[node_index].second;
+    EnsureTensorsVectorCapacity();
     if (OpPrepare(registration, &node) == kTfLiteError) {
       return kTfLiteError;
     }
@@ -413,6 +424,7 @@ TfLiteStatus Interpreter::Invoke() {
     TfLiteNode& node = nodes_and_registration_[node_index].first;
     const TfLiteRegistration& registration =
         nodes_and_registration_[node_index].second;
+    EnsureTensorsVectorCapacity();
     if (OpInvoke(registration, &node) == kTfLiteError) {
       status = kTfLiteError;
     }
diff --git a/tensorflow/contrib/lite/interpreter.h b/tensorflow/contrib/lite/interpreter.h
index bab56a9d72f8992a9d8af23f92133c7c918fd46d..04c19644a026bff0f3693f7b05832393bafd0324 100644
--- a/tensorflow/contrib/lite/interpreter.h
+++ b/tensorflow/contrib/lite/interpreter.h
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/error_reporter.h"
 #include "tensorflow/contrib/lite/memory_planner.h"
+#include "tensorflow/contrib/lite/schema/schema_generated.h"
 
 namespace tflite {
 
@@ -258,6 +259,20 @@ class Interpreter {
   // contain new nodes that replace 1 more nodes.
   TfLiteStatus ModifyGraphWithDelegate(TfLiteDelegate* delegate);
 
+  // WARNING: This is a deprecated interface and will be removed as soon as
+  // possible.  Please do not use it.
+  // TODO(impjdi): Remove this interface after resolving dependencies.
+  void set_model(const Model* model) { model_ = const_cast<Model*>(model); }
+  Model* model() const { return model_; }
+
+  // The default capacity of `tensors_` vector.
+  static constexpr int kTensorsReservedCapacity = 128;
+  // The capacity headroom of `tensors_` vector before calling ops'
+  // `prepare` and `invoke` function. In these functions, it's guaranteed
+  // allocating up to `kTensorsCapacityHeadroom` more tensors won't invalidate
+  // pointers to existing tensors.
+  static constexpr int kTensorsCapacityHeadroom = 16;
+
  private:
   // Give 'op_reg' a chance to initialize itself using the contents of
   // 'buffer'.
@@ -370,6 +385,18 @@ class Interpreter {
   static TfLiteStatus GetExecutionPlan(struct TfLiteContext* context,
                                        TfLiteIntArray** execution_plan);
 
+  // Ensures that `tensors_` has at least `kTensorsCapacityHeadroom` extra
+  // capacity. Calling this function may invalidate existing pointers to
+  // tensors. After calling this function, adding `kTensorsCapacityHeadroom`
+  // more tensors won't invalidate the pointer to existing tensors.
+  void EnsureTensorsVectorCapacity() {
+    const int required_capacity = tensors_size() + kTensorsCapacityHeadroom;
+    if (required_capacity > tensors_.capacity()) {
+      tensors_.reserve(required_capacity);
+      context_.tensors = tensors_.data();
+    }
+  }
+
   // A pure C data structure used to communicate with the pure C plugin
   // interface. To avoid copying tensor metadata, this is also the definitive
   // structure to store tensors.
@@ -425,6 +452,11 @@ class Interpreter {
   std::unique_ptr<NNAPIDelegate> nnapi_delegate_;
 
   std::unique_ptr<MemoryPlanner> memory_planner_;
+
+  // WARNING: This is a deprecated interface and will be removed as soon as
+  // possible.  Please do not use it.
+  // TODO(impjdi): Remove this interface after resolving dependencies.
+  Model* model_ = nullptr;
 };
 
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/interpreter_test.cc b/tensorflow/contrib/lite/interpreter_test.cc
index 28c96e5dde6ffa62bb073db9716a00f91c6e0bdf..2e6727b32361ab771354a3954e5e4d8f9fa833a5 100644
--- a/tensorflow/contrib/lite/interpreter_test.cc
+++ b/tensorflow/contrib/lite/interpreter_test.cc
@@ -561,6 +561,46 @@ TEST(BasicInterpreter, TestCustomErrorReporter) {
   ASSERT_EQ(reporter.calls, 1);
 }
 
+TEST(InterpreterTensorsCapacityTest, TestWithinHeadroom) {
+  Interpreter interpreter;
+  ASSERT_EQ(interpreter.AddTensors(Interpreter::kTensorsReservedCapacity),
+            kTfLiteOk);
+  TfLiteRegistration registration = {nullptr, nullptr, nullptr, nullptr};
+  registration.prepare = [](TfLiteContext* context, TfLiteNode* node) {
+    TfLiteTensor* first_tensor = context->tensors;
+
+    int new_tensor_index;
+    context->AddTensors(context, Interpreter::kTensorsCapacityHeadroom,
+                        &new_tensor_index);
+    EXPECT_EQ(first_tensor, context->tensors);
+    return kTfLiteOk;
+  };
+  ASSERT_EQ(interpreter.AddNodeWithParameters({0}, {1}, nullptr, 0, nullptr,
+                                              &registration),
+            kTfLiteOk);
+  ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
+}
+
+TEST(InterpreterTensorsCapacityTest, TestExceedHeadroom) {
+  Interpreter interpreter;
+  ASSERT_EQ(interpreter.AddTensors(Interpreter::kTensorsReservedCapacity),
+            kTfLiteOk);
+  TfLiteRegistration registration = {nullptr, nullptr, nullptr, nullptr};
+  registration.prepare = [](TfLiteContext* context, TfLiteNode* node) {
+    TfLiteTensor* first_tensor = context->tensors;
+
+    int new_tensor_index;
+    context->AddTensors(context, Interpreter::kTensorsCapacityHeadroom + 1,
+                        &new_tensor_index);
+    EXPECT_NE(first_tensor, context->tensors);
+    return kTfLiteOk;
+  };
+  ASSERT_EQ(interpreter.AddNodeWithParameters({0}, {1}, nullptr, 0, nullptr,
+                                              &registration),
+            kTfLiteOk);
+  ASSERT_EQ(interpreter.AllocateTensors(), kTfLiteOk);
+}
+
 // Test fixture that allows playing with execution plans. It creates a two
 // node graph that can be executed in either [0,1] order or [1,0] order.
 // The CopyOp records when it is invoked in the class member run_order_
diff --git a/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/Camera2BasicFragment.java b/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/Camera2BasicFragment.java
index d2048b41b1e76fe42c919c9b889df5be8a94957f..9b9fdffab557060f0211a0ce361b002cc7d03956 100644
--- a/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/Camera2BasicFragment.java
+++ b/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/Camera2BasicFragment.java
@@ -659,8 +659,7 @@ public class Camera2BasicFragment extends Fragment
       showToast("Uninitialized Classifier or invalid context.");
       return;
     }
-    Bitmap bitmap =
-        textureView.getBitmap(classifier.getImageSizeX(), classifier.getImageSizeY());
+    Bitmap bitmap = textureView.getBitmap(classifier.getImageSizeX(), classifier.getImageSizeY());
     String textToShow = classifier.classifyFrame(bitmap);
     bitmap.recycle();
     showToast(textToShow);
diff --git a/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifier.java b/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifier.java
index c319bff9f11546ac4d49c6b34c5ecdbc41547d58..c57bb348c5b386a59327c7b1bc769717ca755269 100644
--- a/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifier.java
+++ b/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifier.java
@@ -88,8 +88,11 @@ public abstract class ImageClassifier {
     labelList = loadLabelList(activity);
     imgData =
         ByteBuffer.allocateDirect(
-                DIM_BATCH_SIZE * getImageSizeX() * getImageSizeY() * DIM_PIXEL_SIZE *
-                        getNumBytesPerChannel());
+            DIM_BATCH_SIZE
+                * getImageSizeX()
+                * getImageSizeY()
+                * DIM_PIXEL_SIZE
+                * getNumBytesPerChannel());
     imgData.order(ByteOrder.nativeOrder());
     filterLabelProbArray = new float[FILTER_STAGES][getNumLabels()];
     Log.d(TAG, "Created a Tensorflow Lite Image Classifier.");
@@ -208,44 +211,50 @@ public abstract class ImageClassifier {
 
   /**
    * Get the name of the model file stored in Assets.
+   *
    * @return
    */
   protected abstract String getModelPath();
 
   /**
    * Get the name of the label file stored in Assets.
+   *
    * @return
    */
   protected abstract String getLabelPath();
 
   /**
    * Get the image size along the x axis.
+   *
    * @return
    */
   protected abstract int getImageSizeX();
 
   /**
    * Get the image size along the y axis.
+   *
    * @return
    */
   protected abstract int getImageSizeY();
 
   /**
    * Get the number of bytes that is used to store a single color channel value.
+   *
    * @return
    */
   protected abstract int getNumBytesPerChannel();
 
   /**
    * Add pixelValue to byteBuffer.
+   *
    * @param pixelValue
    */
   protected abstract void addPixelValue(int pixelValue);
 
   /**
-   * Read the probability value for the specified label
-   * This is either the original value as it was read from the net's output or the updated value
-   * after the filter was applied.
+   * Read the probability value for the specified label This is either the original value as it was
+   * read from the net's output or the updated value after the filter was applied.
+   *
    * @param labelIndex
    * @return
    */
@@ -253,29 +262,32 @@ public abstract class ImageClassifier {
 
   /**
    * Set the probability value for the specified label.
+   *
    * @param labelIndex
    * @param value
    */
   protected abstract void setProbability(int labelIndex, Number value);
 
   /**
-   * Get the normalized probability value for the specified label.
-   * This is the final value as it will be shown to the user.
+   * Get the normalized probability value for the specified label. This is the final value as it
+   * will be shown to the user.
+   *
    * @return
    */
   protected abstract float getNormalizedProbability(int labelIndex);
 
   /**
-   * Run inference using the prepared input in {@link #imgData}.
-   * Afterwards, the result will be provided by getProbability().
+   * Run inference using the prepared input in {@link #imgData}. Afterwards, the result will be
+   * provided by getProbability().
    *
-   * This additional method is necessary, because we don't have a common base for different
+   * <p>This additional method is necessary, because we don't have a common base for different
    * primitive data types.
    */
   protected abstract void runInference();
 
   /**
    * Get the total number of labels.
+   *
    * @return
    */
   protected int getNumLabels() {
diff --git a/tensorflow/contrib/lite/java/proguard.flags b/tensorflow/contrib/lite/java/proguard.flags
new file mode 100644
index 0000000000000000000000000000000000000000..8ee3d7e7ae728b27789336ac56208acdf13ee424
--- /dev/null
+++ b/tensorflow/contrib/lite/java/proguard.flags
@@ -0,0 +1,3 @@
+-keepclassmembers class org.tensorflow.lite.NativeInterpreterWrapper {
+    private long inferenceDurationNanoseconds;
+}
\ No newline at end of file
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
index dd883d69d2065236ee29012b9bde99972aefbcf7..9e47e921a6a62527345984df7c6112cd38e7ea73 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/Interpreter.java
@@ -80,6 +80,9 @@ public final class Interpreter implements AutoCloseable {
   /**
    * Runs model inference if the model takes only one input, and provides only one output.
    *
+   * <p>Warning: The API runs much faster if {@link ByteBuffer} is used as input data type. Please
+   * consider using {@link ByteBuffer} to feed input data for better performance.
+   *
    * @param input an array or multidimensional array, or a {@link ByteBuffer} of primitive types
    *     including int, float, long, and byte. {@link ByteBuffer} is the preferred way to pass large
    *     input data. When {@link ByteBuffer} is used, its content should remain unchanged until
@@ -96,6 +99,9 @@ public final class Interpreter implements AutoCloseable {
   /**
    * Runs model inference if the model takes multiple inputs, or returns multiple outputs.
    *
+   * <p>Warning: The API runs much faster if {@link ByteBuffer} is used as input data type. Please
+   * consider using {@link ByteBuffer} to feed input data for better performance.
+   *
    * @param inputs an array of input data. The inputs should be in the same order as inputs of the
    *     model. Each input can be an array or multidimensional array, or a {@link ByteBuffer} of
    *     primitive types including int, float, long, and byte. {@link ByteBuffer} is the preferred
@@ -161,6 +167,27 @@ public final class Interpreter implements AutoCloseable {
     return wrapper.getOutputIndex(opName);
   }
 
+  /**
+   * Returns native inference timing.
+   * <p>IllegalArgumentException will be thrown if the model is not initialized by the
+   * {@link Interpreter}.
+   */
+  public Long getLastNativeInferenceDurationNanoseconds() {
+    if (wrapper == null) {
+      throw new IllegalStateException("The interpreter has already been closed.");
+    }
+    return wrapper.getLastNativeInferenceDurationNanoseconds();
+  }
+
+  /** Turns on/off Android NNAPI for hardware acceleration when it is available. */
+  public void setUseNNAPI(boolean useNNAPI) {
+    if (wrapper != null) {
+      wrapper.setUseNNAPI(useNNAPI);
+    } else {
+      throw new IllegalStateException("NativeInterpreterWrapper has already been closed.");
+    }
+  }
+
   /** Release resources associated with the {@code Interpreter}. */
   @Override
   public void close() {
diff --git a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java
index 5ee594dec492ad2fee22e603a6de311b3fed4cac..bca4a3cae603cd05635ff1b5a22f58c04180e8bf 100644
--- a/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java
+++ b/tensorflow/contrib/lite/java/src/main/java/org/tensorflow/lite/NativeInterpreterWrapper.java
@@ -35,6 +35,7 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     errorHandle = createErrorReporter(ERROR_BUFFER_SIZE);
     modelHandle = createModel(modelPath, errorHandle);
     interpreterHandle = createInterpreter(modelHandle, errorHandle);
+    isMemoryAllocated = true;
   }
 
   /**
@@ -47,6 +48,7 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     errorHandle = createErrorReporter(ERROR_BUFFER_SIZE);
     modelHandle = createModelWithBuffer(modelByteBuffer, errorHandle);
     interpreterHandle = createInterpreter(modelHandle, errorHandle);
+    isMemoryAllocated = true;
   }
 
   /** Releases resources associated with this {@code NativeInterpreterWrapper}. */
@@ -59,6 +61,7 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     modelByteBuffer = null;
     inputsIndexes = null;
     outputsIndexes = null;
+    isMemoryAllocated = false;
   }
 
   /** Sets inputs, runs model inference and returns outputs. */
@@ -91,11 +94,21 @@ final class NativeInterpreterWrapper implements AutoCloseable {
                 i, inputs.length));
       }
     }
+    inferenceDurationNanoseconds = -1;
     long[] outputsHandles =
-        run(interpreterHandle, errorHandle, sizes, dataTypes, numsOfBytes, inputs);
+        run(
+            interpreterHandle,
+            errorHandle,
+            sizes,
+            dataTypes,
+            numsOfBytes,
+            inputs,
+            this,
+            isMemoryAllocated);
     if (outputsHandles == null || outputsHandles.length == 0) {
       throw new IllegalStateException("Interpreter has no outputs.");
     }
+    isMemoryAllocated = true;
     Tensor[] outputs = new Tensor[outputsHandles.length];
     for (int i = 0; i < outputsHandles.length; ++i) {
       outputs[i] = Tensor.fromHandle(outputsHandles[i]);
@@ -109,14 +122,18 @@ final class NativeInterpreterWrapper implements AutoCloseable {
       Object[] sizes,
       int[] dtypes,
       int[] numsOfBytes,
-      Object[] values);
+      Object[] values,
+      NativeInterpreterWrapper wrapper,
+      boolean memoryAllocated);
 
   /** Resizes dimensions of a specific input. */
   void resizeInput(int idx, int[] dims) {
-    resizeInput(interpreterHandle, errorHandle, idx, dims);
+    if (resizeInput(interpreterHandle, errorHandle, idx, dims)) {
+      isMemoryAllocated = false;
+    }
   }
 
-  private static native void resizeInput(
+  private static native boolean resizeInput(
       long interpreterHandle, long errorHandle, int inputIdx, int[] dims);
 
   void setUseNNAPI(boolean useNNAPI) {
@@ -236,6 +253,14 @@ final class NativeInterpreterWrapper implements AutoCloseable {
     }
   }
 
+  /**
+   * Gets the last inference duration in nanoseconds. It returns null if there is no previous
+   * inference run or the last inference run failed.
+   */
+  Long getLastNativeInferenceDurationNanoseconds() {
+    return (inferenceDurationNanoseconds < 0) ? null : inferenceDurationNanoseconds;
+  }
+
   private static final int ERROR_BUFFER_SIZE = 512;
 
   private long errorHandle;
@@ -246,12 +271,16 @@ final class NativeInterpreterWrapper implements AutoCloseable {
 
   private int inputSize;
 
+  private long inferenceDurationNanoseconds = -1;
+
   private MappedByteBuffer modelByteBuffer;
 
   private Map<String, Integer> inputsIndexes;
 
   private Map<String, Integer> outputsIndexes;
 
+  private boolean isMemoryAllocated = false;
+
   private static native String[] getInputNames(long interpreterHandle);
 
   private static native String[] getOutputNames(long interpreterHandle);
diff --git a/tensorflow/contrib/lite/java/src/main/native/duration_utils_jni.cc b/tensorflow/contrib/lite/java/src/main/native/duration_utils_jni.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0e08a04370592f6e3c92b5811fa7e163f808e03c
--- /dev/null
+++ b/tensorflow/contrib/lite/java/src/main/native/duration_utils_jni.cc
@@ -0,0 +1,38 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <jni.h>
+#include <time.h>
+
+namespace tflite {
+
+// Gets the elapsed wall-clock timespec.
+timespec getCurrentTime() {
+  timespec time;
+  clock_gettime(CLOCK_MONOTONIC, &time);
+  return time;
+}
+
+// Computes the time diff from two timespecs. Returns '-1' if 'stop' is earlier
+// than 'start'.
+jlong timespec_diff_nanoseconds(struct timespec* start, struct timespec* stop) {
+  jlong result = stop->tv_sec - start->tv_sec;
+  if (result < 0) return -1;
+  result = 1000000000 * result + (stop->tv_nsec - start->tv_nsec);
+  if (result < 0) return -1;
+  return result;
+}
+
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc
index c346f9f92e360c0722ebac440d790da6441ceecf..475b467face2736f9a1b8f9e713fe2d86b9fa77e 100644
--- a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc
+++ b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.cc
@@ -149,6 +149,45 @@ TfLiteStatus checkInputs(JNIEnv* env, tflite::Interpreter* interpreter,
   return kTfLiteOk;
 }
 
+// Checks whether there is any difference between dimensions of a tensor and a
+// given dimensions. Returns true if there is difference, else false.
+bool areDimsDifferent(JNIEnv* env, TfLiteTensor* tensor, jintArray dims) {
+  int num_dims = static_cast<int>(env->GetArrayLength(dims));
+  jint* ptr = env->GetIntArrayElements(dims, nullptr);
+  if (ptr == nullptr) {
+    throwException(env, kIllegalArgumentException,
+                   "Empty dimensions of input array.");
+    return true;
+  }
+  if (tensor->dims->size != num_dims) {
+    return true;
+  }
+  for (int i = 0; i < num_dims; ++i) {
+    if (ptr[i] != tensor->dims->data[i]) {
+      return true;
+    }
+  }
+  env->ReleaseIntArrayElements(dims, ptr, JNI_ABORT);
+  return false;
+}
+
+bool areInputDimensionsTheSame(JNIEnv* env, tflite::Interpreter* interpreter,
+                               int input_size, jobjectArray sizes) {
+  if (interpreter->inputs().size() != input_size) {
+    return false;
+  }
+  for (int i = 0; i < input_size; ++i) {
+    int input_idx = interpreter->inputs()[i];
+    jintArray dims =
+        static_cast<jintArray>(env->GetObjectArrayElement(sizes, i));
+    TfLiteTensor* target = interpreter->tensor(input_idx);
+    if (areDimsDifferent(env, target, dims)) return false;
+    env->DeleteLocalRef(dims);
+    if (env->ExceptionCheck()) return false;
+  }
+  return true;
+}
+
 TfLiteStatus resizeInputs(JNIEnv* env, tflite::Interpreter* interpreter,
                           int input_size, jobjectArray sizes) {
   for (int i = 0; i < input_size; ++i) {
@@ -344,6 +383,15 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_createInterpreter(
     throwException(env, kIllegalArgumentException,
                    "Cannot create interpreter: %s",
                    error_reporter->CachedErrorMessage());
+    return 0;
+  }
+  // allocates memory
+  status = interpreter->AllocateTensors();
+  if (status != kTfLiteOk) {
+    throwException(env, kNullPointerException,
+                   "Can not allocate memory for the interpreter",
+                   error_reporter->CachedErrorMessage());
+    return 0;
   }
   return reinterpret_cast<jlong>(interpreter.release());
 }
@@ -353,7 +401,7 @@ JNIEXPORT jlongArray JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
     JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
     jobjectArray sizes, jintArray data_types, jintArray nums_of_bytes,
-    jobjectArray values) {
+    jobjectArray values, jobject wrapper, jboolean memory_allocated) {
   tflite::Interpreter* interpreter =
       convertLongToInterpreter(env, interpreter_handle);
   if (interpreter == nullptr) return nullptr;
@@ -365,25 +413,29 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
   TfLiteStatus status = checkInputs(env, interpreter, input_size, data_types,
                                     nums_of_bytes, values, sizes);
   if (status != kTfLiteOk) return nullptr;
-  // resizes inputs
-  status = resizeInputs(env, interpreter, input_size, sizes);
-  if (status != kTfLiteOk) {
-    throwException(env, kNullPointerException, "Can not resize the input: %s",
-                   error_reporter->CachedErrorMessage());
-    return nullptr;
-  }
-  // allocates memory
-  status = interpreter->AllocateTensors();
-  if (status != kTfLiteOk) {
-    throwException(env, kNullPointerException,
-                   "Can not allocate memory for the given inputs: %s",
-                   error_reporter->CachedErrorMessage());
-    return nullptr;
+  if (!memory_allocated ||
+      !areInputDimensionsTheSame(env, interpreter, input_size, sizes)) {
+    // resizes inputs
+    status = resizeInputs(env, interpreter, input_size, sizes);
+    if (status != kTfLiteOk) {
+      throwException(env, kNullPointerException, "Can not resize the input: %s",
+                     error_reporter->CachedErrorMessage());
+      return nullptr;
+    }
+    // allocates memory
+    status = interpreter->AllocateTensors();
+    if (status != kTfLiteOk) {
+      throwException(env, kNullPointerException,
+                     "Can not allocate memory for the given inputs: %s",
+                     error_reporter->CachedErrorMessage());
+      return nullptr;
+    }
   }
   // sets inputs
   status = setInputs(env, interpreter, input_size, data_types, nums_of_bytes,
                      values);
   if (status != kTfLiteOk) return nullptr;
+  timespec beforeInference = ::tflite::getCurrentTime();
   // runs inference
   if (interpreter->Invoke() != kTfLiteOk) {
     throwException(env, kIllegalArgumentException,
@@ -391,6 +443,17 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
                    error_reporter->CachedErrorMessage());
     return nullptr;
   }
+  timespec afterInference = ::tflite::getCurrentTime();
+  jclass wrapper_clazz = env->GetObjectClass(wrapper);
+  jfieldID fid =
+      env->GetFieldID(wrapper_clazz, "inferenceDurationNanoseconds", "J");
+  if (env->ExceptionCheck()) {
+    env->ExceptionClear();
+  } else if (fid != nullptr) {
+    env->SetLongField(
+        wrapper, fid,
+        ::tflite::timespec_diff_nanoseconds(&beforeInference, &afterInference));
+  }
   // returns outputs
   const std::vector<int>& results = interpreter->outputs();
   if (results.empty()) {
@@ -438,29 +501,37 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputDims(
   return outputs;
 }
 
-JNIEXPORT void JNICALL
+JNIEXPORT jboolean JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_resizeInput(
     JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
     jint input_idx, jintArray dims) {
   BufferErrorReporter* error_reporter =
       convertLongToErrorReporter(env, error_handle);
-  if (error_reporter == nullptr) return;
+  if (error_reporter == nullptr) return JNI_FALSE;
   tflite::Interpreter* interpreter =
       convertLongToInterpreter(env, interpreter_handle);
-  if (interpreter == nullptr) return;
+  if (interpreter == nullptr) return JNI_FALSE;
   const int idx = static_cast<int>(input_idx);
   if (idx < 0 || idx >= interpreter->inputs().size()) {
     throwException(env, kIllegalArgumentException,
                    "Can not resize %d-th input for a model having %d inputs.",
                    idx, interpreter->inputs().size());
+    return JNI_FALSE;
   }
-  TfLiteStatus status = interpreter->ResizeInputTensor(
-      interpreter->inputs()[idx], convertJIntArrayToVector(env, dims));
-  if (status != kTfLiteOk) {
-    throwException(env, kIllegalArgumentException,
-                   "Failed to resize %d-th input: %s", idx,
-                   error_reporter->CachedErrorMessage());
+  // check whether it is resizing with the same dimensions.
+  TfLiteTensor* target = interpreter->tensor(input_idx);
+  bool is_changed = areDimsDifferent(env, target, dims);
+  if (is_changed) {
+    TfLiteStatus status = interpreter->ResizeInputTensor(
+        interpreter->inputs()[idx], convertJIntArrayToVector(env, dims));
+    if (status != kTfLiteOk) {
+      throwException(env, kIllegalArgumentException,
+                     "Failed to resize %d-th input: %s", idx,
+                     error_reporter->CachedErrorMessage());
+      return JNI_FALSE;
+    }
   }
+  return is_changed ? JNI_TRUE : JNI_FALSE;
 }
 
 JNIEXPORT void JNICALL Java_org_tensorflow_lite_NativeInterpreterWrapper_delete(
diff --git a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h
index c52a7e4e439936344be26d5761fb5747db64794a..f7c2d9bf82a90e0e3156d18ba0bcaa30265942fa 100644
--- a/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h
+++ b/tensorflow/contrib/lite/java/src/main/native/nativeinterpreterwrapper_jni.h
@@ -18,6 +18,7 @@ limitations under the License.
 
 #include <jni.h>
 #include <stdio.h>
+#include <time.h>
 #include <vector>
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/interpreter.h"
@@ -28,6 +29,9 @@ limitations under the License.
 namespace tflite {
 // This is to be provided at link-time by a library.
 extern std::unique_ptr<OpResolver> CreateOpResolver();
+extern timespec getCurrentTime();
+extern jlong timespec_diff_nanoseconds(struct timespec* start,
+                                       struct timespec* stop);
 }  // namespace tflite
 
 #ifdef __cplusplus
@@ -104,13 +108,14 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_createInterpreter(
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
  *  Method:
- *  Signature: (JJ[Ljava/lang/Object;[I[I[Ljava/lang/Object;)[J
+ *  Signature:
+ * (JJ[Ljava/lang/Object;[I[I[Ljava/lang/Object;Ljava/lang/Object;Z)[J
  */
 JNIEXPORT jlongArray JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_run(
     JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
     jobjectArray sizes, jintArray data_types, jintArray nums_of_bytes,
-    jobjectArray values);
+    jobjectArray values, jobject wrapper, jboolean memory_allocated);
 
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
@@ -127,11 +132,12 @@ Java_org_tensorflow_lite_NativeInterpreterWrapper_getInputDims(
 /*
  *  Class:     org_tensorflow_lite_NativeInterpreterWrapper
  *  Method:
- *  Signature: (JJI[I)
+ *  Signature: (JJI[I)Z
  *
- * It resizes dimensions of a input.
+ * It returns true if resizing input tensor to different dimensions, else return
+ * false.
  */
-JNIEXPORT void JNICALL
+JNIEXPORT jboolean JNICALL
 Java_org_tensorflow_lite_NativeInterpreterWrapper_resizeInput(
     JNIEnv* env, jclass clazz, jlong interpreter_handle, jlong error_handle,
     jint input_idx, jintArray dims);
diff --git a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java
index 424b3de6c97672e310c54230a7ac1204f46d9ac8..61d6c35ec86beebf78dd81e17e145863516802fa 100644
--- a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java
+++ b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/InterpreterTest.java
@@ -218,4 +218,52 @@ public final class InterpreterTest {
     int index = interpreter.getOutputIndex("MobilenetV1/Predictions/Softmax");
     assertThat(index).isEqualTo(0);
   }
+
+  @Test
+  public void testTurnOffNNAPI() throws Exception {
+    Path path = MODEL_FILE.toPath();
+    FileChannel fileChannel =
+        (FileChannel) Files.newByteChannel(path, EnumSet.of(StandardOpenOption.READ));
+    MappedByteBuffer mappedByteBuffer =
+        fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());
+    Interpreter interpreter = new Interpreter(mappedByteBuffer);
+    interpreter.setUseNNAPI(true);
+    float[] oneD = {1.23f, 6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    interpreter.run(fourD, parsedOutputs);
+    float[] outputOneD = parsedOutputs[0][0][0];
+    float[] expected = {3.69f, 19.62f, 23.43f};
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    interpreter.setUseNNAPI(false);
+    interpreter.run(fourD, parsedOutputs);
+    outputOneD = parsedOutputs[0][0][0];
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    interpreter.close();
+    fileChannel.close();
+  }
+
+  @Test
+  public void testTurnOnNNAPI() throws Exception {
+    Path path = MODEL_FILE.toPath();
+    FileChannel fileChannel =
+        (FileChannel) Files.newByteChannel(path, EnumSet.of(StandardOpenOption.READ));
+    MappedByteBuffer mappedByteBuffer =
+        fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());
+    Interpreter interpreter = new Interpreter(mappedByteBuffer);
+    interpreter.setUseNNAPI(true);
+    float[] oneD = {1.23f, 6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    interpreter.run(fourD, parsedOutputs);
+    float[] outputOneD = parsedOutputs[0][0][0];
+    float[] expected = {3.69f, 19.62f, 23.43f};
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    interpreter.close();
+    fileChannel.close();
+  }
 }
diff --git a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java
index 90323555d88419d837a76bca7de6d9998e388fca..6371fb59dc64c141ff24ca8dfefedbe763b5097d 100644
--- a/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java
+++ b/tensorflow/contrib/lite/java/src/test/java/org/tensorflow/lite/NativeInterpreterWrapperTest.java
@@ -94,6 +94,30 @@ public final class NativeInterpreterWrapperTest {
     wrapper.close();
   }
 
+  @Test
+  public void testRunWithInputsOfSameDims() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    float[] oneD = {1.23f, -6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    Object[] inputs = {fourD};
+    Tensor[] outputs = wrapper.run(inputs);
+    assertThat(outputs.length).isEqualTo(1);
+    float[][][][] parsedOutputs = new float[2][8][8][3];
+    outputs[0].copyTo(parsedOutputs);
+    float[] outputOneD = parsedOutputs[0][0][0];
+    float[] expected = {3.69f, -19.62f, 23.43f};
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    outputs = wrapper.run(inputs);
+    assertThat(outputs.length).isEqualTo(1);
+    parsedOutputs = new float[2][8][8][3];
+    outputs[0].copyTo(parsedOutputs);
+    outputOneD = parsedOutputs[0][0][0];
+    assertThat(outputOneD).usingTolerance(0.1f).containsExactly(expected).inOrder();
+    wrapper.close();
+  }
+
   @Test
   public void testRunWithInt() {
     NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(INT_MODEL_PATH);
@@ -417,4 +441,45 @@ public final class NativeInterpreterWrapperTest {
     assertThat(shape[1]).isEqualTo(3);
     assertThat(shape[2]).isEqualTo(1);
   }
+
+  @Test
+  public void testGetInferenceLatency() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    float[] oneD = {1.23f, 6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    Object[] inputs = {fourD};
+    Tensor[] outputs = wrapper.run(inputs);
+    assertThat(outputs.length).isEqualTo(1);
+    assertThat(wrapper.getLastNativeInferenceDurationNanoseconds()).isGreaterThan(0L);
+    wrapper.close();
+  }
+
+  @Test
+  public void testGetInferenceLatencyWithNewWrapper() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    assertThat(wrapper.getLastNativeInferenceDurationNanoseconds()).isNull();
+    wrapper.close();
+  }
+
+  @Test
+  public void testGetLatencyAfterFailedInference() {
+    NativeInterpreterWrapper wrapper = new NativeInterpreterWrapper(FLOAT_MODEL_PATH);
+    float[] oneD = {1.23f, 6.54f, 7.81f};
+    float[][] twoD = {oneD, oneD, oneD, oneD, oneD, oneD, oneD};
+    float[][][] threeD = {twoD, twoD, twoD, twoD, twoD, twoD, twoD, twoD};
+    float[][][][] fourD = {threeD, threeD};
+    Object[] inputs = {fourD};
+    try {
+      wrapper.run(inputs);
+      fail();
+    } catch (IllegalArgumentException e) {
+      assertThat(e)
+          .hasMessageThat()
+          .contains("0-th input dimension should be [?,8,8,3], but found [?,8,7,3]");
+    }
+    assertThat(wrapper.getLastNativeInferenceDurationNanoseconds()).isNull();
+    wrapper.close();
+  }
 }
diff --git a/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java b/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java
index 8660cabf709e6531a5667a16e5cf43a93c7135bd..a5c13053d71374c0bdb8652bb22029d7025e20eb 100644
--- a/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java
+++ b/tensorflow/contrib/lite/java/src/testhelper/java/org/tensorflow/lite/TestHelper.java
@@ -32,4 +32,19 @@ public class TestHelper {
       throw new IllegalArgumentException("Interpreter has not initialized; Failed to setUseNNAPI.");
     }
   }
+
+  /**
+   * Gets the last inference duration in nanoseconds. It returns null if there is no previous
+   * inference run or the last inference run failed.
+   *
+   * @param interpreter an instance of {@code Interpreter}. If it is not initialized, an {@code
+   *     IllegalArgumentException} will be thrown.
+   */
+  public static Long getLastNativeInferenceDurationNanoseconds(Interpreter interpreter) {
+    if (interpreter != null && interpreter.wrapper != null) {
+      return interpreter.wrapper.getLastNativeInferenceDurationNanoseconds();
+    } else {
+      throw new IllegalArgumentException("Interpreter has not initialized; Failed to get latency.");
+    }
+  }
 }
diff --git a/tensorflow/contrib/lite/kernels/BUILD b/tensorflow/contrib/lite/kernels/BUILD
index b59dc5ffb339caade28626d1954d41bc821fae41..956bd35fe67b3a487f5eb545a827908e12127455 100644
--- a/tensorflow/contrib/lite/kernels/BUILD
+++ b/tensorflow/contrib/lite/kernels/BUILD
@@ -104,6 +104,7 @@ cc_library(
         "add.cc",
         "basic_rnn.cc",
         "batch_to_space_nd.cc",
+        "bidirectional_sequence_lstm.cc",
         "bidirectional_sequence_rnn.cc",
         "concatenation.cc",
         "conv.cc",
@@ -282,6 +283,18 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "bidirectional_sequence_lstm_test",
+    size = "small",
+    srcs = ["bidirectional_sequence_lstm_test.cc"],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
 tf_cc_test(
     name = "unidirectional_sequence_lstm_test",
     size = "small",
@@ -513,6 +526,19 @@ tf_cc_test(
     ],
 )
 
+tf_cc_test(
+    name = "log_softmax_test",
+    size = "small",
+    srcs = ["log_softmax_test.cc"],
+    deps = [
+        ":builtin_ops",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite/kernels:test_util",
+        "//tensorflow/contrib/lite/kernels/internal:reference_base",
+        "@com_google_googletest//:gtest",
+    ],
+)
+
 tf_cc_test(
     name = "lsh_projection_test",
     size = "small",
diff --git a/tensorflow/contrib/lite/kernels/activations.cc b/tensorflow/contrib/lite/kernels/activations.cc
index 3c5c77815d0f2592ab549152b4d77f45b967a660..093761c43c1cb41ddb2245da13c963014b51271c 100644
--- a/tensorflow/contrib/lite/kernels/activations.cc
+++ b/tensorflow/contrib/lite/kernels/activations.cc
@@ -63,6 +63,33 @@ TfLiteStatus GenericPrepare(TfLiteContext* context, TfLiteNode* node) {
                                TfLiteIntArrayCopy(input->dims));
 }
 
+TfLiteStatus TanhPrepare(TfLiteContext* context, TfLiteNode* node) {
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
+
+  TF_LITE_ENSURE_EQ(context, NumInputs(node), 1);
+  TF_LITE_ENSURE_EQ(context, NumOutputs(node), 1);
+  TfLiteTensor* input = GetInput(context, node, 0);
+  TfLiteTensor* output = GetOutput(context, node, 0);
+  TF_LITE_ENSURE_EQ(context, input->type, output->type);
+
+  if (input->type == kTfLiteUInt8) {
+    static constexpr int kInputIntegerBits = 4;
+
+    const double input_real_multiplier =
+        input->params.scale *
+        static_cast<double>(1 << (31 - kInputIntegerBits));
+
+    QuantizeMultiplierGreaterThanOne(input_real_multiplier,
+                                     &data->input_multiplier,
+                                     &data->input_left_shift);
+    data->input_range_radius =
+        CalculateInputRadius(kInputIntegerBits, data->input_left_shift);
+  }
+
+  return context->ResizeTensor(context, output,
+                               TfLiteIntArrayCopy(input->dims));
+}
+
 TfLiteStatus SigmoidPrepare(TfLiteContext* context, TfLiteNode* node) {
   OpData* data = reinterpret_cast<OpData*>(node->user_data);
 
@@ -180,6 +207,7 @@ TfLiteStatus Relu6Eval(TfLiteContext* context, TfLiteNode* node) {
 }
 
 TfLiteStatus TanhEval(TfLiteContext* context, TfLiteNode* node) {
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
   TfLiteTensor* input = GetInput(context, node, 0);
   TfLiteTensor* output = GetOutput(context, node, 0);
   switch (input->type) {
@@ -191,6 +219,14 @@ TfLiteStatus TanhEval(TfLiteContext* context, TfLiteNode* node) {
       for (; in < in_end; in++, out++) *out = std::tanh(*in);
       return kTfLiteOk;
     } break;
+    case kTfLiteUInt8: {
+      optimized_ops::Tanh(GetTensorData<uint8_t>(input), GetTensorDims(input),
+                          input->params.zero_point, data->input_range_radius,
+                          data->input_multiplier, data->input_left_shift,
+                          GetTensorData<uint8_t>(output),
+                          GetTensorDims(output));
+      return kTfLiteOk;
+    } break;
     default:
       context->ReportError(context, "Only float32 supported currently.");
       return kTfLiteError;
@@ -337,6 +373,21 @@ TfLiteStatus SoftmaxEval(TfLiteContext* context, TfLiteNode* node) {
   }
 }
 
+TfLiteStatus LogSoftmaxEval(TfLiteContext* context, TfLiteNode* node) {
+  TfLiteTensor* input = GetInput(context, node, 0);
+  TfLiteTensor* output = GetOutput(context, node, 0);
+  switch (input->type) {
+    case kTfLiteFloat32:
+      optimized_ops::LogSoftmax(
+          GetTensorData<float>(input), GetTensorDims(input),
+          GetTensorData<float>(output), GetTensorDims(output));
+      return kTfLiteOk;
+    default:
+      context->ReportError(context, "Only float32 supported currently.");
+      return kTfLiteError;
+  }
+}
+
 }  // namespace activations
 
 TfLiteRegistration* Register_RELU() {
@@ -361,8 +412,8 @@ TfLiteRegistration* Register_RELU6() {
 }
 
 TfLiteRegistration* Register_TANH() {
-  static TfLiteRegistration r = {/*init=*/nullptr, /*free=*/nullptr,
-                                 activations::GenericPrepare,
+  static TfLiteRegistration r = {activations::Init, activations::Free,
+                                 activations::TanhPrepare,
                                  activations::TanhEval};
   return &r;
 }
@@ -381,6 +432,13 @@ TfLiteRegistration* Register_SOFTMAX() {
   return &r;
 }
 
+TfLiteRegistration* Register_LOG_SOFTMAX() {
+  static TfLiteRegistration r = {activations::Init, activations::Free,
+                                 activations::GenericPrepare,
+                                 activations::LogSoftmaxEval};
+  return &r;
+}
+
 }  // namespace builtin
 }  // namespace ops
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/activations_test.cc b/tensorflow/contrib/lite/kernels/activations_test.cc
index 68d49944e51b043b6b82aa1589d22f6ebed37574..b9a96e3f79677c5a94ade86a6b334abb4c265fa1 100644
--- a/tensorflow/contrib/lite/kernels/activations_test.cc
+++ b/tensorflow/contrib/lite/kernels/activations_test.cc
@@ -52,6 +52,14 @@ class BaseActivationsOpModel : public SingleOpModel {
     BuildInterpreter({GetShape(input_)});
   }
 
+  BaseActivationsOpModel(BuiltinOperator type, const TensorData &input,
+                         const TensorData &output) {
+    input_ = AddInput(input);
+    output_ = AddOutput(output);
+    SetBuiltinOp(type, BuiltinOptions_NONE, 0);
+    BuildInterpreter({GetShape(input_)});
+  }
+
  protected:
   int input_;
   int output_;
@@ -143,6 +151,27 @@ TEST(FloatActivationsOpTest, Tanh) {
                              })));
 }
 
+TEST(QuantizedActivationsOpTest, Tanh) {
+  QuantizedActivationsOpModel m(
+      BuiltinOperator_TANH,
+      /*input=*/{TensorType_UINT8, {1, 2, 4, 1}, -8, 8},
+      /*output=*/{TensorType_UINT8, {1, 2, 4, 1}, -1, 1});
+  m.SetInput({
+      0, -6, 2, 4,   //
+      -4, -2, 8, 1,  //
+  });
+  m.Invoke();
+  EXPECT_THAT(m.GetDequantizedOutput(),
+              ElementsAreArray(ArrayFloatNear(
+                  {
+                      0.0, -0.999987, 0.964027, 0.999329,     //
+                      -0.996078, -0.96402, 0.99999, 0.76159,  //
+                  },
+                  4 * (1. / 256))));
+  EXPECT_THAT(m.GetOutput(),
+              ElementsAreArray({128, 0, 251, 255, 0, 5, 255, 226}));
+}
+
 TEST(FloatActivationsOpTest, Sigmoid) {
   FloatActivationsOpModel m(BuiltinOperator_LOGISTIC,
                             /*input=*/{TensorType_FLOAT32, {1, 2, 4, 1}});
@@ -313,6 +342,47 @@ TEST(QuantizedActivationsOpTest, Softmax2D) {
                                              kQuantizedTolerance)));
 }
 
+// This contains the same test values as the Softmax test, but reference answer
+// generated via the following snippet of python:
+//   logits1 = tf.constant([[0, -6, 2, 4],[3, -2, 10, 1]], dtype=tf.float32)
+//   logits2 = tf.constant([[0,-6],[2,4],[3,-2],[10,1]], dtype=tf.float32)
+//   lsm1 = tf.nn.log_softmax(logits1)
+//   lsm2 = tf.nn.log_softmax(logits2)
+//   with tf.Session() as sess:
+//     print('lsm1', sess.run(lsm1))
+//     print('lsm2', sess.run(lsm2))
+
+TEST(FloatActivationsOpTest, LogSoftmax) {
+  FloatActivationsOpModel m(BuiltinOperator_LOG_SOFTMAX,
+                            /*input=*/{TensorType_FLOAT32, {2, 4}});
+  m.SetInput({
+      0, -6, 2, 4,   //
+      3, -2, 10, 1,  //
+  });
+  m.Invoke();
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear({
+                                 -4.14297, -10.14297, -2.14297, -.142971,    //
+                                 -7.00104, -12.00104, -.00104087, -9.00104,  //
+                             })));
+
+  // Same input, but a different shape.
+  FloatActivationsOpModel m2(BuiltinOperator_LOG_SOFTMAX,
+                             /*input=*/{TensorType_FLOAT32, {4, 2}});
+  m2.SetInput({
+      0, -6,  //
+      2, 4,   //
+      3, -2,  //
+      10, 1,  //
+  });
+  m2.Invoke();
+  EXPECT_THAT(m2.GetOutput(), ElementsAreArray(ArrayFloatNear({
+                                  -.00247565, -6.00247,   //
+                                  -2.12692, -.126928,     //
+                                  -.00671534, -5.00671,   //
+                                  -.000123374, -9.00012,  //
+                              })));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc b/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a64ac42bc43336db928d2682e290f5263f3db0f4
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm.cc
@@ -0,0 +1,702 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <unistd.h>
+#include <cassert>
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+#include <iostream>
+#include <limits>
+
+#include "tensorflow/contrib/lite/builtin_op_data.h"
+#include "tensorflow/contrib/lite/context.h"
+#include "tensorflow/contrib/lite/kernels/activation_functor.h"
+#include "tensorflow/contrib/lite/kernels/internal/kernel_utils.h"
+#include "tensorflow/contrib/lite/kernels/internal/tensor_utils.h"
+#include "tensorflow/contrib/lite/kernels/kernel_util.h"
+#include "tensorflow/contrib/lite/kernels/op_macros.h"
+
+namespace tflite {
+namespace ops {
+namespace builtin {
+namespace bidirectional_sequence_lstm {
+
+// Input Tensors of size {max_time, n_batch, n_input}
+constexpr int kInputTensor = 0;
+
+// Forward LSTM cell tensors.
+// Input weight tensors of size: {n_cell, n_input}
+constexpr int kFwInputToInputWeightsTensor = 1;  // Optional
+constexpr int kFwInputToForgetWeightsTensor = 2;
+constexpr int kFwInputToCellWeightsTensor = 3;
+constexpr int kFwInputToOutputWeightsTensor = 4;
+
+// Recurrent weight tensors of size {n_cell, n_output}
+constexpr int kFwRecurrentToInputWeightsTensor = 5;  // Optional
+constexpr int kFwRecurrentToForgetWeightsTensor = 6;
+constexpr int kFwRecurrentToCellWeightsTensor = 7;
+constexpr int kFwRecurrentToOutputWeightsTensor = 8;
+
+// Peephole weights tensors of size {n_cell}, representing a diagonal matrix.
+constexpr int kFwCellToInputWeightsTensor = 9;    // Optional
+constexpr int kFwCellToForgetWeightsTensor = 10;  // Optional
+constexpr int kFwCellToOutputWeightsTensor = 11;  // Optional
+
+// Gates bias tensors of size {n_cell}
+constexpr int kFwInputGateBiasTensor = 12;  // Optional
+constexpr int kFwForgetGateBiasTensor = 13;
+constexpr int kFwCellGateBiasTensor = 14;
+constexpr int kFwOutputGateBiasTensor = 15;
+
+// Projection weight tensor of size {n_output, n_cell}
+constexpr int kFwProjectionWeightsTensor = 16;  // Optional
+// Projection bias tensor of size {n_output}
+constexpr int kFwProjectionBiasTensor = 17;  // Optional
+
+// Backward LSTM cell tensors.
+// Input weight tensors of size: {n_cell, n_input}
+constexpr int kBwInputToInputWeightsTensor = 18;  // Optional
+constexpr int kBwInputToForgetWeightsTensor = 19;
+constexpr int kBwInputToCellWeightsTensor = 20;
+constexpr int kBwInputToOutputWeightsTensor = 21;
+
+// Recurrent weight tensors of size {n_cell, n_output}
+constexpr int kBwRecurrentToInputWeightsTensor = 22;  // Optional
+constexpr int kBwRecurrentToForgetWeightsTensor = 23;
+constexpr int kBwRecurrentToCellWeightsTensor = 24;
+constexpr int kBwRecurrentToOutputWeightsTensor = 25;
+
+// Peephole weights tensors of size {n_cell}, representing a diagonal matrix.
+constexpr int kBwCellToInputWeightsTensor = 26;   // Optional
+constexpr int kBwCellToForgetWeightsTensor = 27;  // Optional
+constexpr int kBwCellToOutputWeightsTensor = 28;  // Optional
+
+// Gates bias tensors of size {n_cell}
+constexpr int kBwInputGateBiasTensor = 29;  // Optional
+constexpr int kBwForgetGateBiasTensor = 30;
+constexpr int kBwCellGateBiasTensor = 31;
+constexpr int kBwOutputGateBiasTensor = 32;
+
+// Projection weight tensor of size {n_output, n_cell}
+constexpr int kBwProjectionWeightsTensor = 33;  // Optional
+// Projection bias tensor of size {n_output}
+constexpr int kBwProjectionBiasTensor = 34;  // Optional
+
+// Output tensors.
+constexpr int kFwScratchBufferTensor = 0;
+constexpr int kFwOutputStateTensor = 1;
+constexpr int kFwCellStateTensor = 2;
+constexpr int kFwOutputTensor = 3;
+
+constexpr int kBwScratchBufferTensor = 4;
+constexpr int kBwOutputStateTensor = 5;
+constexpr int kBwCellStateTensor = 6;
+constexpr int kBwOutputTensor = 7;
+
+// Check that input tensor dimensions matches with each other.
+TfLiteStatus CheckLstmTensorDimensions(
+    TfLiteContext* context, TfLiteNode* node, int n_input, int n_output,
+    int n_cell, int input_to_input_weights_tensor,
+    int input_to_forget_weights_tensor, int input_to_cell_weights_tensor,
+    int input_to_output_weights_tensor, int recurrent_to_input_weights_tensor,
+    int recurrent_to_forget_weights_tensor,
+    int recurrent_to_cell_weights_tensor,
+    int recurrent_to_output_weights_tensor, int cell_to_input_weights_tensor,
+    int cell_to_forget_weights_tensor, int cell_to_output_weights_tensor,
+    int input_gate_bias_tensor, int forget_gate_bias_tensor,
+    int cell_gate_bias_tensor, int output_gate_bias_tensor,
+    int projection_weights_tensor, int projection_bias_tensor) {
+  auto* params = reinterpret_cast<TfLiteLSTMParams*>(node->builtin_data);
+
+  // Making sure clipping parameters have valid values.
+  // == 0 means no clipping
+  //  > 0 means clipping
+  TF_LITE_ENSURE(context, params->cell_clip >= 0);
+  TF_LITE_ENSURE(context, params->proj_clip >= 0);
+
+  TfLiteTensor* input_to_input_weights =
+      GetOptionalInputTensor(context, node, input_to_input_weights_tensor);
+  if (input_to_input_weights) {
+    TF_LITE_ENSURE_EQ(context, input_to_input_weights->dims->size, 2);
+    TF_LITE_ENSURE_EQ(context, input_to_input_weights->dims->data[0], n_cell);
+    TF_LITE_ENSURE_EQ(context, input_to_input_weights->dims->data[1], n_input);
+  }
+
+  TfLiteTensor* input_to_forget_weights =
+      GetInput(context, node, input_to_forget_weights_tensor);
+  TF_LITE_ENSURE_EQ(context, input_to_forget_weights->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, input_to_forget_weights->dims->data[0], n_cell);
+  TF_LITE_ENSURE_EQ(context, input_to_forget_weights->dims->data[1], n_input);
+
+  TfLiteTensor* input_to_cell_weights =
+      GetInput(context, node, input_to_cell_weights_tensor);
+  TF_LITE_ENSURE_EQ(context, input_to_cell_weights->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, input_to_cell_weights->dims->data[0], n_cell);
+  TF_LITE_ENSURE_EQ(context, input_to_cell_weights->dims->data[1], n_input);
+
+  TfLiteTensor* recurrent_to_input_weights =
+      GetOptionalInputTensor(context, node, recurrent_to_input_weights_tensor);
+  if (recurrent_to_input_weights) {
+    TF_LITE_ENSURE_EQ(context, recurrent_to_input_weights->dims->size, 2);
+    TF_LITE_ENSURE_EQ(context, recurrent_to_input_weights->dims->data[0],
+                      n_cell);
+    TF_LITE_ENSURE_EQ(context, recurrent_to_input_weights->dims->data[1],
+                      n_output);
+  }
+
+  TfLiteTensor* recurrent_to_forget_weights =
+      GetInput(context, node, recurrent_to_forget_weights_tensor);
+  TF_LITE_ENSURE_EQ(context, recurrent_to_forget_weights->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, recurrent_to_forget_weights->dims->data[0],
+                    n_cell);
+  TF_LITE_ENSURE_EQ(context, recurrent_to_forget_weights->dims->data[1],
+                    n_output);
+
+  TfLiteTensor* recurrent_to_cell_weights =
+      GetInput(context, node, recurrent_to_cell_weights_tensor);
+  TF_LITE_ENSURE_EQ(context, recurrent_to_cell_weights->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, recurrent_to_cell_weights->dims->data[0], n_cell);
+  TF_LITE_ENSURE_EQ(context, recurrent_to_cell_weights->dims->data[1],
+                    n_output);
+
+  // We make sure the input-gate's parameters are either both present (regular
+  // LSTM) or not at all (CIFG-LSTM).
+  const bool cifg_weights_all_or_none =
+      ((input_to_input_weights != nullptr) &&
+       (recurrent_to_input_weights != nullptr)) ||
+      ((input_to_input_weights == nullptr) &&
+       (recurrent_to_input_weights == nullptr));
+  TF_LITE_ENSURE(context, cifg_weights_all_or_none == true);
+
+  TfLiteTensor* cell_to_input_weights =
+      GetOptionalInputTensor(context, node, cell_to_input_weights_tensor);
+  if (cell_to_input_weights) {
+    TF_LITE_ENSURE_EQ(context, cell_to_input_weights->dims->size, 1);
+    TF_LITE_ENSURE_EQ(context, cell_to_input_weights->dims->data[0], n_cell);
+  }
+
+  TfLiteTensor* cell_to_forget_weights =
+      GetOptionalInputTensor(context, node, cell_to_forget_weights_tensor);
+  if (cell_to_forget_weights) {
+    TF_LITE_ENSURE_EQ(context, cell_to_forget_weights->dims->size, 1);
+    TF_LITE_ENSURE_EQ(context, cell_to_forget_weights->dims->data[0], n_cell);
+  }
+
+  TfLiteTensor* cell_to_output_weights =
+      GetOptionalInputTensor(context, node, cell_to_output_weights_tensor);
+  if (cell_to_output_weights) {
+    TF_LITE_ENSURE_EQ(context, cell_to_output_weights->dims->size, 1);
+    TF_LITE_ENSURE_EQ(context, cell_to_output_weights->dims->data[0], n_cell);
+  }
+
+  // Making sure the peephole weights are there all or none.
+  const bool use_cifg = (input_to_input_weights == nullptr);
+  const bool peephole_weights_all_or_none =
+      ((cell_to_input_weights != nullptr || use_cifg) &&
+       (cell_to_forget_weights != nullptr) &&
+       (cell_to_output_weights != nullptr)) ||
+      ((cell_to_input_weights == nullptr) &&
+       (cell_to_forget_weights == nullptr) &&
+       (cell_to_output_weights == nullptr));
+  TF_LITE_ENSURE(context, peephole_weights_all_or_none == true);
+
+  // Make sure the input gate bias is present only when not a CIFG-LSTM.
+  TfLiteTensor* input_gate_bias =
+      GetOptionalInputTensor(context, node, input_gate_bias_tensor);
+  if (use_cifg) {
+    TF_LITE_ENSURE_EQ(context, input_gate_bias, nullptr);
+  } else {
+    TF_LITE_ENSURE_EQ(context, input_gate_bias->dims->size, 1);
+    TF_LITE_ENSURE_EQ(context, input_gate_bias->dims->data[0], n_cell);
+  }
+
+  TfLiteTensor* forget_gate_bias =
+      GetInput(context, node, forget_gate_bias_tensor);
+  TF_LITE_ENSURE_EQ(context, forget_gate_bias->dims->size, 1);
+  TF_LITE_ENSURE_EQ(context, forget_gate_bias->dims->data[0], n_cell);
+
+  TfLiteTensor* cell_bias = GetInput(context, node, cell_gate_bias_tensor);
+  TF_LITE_ENSURE_EQ(context, cell_bias->dims->size, 1);
+  TF_LITE_ENSURE_EQ(context, cell_bias->dims->data[0], n_cell);
+
+  TfLiteTensor* output_gate_bias =
+      GetInput(context, node, output_gate_bias_tensor);
+  TF_LITE_ENSURE_EQ(context, output_gate_bias->dims->size, 1);
+  TF_LITE_ENSURE_EQ(context, output_gate_bias->dims->data[0], n_cell);
+
+  TfLiteTensor* projection_weights =
+      GetOptionalInputTensor(context, node, projection_weights_tensor);
+  if (projection_weights) {
+    TF_LITE_ENSURE_EQ(context, projection_weights->dims->size, 2);
+    TF_LITE_ENSURE_EQ(context, projection_weights->dims->data[0], n_output);
+    TF_LITE_ENSURE_EQ(context, projection_weights->dims->data[1], n_cell);
+  }
+
+  TfLiteTensor* projection_bias =
+      GetOptionalInputTensor(context, node, projection_bias_tensor);
+  if (projection_bias) {
+    TF_LITE_ENSURE_EQ(context, projection_bias->dims->size, 1);
+    TF_LITE_ENSURE_EQ(context, projection_bias->dims->data[0], n_output);
+  }
+
+  // Making sure the projection tensors are consistent:
+  // 1) If projection weight is not present, then projection bias should not be
+  // present.
+  // 2) If projection weight is present, then projection bias is optional.
+  // TODO(ghodrat): make sure this is correct.
+  const bool projecton_tensors_consistent =
+      ((projection_weights != nullptr) || (projection_bias == nullptr));
+  TF_LITE_ENSURE(context, projecton_tensors_consistent == true);
+
+  return kTfLiteOk;
+}
+
+TfLiteStatus CheckInputTensorDimensions(TfLiteContext* context,
+                                        TfLiteNode* node, int n_input,
+                                        int n_output, int n_cell) {
+  CheckLstmTensorDimensions(
+      context, node, n_input, n_output, n_cell, kFwInputToInputWeightsTensor,
+      kFwInputToForgetWeightsTensor, kFwInputToCellWeightsTensor,
+      kFwInputToOutputWeightsTensor, kFwRecurrentToInputWeightsTensor,
+      kFwRecurrentToForgetWeightsTensor, kFwRecurrentToCellWeightsTensor,
+      kFwRecurrentToOutputWeightsTensor, kFwCellToInputWeightsTensor,
+      kFwCellToForgetWeightsTensor, kFwCellToOutputWeightsTensor,
+      kFwInputGateBiasTensor, kFwForgetGateBiasTensor, kFwCellGateBiasTensor,
+      kFwOutputGateBiasTensor, kFwProjectionWeightsTensor,
+      kFwProjectionBiasTensor);
+
+  CheckLstmTensorDimensions(
+      context, node, n_input, n_output, n_cell, kBwInputToInputWeightsTensor,
+      kBwInputToForgetWeightsTensor, kBwInputToCellWeightsTensor,
+      kBwInputToOutputWeightsTensor, kBwRecurrentToInputWeightsTensor,
+      kBwRecurrentToForgetWeightsTensor, kBwRecurrentToCellWeightsTensor,
+      kBwRecurrentToOutputWeightsTensor, kBwCellToInputWeightsTensor,
+      kBwCellToForgetWeightsTensor, kBwCellToOutputWeightsTensor,
+      kBwInputGateBiasTensor, kBwForgetGateBiasTensor, kBwCellGateBiasTensor,
+      kBwOutputGateBiasTensor, kBwProjectionWeightsTensor,
+      kBwProjectionBiasTensor);
+
+  // Check if Forward and Backward tensors match along required dimensions.
+  return kTfLiteOk;
+}
+
+// Resize the output, state and scratch tensors based on the sizes of the input
+// tensors. Also check that the size of the input tensors match each other.
+TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
+  // Check we have all the inputs and outputs we need.
+  TF_LITE_ENSURE_EQ(context, node->inputs->size, 35);
+  TF_LITE_ENSURE_EQ(context, node->outputs->size, 8);
+
+  // Inferring batch size, number of outputs and sequence length and
+  // number of cells from the input tensors.
+  TfLiteTensor* input = GetInput(context, node, kInputTensor);
+  TF_LITE_ENSURE(context, input->dims->size > 1);
+  const int max_time = input->dims->data[0];
+  const int n_batch = input->dims->data[1];
+  const int n_input = input->dims->data[2];
+
+  TfLiteTensor* fw_input_to_output_weights =
+      GetInput(context, node, kFwInputToOutputWeightsTensor);
+  const int n_fw_cell = fw_input_to_output_weights->dims->data[0];
+  TF_LITE_ENSURE_EQ(context, fw_input_to_output_weights->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, fw_input_to_output_weights->dims->data[1],
+                    n_input);
+
+  TfLiteTensor* fw_recurrent_to_output_weights =
+      GetInput(context, node, kFwRecurrentToOutputWeightsTensor);
+  TF_LITE_ENSURE_EQ(context, fw_recurrent_to_output_weights->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, fw_recurrent_to_output_weights->dims->data[0],
+                    n_fw_cell);
+  const int n_fw_output = fw_recurrent_to_output_weights->dims->data[1];
+
+  // Check that input tensor dimensions matches with each other.
+  CheckInputTensorDimensions(context, node, n_input, n_fw_output, n_fw_cell);
+
+  // Get the pointer to output, state and scratch buffer tensors.
+  TfLiteTensor* fw_output = GetOutput(context, node, kFwOutputTensor);
+  TfLiteTensor* fw_output_state =
+      GetOutput(context, node, kFwOutputStateTensor);
+  TfLiteTensor* fw_cell_state = GetOutput(context, node, kFwCellStateTensor);
+  // TODO(ghodrat): Modify this as soon as we have a finalized method for
+  // scratch buffers.
+  TfLiteTensor* fw_scratch_buffer =
+      GetOutput(context, node, kFwScratchBufferTensor);
+
+  // Resize the output and output_state tensors.
+  TfLiteIntArray* fw_output_size = TfLiteIntArrayCreate(3);
+  fw_output_size->data[0] = max_time;
+  fw_output_size->data[1] = n_batch;
+  fw_output_size->data[2] = n_fw_output;
+  TF_LITE_ENSURE_OK(context,
+                    context->ResizeTensor(context, fw_output, fw_output_size));
+
+  TfLiteIntArray* fw_output_state_size = TfLiteIntArrayCreate(2);
+  fw_output_state_size->data[0] = n_batch;
+  fw_output_state_size->data[1] = n_fw_output;
+  TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, fw_output_state,
+                                                   fw_output_state_size));
+
+  // Resize the scratch buffer tensor.
+  TfLiteIntArray* fw_cell_size = TfLiteIntArrayCreate(2);
+  fw_cell_size->data[0] = n_batch;
+  fw_cell_size->data[1] = n_fw_cell;
+  TF_LITE_ENSURE_OK(
+      context, context->ResizeTensor(context, fw_cell_state, fw_cell_size));
+
+  // Mark state tensors as persistent tensors.
+  fw_output_state->allocation_type = kTfLiteArenaRwPersistent;
+  fw_cell_state->allocation_type = kTfLiteArenaRwPersistent;
+
+  TfLiteTensor* fw_input_to_input_weights =
+      GetOptionalInputTensor(context, node, kFwInputToInputWeightsTensor);
+  const bool fw_use_cifg = (fw_input_to_input_weights == nullptr);
+  TfLiteIntArray* fw_scratch_buffer_size = TfLiteIntArrayCreate(2);
+  fw_scratch_buffer_size->data[0] = n_batch;
+  if (fw_use_cifg) {
+    // Reserving space for Cell, Forget, Output gates
+    fw_scratch_buffer_size->data[1] = n_fw_cell * 3;
+  } else {
+    // Reserving space for Input, Cell, Forget, Output gates
+    fw_scratch_buffer_size->data[1] = n_fw_cell * 4;
+  }
+  TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, fw_scratch_buffer,
+                                                   fw_scratch_buffer_size));
+  // Same for the backward cell.
+  TfLiteTensor* bw_input_to_output_weights =
+      GetInput(context, node, kBwInputToOutputWeightsTensor);
+  const int n_bw_cell = bw_input_to_output_weights->dims->data[0];
+  TF_LITE_ENSURE_EQ(context, bw_input_to_output_weights->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, bw_input_to_output_weights->dims->data[1],
+                    n_input);
+
+  TfLiteTensor* bw_recurrent_to_output_weights =
+      GetInput(context, node, kBwRecurrentToOutputWeightsTensor);
+  TF_LITE_ENSURE_EQ(context, bw_recurrent_to_output_weights->dims->size, 2);
+  TF_LITE_ENSURE_EQ(context, bw_recurrent_to_output_weights->dims->data[0],
+                    n_bw_cell);
+  const int n_bw_output = bw_recurrent_to_output_weights->dims->data[1];
+
+  // Check that input tensor dimensions matches with each other.
+  CheckInputTensorDimensions(context, node, n_input, n_bw_output, n_bw_cell);
+
+  // Get the pointer to output, state and scratch buffer tensors.
+  TfLiteTensor* bw_output = GetOutput(context, node, kBwOutputTensor);
+  TfLiteTensor* bw_output_state =
+      GetOutput(context, node, kBwOutputStateTensor);
+  TfLiteTensor* bw_cell_state = GetOutput(context, node, kBwCellStateTensor);
+  // TODO(ghodrat): Modify this as soon as we have a finalized method for
+  // scratch buffers.
+  TfLiteTensor* bw_scratch_buffer =
+      GetOutput(context, node, kBwScratchBufferTensor);
+
+  // Resize the output and output_state tensors.
+  TfLiteIntArray* bw_output_size = TfLiteIntArrayCreate(3);
+  bw_output_size->data[0] = max_time;
+  bw_output_size->data[1] = n_batch;
+  bw_output_size->data[2] = n_bw_output;
+  TF_LITE_ENSURE_OK(context,
+                    context->ResizeTensor(context, bw_output, bw_output_size));
+
+  TfLiteIntArray* bw_output_state_size = TfLiteIntArrayCreate(2);
+  bw_output_state_size->data[0] = n_batch;
+  bw_output_state_size->data[1] = n_bw_output;
+  TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, bw_output_state,
+                                                   bw_output_state_size));
+
+  // Resize the scratch buffer tensor.
+  TfLiteIntArray* bw_cell_size = TfLiteIntArrayCreate(2);
+  bw_cell_size->data[0] = n_batch;
+  bw_cell_size->data[1] = n_bw_cell;
+  TF_LITE_ENSURE_OK(
+      context, context->ResizeTensor(context, bw_cell_state, bw_cell_size));
+
+  // Mark state tensors as persistent tensors.
+  bw_output_state->allocation_type = kTfLiteArenaRwPersistent;
+  bw_cell_state->allocation_type = kTfLiteArenaRwPersistent;
+
+  TfLiteTensor* bw_input_to_input_weights =
+      GetOptionalInputTensor(context, node, kBwInputToInputWeightsTensor);
+  const bool bw_use_cifg = (bw_input_to_input_weights == nullptr);
+  TfLiteIntArray* bw_scratch_buffer_size = TfLiteIntArrayCreate(2);
+  bw_scratch_buffer_size->data[0] = n_batch;
+  if (bw_use_cifg) {
+    // Reserving space for Cell, Forget, Output gates
+    bw_scratch_buffer_size->data[1] = n_bw_cell * 3;
+  } else {
+    // Reserving space for Input, Cell, Forget, Output gates
+    bw_scratch_buffer_size->data[1] = n_bw_cell * 4;
+  }
+  TF_LITE_ENSURE_OK(context, context->ResizeTensor(context, bw_scratch_buffer,
+                                                   bw_scratch_buffer_size));
+  return kTfLiteOk;
+}
+
+// The LSTM Op engine.
+TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
+  auto* params = reinterpret_cast<TfLiteLSTMParams*>(node->builtin_data);
+
+  // Input tensor.
+  TfLiteTensor* input = GetInput(context, node, kInputTensor);
+  const int max_time = input->dims->data[0];
+  const int n_batch = input->dims->data[1];
+  const int n_input = input->dims->data[2];
+
+  // Tensors for the forward cell.
+  TfLiteTensor* fw_input_to_input_weights =
+      GetOptionalInputTensor(context, node, kFwInputToInputWeightsTensor);
+  TfLiteTensor* fw_input_to_forget_weights =
+      GetInput(context, node, kFwInputToForgetWeightsTensor);
+  TfLiteTensor* fw_input_to_cell_weights =
+      GetInput(context, node, kFwInputToCellWeightsTensor);
+  TfLiteTensor* fw_input_to_output_weights =
+      GetInput(context, node, kFwInputToOutputWeightsTensor);
+
+  TfLiteTensor* fw_recurrent_to_input_weights =
+      GetOptionalInputTensor(context, node, kFwRecurrentToInputWeightsTensor);
+  TfLiteTensor* fw_recurrent_to_forget_weights =
+      GetInput(context, node, kFwRecurrentToForgetWeightsTensor);
+  TfLiteTensor* fw_recurrent_to_cell_weights =
+      GetInput(context, node, kFwRecurrentToCellWeightsTensor);
+  TfLiteTensor* fw_recurrent_to_output_weights =
+      GetInput(context, node, kFwRecurrentToOutputWeightsTensor);
+
+  TfLiteTensor* fw_cell_to_input_weights =
+      GetOptionalInputTensor(context, node, kFwCellToInputWeightsTensor);
+  TfLiteTensor* fw_cell_to_forget_weights =
+      GetOptionalInputTensor(context, node, kFwCellToForgetWeightsTensor);
+  TfLiteTensor* fw_cell_to_output_weights =
+      GetOptionalInputTensor(context, node, kFwCellToOutputWeightsTensor);
+
+  TfLiteTensor* fw_input_gate_bias =
+      GetOptionalInputTensor(context, node, kFwInputGateBiasTensor);
+  TfLiteTensor* fw_forget_gate_bias =
+      GetInput(context, node, kFwForgetGateBiasTensor);
+  TfLiteTensor* fw_cell_bias = GetInput(context, node, kFwCellGateBiasTensor);
+  TfLiteTensor* fw_output_gate_bias =
+      GetInput(context, node, kFwOutputGateBiasTensor);
+
+  TfLiteTensor* fw_projection_weights =
+      GetOptionalInputTensor(context, node, kFwProjectionWeightsTensor);
+  TfLiteTensor* fw_projection_bias =
+      GetOptionalInputTensor(context, node, kFwProjectionBiasTensor);
+
+  TfLiteTensor* fw_output_state =
+      GetOutput(context, node, kFwOutputStateTensor);
+  TfLiteTensor* fw_cell_state = GetOutput(context, node, kFwCellStateTensor);
+  TfLiteTensor* fw_output = GetOutput(context, node, kFwOutputTensor);
+
+  // Tensors for the backward cell.
+  TfLiteTensor* bw_input_to_input_weights =
+      GetOptionalInputTensor(context, node, kBwInputToInputWeightsTensor);
+  TfLiteTensor* bw_input_to_forget_weights =
+      GetInput(context, node, kBwInputToForgetWeightsTensor);
+  TfLiteTensor* bw_input_to_cell_weights =
+      GetInput(context, node, kBwInputToCellWeightsTensor);
+  TfLiteTensor* bw_input_to_output_weights =
+      GetInput(context, node, kBwInputToOutputWeightsTensor);
+
+  TfLiteTensor* bw_recurrent_to_input_weights =
+      GetOptionalInputTensor(context, node, kBwRecurrentToInputWeightsTensor);
+  TfLiteTensor* bw_recurrent_to_forget_weights =
+      GetInput(context, node, kBwRecurrentToForgetWeightsTensor);
+  TfLiteTensor* bw_recurrent_to_cell_weights =
+      GetInput(context, node, kBwRecurrentToCellWeightsTensor);
+  TfLiteTensor* bw_recurrent_to_output_weights =
+      GetInput(context, node, kBwRecurrentToOutputWeightsTensor);
+
+  TfLiteTensor* bw_cell_to_input_weights =
+      GetOptionalInputTensor(context, node, kBwCellToInputWeightsTensor);
+  TfLiteTensor* bw_cell_to_forget_weights =
+      GetOptionalInputTensor(context, node, kBwCellToForgetWeightsTensor);
+  TfLiteTensor* bw_cell_to_output_weights =
+      GetOptionalInputTensor(context, node, kBwCellToOutputWeightsTensor);
+
+  TfLiteTensor* bw_input_gate_bias =
+      GetOptionalInputTensor(context, node, kBwInputGateBiasTensor);
+  TfLiteTensor* bw_forget_gate_bias =
+      GetInput(context, node, kBwForgetGateBiasTensor);
+  TfLiteTensor* bw_cell_bias = GetInput(context, node, kBwCellGateBiasTensor);
+  TfLiteTensor* bw_output_gate_bias =
+      GetInput(context, node, kBwOutputGateBiasTensor);
+
+  TfLiteTensor* bw_projection_weights =
+      GetOptionalInputTensor(context, node, kBwProjectionWeightsTensor);
+  TfLiteTensor* bw_projection_bias =
+      GetOptionalInputTensor(context, node, kBwProjectionBiasTensor);
+
+  TfLiteTensor* bw_output_state =
+      GetOutput(context, node, kBwOutputStateTensor);
+  TfLiteTensor* bw_cell_state = GetOutput(context, node, kBwCellStateTensor);
+  TfLiteTensor* bw_output = GetOutput(context, node, kBwOutputTensor);
+
+  // n_cell and n_output will be the same size when there is no projection.
+  const int n_fw_cell = fw_input_to_output_weights->dims->data[0];
+  const int n_fw_output = fw_recurrent_to_output_weights->dims->data[1];
+
+  // Since we have already checked that weights are all there or none, we can
+  // check the existense of only one to the get the condition.
+  const bool fw_use_cifg = (fw_input_to_input_weights == nullptr);
+  const bool fw_use_peephole = (fw_cell_to_output_weights != nullptr);
+
+  // Index the scratch buffers pointers to the global scratch buffer.
+  TfLiteTensor* fw_scratch_buffer =
+      GetOutput(context, node, kFwScratchBufferTensor);
+  float* fw_input_gate_scratch = nullptr;
+  float* fw_cell_scratch = nullptr;
+  float* fw_forget_gate_scratch = nullptr;
+  float* fw_output_gate_scratch = nullptr;
+  if (fw_use_cifg) {
+    fw_cell_scratch = fw_scratch_buffer->data.f;
+    fw_forget_gate_scratch = fw_scratch_buffer->data.f + n_fw_cell * n_batch;
+    fw_output_gate_scratch =
+        fw_scratch_buffer->data.f + 2 * n_fw_cell * n_batch;
+  } else {
+    fw_input_gate_scratch = fw_scratch_buffer->data.f;
+    fw_cell_scratch = fw_scratch_buffer->data.f + n_fw_cell * n_batch;
+    fw_forget_gate_scratch =
+        fw_scratch_buffer->data.f + 2 * n_fw_cell * n_batch;
+    fw_output_gate_scratch =
+        fw_scratch_buffer->data.f + 3 * n_fw_cell * n_batch;
+  }
+
+  // Check optional tensors, the respective pointers can be null.
+  const float* fw_input_to_input_weights_ptr =
+      (fw_use_cifg) ? nullptr : fw_input_to_input_weights->data.f;
+  const float* fw_recurrent_to_input_weights_ptr =
+      (fw_use_cifg) ? nullptr : fw_recurrent_to_input_weights->data.f;
+  const float* fw_input_gate_bias_ptr =
+      (fw_use_cifg) ? nullptr : fw_input_gate_bias->data.f;
+  const float* fw_cell_to_input_weights_ptr =
+      (fw_use_peephole && !fw_use_cifg) ? fw_cell_to_input_weights->data.f
+                                        : nullptr;
+  const float* fw_cell_to_forget_weights_ptr =
+      (fw_use_peephole) ? fw_cell_to_forget_weights->data.f : nullptr;
+  const float* fw_cell_to_output_weights_ptr =
+      (fw_use_peephole) ? fw_cell_to_output_weights->data.f : nullptr;
+  const float* fw_projection_weights_ptr = (fw_projection_weights == nullptr)
+                                               ? nullptr
+                                               : fw_projection_weights->data.f;
+  const float* fw_projection_bias_ptr =
+      (fw_projection_bias == nullptr) ? nullptr : fw_projection_bias->data.f;
+
+  // Loop through the sequence.
+  for (int t = 0; t < max_time; t++) {
+    const float* input_ptr_batch = input->data.f + t * n_batch * n_input;
+    float* output_ptr_time = fw_output->data.f + t * n_batch * n_fw_output;
+
+    kernel_utils::LstmStep(
+        input_ptr_batch, fw_input_to_input_weights_ptr,
+        fw_input_to_forget_weights->data.f, fw_input_to_cell_weights->data.f,
+        fw_input_to_output_weights->data.f, fw_recurrent_to_input_weights_ptr,
+        fw_recurrent_to_forget_weights->data.f,
+        fw_recurrent_to_cell_weights->data.f,
+        fw_recurrent_to_output_weights->data.f, fw_cell_to_input_weights_ptr,
+        fw_cell_to_forget_weights_ptr, fw_cell_to_output_weights_ptr,
+        fw_input_gate_bias_ptr, fw_forget_gate_bias->data.f,
+        fw_cell_bias->data.f, fw_output_gate_bias->data.f,
+        fw_projection_weights_ptr, fw_projection_bias_ptr, params, n_batch,
+        n_fw_cell, n_input, n_fw_output, fw_output_state->data.f,
+        fw_cell_state->data.f, fw_input_gate_scratch, fw_forget_gate_scratch,
+        fw_cell_scratch, fw_output_gate_scratch, output_ptr_time);
+  }
+
+  // n_cell and n_output will be the same size when there is no projection.
+  const int n_bw_cell = bw_input_to_output_weights->dims->data[0];
+  const int n_bw_output = bw_recurrent_to_output_weights->dims->data[1];
+
+  // Since we have already checked that weights are all there or none, we can
+  // check the existense of only one to the get the condition.
+  const bool bw_use_cifg = (bw_input_to_input_weights == nullptr);
+  const bool bw_use_peephole = (bw_cell_to_output_weights != nullptr);
+
+  // Index the scratch buffers pointers to the global scratch buffer.
+  TfLiteTensor* bw_scratch_buffer =
+      GetOutput(context, node, kBwScratchBufferTensor);
+  float* bw_input_gate_scratch = nullptr;
+  float* bw_cell_scratch = nullptr;
+  float* bw_forget_gate_scratch = nullptr;
+  float* bw_output_gate_scratch = nullptr;
+  if (bw_use_cifg) {
+    bw_cell_scratch = bw_scratch_buffer->data.f;
+    bw_forget_gate_scratch = bw_scratch_buffer->data.f + n_bw_cell * n_batch;
+    bw_output_gate_scratch =
+        bw_scratch_buffer->data.f + 2 * n_bw_cell * n_batch;
+  } else {
+    bw_input_gate_scratch = bw_scratch_buffer->data.f;
+    bw_cell_scratch = bw_scratch_buffer->data.f + n_bw_cell * n_batch;
+    bw_forget_gate_scratch =
+        bw_scratch_buffer->data.f + 2 * n_bw_cell * n_batch;
+    bw_output_gate_scratch =
+        bw_scratch_buffer->data.f + 3 * n_bw_cell * n_batch;
+  }
+
+  // Check optional tensors, the respective pointers can be null.
+  const float* bw_input_to_input_weights_ptr =
+      (bw_use_cifg) ? nullptr : bw_input_to_input_weights->data.f;
+  const float* bw_recurrent_to_input_weights_ptr =
+      (bw_use_cifg) ? nullptr : bw_recurrent_to_input_weights->data.f;
+  const float* bw_input_gate_bias_ptr =
+      (bw_use_cifg) ? nullptr : bw_input_gate_bias->data.f;
+  const float* bw_cell_to_input_weights_ptr =
+      (bw_use_peephole && !bw_use_cifg) ? bw_cell_to_input_weights->data.f
+                                        : nullptr;
+  const float* bw_cell_to_forget_weights_ptr =
+      (bw_use_peephole) ? bw_cell_to_forget_weights->data.f : nullptr;
+  const float* bw_cell_to_output_weights_ptr =
+      (bw_use_peephole) ? bw_cell_to_output_weights->data.f : nullptr;
+  const float* bw_projection_weights_ptr = (bw_projection_weights == nullptr)
+                                               ? nullptr
+                                               : bw_projection_weights->data.f;
+  const float* bw_projection_bias_ptr =
+      (bw_projection_bias == nullptr) ? nullptr : bw_projection_bias->data.f;
+
+  // Loop through the sequence backwards.
+  for (int t = max_time - 1; t >= 0; t--) {
+    const float* input_ptr_batch = input->data.f + t * n_batch * n_input;
+    float* output_ptr_time = bw_output->data.f + t * n_batch * n_bw_output;
+
+    kernel_utils::LstmStep(
+        input_ptr_batch, bw_input_to_input_weights_ptr,
+        bw_input_to_forget_weights->data.f, bw_input_to_cell_weights->data.f,
+        bw_input_to_output_weights->data.f, bw_recurrent_to_input_weights_ptr,
+        bw_recurrent_to_forget_weights->data.f,
+        bw_recurrent_to_cell_weights->data.f,
+        bw_recurrent_to_output_weights->data.f, bw_cell_to_input_weights_ptr,
+        bw_cell_to_forget_weights_ptr, bw_cell_to_output_weights_ptr,
+        bw_input_gate_bias_ptr, bw_forget_gate_bias->data.f,
+        bw_cell_bias->data.f, bw_output_gate_bias->data.f,
+        bw_projection_weights_ptr, bw_projection_bias_ptr, params, n_batch,
+        n_bw_cell, n_input, n_bw_output, bw_output_state->data.f,
+        bw_cell_state->data.f, bw_input_gate_scratch, bw_forget_gate_scratch,
+        bw_cell_scratch, bw_output_gate_scratch, output_ptr_time);
+  }
+
+  // Backward step.
+  return kTfLiteOk;
+}
+
+}  // namespace bidirectional_sequence_lstm
+
+TfLiteRegistration* Register_BIDIRECTIONAL_SEQUENCE_LSTM() {
+  static TfLiteRegistration r = {/*init=*/nullptr, /*free=*/nullptr,
+                                 bidirectional_sequence_lstm::Prepare,
+                                 bidirectional_sequence_lstm::Eval};
+  return &r;
+}
+
+}  // namespace builtin
+}  // namespace ops
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm_test.cc b/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..cca857bac0633ded01d40273d2e9e8dde488d61e
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/bidirectional_sequence_lstm_test.cc
@@ -0,0 +1,1411 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+// Unit test for TFLite Bidirectional LSTM op.
+
+#include <iomanip>
+#include <memory>
+#include <vector>
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+using ::testing::ElementsAreArray;
+
+class BidirectionalLSTMOpModel : public SingleOpModel {
+ public:
+  BidirectionalLSTMOpModel(int n_batch, int n_input, int n_cell, int n_output,
+                           int sequence_length, bool use_cifg,
+                           bool use_peephole, bool use_projection_weights,
+                           bool use_projection_bias, float cell_clip,
+                           float proj_clip,
+                           const std::vector<std::vector<int>>& input_shapes)
+      : n_batch_(n_batch),
+        n_input_(n_input),
+        n_fw_cell_(n_cell),
+        n_bw_cell_(n_cell),
+        n_fw_output_(n_output),
+        n_bw_output_(n_output),
+        sequence_length_(sequence_length) {
+    input_ = AddInput(TensorType_FLOAT32);
+
+    if (use_cifg) {
+      fw_input_to_input_weights_ = AddNullInput();
+    } else {
+      fw_input_to_input_weights_ = AddInput(TensorType_FLOAT32);
+    }
+
+    fw_input_to_forget_weights_ = AddInput(TensorType_FLOAT32);
+    fw_input_to_cell_weights_ = AddInput(TensorType_FLOAT32);
+    fw_input_to_output_weights_ = AddInput(TensorType_FLOAT32);
+
+    if (use_cifg) {
+      fw_recurrent_to_input_weights_ = AddNullInput();
+    } else {
+      fw_recurrent_to_input_weights_ = AddInput(TensorType_FLOAT32);
+    }
+
+    fw_recurrent_to_forget_weights_ = AddInput(TensorType_FLOAT32);
+    fw_recurrent_to_cell_weights_ = AddInput(TensorType_FLOAT32);
+    fw_recurrent_to_output_weights_ = AddInput(TensorType_FLOAT32);
+
+    if (use_peephole) {
+      if (use_cifg) {
+        fw_cell_to_input_weights_ = AddNullInput();
+      } else {
+        fw_cell_to_input_weights_ = AddInput(TensorType_FLOAT32);
+      }
+      fw_cell_to_forget_weights_ = AddInput(TensorType_FLOAT32);
+      fw_cell_to_output_weights_ = AddInput(TensorType_FLOAT32);
+    } else {
+      fw_cell_to_input_weights_ = AddNullInput();
+      fw_cell_to_forget_weights_ = AddNullInput();
+      fw_cell_to_output_weights_ = AddNullInput();
+    }
+
+    if (use_cifg) {
+      fw_input_gate_bias_ = AddNullInput();
+    } else {
+      fw_input_gate_bias_ = AddInput(TensorType_FLOAT32);
+    }
+    fw_forget_gate_bias_ = AddInput(TensorType_FLOAT32);
+    fw_cell_bias_ = AddInput(TensorType_FLOAT32);
+    fw_output_gate_bias_ = AddInput(TensorType_FLOAT32);
+
+    if (use_projection_weights) {
+      fw_projection_weights_ = AddInput(TensorType_FLOAT32);
+      if (use_projection_bias) {
+        fw_projection_bias_ = AddInput(TensorType_FLOAT32);
+      } else {
+        fw_projection_bias_ = AddNullInput();
+      }
+    } else {
+      fw_projection_weights_ = AddNullInput();
+      fw_projection_bias_ = AddNullInput();
+    }
+
+    fw_scratch_buffer_ = AddOutput(TensorType_FLOAT32);
+    // TODO(ghodrat): Modify these states when we have a permanent solution for
+    // persistent buffer.
+    fw_output_state_ = AddOutput(TensorType_FLOAT32);
+    fw_cell_state_ = AddOutput(TensorType_FLOAT32);
+    fw_output_ = AddOutput(TensorType_FLOAT32);
+
+    if (use_cifg) {
+      bw_input_to_input_weights_ = AddNullInput();
+    } else {
+      bw_input_to_input_weights_ = AddInput(TensorType_FLOAT32);
+    }
+
+    bw_input_to_forget_weights_ = AddInput(TensorType_FLOAT32);
+    bw_input_to_cell_weights_ = AddInput(TensorType_FLOAT32);
+    bw_input_to_output_weights_ = AddInput(TensorType_FLOAT32);
+
+    if (use_cifg) {
+      bw_recurrent_to_input_weights_ = AddNullInput();
+    } else {
+      bw_recurrent_to_input_weights_ = AddInput(TensorType_FLOAT32);
+    }
+
+    bw_recurrent_to_forget_weights_ = AddInput(TensorType_FLOAT32);
+    bw_recurrent_to_cell_weights_ = AddInput(TensorType_FLOAT32);
+    bw_recurrent_to_output_weights_ = AddInput(TensorType_FLOAT32);
+
+    if (use_peephole) {
+      if (use_cifg) {
+        bw_cell_to_input_weights_ = AddNullInput();
+      } else {
+        bw_cell_to_input_weights_ = AddInput(TensorType_FLOAT32);
+      }
+      bw_cell_to_forget_weights_ = AddInput(TensorType_FLOAT32);
+      bw_cell_to_output_weights_ = AddInput(TensorType_FLOAT32);
+    } else {
+      bw_cell_to_input_weights_ = AddNullInput();
+      bw_cell_to_forget_weights_ = AddNullInput();
+      bw_cell_to_output_weights_ = AddNullInput();
+    }
+
+    if (use_cifg) {
+      bw_input_gate_bias_ = AddNullInput();
+    } else {
+      bw_input_gate_bias_ = AddInput(TensorType_FLOAT32);
+    }
+    bw_forget_gate_bias_ = AddInput(TensorType_FLOAT32);
+    bw_cell_bias_ = AddInput(TensorType_FLOAT32);
+    bw_output_gate_bias_ = AddInput(TensorType_FLOAT32);
+
+    if (use_projection_weights) {
+      bw_projection_weights_ = AddInput(TensorType_FLOAT32);
+      if (use_projection_bias) {
+        bw_projection_bias_ = AddInput(TensorType_FLOAT32);
+      } else {
+        bw_projection_bias_ = AddNullInput();
+      }
+    } else {
+      bw_projection_weights_ = AddNullInput();
+      bw_projection_bias_ = AddNullInput();
+    }
+
+    bw_scratch_buffer_ = AddOutput(TensorType_FLOAT32);
+    // TODO(ghodrat): Modify these states when we have a permanent solution for
+    // persistent buffer.
+    bw_output_state_ = AddOutput(TensorType_FLOAT32);
+    bw_cell_state_ = AddOutput(TensorType_FLOAT32);
+    bw_output_ = AddOutput(TensorType_FLOAT32);
+
+    SetBuiltinOp(BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM,
+                 BuiltinOptions_LSTMOptions,
+                 CreateLSTMOptions(builder_, ActivationFunctionType_TANH,
+                                   cell_clip, proj_clip)
+                     .Union());
+    BuildInterpreter(input_shapes);
+  }
+
+  // Set weights in forward and backward cells to be the same.
+  void SetInputToInputWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_input_to_input_weights_, f);
+    PopulateTensor(bw_input_to_input_weights_, f);
+  }
+
+  void SetInputToForgetWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_input_to_forget_weights_, f);
+    PopulateTensor(bw_input_to_forget_weights_, f);
+  }
+
+  void SetInputToCellWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_input_to_cell_weights_, f);
+    PopulateTensor(bw_input_to_cell_weights_, f);
+  }
+
+  void SetInputToOutputWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_input_to_output_weights_, f);
+    PopulateTensor(bw_input_to_output_weights_, f);
+  }
+
+  void SetRecurrentToInputWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_recurrent_to_input_weights_, f);
+    PopulateTensor(bw_recurrent_to_input_weights_, f);
+  }
+
+  void SetRecurrentToForgetWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_recurrent_to_forget_weights_, f);
+    PopulateTensor(bw_recurrent_to_forget_weights_, f);
+  }
+
+  void SetRecurrentToCellWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_recurrent_to_cell_weights_, f);
+    PopulateTensor(bw_recurrent_to_cell_weights_, f);
+  }
+
+  void SetRecurrentToOutputWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_recurrent_to_output_weights_, f);
+    PopulateTensor(bw_recurrent_to_output_weights_, f);
+  }
+
+  void SetCellToInputWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_cell_to_input_weights_, f);
+    PopulateTensor(bw_cell_to_input_weights_, f);
+  }
+
+  void SetCellToForgetWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_cell_to_forget_weights_, f);
+    PopulateTensor(bw_cell_to_forget_weights_, f);
+  }
+
+  void SetCellToOutputWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_cell_to_output_weights_, f);
+    PopulateTensor(bw_cell_to_output_weights_, f);
+  }
+
+  void SetInputGateBias(std::initializer_list<float> f) {
+    PopulateTensor(fw_input_gate_bias_, f);
+    PopulateTensor(bw_input_gate_bias_, f);
+  }
+
+  void SetForgetGateBias(std::initializer_list<float> f) {
+    PopulateTensor(fw_forget_gate_bias_, f);
+    PopulateTensor(bw_forget_gate_bias_, f);
+  }
+
+  void SetCellBias(std::initializer_list<float> f) {
+    PopulateTensor(fw_cell_bias_, f);
+    PopulateTensor(bw_cell_bias_, f);
+  }
+
+  void SetOutputGateBias(std::initializer_list<float> f) {
+    PopulateTensor(fw_output_gate_bias_, f);
+    PopulateTensor(bw_output_gate_bias_, f);
+  }
+
+  void SetProjectionWeights(std::initializer_list<float> f) {
+    PopulateTensor(fw_projection_weights_, f);
+    PopulateTensor(bw_projection_weights_, f);
+  }
+
+  void SetProjectionBias(std::initializer_list<float> f) {
+    PopulateTensor(fw_projection_bias_, f);
+    PopulateTensor(bw_projection_bias_, f);
+  }
+
+  void ResetFwOutputAndCellStates() {
+    const int zero_buffer_size = n_fw_cell_ * n_batch_;
+    std::unique_ptr<float[]> zero_buffer(new float[zero_buffer_size]);
+    memset(zero_buffer.get(), 0, zero_buffer_size * sizeof(float));
+    PopulateTensor(fw_output_state_, 0, zero_buffer.get(),
+                   zero_buffer.get() + zero_buffer_size);
+    PopulateTensor(fw_cell_state_, 0, zero_buffer.get(),
+                   zero_buffer.get() + zero_buffer_size);
+  }
+
+  void ResetBwOutputAndCellStates() {
+    const int zero_buffer_size = n_bw_cell_ * n_batch_;
+    std::unique_ptr<float[]> zero_buffer(new float[zero_buffer_size]);
+    memset(zero_buffer.get(), 0, zero_buffer_size * sizeof(float));
+    PopulateTensor(bw_output_state_, 0, zero_buffer.get(),
+                   zero_buffer.get() + zero_buffer_size);
+    PopulateTensor(bw_cell_state_, 0, zero_buffer.get(),
+                   zero_buffer.get() + zero_buffer_size);
+  }
+
+  void SetInput(int offset, float* begin, float* end) {
+    PopulateTensor(input_, offset, begin, end);
+  }
+
+  std::vector<float> GetFwOutput() { return ExtractVector<float>(fw_output_); }
+  std::vector<float> GetBwOutput() { return ExtractVector<float>(bw_output_); }
+
+  int num_inputs() { return n_input_; }
+  int num_fw_outputs() { return n_fw_output_; }
+  int num_bw_outputs() { return n_bw_output_; }
+  int num_fw_cells() { return n_fw_cell_; }
+  int num_bw_cells() { return n_bw_cell_; }
+  int num_batches() { return n_batch_; }
+  int sequence_length() { return sequence_length_; }
+
+ private:
+  int input_;
+  int fw_input_to_input_weights_;
+  int fw_input_to_forget_weights_;
+  int fw_input_to_cell_weights_;
+  int fw_input_to_output_weights_;
+
+  int fw_recurrent_to_input_weights_;
+  int fw_recurrent_to_forget_weights_;
+  int fw_recurrent_to_cell_weights_;
+  int fw_recurrent_to_output_weights_;
+
+  int fw_cell_to_input_weights_;
+  int fw_cell_to_forget_weights_;
+  int fw_cell_to_output_weights_;
+
+  int fw_input_gate_bias_;
+  int fw_forget_gate_bias_;
+  int fw_cell_bias_;
+  int fw_output_gate_bias_;
+
+  int fw_projection_weights_;
+  int fw_projection_bias_;
+
+  int bw_input_to_input_weights_;
+  int bw_input_to_forget_weights_;
+  int bw_input_to_cell_weights_;
+  int bw_input_to_output_weights_;
+
+  int bw_recurrent_to_input_weights_;
+  int bw_recurrent_to_forget_weights_;
+  int bw_recurrent_to_cell_weights_;
+  int bw_recurrent_to_output_weights_;
+
+  int bw_cell_to_input_weights_;
+  int bw_cell_to_forget_weights_;
+  int bw_cell_to_output_weights_;
+
+  int bw_input_gate_bias_;
+  int bw_forget_gate_bias_;
+  int bw_cell_bias_;
+  int bw_output_gate_bias_;
+
+  int bw_projection_weights_;
+  int bw_projection_bias_;
+
+  int fw_output_;
+  int fw_output_state_;
+  int fw_cell_state_;
+  int fw_scratch_buffer_;
+
+  int bw_output_;
+  int bw_output_state_;
+  int bw_cell_state_;
+  int bw_scratch_buffer_;
+
+  int n_batch_;
+  int n_input_;
+  int n_fw_cell_;
+  int n_bw_cell_;
+  int n_fw_output_;
+  int n_bw_output_;
+  int sequence_length_;
+};
+
+TEST(LSTMOpTest, BlackBoxTestNoCifgNoPeepholeNoProjectionNoClipping) {
+  const int n_batch = 1;
+  const int n_input = 2;
+  // n_cell and n_output have the same size when there is no projection.
+  const int n_cell = 4;
+  const int n_output = 4;
+  const int sequence_length = 3;
+
+  BidirectionalLSTMOpModel lstm(
+      n_batch, n_input, n_cell, n_output, sequence_length, /*use_cifg=*/false,
+      /*use_peephole=*/false, /*use_projection_weights=*/false,
+      /*use_projection_bias=*/false, /*cell_clip=*/0.0, /*proj_clip=*/0.0,
+      {
+          {sequence_length, n_batch, n_input},  // input tensor
+
+          // Forward cell
+          {n_cell, n_input},  // input_to_input_weight tensor
+          {n_cell, n_input},  // input_to_forget_weight tensor
+          {n_cell, n_input},  // input_to_cell_weight tensor
+          {n_cell, n_input},  // input_to_output_weight tensor
+
+          {n_cell, n_output},  // recurrent_to_input_weight tensor
+          {n_cell, n_output},  // recurrent_to_forget_weight tensor
+          {n_cell, n_output},  // recurrent_to_cell_weight tensor
+          {n_cell, n_output},  // recurrent_to_output_weight tensor
+
+          {0},  // cell_to_input_weight tensor
+          {0},  // cell_to_forget_weight tensor
+          {0},  // cell_to_output_weight tensor
+
+          {n_cell},  // input_gate_bias tensor
+          {n_cell},  // forget_gate_bias tensor
+          {n_cell},  // cell_bias tensor
+          {n_cell},  // output_gate_bias tensor
+
+          {0, 0},  // projection_weight tensor
+          {0},     // projection_bias tensor
+
+          // Backward cell
+          {n_cell, n_input},  // input_to_input_weight tensor
+          {n_cell, n_input},  // input_to_forget_weight tensor
+          {n_cell, n_input},  // input_to_cell_weight tensor
+          {n_cell, n_input},  // input_to_output_weight tensor
+
+          {n_cell, n_output},  // recurrent_to_input_weight tensor
+          {n_cell, n_output},  // recurrent_to_forget_weight tensor
+          {n_cell, n_output},  // recurrent_to_cell_weight tensor
+          {n_cell, n_output},  // recurrent_to_output_weight tensor
+
+          {0},  // cell_to_input_weight tensor
+          {0},  // cell_to_forget_weight tensor
+          {0},  // cell_to_output_weight tensor
+
+          {n_cell},  // input_gate_bias tensor
+          {n_cell},  // forget_gate_bias tensor
+          {n_cell},  // cell_bias tensor
+          {n_cell},  // output_gate_bias tensor
+
+          {0, 0},  // projection_weight tensor
+          {0},     // projection_bias tensor
+      });
+
+  lstm.SetInputToInputWeights({-0.45018822, -0.02338299, -0.0870589,
+                               -0.34550029, 0.04266912, -0.15680569,
+                               -0.34856534, 0.43890524});
+
+  lstm.SetInputToCellWeights({-0.50013041, 0.1370284, 0.11810488, 0.2013163,
+                              -0.20583314, 0.44344562, 0.22077113,
+                              -0.29909778});
+
+  lstm.SetInputToForgetWeights({0.09701663, 0.20334584, -0.50592935,
+                                -0.31343272, -0.40032279, 0.44781327,
+                                0.01387155, -0.35593212});
+
+  lstm.SetInputToOutputWeights({-0.25065863, -0.28290087, 0.04613829,
+                                0.40525138, 0.44272184, 0.03897077, -0.1556896,
+                                0.19487578});
+
+  lstm.SetInputGateBias({0., 0., 0., 0.});
+
+  lstm.SetCellBias({0., 0., 0., 0.});
+
+  lstm.SetForgetGateBias({1., 1., 1., 1.});
+
+  lstm.SetOutputGateBias({0., 0., 0., 0.});
+
+  lstm.SetRecurrentToInputWeights(
+      {-0.0063535, -0.2042388, 0.31454784, -0.35746509, 0.28902304, 0.08183324,
+       -0.16555229, 0.02286911, -0.13566875, 0.03034258, 0.48091322,
+       -0.12528998, 0.24077177, -0.51332325, -0.33502164, 0.10629296});
+
+  lstm.SetRecurrentToCellWeights(
+      {-0.3407414, 0.24443203, -0.2078532, 0.26320225, 0.05695659, -0.00123841,
+       -0.4744786, -0.35869038, -0.06418842, -0.13502428, -0.501764, 0.22830659,
+       -0.46367589, 0.26016325, -0.03894562, -0.16368064});
+
+  lstm.SetRecurrentToForgetWeights(
+      {-0.48684245, -0.06655136, 0.42224967, 0.2112639, 0.27654213, 0.20864892,
+       -0.07646349, 0.45877004, 0.00141793, -0.14609534, 0.36447752, 0.09196436,
+       0.28053468, 0.01560611, -0.20127171, -0.01140004});
+
+  lstm.SetRecurrentToOutputWeights(
+      {0.43385774, -0.17194885, 0.2718237, 0.09215671, 0.24107647, -0.39835793,
+       0.18212086, 0.01301402, 0.48572797, -0.50656658, 0.20047462, -0.20607421,
+       -0.51818722, -0.15390486, 0.0468148, 0.39922136});
+
+  // Input should have n_input * sequence_length many values.
+  static float lstm_input[] = {2., 3., 3., 4., 1., 1.};
+  static float lstm_fw_golden_output[] = {
+      -0.02973187, 0.1229473,  0.20885126, -0.15358765,
+      -0.03716109, 0.12507336, 0.41193449, -0.20860538,
+      -0.15053082, 0.09120187, 0.24278517, -0.12222792};
+  static float lstm_bw_golden_output[] = {
+      -0.0806187, 0.139077, 0.400476, -0.197842,
+      -0.0332076, 0.123838, 0.309777, -0.17621,
+      -0.0490733, 0.0739237, 0.067706, -0.0208124};
+
+  // Resetting cell_state and output_state
+  lstm.ResetFwOutputAndCellStates();
+  lstm.ResetBwOutputAndCellStates();
+
+  float* batch0_start = lstm_input;
+  float* batch0_end = batch0_start + lstm.num_inputs() * lstm.sequence_length();
+
+  lstm.SetInput(0, batch0_start, batch0_end);
+
+  lstm.Invoke();
+
+  float* fw_golden_start = lstm_fw_golden_output;
+  float* fw_golden_end =
+      fw_golden_start + lstm.num_fw_outputs() * lstm.sequence_length();
+  std::vector<float> fw_expected;
+  fw_expected.insert(fw_expected.end(), fw_golden_start, fw_golden_end);
+  EXPECT_THAT(lstm.GetFwOutput(),
+              ElementsAreArray(ArrayFloatNear(fw_expected)));
+
+  float* bw_golden_start = lstm_bw_golden_output;
+  float* bw_golden_end =
+      bw_golden_start + lstm.num_bw_outputs() * lstm.sequence_length();
+  std::vector<float> bw_expected;
+  bw_expected.insert(bw_expected.end(), bw_golden_start, bw_golden_end);
+  EXPECT_THAT(lstm.GetBwOutput(),
+              ElementsAreArray(ArrayFloatNear(bw_expected)));
+
+  // Check reversed inputs.
+  static float lstm_input_reversed[] = {1., 1., 3., 4., 2., 3.};
+
+  // Resetting cell_state and output_state
+  lstm.ResetFwOutputAndCellStates();
+  lstm.ResetBwOutputAndCellStates();
+
+  batch0_start = lstm_input_reversed;
+  batch0_end = batch0_start + lstm.num_inputs() * lstm.sequence_length();
+
+  lstm.SetInput(0, batch0_start, batch0_end);
+
+  lstm.Invoke();
+
+  fw_expected.clear();
+  for (int s = 0; s < lstm.sequence_length(); s++) {
+    fw_golden_start = lstm_fw_golden_output + s * lstm.num_fw_outputs();
+    fw_golden_end = fw_golden_start + lstm.num_fw_outputs();
+    fw_expected.insert(fw_expected.begin(), fw_golden_start, fw_golden_end);
+  }
+  EXPECT_THAT(lstm.GetBwOutput(),
+              ElementsAreArray(ArrayFloatNear(fw_expected)));
+
+  bw_expected.clear();
+  for (int s = 0; s < lstm.sequence_length(); s++) {
+    bw_golden_start = lstm_bw_golden_output + s * lstm.num_bw_outputs();
+    bw_golden_end = bw_golden_start + lstm.num_bw_outputs();
+    bw_expected.insert(bw_expected.begin(), bw_golden_start, bw_golden_end);
+  }
+  EXPECT_THAT(lstm.GetFwOutput(),
+              ElementsAreArray(ArrayFloatNear(bw_expected)));
+}
+
+TEST(LSTMOpTest, BlackBoxTestWithCifgWithPeepholeNoProjectionNoClipping) {
+  const int n_batch = 1;
+  const int n_input = 2;
+  // n_cell and n_output have the same size when there is no projection.
+  const int n_cell = 4;
+  const int n_output = 4;
+  const int sequence_length = 3;
+
+  BidirectionalLSTMOpModel lstm(
+      n_batch, n_input, n_cell, n_output, sequence_length, /*use_cifg=*/true,
+      /*use_peephole=*/true, /*use_projection_weights=*/false,
+      /*use_projection_bias=*/false, /*cell_clip=*/0.0, /*proj_clip=*/0.0,
+      {
+          {sequence_length, n_batch, n_input},  // input tensor
+
+          {0, 0},             // input_to_input_weight tensor
+          {n_cell, n_input},  // input_to_forget_weight tensor
+          {n_cell, n_input},  // input_to_cell_weight tensor
+          {n_cell, n_input},  // input_to_output_weight tensor
+
+          {0, 0},              // recurrent_to_input_weight tensor
+          {n_cell, n_output},  // recurrent_to_forget_weight tensor
+          {n_cell, n_output},  // recurrent_to_cell_weight tensor
+          {n_cell, n_output},  // recurrent_to_output_weight tensor
+
+          {0},       // cell_to_input_weight tensor
+          {n_cell},  // cell_to_forget_weight tensor
+          {n_cell},  // cell_to_output_weight tensor
+
+          {0},       // input_gate_bias tensor
+          {n_cell},  // forget_gate_bias tensor
+          {n_cell},  // cell_bias tensor
+          {n_cell},  // output_gate_bias tensor
+
+          {0, 0},  // projection_weight tensor
+          {0},     // projection_bias tensor
+
+          {0, 0},             // input_to_input_weight tensor
+          {n_cell, n_input},  // input_to_forget_weight tensor
+          {n_cell, n_input},  // input_to_cell_weight tensor
+          {n_cell, n_input},  // input_to_output_weight tensor
+
+          {0, 0},              // recurrent_to_input_weight tensor
+          {n_cell, n_output},  // recurrent_to_forget_weight tensor
+          {n_cell, n_output},  // recurrent_to_cell_weight tensor
+          {n_cell, n_output},  // recurrent_to_output_weight tensor
+
+          {0},       // cell_to_input_weight tensor
+          {n_cell},  // cell_to_forget_weight tensor
+          {n_cell},  // cell_to_output_weight tensor
+
+          {0},       // input_gate_bias tensor
+          {n_cell},  // forget_gate_bias tensor
+          {n_cell},  // cell_bias tensor
+          {n_cell},  // output_gate_bias tensor
+
+          {0, 0},  // projection_weight tensor
+          {0},     // projection_bias tensor
+      });
+
+  lstm.SetInputToCellWeights({-0.49770179, -0.27711356, -0.09624726, 0.05100781,
+                              0.04717243, 0.48944736, -0.38535351,
+                              -0.17212132});
+
+  lstm.SetInputToForgetWeights({-0.55291498, -0.42866567, 0.13056988,
+                                -0.3633365, -0.22755712, 0.28253698, 0.24407166,
+                                0.33826375});
+
+  lstm.SetInputToOutputWeights({0.10725588, -0.02335852, -0.55932593,
+                                -0.09426838, -0.44257352, 0.54939759,
+                                0.01533556, 0.42751634});
+
+  lstm.SetCellBias({0., 0., 0., 0.});
+
+  lstm.SetForgetGateBias({1., 1., 1., 1.});
+
+  lstm.SetOutputGateBias({0., 0., 0., 0.});
+
+  lstm.SetRecurrentToCellWeights(
+      {0.54066205, -0.32668582, -0.43562764, -0.56094903, 0.42957711,
+       0.01841056, -0.32764608, -0.33027974, -0.10826075, 0.20675004,
+       0.19069612, -0.03026325, -0.54532051, 0.33003211, 0.44901288,
+       0.21193194});
+
+  lstm.SetRecurrentToForgetWeights(
+      {-0.13832897, -0.0515101, -0.2359007, -0.16661474, -0.14340827,
+       0.36986142, 0.23414481, 0.55899, 0.10798943, -0.41174671, 0.17751795,
+       -0.34484994, -0.35874045, -0.11352962, 0.27268326, 0.54058349});
+
+  lstm.SetRecurrentToOutputWeights(
+      {0.41613156, 0.42610586, -0.16495961, -0.5663873, 0.30579174, -0.05115908,
+       -0.33941799, 0.23364776, 0.11178309, 0.09481031, -0.26424935, 0.46261835,
+       0.50248802, 0.26114327, -0.43736315, 0.33149987});
+
+  lstm.SetCellToForgetWeights(
+      {0.47485286, -0.51955009, -0.24458408, 0.31544167});
+  lstm.SetCellToOutputWeights(
+      {-0.17135078, 0.82760304, 0.85573703, -0.77109635});
+
+  static float lstm_input[] = {2., 3., 3., 4., 1., 1.};
+  static float lstm_fw_golden_output[] = {
+      -0.36444446, -0.00352185, 0.12886585, -0.05163646,
+      -0.42312205, -0.01218222, 0.24201041, -0.08124574,
+      -0.358325,   -0.04621704, 0.21641694, -0.06471302};
+  static float lstm_bw_golden_output[] = {
+      -0.401685, -0.0232794, 0.288642,  -0.123074,   -0.42915,  -0.00871577,
+      0.20912,   -0.103567,  -0.166398, -0.00486649, 0.0697471, -0.0537578};
+
+  // Resetting cell_state and output_state
+  lstm.ResetFwOutputAndCellStates();
+  lstm.ResetBwOutputAndCellStates();
+
+  float* batch0_start = lstm_input;
+  float* batch0_end = batch0_start + lstm.num_inputs() * lstm.sequence_length();
+
+  lstm.SetInput(0, batch0_start, batch0_end);
+
+  lstm.Invoke();
+
+  float* fw_golden_start = lstm_fw_golden_output;
+  float* fw_golden_end =
+      fw_golden_start + lstm.num_fw_outputs() * lstm.sequence_length();
+  std::vector<float> fw_expected;
+  fw_expected.insert(fw_expected.end(), fw_golden_start, fw_golden_end);
+  EXPECT_THAT(lstm.GetFwOutput(),
+              ElementsAreArray(ArrayFloatNear(fw_expected)));
+
+  float* bw_golden_start = lstm_bw_golden_output;
+  float* bw_golden_end =
+      bw_golden_start + lstm.num_bw_outputs() * lstm.sequence_length();
+  std::vector<float> bw_expected;
+  bw_expected.insert(bw_expected.end(), bw_golden_start, bw_golden_end);
+  EXPECT_THAT(lstm.GetBwOutput(),
+              ElementsAreArray(ArrayFloatNear(bw_expected)));
+
+  // Check reversed inputs.
+  static float lstm_input_reversed[] = {1., 1., 3., 4., 2., 3.};
+
+  // Resetting cell_state and output_state
+  lstm.ResetFwOutputAndCellStates();
+  lstm.ResetBwOutputAndCellStates();
+
+  batch0_start = lstm_input_reversed;
+  batch0_end = batch0_start + lstm.num_inputs() * lstm.sequence_length();
+
+  lstm.SetInput(0, batch0_start, batch0_end);
+
+  lstm.Invoke();
+
+  fw_expected.clear();
+  for (int s = 0; s < lstm.sequence_length(); s++) {
+    fw_golden_start = lstm_fw_golden_output + s * lstm.num_fw_outputs();
+    fw_golden_end = fw_golden_start + lstm.num_fw_outputs();
+    fw_expected.insert(fw_expected.begin(), fw_golden_start, fw_golden_end);
+  }
+  EXPECT_THAT(lstm.GetBwOutput(),
+              ElementsAreArray(ArrayFloatNear(fw_expected)));
+
+  bw_expected.clear();
+  for (int s = 0; s < lstm.sequence_length(); s++) {
+    bw_golden_start = lstm_bw_golden_output + s * lstm.num_bw_outputs();
+    bw_golden_end = bw_golden_start + lstm.num_bw_outputs();
+    bw_expected.insert(bw_expected.begin(), bw_golden_start, bw_golden_end);
+  }
+  EXPECT_THAT(lstm.GetFwOutput(),
+              ElementsAreArray(ArrayFloatNear(bw_expected)));
+}
+
+TEST(LSTMOpTest, BlackBoxTestWithPeepholeWithProjectionNoClipping) {
+  const int n_batch = 2;
+  const int n_input = 5;
+  const int n_cell = 20;
+  const int n_output = 16;
+  const int sequence_length = 4;
+
+  BidirectionalLSTMOpModel lstm(
+      n_batch, n_input, n_cell, n_output, sequence_length, /*use_cifg=*/false,
+      /*use_peephole=*/true, /*use_projection_weights=*/true,
+      /*use_projection_bias=*/false, /*cell_clip=*/0.0, /*proj_clip=*/0.0,
+      {
+          {sequence_length, n_batch, n_input},  // input tensor
+
+          {n_cell, n_input},  // input_to_input_weight tensor
+          {n_cell, n_input},  // input_to_forget_weight tensor
+          {n_cell, n_input},  // input_to_cell_weight tensor
+          {n_cell, n_input},  // input_to_output_weight tensor
+
+          {n_cell, n_output},  // recurrent_to_input_weight tensor
+          {n_cell, n_output},  // recurrent_to_forget_weight tensor
+          {n_cell, n_output},  // recurrent_to_cell_weight tensor
+          {n_cell, n_output},  // recurrent_to_output_weight tensor
+
+          {n_cell},  // cell_to_input_weight tensor
+          {n_cell},  // cell_to_forget_weight tensor
+          {n_cell},  // cell_to_output_weight tensor
+
+          {n_cell},  // input_gate_bias tensor
+          {n_cell},  // forget_gate_bias tensor
+          {n_cell},  // cell_bias tensor
+          {n_cell},  // output_gate_bias tensor
+
+          {n_output, n_cell},  // projection_weight tensor
+          {0},                 // projection_bias tensor
+
+          {n_cell, n_input},  // input_to_input_weight tensor
+          {n_cell, n_input},  // input_to_forget_weight tensor
+          {n_cell, n_input},  // input_to_cell_weight tensor
+          {n_cell, n_input},  // input_to_output_weight tensor
+
+          {n_cell, n_output},  // recurrent_to_input_weight tensor
+          {n_cell, n_output},  // recurrent_to_forget_weight tensor
+          {n_cell, n_output},  // recurrent_to_cell_weight tensor
+          {n_cell, n_output},  // recurrent_to_output_weight tensor
+
+          {n_cell},  // cell_to_input_weight tensor
+          {n_cell},  // cell_to_forget_weight tensor
+          {n_cell},  // cell_to_output_weight tensor
+
+          {n_cell},  // input_gate_bias tensor
+          {n_cell},  // forget_gate_bias tensor
+          {n_cell},  // cell_bias tensor
+          {n_cell},  // output_gate_bias tensor
+
+          {n_output, n_cell},  // projection_weight tensor
+          {0},                 // projection_bias tensor
+      });
+
+  lstm.SetInputToInputWeights(
+      {0.021393683,  0.06124551,    0.046905167,  -0.014657677,  -0.03149463,
+       0.09171803,   0.14647801,    0.10797193,   -0.0057968358, 0.0019193048,
+       -0.2726754,   0.10154029,    -0.018539885, 0.080349885,   -0.10262385,
+       -0.022599787, -0.09121155,   -0.008675967, -0.045206103,  -0.0821282,
+       -0.008045952, 0.015478081,   0.055217247,  0.038719587,   0.044153627,
+       -0.06453243,  0.05031825,    -0.046935108, -0.008164439,  0.014574226,
+       -0.1671009,   -0.15519552,   -0.16819797,  -0.13971269,   -0.11953059,
+       0.25005487,   -0.22790983,   0.009855087,  -0.028140958,  -0.11200698,
+       0.11295408,   -0.0035217577, 0.054485075,  0.05184695,    0.064711206,
+       0.10989193,   0.11674786,    0.03490607,   0.07727357,    0.11390585,
+       -0.1863375,   -0.1034451,    -0.13945189,  -0.049401227,  -0.18767063,
+       0.042483903,  0.14233552,    0.13832581,   0.18350165,    0.14545603,
+       -0.028545704, 0.024939531,   0.050929718,  0.0076203286,  -0.0029723682,
+       -0.042484224, -0.11827596,   -0.09171104,  -0.10808628,   -0.16327988,
+       -0.2273378,   -0.0993647,    -0.017155107, 0.0023917493,  0.049272764,
+       0.0038534778, 0.054764505,   0.089753784,  0.06947234,    0.08014476,
+       -0.04544234,  -0.0497073,    -0.07135631,  -0.048929106,  -0.004042012,
+       -0.009284026, 0.018042054,   0.0036860977, -0.07427302,   -0.11434604,
+       -0.018995456, 0.031487543,   0.012834908,  0.019977754,   0.044256654,
+       -0.39292613,  -0.18519334,   -0.11651281,  -0.06809892,   0.011373677});
+
+  lstm.SetInputToForgetWeights(
+      {-0.0018401089, -0.004852237,  0.03698424,   0.014181704,   0.028273236,
+       -0.016726194,  -0.05249759,   -0.10204261,  0.00861066,    -0.040979505,
+       -0.009899187,  0.01923892,    -0.028177269, -0.08535103,   -0.14585495,
+       0.10662567,    -0.01909731,   -0.017883534, -0.0047269356, -0.045103323,
+       0.0030784295,  0.076784775,   0.07463696,   0.094531395,   0.0814421,
+       -0.12257899,   -0.033945758,  -0.031303465, 0.045630626,   0.06843887,
+       -0.13492945,   -0.012480007,  -0.0811829,   -0.07224499,   -0.09628791,
+       0.045100946,   0.0012300825,  0.013964662,  0.099372394,   0.02543059,
+       0.06958324,    0.034257296,   0.0482646,    0.06267997,    0.052625068,
+       0.12784666,    0.07077897,    0.025725935,  0.04165009,    0.07241905,
+       0.018668644,   -0.037377294,  -0.06277783,  -0.08833636,   -0.040120605,
+       -0.011405586,  -0.007808335,  -0.010301386, -0.005102167,  0.027717464,
+       0.05483423,    0.11449111,    0.11289652,   0.10939839,    0.13396506,
+       -0.08402166,   -0.01901462,   -0.044678304, -0.07720565,   0.014350063,
+       -0.11757958,   -0.0652038,    -0.08185733,  -0.076754324,  -0.092614375,
+       0.10405491,    0.052960336,   0.035755895,  0.035839386,   -0.012540553,
+       0.036881298,   0.02913376,    0.03420159,   0.05448447,    -0.054523353,
+       0.02582715,    0.02327355,    -0.011857179, -0.0011980024, -0.034641717,
+       -0.026125094,  -0.17582615,   -0.15923657,  -0.27486774,   -0.0006143371,
+       0.0001771948,  -8.470171e-05, 0.02651807,   0.045790765,   0.06956496});
+
+  lstm.SetInputToCellWeights(
+      {-0.04580283,   -0.09549462,   -0.032418985,  -0.06454633,
+       -0.043528453,  0.043018587,   -0.049152344,  -0.12418144,
+       -0.078985475,  -0.07596889,   0.019484362,   -0.11434962,
+       -0.0074034138, -0.06314844,   -0.092981495,  0.0062155537,
+       -0.025034338,  -0.0028890965, 0.048929527,   0.06235075,
+       0.10665918,    -0.032036792,  -0.08505916,   -0.10843358,
+       -0.13002433,   -0.036816437,  -0.02130134,   -0.016518239,
+       0.0047691227,  -0.0025825808, 0.066017866,   0.029991534,
+       -0.10652836,   -0.1037554,    -0.13056071,   -0.03266643,
+       -0.033702414,  -0.006473424,  -0.04611692,   0.014419339,
+       -0.025174323,  0.0396852,     0.081777506,   0.06157468,
+       0.10210095,    -0.009658194,  0.046511717,   0.03603906,
+       0.0069369148,  0.015960095,   -0.06507666,   0.09551598,
+       0.053568836,   0.06408714,    0.12835667,    -0.008714329,
+       -0.20211966,   -0.12093674,   0.029450472,   0.2849013,
+       -0.029227901,  0.1164364,     -0.08560263,   0.09941786,
+       -0.036999565,  -0.028842626,  -0.0033637602, -0.017012902,
+       -0.09720865,   -0.11193351,   -0.029155117,  -0.017936034,
+       -0.009768936,  -0.04223324,   -0.036159635,  0.06505112,
+       -0.021742892,  -0.023377212,  -0.07221364,   -0.06430552,
+       0.05453865,    0.091149814,   0.06387331,    0.007518393,
+       0.055960953,   0.069779344,   0.046411168,   0.10509911,
+       0.07463894,    0.0075130584,  0.012850982,   0.04555431,
+       0.056955688,   0.06555285,    0.050801456,   -0.009862683,
+       0.00826772,    -0.026555609,  -0.0073611983, -0.0014897042});
+
+  lstm.SetInputToOutputWeights(
+      {-0.0998932,   -0.07201956,  -0.052803773,  -0.15629593,  -0.15001918,
+       -0.07650751,  0.02359855,   -0.075155355,  -0.08037709,  -0.15093534,
+       0.029517552,  -0.04751393,  0.010350531,   -0.02664851,  -0.016839722,
+       -0.023121163, 0.0077019283, 0.012851257,   -0.05040649,  -0.0129761,
+       -0.021737747, -0.038305793, -0.06870586,   -0.01481247,  -0.001285394,
+       0.10124236,   0.083122835,  0.053313006,   -0.062235646, -0.075637154,
+       -0.027833903, 0.029774971,  0.1130802,     0.09218906,   0.09506135,
+       -0.086665764, -0.037162706, -0.038880914,  -0.035832845, -0.014481564,
+       -0.09825003,  -0.12048569,  -0.097665586,  -0.05287633,  -0.0964047,
+       -0.11366429,  0.035777505,  0.13568819,    0.052451383,  0.050649304,
+       0.05798951,   -0.021852335, -0.099848844,  0.014740475,  -0.078897946,
+       0.04974699,   0.014160473,  0.06973932,    0.04964942,   0.033364646,
+       0.08190124,   0.025535367,  0.050893165,   0.048514254,  0.06945813,
+       -0.078907564, -0.06707616,  -0.11844508,   -0.09986688,  -0.07509403,
+       0.06263226,   0.14925587,   0.20188436,    0.12098451,   0.14639415,
+       0.0015017595, -0.014267382, -0.03417257,   0.012711468,  0.0028300495,
+       -0.024758482, -0.05098548,  -0.0821182,    0.014225672,  0.021544158,
+       0.08949725,   0.07505268,   -0.0020780868, 0.04908258,   0.06476295,
+       -0.022907063, 0.027562456,  0.040185735,   0.019567577,  -0.015598739,
+       -0.049097303, -0.017121866, -0.083368234,  -0.02332002,  -0.0840956});
+
+  lstm.SetInputGateBias(
+      {0.02234832,  0.14757581,   0.18176508,  0.10380666,  0.053110216,
+       -0.06928846, -0.13942584,  -0.11816189, 0.19483899,  0.03652339,
+       -0.10250295, 0.036714908,  -0.18426876, 0.036065217, 0.21810818,
+       0.02383196,  -0.043370757, 0.08690144,  -0.04444982, 0.00030581196});
+
+  lstm.SetForgetGateBias({0.035185695, -0.042891346, -0.03032477, 0.23027696,
+                          0.11098921,  0.15378423,   0.09263801,  0.09790885,
+                          0.09508917,  0.061199076,  0.07665568,  -0.015443159,
+                          -0.03499149, 0.046190713,  0.08895977,  0.10899629,
+                          0.40694186,  0.06030037,   0.012413437, -0.06108739});
+
+  lstm.SetCellBias({-0.024379363, 0.0055531194, 0.23377132,   0.033463873,
+                    -0.1483596,   -0.10639995,  -0.091433935, 0.058573797,
+                    -0.06809782,  -0.07889636,  -0.043246906, -0.09829136,
+                    -0.4279842,   0.034901652,  0.18797937,   0.0075234566,
+                    0.016178843,  0.1749513,    0.13975595,   0.92058027});
+
+  lstm.SetOutputGateBias(
+      {0.046159424,  -0.0012809046, 0.03563469,   0.12648113, 0.027195795,
+       0.35373217,   -0.018957434,  0.008907322,  -0.0762701, 0.12018895,
+       0.04216877,   0.0022856654,  0.040952638,  0.3147856,  0.08225149,
+       -0.057416286, -0.14995944,   -0.008040261, 0.13208859, 0.029760877});
+
+  lstm.SetRecurrentToInputWeights(
+      {-0.001374326,   -0.078856036,   0.10672688,    0.029162422,
+       -0.11585556,    0.02557986,     -0.13446963,   -0.035785314,
+       -0.01244275,    0.025961924,    -0.02337298,   -0.044228926,
+       -0.055839065,   -0.046598054,   -0.010546039,  -0.06900766,
+       0.027239809,    0.022582639,    -0.013296484,  -0.05459212,
+       0.08981,        -0.045407712,   0.08682226,    -0.06867011,
+       -0.14390695,    -0.02916037,    0.000996957,   0.091420636,
+       0.14283475,     -0.07390571,    -0.06402044,   0.062524505,
+       -0.093129106,   0.04860203,     -0.08364217,   -0.08119002,
+       0.009352075,    0.22920375,     0.0016303885,  0.11583097,
+       -0.13732095,    0.012405723,    -0.07551853,   0.06343048,
+       0.12162708,     -0.031923793,   -0.014335606,  0.01790974,
+       -0.10650317,    -0.0724401,     0.08554849,    -0.05727212,
+       0.06556731,     -0.042729504,   -0.043227166,  0.011683251,
+       -0.013082158,   -0.029302018,   -0.010899579,  -0.062036745,
+       -0.022509435,   -0.00964907,    -0.01567329,   0.04260106,
+       -0.07787477,    -0.11576462,    0.017356863,   0.048673786,
+       -0.017577527,   -0.05527947,    -0.082487635,  -0.040137455,
+       -0.10820036,    -0.04666372,    0.022746278,   -0.07851417,
+       0.01068115,     0.032956902,    0.022433773,   0.0026891115,
+       0.08944216,     -0.0685835,     0.010513544,   0.07228705,
+       0.02032331,     -0.059686817,   -0.0005566496, -0.086984694,
+       0.040414046,    -0.1380399,     0.094208956,   -0.05722982,
+       0.012092817,    -0.04989123,    -0.086576,     -0.003399834,
+       -0.04696032,    -0.045747425,   0.10091314,    0.048676282,
+       -0.029037097,   0.031399418,    -0.0040285117, 0.047237843,
+       0.09504992,     0.041799378,    -0.049185462,  -0.031518843,
+       -0.10516937,    0.026374253,    0.10058866,    -0.0033195973,
+       -0.041975245,   0.0073591834,   0.0033782164,  -0.004325073,
+       -0.10167381,    0.042500053,    -0.01447153,   0.06464186,
+       -0.017142897,   0.03312627,     0.009205989,   0.024138335,
+       -0.011337001,   0.035530265,    -0.010912711,  0.0706555,
+       -0.005894094,   0.051841937,    -0.1401738,    -0.02351249,
+       0.0365468,      0.07590991,     0.08838724,    0.021681072,
+       -0.10086113,    0.019608743,    -0.06195883,   0.077335775,
+       0.023646897,    -0.095322326,   0.02233014,    0.09756986,
+       -0.048691444,   -0.009579111,   0.07595467,    0.11480546,
+       -0.09801813,    0.019894179,    0.08502348,    0.004032281,
+       0.037211012,    0.068537936,    -0.048005626,  -0.091520436,
+       -0.028379958,   -0.01556313,    0.06554592,    -0.045599163,
+       -0.01672207,    -0.020169014,   -0.011877351,  -0.20212261,
+       0.010889619,    0.0047078193,   0.038385306,   0.08540671,
+       -0.017140968,   -0.0035865551,  0.016678626,   0.005633034,
+       0.015963363,    0.00871737,     0.060130805,   0.028611384,
+       0.10109069,     -0.015060172,   -0.07894427,   0.06401885,
+       0.011584063,    -0.024466386,   0.0047652307,  -0.09041358,
+       0.030737216,    -0.0046374933,  0.14215417,    -0.11823516,
+       0.019899689,    0.006106124,    -0.027092824,  0.0786356,
+       0.05052217,     -0.058925,      -0.011402121,  -0.024987547,
+       -0.0013661642,  -0.06832946,    -0.015667673,  -0.1083353,
+       -0.00096863037, -0.06988685,    -0.053350925,  -0.027275559,
+       -0.033664223,   -0.07978348,    -0.025200296,  -0.017207067,
+       -0.058403496,   -0.055697463,   0.005798788,   0.12965427,
+       -0.062582195,   0.0013350133,   -0.10482091,   0.0379771,
+       0.072521195,    -0.0029455067,  -0.13797039,   -0.03628521,
+       0.013806405,    -0.017858358,   -0.01008298,   -0.07700066,
+       -0.017081132,   0.019358726,    0.0027079724,  0.004635139,
+       0.062634714,    -0.02338735,    -0.039547626,  -0.02050681,
+       0.03385117,     -0.083611414,   0.002862572,   -0.09421313,
+       0.058618143,    -0.08598433,    0.00972939,    0.023867095,
+       -0.053934585,   -0.023203006,   0.07452513,    -0.048767887,
+       -0.07314807,    -0.056307215,   -0.10433547,   -0.06440842,
+       0.04328182,     0.04389765,     -0.020006588,  -0.09076438,
+       -0.11652589,    -0.021705797,   0.03345259,    -0.010329105,
+       -0.025767034,   0.013057034,    -0.07316461,   -0.10145612,
+       0.06358255,     0.18531723,     0.07759293,    0.12006465,
+       0.1305557,      0.058638252,    -0.03393652,   0.09622831,
+       -0.16253184,    -2.4580743e-06, 0.079869635,   -0.070196845,
+       -0.005644518,   0.06857898,     -0.12598175,   -0.035084512,
+       0.03156317,     -0.12794146,    -0.031963028,  0.04692781,
+       0.030070418,    0.0071660685,   -0.095516115,  -0.004643372,
+       0.040170413,    -0.062104587,   -0.0037324072, 0.0554317,
+       0.08184801,     -0.019164372,   0.06791302,    0.034257166,
+       -0.10307039,    0.021943003,    0.046745934,   0.0790918,
+       -0.0265588,     -0.007824208,   0.042546265,   -0.00977924,
+       -0.0002440307,  -0.017384544,   -0.017990116,  0.12252321,
+       -0.014512694,   -0.08251313,    0.08861942,    0.13589665,
+       0.026351685,    0.012641483,    0.07466548,    0.044301085,
+       -0.045414884,   -0.051112458,   0.03444247,    -0.08502782,
+       -0.04106223,    -0.028126027,   0.028473156,   0.10467447});
+
+  lstm.SetRecurrentToForgetWeights(
+      {-0.057784554,  -0.026057621,  -0.068447545,   -0.022581743,
+       0.14811787,    0.10826372,    0.09471067,     0.03987225,
+       -0.0039523416, 0.00030638507, 0.053185795,    0.10572994,
+       0.08414449,    -0.022036452,  -0.00066928595, -0.09203576,
+       0.032950465,   -0.10985798,   -0.023809856,   0.0021431844,
+       -0.02196096,   -0.00326074,   0.00058621005,  -0.074678116,
+       -0.06193199,   0.055729095,   0.03736828,     0.020123724,
+       0.061878487,   -0.04729229,   0.034919553,    -0.07585433,
+       -0.04421272,   -0.044019096,  0.085488975,    0.04058006,
+       -0.06890133,   -0.030951202,  -0.024628663,   -0.07672815,
+       0.034293607,   0.08556707,    -0.05293577,    -0.033561368,
+       -0.04899627,   0.0241671,     0.015736353,    -0.095442444,
+       -0.029564252,  0.016493602,   -0.035026584,   0.022337519,
+       -0.026871363,  0.004780428,   0.0077918363,   -0.03601621,
+       0.016435321,   -0.03263031,   -0.09543275,    -0.047392778,
+       0.013454138,   0.028934088,   0.01685226,     -0.086110644,
+       -0.046250615,  -0.01847454,   0.047608484,    0.07339695,
+       0.034546845,   -0.04881143,   0.009128804,    -0.08802852,
+       0.03761666,    0.008096139,   -0.014454086,   0.014361001,
+       -0.023502491,  -0.0011840804, -0.07607001,    0.001856849,
+       -0.06509276,   -0.006021153,  -0.08570962,    -0.1451793,
+       0.060212336,   0.055259194,   0.06974018,     0.049454916,
+       -0.027794661,  -0.08077226,   -0.016179763,   0.1169753,
+       0.17213494,    -0.0056326236, -0.053934924,   -0.0124349,
+       -0.11520337,   0.05409887,    0.088759385,    0.0019655675,
+       0.0042065294,  0.03881498,    0.019844765,    0.041858196,
+       -0.05695512,   0.047233116,   0.038937137,    -0.06542224,
+       0.014429736,   -0.09719407,   0.13908425,     -0.05379757,
+       0.012321099,   0.082840554,   -0.029899208,   0.044217527,
+       0.059855383,   0.07711018,    -0.045319796,   0.0948846,
+       -0.011724666,  -0.0033288454, -0.033542685,   -0.04764985,
+       -0.13873616,   0.040668588,   0.034832682,    -0.015319203,
+       -0.018715994,  0.046002675,   0.0599172,      -0.043107376,
+       0.0294216,     -0.002314414,  -0.022424703,   0.0030315618,
+       0.0014641669,  0.0029166266,  -0.11878115,    0.013738511,
+       0.12375372,    -0.0006038222, 0.029104086,    0.087442465,
+       0.052958444,   0.07558703,    0.04817258,     0.044462286,
+       -0.015213451,  -0.08783778,   -0.0561384,     -0.003008196,
+       0.047060397,   -0.002058388,  0.03429439,     -0.018839769,
+       0.024734668,   0.024614193,   -0.042046934,   0.09597743,
+       -0.0043254104, 0.04320769,    0.0064070094,   -0.0019131786,
+       -0.02558259,   -0.022822596,  -0.023273505,   -0.02464396,
+       -0.10991725,   -0.006240552,  0.0074488563,   0.024044557,
+       0.04383914,    -0.046476185,  0.028658995,    0.060410924,
+       0.050786525,   0.009452605,   -0.0073054377,  -0.024810238,
+       0.0052906186,  0.0066939713,  -0.0020913032,  0.014515517,
+       0.015898481,   0.021362653,   -0.030262267,   0.016587038,
+       -0.011442813,  0.041154444,   -0.007631438,   -0.03423484,
+       -0.010977775,  0.036152758,   0.0066366293,   0.11915515,
+       0.02318443,    -0.041350313,  0.021485701,    -0.10906167,
+       -0.028218046,  -0.00954771,   0.020531068,    -0.11995105,
+       -0.03672871,   0.024019798,   0.014255957,    -0.05221243,
+       -0.00661567,   -0.04630967,   0.033188973,    0.10107534,
+       -0.014027541,  0.030796422,   -0.10270911,    -0.035999842,
+       0.15443139,    0.07684145,    0.036571592,    -0.035900835,
+       -0.0034699554, 0.06209149,    0.015920248,    -0.031122351,
+       -0.03858649,   0.01849943,    0.13872518,     0.01503974,
+       0.069941424,   -0.06948533,   -0.0088794185,  0.061282158,
+       -0.047401894,  0.03100163,    -0.041533746,   -0.10430945,
+       0.044574402,   -0.01425562,   -0.024290353,   0.034563623,
+       0.05866852,    0.023947537,   -0.09445152,    0.035450947,
+       0.02247216,    -0.0042998926, 0.061146557,    -0.10250651,
+       0.020881841,   -0.06747029,   0.10062043,     -0.0023941975,
+       0.03532124,    -0.016341697,  0.09685456,     -0.016764693,
+       0.051808182,   0.05875331,    -0.04536488,    0.001626336,
+       -0.028892258,  -0.01048663,   -0.009793449,   -0.017093895,
+       0.010987891,   0.02357273,    -0.00010856845, 0.0099760275,
+       -0.001845119,  -0.03551521,   0.0018358806,   0.05763657,
+       -0.01769146,   0.040995963,   0.02235177,     -0.060430344,
+       0.11475477,    -0.023854522,  0.10071741,     0.0686208,
+       -0.014250481,  0.034261297,   0.047418304,    0.08562733,
+       -0.030519066,  0.0060542435,  0.014653856,    -0.038836084,
+       0.04096551,    0.032249358,   -0.08355519,    -0.026823482,
+       0.056386515,   -0.010401743,  -0.028396193,   0.08507674,
+       0.014410365,   0.020995233,   0.17040324,     0.11511526,
+       0.02459721,    0.0066619175,  0.025853224,    -0.023133837,
+       -0.081302024,  0.017264642,   -0.009585969,   0.09491168,
+       -0.051313367,  0.054532815,   -0.014298593,   0.10657464,
+       0.007076659,   0.10964551,    0.0409152,      0.008275321,
+       -0.07283536,   0.07937492,    0.04192024,     -0.1075027});
+
+  lstm.SetRecurrentToCellWeights(
+      {-0.037322544,   0.018592842,   0.0056175636,  -0.06253426,
+       0.055647098,    -0.05713207,   -0.05626563,   0.005559383,
+       0.03375411,     -0.025757805,  -0.088049285,  0.06017052,
+       -0.06570978,    0.007384076,   0.035123326,   -0.07920549,
+       0.053676967,    0.044480428,   -0.07663568,   0.0071805613,
+       0.08089997,     0.05143358,    0.038261272,   0.03339287,
+       -0.027673481,   0.044746667,   0.028349208,   0.020090483,
+       -0.019443132,   -0.030755889,  -0.0040000007, 0.04465846,
+       -0.021585021,   0.0031670958,  0.0053199246,  -0.056117613,
+       -0.10893326,    0.076739706,   -0.08509834,   -0.027997585,
+       0.037871376,    0.01449768,    -0.09002357,   -0.06111149,
+       -0.046195522,   0.0422062,     -0.005683705,  -0.1253618,
+       -0.012925729,   -0.04890792,   0.06985068,    0.037654128,
+       0.03398274,     -0.004781977,  0.007032333,   -0.031787455,
+       0.010868644,    -0.031489216,  0.09525667,    0.013939797,
+       0.0058680447,   0.0167067,     0.02668468,    -0.04797466,
+       -0.048885044,   -0.12722108,   0.035304096,   0.06554885,
+       0.00972396,     -0.039238118,  -0.05159735,   -0.11329045,
+       0.1613692,      -0.03750952,   0.06529313,    -0.071974665,
+       -0.11769596,    0.015524369,   -0.0013754242, -0.12446318,
+       0.02786344,     -0.014179351,  0.005264273,   0.14376344,
+       0.015983658,    0.03406988,    -0.06939408,   0.040699873,
+       0.02111075,     0.09669095,    0.041345075,   -0.08316494,
+       -0.07684199,    -0.045768797,  0.032298047,   -0.041805092,
+       0.0119405,      0.0061010392,  0.12652606,    0.0064572375,
+       -0.024950314,   0.11574242,    0.04508852,    -0.04335324,
+       0.06760663,     -0.027437469,  0.07216407,    0.06977076,
+       -0.05438599,    0.034033038,   -0.028602652,  0.05346137,
+       0.043184172,    -0.037189785,  0.10420091,    0.00882477,
+       -0.054019816,   -0.074273005,  -0.030617684,  -0.0028467078,
+       0.024302477,    -0.0038869337, 0.005332455,   0.0013399826,
+       0.04361412,     -0.007001822,  0.09631092,    -0.06702025,
+       -0.042049985,   -0.035070654,  -0.04103342,   -0.10273396,
+       0.0544271,      0.037184782,   -0.13150354,   -0.0058036847,
+       -0.008264958,   0.042035464,   0.05891794,    0.029673764,
+       0.0063542654,   0.044788733,   0.054816857,   0.062257513,
+       -0.00093483756, 0.048938446,   -0.004952862,  -0.007730018,
+       -0.04043371,    -0.017094059,  0.07229206,    -0.023670016,
+       -0.052195564,   -0.025616996,  -0.01520939,   0.045104615,
+       -0.007376126,   0.003533447,   0.006570588,   0.056037236,
+       0.12436656,     0.051817212,   0.028532185,   -0.08686856,
+       0.11868599,     0.07663395,    -0.07323171,   0.03463402,
+       -0.050708205,   -0.04458982,   -0.11590894,   0.021273347,
+       0.1251325,      -0.15313013,   -0.12224372,   0.17228661,
+       0.023029093,    0.086124025,   0.006445803,   -0.03496501,
+       0.028332196,    0.04449512,    -0.042436164,  -0.026587414,
+       -0.006041347,   -0.09292539,   -0.05678812,   0.03897832,
+       0.09465633,     0.008115513,   -0.02171956,   0.08304309,
+       0.071401566,    0.019622514,   0.032163795,   -0.004167056,
+       0.02295182,     0.030739572,   0.056506045,   0.004612461,
+       0.06524936,     0.059999723,   0.046395954,   -0.0045512207,
+       -0.1335546,     -0.030136576,  0.11584653,    -0.014678886,
+       0.0020118146,   -0.09688814,   -0.0790206,    0.039770417,
+       -0.0329582,     0.07922767,    0.029322514,   0.026405897,
+       0.04207835,     -0.07073373,   0.063781224,   0.0859677,
+       -0.10925287,    -0.07011058,   0.048005477,   0.03438226,
+       -0.09606514,    -0.006669445,  -0.043381985,  0.04240257,
+       -0.06955775,    -0.06769346,   0.043903265,   -0.026784198,
+       -0.017840602,   0.024307009,   -0.040079936,  -0.019946516,
+       0.045318738,    -0.12233574,   0.026170589,   0.0074471775,
+       0.15978073,     0.10185836,    0.10298046,    -0.015476589,
+       -0.039390966,   -0.072174534,  0.0739445,     -0.1211869,
+       -0.0347889,     -0.07943156,   0.014809798,   -0.12412325,
+       -0.0030663363,  0.039695457,   0.0647603,     -0.08291318,
+       -0.018529687,   -0.004423833,  0.0037507233,  0.084633216,
+       -0.01514876,    -0.056505352,  -0.012800942,  -0.06994386,
+       0.012962922,    -0.031234352,  0.07029052,    0.016418684,
+       0.03618972,     0.055686004,   -0.08663945,   -0.017404709,
+       -0.054761406,   0.029065743,   0.052404847,   0.020238016,
+       0.0048197987,   -0.0214882,    0.07078733,    0.013016777,
+       0.06262858,     0.009184685,   0.020785125,   -0.043904778,
+       -0.0270329,     -0.03299152,   -0.060088247,  -0.015162964,
+       -0.001828936,   0.12642565,    -0.056757294,  0.013586685,
+       0.09232601,     -0.035886683,  0.06000002,    0.05229691,
+       -0.052580316,   -0.082029596,  -0.010794592,  0.012947712,
+       -0.036429964,   -0.085508935,  -0.13127148,   -0.017744139,
+       0.031502828,    0.036232427,   -0.031581745,  0.023051167,
+       -0.05325106,    -0.03421577,   0.028793324,   -0.034633752,
+       -0.009881397,   -0.043551125,  -0.018609839,  0.0019097115,
+       -0.008799762,   0.056595087,   0.0022273948,  0.055752404});
+
+  lstm.SetRecurrentToOutputWeights({
+      0.025825322,   -0.05813119,  0.09495884,   -0.045984812,   -0.01255415,
+      -0.0026479573, -0.08196161,  -0.054914974, -0.0046604523,  -0.029587349,
+      -0.044576716,  -0.07480124,  -0.082868785, 0.023254942,    0.027502948,
+      -0.0039728214, -0.08683098,  -0.08116779,  -0.014675607,   -0.037924774,
+      -0.023314456,  -0.007401714, -0.09255757,  0.029460307,    -0.08829125,
+      -0.005139627,  -0.08989442,  -0.0555066,   0.13596267,     -0.025062224,
+      -0.048351806,  -0.03850004,  0.07266485,   -0.022414139,   0.05940088,
+      0.075114764,   0.09597592,   -0.010211725, -0.0049794707,  -0.011523867,
+      -0.025980417,  0.072999895,  0.11091378,   -0.081685916,   0.014416728,
+      0.043229222,   0.034178585,  -0.07530371,  0.035837382,    -0.085607,
+      -0.007721233,  -0.03287832,  -0.043848954, -0.06404588,    -0.06632928,
+      -0.073643476,  0.008214239,  -0.045984086, 0.039764922,    0.03474462,
+      0.060612556,   -0.080590084, 0.049127717,  0.04151091,     -0.030063879,
+      0.008801774,   -0.023021035, -0.019558564, 0.05158114,     -0.010947698,
+      -0.011825728,  0.0075720972, 0.0699727,    -0.0039981045,  0.069350146,
+      0.08799282,    0.016156472,  0.035502106,  0.11695009,     0.006217345,
+      0.13392477,    -0.037875112, 0.025745004,  0.08940699,     -0.00924166,
+      0.0046702605,  -0.036598757, -0.08811812,  0.10522024,     -0.032441203,
+      0.008176899,   -0.04454919,  0.07058152,   0.0067963637,   0.039206743,
+      0.03259838,    0.03725492,   -0.09515802,  0.013326398,    -0.052055415,
+      -0.025676316,  0.03198509,   -0.015951829, -0.058556724,   0.036879618,
+      0.043357447,   0.028362012,  -0.05908629,  0.0059240665,   -0.04995891,
+      -0.019187413,  0.0276265,    -0.01628143,  0.0025863599,   0.08800015,
+      0.035250366,   -0.022165963, -0.07328642,  -0.009415526,   -0.07455109,
+      0.11690406,    0.0363299,    0.07411125,   0.042103454,    -0.009660886,
+      0.019076364,   0.018299393,  -0.046004917, 0.08891175,     0.0431396,
+      -0.026327137,  -0.051502608, 0.08979574,   -0.051670972,   0.04940282,
+      -0.07491107,   -0.021240504, 0.022596184,  -0.034280192,   0.060163025,
+      -0.058211457,  -0.051837247, -0.01349775,  -0.04639988,    -0.035936575,
+      -0.011681591,  0.064818054,  0.0073146066, -0.021745546,   -0.043124277,
+      -0.06471268,   -0.07053354,  -0.029321948, -0.05330136,    0.016933719,
+      -0.053782392,  0.13747959,   -0.1361751,   -0.11569455,    0.0033329215,
+      0.05693899,    -0.053219706, 0.063698,     0.07977434,     -0.07924483,
+      0.06936997,    0.0034815092, -0.007305279, -0.037325785,   -0.07251102,
+      -0.033633437,  -0.08677009,  0.091591336,  -0.14165086,    0.021752775,
+      0.019683983,   0.0011612234, -0.058154266, 0.049996935,    0.0288841,
+      -0.0024567875, -0.14345716,  0.010955264,  -0.10234828,    0.1183656,
+      -0.0010731248, -0.023590032, -0.072285876, -0.0724771,     -0.026382286,
+      -0.0014920527, 0.042667855,  0.0018776858, 0.02986552,     0.009814309,
+      0.0733756,     0.12289186,   0.018043943,  -0.0458958,     0.049412545,
+      0.033632483,   0.05495232,   0.036686596,  -0.013781798,   -0.010036754,
+      0.02576849,    -0.08307328,  0.010112348,  0.042521734,    -0.05869831,
+      -0.071689695,  0.03876447,   -0.13275425,  -0.0352966,     -0.023077697,
+      0.10285965,    0.084736146,  0.15568255,   -0.00040734606, 0.027835453,
+      -0.10292561,   -0.032401145, 0.10053256,   -0.026142767,   -0.08271222,
+      -0.0030240538, -0.016368777, 0.1070414,    0.042672627,    0.013456989,
+      -0.0437609,    -0.022309763, 0.11576483,   0.04108048,     0.061026827,
+      -0.0190714,    -0.0869359,   0.037901703,  0.0610107,      0.07202949,
+      0.01675338,    0.086139716,  -0.08795751,  -0.014898893,   -0.023771819,
+      -0.01965048,   0.007955471,  -0.043740474, 0.03346837,     -0.10549954,
+      0.090567775,   0.042013682,  -0.03176985,  0.12569028,     -0.02421228,
+      -0.029526481,  0.023851605,  0.031539805,  0.05292009,     -0.02344001,
+      -0.07811758,   -0.08834428,  0.10094801,   0.16594367,     -0.06861939,
+      -0.021256343,  -0.041093912, -0.06669611,  0.035498552,    0.021757556,
+      -0.09302526,   -0.015403468, -0.06614931,  -0.051798206,   -0.013874718,
+      0.03630673,    0.010412845,  -0.08077351,  0.046185967,    0.0035662893,
+      0.03541868,    -0.094149634, -0.034814864, 0.003128424,    -0.020674974,
+      -0.03944324,   -0.008110165, -0.11113267,  0.08484226,     0.043586485,
+      0.040582247,   0.0968012,    -0.065249965, -0.028036479,   0.0050708856,
+      0.0017462453,  0.0326779,    0.041296225,  0.09164146,     -0.047743853,
+      -0.015952192,  -0.034451712, 0.084197424,  -0.05347844,    -0.11768019,
+      0.085926116,   -0.08251791,  -0.045081906, 0.0948852,      0.068401024,
+      0.024856757,   0.06978981,   -0.057309967, -0.012775832,   -0.0032452994,
+      0.01977615,    -0.041040014, -0.024264973, 0.063464895,    0.05431621,
+  });
+
+  lstm.SetCellToInputWeights(
+      {0.040369894, 0.030746894,  0.24704495,  0.018586371,  -0.037586458,
+       -0.15312155, -0.11812848,  -0.11465643, 0.20259799,   0.11418174,
+       -0.10116027, -0.011334949, 0.12411352,  -0.076769054, -0.052169047,
+       0.21198851,  -0.38871562,  -0.09061183, -0.09683246,  -0.21929175});
+
+  lstm.SetCellToForgetWeights(
+      {-0.01998659,  -0.15568835,  -0.24248174,   -0.012770197, 0.041331276,
+       -0.072311886, -0.052123554, -0.0066330447, -0.043891653, 0.036225766,
+       -0.047248036, 0.021479502,  0.033189066,   0.11952997,   -0.020432774,
+       0.64658105,   -0.06650122,  -0.03467612,   0.095340036,  0.23647355});
+
+  lstm.SetCellToOutputWeights(
+      {0.08286371,  -0.08261836, -0.51210177, 0.002913762, 0.17764764,
+       -0.5495371,  -0.08460716, -0.24552552, 0.030037103, 0.04123544,
+       -0.11940523, 0.007358328, 0.1890978,   0.4833202,   -0.34441817,
+       0.36312827,  -0.26375428, 0.1457655,   -0.19724406, 0.15548733});
+
+  lstm.SetProjectionWeights(
+      {-0.009802181,  0.09401916,    0.0717386,     -0.13895074,  0.09641832,
+       0.060420845,   0.08539281,    0.054285463,   0.061395317,  0.034448683,
+       -0.042991187,  0.019801661,   -0.16840284,   -0.015726732, -0.23041931,
+       -0.024478018,  -0.10959692,   -0.013875541,  0.18600968,   -0.061274476,
+       0.0138165,     -0.08160894,   -0.07661644,   0.032372914,  0.16169067,
+       0.22465782,    -0.03993472,   -0.004017731,  0.08633481,   -0.28869787,
+       0.08682067,    0.17240396,    0.014975425,   0.056431185,  0.031037588,
+       0.16702051,    0.0077946745,  0.15140012,    0.29405436,   0.120285,
+       -0.188994,     -0.027265169,  0.043389652,   -0.022061434, 0.014777949,
+       -0.20203483,   0.094781205,   0.19100232,    0.13987629,   -0.036132768,
+       -0.06426278,   -0.05108664,   0.13221376,    0.009441198,  -0.16715929,
+       0.15859416,    -0.040437475,  0.050779544,   -0.022187516, 0.012166504,
+       0.027685808,   -0.07675938,   -0.0055694645, -0.09444123,  0.0046453946,
+       0.050794356,   0.10770313,    -0.20790008,   -0.07149004,  -0.11425117,
+       0.008225835,   -0.035802525,  0.14374903,    0.15262283,   0.048710253,
+       0.1847461,     -0.007487823,  0.11000021,    -0.09542012,  0.22619456,
+       -0.029149994,  0.08527916,    0.009043713,   0.0042746216, 0.016261552,
+       0.022461696,   0.12689082,    -0.043589946,  -0.12035478,  -0.08361797,
+       -0.050666027,  -0.1248618,    -0.1275799,    -0.071875185, 0.07377272,
+       0.09944291,    -0.18897448,   -0.1593054,    -0.06526116,  -0.040107165,
+       -0.004618631,  -0.067624845,  -0.007576253,  0.10727444,   0.041546922,
+       -0.20424393,   0.06907816,    0.050412357,   0.00724631,   0.039827548,
+       0.12449835,    0.10747581,    0.13708383,    0.09134148,   -0.12617786,
+       -0.06428341,   0.09956831,    0.1208086,     -0.14676677,  -0.0727722,
+       0.1126304,     0.010139365,   0.015571211,   -0.038128063, 0.022913318,
+       -0.042050496,  0.16842307,    -0.060597885,  0.10531834,   -0.06411776,
+       -0.07451711,   -0.03410368,   -0.13393489,   0.06534304,   0.003620307,
+       0.04490757,    0.05970546,    0.05197996,    0.02839995,   0.10434969,
+       -0.013699693,  -0.028353551,  -0.07260381,   0.047201227,  -0.024575593,
+       -0.036445823,  0.07155557,    0.009672501,   -0.02328883,  0.009533515,
+       -0.03606021,   -0.07421458,   -0.028082801,  -0.2678904,   -0.13221288,
+       0.18419984,    -0.13012612,   -0.014588381,  -0.035059117, -0.04824723,
+       0.07830115,    -0.056184657,  0.03277091,    0.025466874,  0.14494097,
+       -0.12522776,   -0.098633975,  -0.10766018,   -0.08317623,  0.08594209,
+       0.07749552,    0.039474737,   0.1776665,     -0.07409566,  -0.0477268,
+       0.29323658,    0.10801441,    0.1154011,     0.013952499,  0.10739139,
+       0.10708251,    -0.051456142,  0.0074137426,  -0.10430189,  0.10034707,
+       0.045594677,   0.0635285,     -0.0715442,    -0.089667566, -0.10811871,
+       0.00026344223, 0.08298446,    -0.009525053,  0.006585689,  -0.24567553,
+       -0.09450807,   0.09648481,    0.026996298,   -0.06419476,  -0.04752702,
+       -0.11063944,   -0.23441927,   -0.17608605,   -0.052156363, 0.067035615,
+       0.19271925,    -0.0032889997, -0.043264326,  0.09663576,   -0.057112187,
+       -0.10100678,   0.0628376,     0.04447668,    0.017961001,  -0.10094388,
+       -0.10190601,   0.18335468,    0.10494553,    -0.052095775, -0.0026118709,
+       0.10539724,    -0.04383912,   -0.042349473,  0.08438151,   -0.1947263,
+       0.02251204,    0.11216432,    -0.10307853,   0.17351969,   -0.039091777,
+       0.08066188,    -0.00561982,   0.12633002,    0.11335965,   -0.0088127935,
+       -0.019777594,  0.06864014,    -0.059751723,  0.016233567,  -0.06894641,
+       -0.28651384,   -0.004228674,  0.019708522,   -0.16305895,  -0.07468996,
+       -0.0855457,    0.099339016,   -0.07580735,   -0.13775392,  0.08434318,
+       0.08330512,    -0.12131499,   0.031935584,   0.09180414,   -0.08876437,
+       -0.08049874,   0.008753825,   0.03498998,    0.030215185,  0.03907079,
+       0.089751154,   0.029194152,   -0.03337423,   -0.019092513, 0.04331237,
+       0.04299654,    -0.036394123,  -0.12915532,   0.09793732,   0.07512415,
+       -0.11319543,   -0.032502122,  0.15661901,    0.07671967,   -0.005491124,
+       -0.19379048,   -0.218606,     0.21448623,    0.017840758,  0.1416943,
+       -0.07051762,   0.19488361,    0.02664691,    -0.18104725,  -0.09334311,
+       0.15026465,    -0.15493552,   -0.057762887,  -0.11604192,  -0.262013,
+       -0.01391798,   0.012185008,   0.11156489,    -0.07483202,  0.06693364,
+       -0.26151478,   0.046425626,   0.036540434,   -0.16435726,  0.17338543,
+       -0.21401681,   -0.11385144,   -0.08283257,   -0.069031075, 0.030635102,
+       0.010969227,   0.11109743,    0.010919218,   0.027526086,  0.13519906,
+       0.01891392,    -0.046839405,  -0.040167913,  0.017953383,  -0.09700955,
+       0.0061885654,  -0.07000971,   0.026893595,   -0.038844477, 0.14543656});
+
+  static float lstm_input[][20] = {
+      {// Batch0: 4 (input_sequence_size) * 5 (n_input)
+       0.787926, 0.151646, 0.071352, 0.118426, 0.458058, 0.596268, 0.998386,
+       0.568695, 0.864524, 0.571277, 0.073204, 0.296072, 0.743333, 0.069199,
+       0.045348, 0.867394, 0.291279, 0.013714, 0.482521, 0.626339},
+
+      {// Batch1: 4 (input_sequence_size) * 5 (n_input)
+       0.295743, 0.544053, 0.690064, 0.858138, 0.497181, 0.642421, 0.524260,
+       0.134799, 0.003639, 0.162482, 0.640394, 0.930399, 0.050782, 0.432485,
+       0.988078, 0.082922, 0.563329, 0.865614, 0.333232, 0.259916}};
+
+  static float lstm_fw_golden_output[][64] = {
+      {// Batch0: 4 (input_sequence_size) * 16 (n_output)
+       -0.00396806, 0.029352,     -0.00279226, 0.0159977,   -0.00835576,
+       -0.0211779,  0.0283512,    -0.0114597,  0.00907307,  -0.0244004,
+       -0.0152191,  -0.0259063,   0.00914318,  0.00415118,  0.017147,
+       0.0134203,   -0.0166936,   0.0381209,   0.000889694, 0.0143363,
+       -0.0328911,  -0.0234288,   0.0333051,   -0.012229,   0.0110322,
+       -0.0457725,  -0.000832209, -0.0202817,  0.0327257,   0.0121308,
+       0.0155969,   0.0312091,    -0.0213783,  0.0350169,   0.000324794,
+       0.0276012,   -0.0263374,   -0.0371449,  0.0446149,   -0.0205474,
+       0.0103729,   -0.0576349,   -0.0150052,  -0.0292043,  0.0376827,
+       0.0136115,   0.0243435,    0.0354492,   -0.0189322,  0.0464512,
+       -0.00251373, 0.0225745,    -0.0308346,  -0.0317124,  0.0460407,
+       -0.0189395,  0.0149363,    -0.0530162,  -0.0150767,  -0.0340193,
+       0.0286833,   0.00824207,   0.0264887,   0.0305169},
+      {// Batch1: 4 (input_sequence_size) * 16 (n_output)
+       -0.013869,    0.0287268,   -0.00334693, 0.00733398,  -0.0287926,
+       -0.0186926,   0.0193662,   -0.0115437,  0.00422612,  -0.0345232,
+       0.00223253,   -0.00957321, 0.0210624,   0.013331,    0.0150954,
+       0.02168,      -0.0141913,  0.0322082,   0.00227024,  0.0260507,
+       -0.0188721,   -0.0296489,  0.0399134,   -0.0160509,  0.0116039,
+       -0.0447318,   -0.0150515,  -0.0277406,  0.0316596,   0.0118233,
+       0.0214762,    0.0293641,   -0.0204549,  0.0450315,   -0.00117378,
+       0.0167673,    -0.0375007,  -0.0238314,  0.038784,    -0.0174034,
+       0.0131743,    -0.0506589,  -0.0048447,  -0.0240239,  0.0325789,
+       0.00790065,   0.0220157,   0.0333314,   -0.0264787,  0.0387855,
+       -0.000764675, 0.0217599,   -0.037537,   -0.0335206,  0.0431679,
+       -0.0211424,   0.010203,    -0.062785,   -0.00832363, -0.025181,
+       0.0412031,    0.0118723,   0.0239643,   0.0394009}};
+
+  static float lstm_combined_golden_output[][64] = {
+    {
+      -0.022014,  0.073544, -0.002235,  0.040068, -0.037136, -0.052788,
+      0.075325, -0.029378,  0.024298, -0.07733 , -0.030674, -0.060229,
+      0.040599,  0.011608,  0.042005,  0.045977, -0.039225,  0.076294,
+      0.000735,  0.032852, -0.069869, -0.053312,  0.073527, -0.028136,
+      0.021585, -0.102679, -0.004327, -0.043304,  0.072861,  0.027077,
+      0.034558,  0.068292, -0.036292,  0.069832, -0.003032,  0.053829,
+      -0.043821, -0.072713,  0.085029, -0.040374,  0.020014, -0.104521,
+      -0.034504, -0.059759,  0.062569,  0.025652,  0.049306,  0.061189,
+      -0.025146,  0.079643, -0.005188,  0.033080, -0.048079, -0.048082,
+      0.069369, -0.028900,  0.024572, -0.077547, -0.022517, -0.054477,
+      0.038857,  0.013336,  0.043234,  0.044788},
+    {
+      -0.039186,  0.070792, -0.005913,  0.02642,  -0.068274, -0.05022,
+      0.061444, -0.031241,  0.014996, -0.094544, -0.004146, -0.03464,
+      0.058981,  0.026097,  0.039781,  0.058408, -0.031887,  0.069252,
+      0.00576,   0.054062, -0.042801, -0.059974,  0.085272, -0.034453,
+      0.026097, -0.0959,   -0.031164, -0.058699,  0.06839,   0.020512,
+      0.044727,  0.063609, -0.039863,  0.084819, -0.003909,  0.028666,
+      -0.075677, -0.045125,  0.070379, -0.033895,  0.022111, -0.097184,
+      -0.004921, -0.040851,  0.062316,  0.017435,  0.041437,  0.064568,
+      -0.039656,  0.060726, -0.003402,  0.036854, -0.056503, -0.058554,
+      0.068588, -0.034879,  0.01352,  -0.09962,  -0.01434,  -0.039505,
+      0.065133,  0.024321,  0.038473,  0.062438
+    }};
+
+  // Resetting cell_state and output_state
+  lstm.ResetFwOutputAndCellStates();
+  lstm.ResetBwOutputAndCellStates();
+
+  for (int i = 0; i < lstm.sequence_length(); i++) {
+    float* batch0_start = lstm_input[0] + i * lstm.num_inputs();
+    float* batch0_end = batch0_start + lstm.num_inputs();
+
+    lstm.SetInput(2 * i * lstm.num_inputs(), batch0_start, batch0_end);
+
+    float* batch1_start = lstm_input[1] + i * lstm.num_inputs();
+    float* batch1_end = batch1_start + lstm.num_inputs();
+    lstm.SetInput((2 * i + 1) * lstm.num_inputs(), batch1_start, batch1_end);
+  }
+
+  lstm.Invoke();
+
+  std::vector<float> expected;
+  for (int i = 0; i < lstm.sequence_length(); i++) {
+    float* golden_start_batch0 =
+        lstm_fw_golden_output[0] + i * lstm.num_fw_outputs();
+    float* golden_end_batch0 = golden_start_batch0 + lstm.num_fw_outputs();
+    float* golden_start_batch1 =
+        lstm_fw_golden_output[1] + i * lstm.num_fw_outputs();
+    float* golden_end_batch1 = golden_start_batch1 + lstm.num_fw_outputs();
+    expected.insert(expected.end(), golden_start_batch0, golden_end_batch0);
+    expected.insert(expected.end(), golden_start_batch1, golden_end_batch1);
+  }
+  EXPECT_THAT(lstm.GetFwOutput(), ElementsAreArray(ArrayFloatNear(expected)));
+
+  // Check if the sum of forward backward matches the golden.
+  expected.clear();
+  for (int i = 0; i < lstm.sequence_length(); i++) {
+    float* golden_start_batch0 =
+        lstm_combined_golden_output[0] + i * lstm.num_fw_outputs();
+    float* golden_end_batch0 = golden_start_batch0 + lstm.num_fw_outputs();
+    float* golden_start_batch1 =
+        lstm_combined_golden_output[1] + i * lstm.num_fw_outputs();
+    float* golden_end_batch1 = golden_start_batch1 + lstm.num_fw_outputs();
+    expected.insert(expected.end(), golden_start_batch0, golden_end_batch0);
+    expected.insert(expected.end(), golden_start_batch1, golden_end_batch1);
+  }
+
+  std::vector<float> combined;
+  for (int i = 0; i < lstm.GetFwOutput().size(); ++i) {
+    combined.push_back(lstm.GetFwOutput()[i] + lstm.GetBwOutput()[i]);
+  }
+  EXPECT_THAT(combined, ElementsAreArray(ArrayFloatNear(expected)));
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/conv.cc b/tensorflow/contrib/lite/kernels/conv.cc
index 66d2c04bba4a164bbcdcf4b1a097d9aac0b3aeeb..b93a416351cae34b2df8791e382a8a2cd38dcffb 100644
--- a/tensorflow/contrib/lite/kernels/conv.cc
+++ b/tensorflow/contrib/lite/kernels/conv.cc
@@ -51,11 +51,13 @@ enum KernelType {
   kCblasOptimized,
 };
 
+const int kTensorNotAllocated = -1;
+
 struct OpData {
   // IDs are the arbitrary identifiers used by TF Lite to identify and access
   // memory buffers.
-  int im2col_id;
-  int hwcn_weights_id;
+  int im2col_id = kTensorNotAllocated;
+  int hwcn_weights_id = kTensorNotAllocated;
 
   TfLitePaddingValues padding;
   // The scaling factor from input to output (aka the 'real multiplier') can
@@ -80,8 +82,6 @@ void* Init(TfLiteContext* context, const char* buffer, size_t length) {
   // Instead, we allocate a new object to use as scratch space for im2col, and
   // to carry information from Prepare() to Eval().
   auto* data = new OpData;
-  context->AddTensors(context, 1, &data->im2col_id);
-  context->AddTensors(context, 1, &data->hwcn_weights_id);
   gemm_support::IncrementUsageCounter(context);
   return data;
 }
@@ -107,10 +107,66 @@ void TransposeFloatTensor(TfLiteTensor* input, TfLiteTensor* output) {
   }
 }
 
+// Allocate temporary tensors (`im2col`, `hwcn_weights` if necessary).
+// Note: `context->AddTensors` might invalidate pointers to existing tensors.
+// Therefore the logic to add tensors are isolated into this function.
+static TfLiteStatus AllocateTemporaryTensorsIfRequired(TfLiteContext* context,
+                                                       TfLiteNode* node) {
+  auto* params = reinterpret_cast<TfLiteConvParams*>(node->builtin_data);
+  OpData* data = reinterpret_cast<OpData*>(node->user_data);
+
+  TF_LITE_ENSURE(context, node->inputs->size >= 2);
+  TfLiteTensor* input = &context->tensors[node->inputs->data[0]];
+  TfLiteTensor* filter = &context->tensors[node->inputs->data[1]];
+
+  int filter_width = filter->dims->data[2];
+  int filter_height = filter->dims->data[1];
+
+  // We don't always need to allocate im2col. It is only used in some versions
+  // of the optimized Conv. This test just mimics something that happens inside
+  // optimized_ops.h, in order to avoid a DCHECK(!im2col_data).
+  data->need_im2col =
+      (params->stride_width != 1 || params->stride_height != 1 ||
+       filter_width != 1 || filter_height != 1);
+  // If we're using the optimized multithreaded EigenTensor implementation of
+  // convolution, it expects the filter weights to be transposed compared to
+  // the normal TF Lite buffer format. Typical TF Lite weights are
+  // [filter_count, filter_height, filter_width, input_depth], but for the float
+  // implementation we need them as [filter_height, filter_width, input_depth,
+  // filter_count]. We get to that format by transposing, and create a temporary
+  // buffer to store the results.
+  // This path is only used for float processing, so only create the buffer if
+  // we're running with that data type.
+  data->need_hwcn_weights = (input->type == kTfLiteFloat32);
+
+  int temporaries_count = 0;
+  if (data->need_im2col) {
+    data->im2col_index = temporaries_count;
+    if (data->im2col_id == kTensorNotAllocated) {
+      context->AddTensors(context, 1, &data->im2col_id);
+    }
+    ++temporaries_count;
+  }
+  if (data->need_hwcn_weights) {
+    data->hwcn_weights_index = temporaries_count;
+    if (data->hwcn_weights_id == kTensorNotAllocated) {
+      context->AddTensors(context, 1, &data->hwcn_weights_id);
+    }
+    ++temporaries_count;
+  }
+
+  TfLiteIntArrayFree(node->temporaries);
+  node->temporaries = TfLiteIntArrayCreate(temporaries_count);
+
+  return kTfLiteOk;
+}
+
 TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   auto* params = reinterpret_cast<TfLiteConvParams*>(node->builtin_data);
   OpData* data = reinterpret_cast<OpData*>(node->user_data);
 
+  TF_LITE_ENSURE_STATUS(AllocateTemporaryTensorsIfRequired(context, node));
+
   bool hasBias = node->inputs->size == 3;
   // Check number of inputs/outputs
   TF_LITE_ENSURE(context, hasBias || node->inputs->size == 2);
@@ -118,6 +174,7 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
   TfLiteTensor* output = &context->tensors[node->outputs->data[0]];
   TfLiteTensor* input = &context->tensors[node->inputs->data[0]];
   TfLiteTensor* filter = &context->tensors[node->inputs->data[1]];
+
   // Check dimensionality of input, filter
   TF_LITE_ENSURE_EQ(context, input->dims->size, 4);
   TF_LITE_ENSURE_EQ(context, filter->dims->size, 4);
@@ -199,36 +256,6 @@ TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
 
   if (output_status != kTfLiteOk) return output_status;
 
-  // We don't always need to allocate im2col. It is only used in some versions
-  // of the optimized Conv. This test just mimics something that happens inside
-  // optimized_ops.h, in order to avoid a DCHECK(!im2col_data).
-  data->need_im2col =
-      (params->stride_width != 1 || params->stride_height != 1 ||
-       filter_width != 1 || filter_height != 1);
-  // If we're using the optimized multithreaded EigenTensor implementation of
-  // convolution, it expects the filter weights to be transposed compared to
-  // the normal TF Lite buffer format. Typical TF Lite weights are
-  // [filter_count, filter_height, filter_width, input_depth], but for the float
-  // implementation we need them as [filter_height, filter_width, input_depth,
-  // filter_count]. We get to that format by transposing, and create a temporary
-  // buffer to store the results.
-  // This path is only used for float processing, so only create the buffer if
-  // we're running with that data type.
-  data->need_hwcn_weights = (data_type == kTfLiteFloat32);
-
-  int temporaries_count = 0;
-  if (data->need_im2col) {
-    data->im2col_index = temporaries_count;
-    ++temporaries_count;
-  }
-  if (data->need_hwcn_weights) {
-    data->hwcn_weights_index = temporaries_count;
-    ++temporaries_count;
-  }
-
-  TfLiteIntArrayFree(node->temporaries);
-  node->temporaries = TfLiteIntArrayCreate(temporaries_count);
-
   if (data->need_im2col) {
     node->temporaries->data[data->im2col_index] = data->im2col_id;
 
@@ -344,7 +371,7 @@ void EvalFloat(TfLiteContext* context, TfLiteNode* node,
       reference_ops::Conv(GetTensorData<float>(input), GetTensorDims(input),
                           GetTensorData<float>(filter), GetTensorDims(filter),
                           GetTensorData<float>(bias), GetTensorDims(bias),
-                          params->stride_width, params->stride_height,
+                          params->stride_width, params->stride_height, 1, 1,
                           data->padding.width, data->padding.height,
                           output_activation_min, output_activation_max,
                           GetTensorData<float>(output), GetTensorDims(output),
@@ -355,7 +382,7 @@ void EvalFloat(TfLiteContext* context, TfLiteNode* node,
       optimized_ops::Conv(GetTensorData<float>(input), GetTensorDims(input),
                           GetTensorData<float>(filter), GetTensorDims(filter),
                           GetTensorData<float>(bias), GetTensorDims(bias),
-                          params->stride_width, params->stride_height,
+                          params->stride_width, params->stride_height, 1, 1,
                           data->padding.width, data->padding.height,
                           output_activation_min, output_activation_max,
                           GetTensorData<float>(output), GetTensorDims(output),
diff --git a/tensorflow/contrib/lite/kernels/internal/BUILD b/tensorflow/contrib/lite/kernels/internal/BUILD
index f47fb04cbaa688b75e763ff9d3cb7df44ac3f166..6ccad3b1cef9bccfc42dfec0a1dc999da254b492 100644
--- a/tensorflow/contrib/lite/kernels/internal/BUILD
+++ b/tensorflow/contrib/lite/kernels/internal/BUILD
@@ -10,21 +10,25 @@ tflite_deps_intel = [
     "@arm_neon_2_x86_sse",
 ]
 
+HARD_FP_FLAGS_IF_APPLICABLE = select({
+    "//tensorflow:android_arm": ["-mfloat-abi=softfp"],
+    "//tensorflow:android_arm64": ["-mfloat-abi=softfp"],
+    "//tensorflow:android_armeabi": ["-mfloat-abi=softfp"],
+    "//conditions:default": [],
+})
+
 NEON_FLAGS_IF_APPLICABLE = select({
     ":arm": [
         "-O3",
         "-mfpu=neon",
-        "-mfloat-abi=softfp",
     ],
     ":armeabi-v7a": [
         "-O3",
         "-mfpu=neon",
-        "-mfloat-abi=softfp",
     ],
     ":armv7a": [
         "-O3",
         "-mfpu=neon",
-        "-mfloat-abi=softfp",
     ],
     "//conditions:default": [
         "-O3",
@@ -283,7 +287,7 @@ cc_library(
         "optimized/neon_tensor_utils.h",
         "optimized/tensor_utils_impl.h",
     ],
-    copts = NEON_FLAGS_IF_APPLICABLE,
+    copts = NEON_FLAGS_IF_APPLICABLE + HARD_FP_FLAGS_IF_APPLICABLE,
     deps = [
         ":cpu_check",
         ":portable_tensor_utils",
diff --git a/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc b/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
index 510395126ce3785b1d44fec1e0eb994c29ff0db7..f142374269606bdd3d4184af013749102666ab89 100644
--- a/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
+++ b/tensorflow/contrib/lite/kernels/internal/kernel_utils.cc
@@ -40,5 +40,152 @@ void RnnBatchStep(const float* input_ptr_batch, const float* input_weights_ptr,
                                         hidden_state_ptr_batch);
 }
 
+void LstmStep(
+    const float* input_ptr_batch, const float* input_to_input_weights_ptr,
+    const float* input_to_forget_weights_ptr,
+    const float* input_to_cell_weights_ptr,
+    const float* input_to_output_weights_ptr,
+    const float* recurrent_to_input_weights_ptr,
+    const float* recurrent_to_forget_weights_ptr,
+    const float* recurrent_to_cell_weights_ptr,
+    const float* recurrent_to_output_weights_ptr,
+    const float* cell_to_input_weights_ptr,
+    const float* cell_to_forget_weights_ptr,
+    const float* cell_to_output_weights_ptr, const float* input_gate_bias_ptr,
+    const float* forget_gate_bias_ptr, const float* cell_bias_ptr,
+    const float* output_gate_bias_ptr, const float* projection_weights_ptr,
+    const float* projection_bias_ptr, const TfLiteLSTMParams* params,
+    int n_batch, int n_cell, int n_input, int n_output, float* output_state_ptr,
+    float* cell_state_ptr, float* input_gate_scratch,
+    float* forget_gate_scratch, float* cell_scratch, float* output_gate_scratch,
+    float* output_ptr_batch) {
+  // Since we have already checked that weights are all there or none, we can
+  // check the existense of only one to the get the condition.
+  const bool use_cifg = (input_to_input_weights_ptr == nullptr);
+  const bool use_peephole = (cell_to_output_weights_ptr != nullptr);
+  // Initialize scratch buffers with bias.
+  if (!use_cifg) {
+    tensor_utils::VectorBatchVectorAssign(input_gate_bias_ptr, n_cell, n_batch,
+                                          input_gate_scratch);
+  }
+  tensor_utils::VectorBatchVectorAssign(forget_gate_bias_ptr, n_cell, n_batch,
+                                        forget_gate_scratch);
+  tensor_utils::VectorBatchVectorAssign(cell_bias_ptr, n_cell, n_batch,
+                                        cell_scratch);
+  tensor_utils::VectorBatchVectorAssign(output_gate_bias_ptr, n_cell, n_batch,
+                                        output_gate_scratch);
+
+  // For each batch and cell: compute input_weight * input.
+  if (!use_cifg) {
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        input_to_input_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
+        input_gate_scratch, /*result_stride=*/1);
+  }
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      input_to_forget_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
+      forget_gate_scratch, /*result_stride=*/1);
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      input_to_cell_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
+      cell_scratch, /*result_stride=*/1);
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      input_to_output_weights_ptr, n_cell, n_input, input_ptr_batch, n_batch,
+      output_gate_scratch, /*result_stride=*/1);
+
+  // For each batch and cell: compute recurrent_weight * output_state.
+  if (!use_cifg) {
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        recurrent_to_input_weights_ptr, n_cell, n_output, output_state_ptr,
+        n_batch, input_gate_scratch,
+        /*result_stride=*/1);
+  }
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      recurrent_to_forget_weights_ptr, n_cell, n_output, output_state_ptr,
+      n_batch, forget_gate_scratch,
+      /*result_stride=*/1);
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      recurrent_to_cell_weights_ptr, n_cell, n_output, output_state_ptr,
+      n_batch, cell_scratch, /*result_stride=*/1);
+  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+      recurrent_to_output_weights_ptr, n_cell, n_output, output_state_ptr,
+      n_batch, output_gate_scratch,
+      /*result_stride=*/1);
+
+  // For each batch and cell: update input gate.
+  if (!use_cifg) {
+    if (use_peephole) {
+      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
+          cell_to_input_weights_ptr, n_cell, cell_state_ptr, n_batch,
+          input_gate_scratch);
+    }
+    tensor_utils::ApplySigmoidToVector(input_gate_scratch, n_cell * n_batch,
+                                       input_gate_scratch);
+  }
+
+  // For each batch and cell: update forget gate.
+  if (use_peephole) {
+    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
+        cell_to_forget_weights_ptr, n_cell, cell_state_ptr, n_batch,
+        forget_gate_scratch);
+  }
+  tensor_utils::ApplySigmoidToVector(forget_gate_scratch, n_cell * n_batch,
+                                     forget_gate_scratch);
+
+  // For each batch and cell: update the cell.
+  tensor_utils::VectorVectorCwiseProduct(forget_gate_scratch, cell_state_ptr,
+                                         n_batch * n_cell, cell_state_ptr);
+  tensor_utils::ApplyActivationToVector(cell_scratch, n_batch * n_cell,
+                                        params->activation, cell_scratch);
+  if (use_cifg) {
+    tensor_utils::Sub1Vector(forget_gate_scratch, n_batch * n_cell,
+                             forget_gate_scratch);
+    tensor_utils::VectorVectorCwiseProductAccumulate(
+        cell_scratch, forget_gate_scratch, n_batch * n_cell, cell_state_ptr);
+  } else {
+    tensor_utils::VectorVectorCwiseProductAccumulate(
+        cell_scratch, input_gate_scratch, n_batch * n_cell, cell_state_ptr);
+  }
+  if (params->cell_clip > 0.0) {
+    tensor_utils::ClipVector(cell_state_ptr, n_batch * n_cell,
+                             params->cell_clip, cell_state_ptr);
+  }
+
+  // For each batch and cell: update the output gate.
+  if (use_peephole) {
+    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
+        cell_to_output_weights_ptr, n_cell, cell_state_ptr, n_batch,
+        output_gate_scratch);
+  }
+  tensor_utils::ApplySigmoidToVector(output_gate_scratch, n_batch * n_cell,
+                                     output_gate_scratch);
+  tensor_utils::ApplyActivationToVector(cell_state_ptr, n_batch * n_cell,
+                                        params->activation, cell_scratch);
+  tensor_utils::VectorVectorCwiseProduct(output_gate_scratch, cell_scratch,
+                                         n_batch * n_cell, output_gate_scratch);
+
+  // For each batch: update the projection and output_state.
+  const bool use_projection_weight = (projection_weights_ptr != nullptr);
+  const bool use_projection_bias = (projection_bias_ptr != nullptr);
+  if (use_projection_weight) {
+    if (use_projection_bias) {
+      tensor_utils::VectorBatchVectorAssign(projection_bias_ptr, n_output,
+                                            n_batch, output_ptr_batch);
+    } else {
+      tensor_utils::ZeroVector(output_ptr_batch, n_batch * n_output);
+    }
+    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
+        projection_weights_ptr, n_output, n_cell, output_gate_scratch, n_batch,
+        output_ptr_batch, /*result_stride=*/1);
+    if (params->proj_clip > 0.0) {
+      tensor_utils::ClipVector(output_ptr_batch, n_batch * n_output,
+                               params->proj_clip, output_ptr_batch);
+    }
+  } else {
+    tensor_utils::CopyVector(output_gate_scratch, n_batch * n_output,
+                             output_ptr_batch);
+  }
+  tensor_utils::CopyVector(output_ptr_batch, n_batch * n_output,
+                           output_state_ptr);
+}
+
 }  // namespace kernel_utils
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/kernels/internal/kernel_utils.h b/tensorflow/contrib/lite/kernels/internal/kernel_utils.h
index 9872d4500b862388ed4b96c97e3755f548e35d35..3ec60ee57a87833959a34ba95d32df15bea188a4 100644
--- a/tensorflow/contrib/lite/kernels/internal/kernel_utils.h
+++ b/tensorflow/contrib/lite/kernels/internal/kernel_utils.h
@@ -35,6 +35,42 @@ void RnnBatchStep(const float* input_ptr_batch, const float* input_weights_ptr,
                   TfLiteFusedActivation activation,
                   float* hidden_state_ptr_batch, float* output_ptr_batch);
 
+// Performs an LSTM batch inference step for input specified by input_ptr_batch.
+// The LSTM cell is specified by the pointers to its weights (*_weights_ptr) and
+// biases (*_bias_ptr), and buffers (*_scratch), along with additional
+// parameters:
+//  - params: various LSTM params including activation, clipping, etc.,
+//  - n_batch: size of batch,
+//  - n_cell: number of cells (or units),
+//  - n_input: the input size,
+//  - n_output: the output size.
+//
+// The pointers to the cell and output state and the output are updated. Unless
+// projection is specified output and output state contain the same data.
+//
+// The pointers with the suffix "_batch" point to data aligned in batch_major
+// order, and each step processes batch_size many inputs from input_ptr_batch,
+// and updates batch_size many cell and output states.
+void LstmStep(
+    const float* input_ptr_batch, const float* input_to_input_weights_ptr,
+    const float* input_to_forget_weights_ptr,
+    const float* input_to_cell_weights_ptr,
+    const float* input_to_output_weights_ptr,
+    const float* recurrent_to_input_weights_ptr,
+    const float* recurrent_to_forget_weights_ptr,
+    const float* recurrent_to_cell_weights_ptr,
+    const float* recurrent_to_output_weights_ptr,
+    const float* cell_to_input_weights_ptr,
+    const float* cell_to_forget_weights_ptr,
+    const float* cell_to_output_weights_ptr, const float* input_gate_bias_ptr,
+    const float* forget_gate_bias_ptr, const float* cell_bias_ptr,
+    const float* output_gate_bias_ptr, const float* projection_weights_ptr,
+    const float* projection_bias_ptr, const TfLiteLSTMParams* params,
+    int n_batch, int n_cell, int n_input, int n_output, float* output_state_ptr,
+    float* cell_state_ptr, float* input_gate_scratch,
+    float* forget_gate_scratch, float* cell_scratch, float* output_gate_scratch,
+    float* output_ptr_batch);
+
 }  // namespace kernel_utils
 }  // namespace tflite
 #endif  // TENSORFLOW_CONTRIB_LITE_KERNELS_INTERNAL_KERNEL_UTILS_H_
diff --git a/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h b/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
index dec58fea4facf0fbd9af6ac0cc916bfe5778e1a1..3866f86d38a6f200e091497cab2972ed92e25c6b 100644
--- a/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/optimized/optimized_ops.h
@@ -758,14 +758,89 @@ void Im2col(const T* input_data, const Dims<4>& input_dims, int stride,
          kwidth, byte_zero, output_data, output_dims);
 }
 
+inline void DilatedConv(const float* input_data, const Dims<4>& input_dims,
+                        const float* filter_data, const Dims<4>& filter_dims,
+                        const float* bias_data, const Dims<4>& bias_dims,
+                        int stride_width, int stride_height,
+                        int dilation_width_factor, int dilation_height_factor,
+                        int pad_width, int pad_height,
+                        float output_activation_min,
+                        float output_activation_max, float* output_data,
+                        const Dims<4>& output_dims, float* im2col_data,
+                        const Dims<4>& im2col_dims) {
+  // This is a copy of the reference Conv implementation. We do not currently
+  // have an optimized path for dilation.
+  (void)im2col_data;  // only used in optimized code.
+  (void)im2col_dims;  // only used in optimized code.
+  const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
+  const int input_depth = MatchingArraySize(input_dims, 0, filter_dims, 0);
+  const int output_depth = MatchingArraySize(filter_dims, 3, output_dims, 0);
+  if (bias_data) {
+    TFLITE_DCHECK_EQ(ArraySize(filter_dims, 3), ArraySize(bias_dims, 0));
+  }
+  const int input_height = ArraySize(input_dims, 2);
+  const int input_width = ArraySize(input_dims, 1);
+  const int filter_height = ArraySize(filter_dims, 2);
+  const int filter_width = ArraySize(filter_dims, 1);
+  const int output_height = ArraySize(output_dims, 2);
+  const int output_width = ArraySize(output_dims, 1);
+  for (int batch = 0; batch < batches; ++batch) {
+    for (int out_y = 0; out_y < output_height; ++out_y) {
+      for (int out_x = 0; out_x < output_width; ++out_x) {
+        for (int out_channel = 0; out_channel < output_depth; ++out_channel) {
+          const int in_x_origin = (out_x * stride_width) - pad_width;
+          const int in_y_origin = (out_y * stride_height) - pad_height;
+          float total = 0.f;
+          for (int filter_y = 0; filter_y < filter_height; ++filter_y) {
+            for (int filter_x = 0; filter_x < filter_width; ++filter_x) {
+              for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
+                const int in_x = in_x_origin + dilation_width_factor * filter_x;
+                const int in_y =
+                    in_y_origin + dilation_height_factor * filter_y;
+                // If the location is outside the bounds of the input image,
+                // use zero as a default value.
+                if ((in_x >= 0) && (in_x < input_width) && (in_y >= 0) &&
+                    (in_y < input_height)) {
+                  float input_value = input_data[Offset(input_dims, in_channel,
+                                                        in_x, in_y, batch)];
+                  float filter_value =
+                      filter_data[Offset(filter_dims, in_channel, filter_x,
+                                         filter_y, out_channel)];
+                  total += (input_value * filter_value);
+                }
+              }
+            }
+          }
+          float bias_value = 0.0f;
+          if (bias_data) {
+            bias_value = bias_data[Offset(bias_dims, out_channel, 0, 0, 0)];
+          }
+          output_data[Offset(output_dims, out_channel, out_x, out_y, batch)] =
+              ActivationFunctionWithMinMax(total + bias_value,
+                                           output_activation_min,
+                                           output_activation_max);
+        }
+      }
+    }
+  }
+}
+
 inline void Conv(const float* input_data, const Dims<4>& input_dims,
                  const float* filter_data, const Dims<4>& filter_dims,
                  const float* bias_data, const Dims<4>& bias_dims,
-                 int stride_width, int stride_height, int pad_width,
-                 int pad_height, float output_activation_min,
-                 float output_activation_max, float* output_data,
-                 const Dims<4>& output_dims, float* im2col_data,
-                 const Dims<4>& im2col_dims) {
+                 int stride_width, int stride_height, int dilation_width_factor,
+                 int dilation_height_factor, int pad_width, int pad_height,
+                 float output_activation_min, float output_activation_max,
+                 float* output_data, const Dims<4>& output_dims,
+                 float* im2col_data, const Dims<4>& im2col_dims) {
+  if ((dilation_width_factor != 1) || (dilation_height_factor != 1)) {
+    return DilatedConv(input_data, input_dims, filter_data, filter_dims,
+                       bias_data, bias_dims, stride_width, stride_height,
+                       dilation_width_factor, dilation_height_factor, pad_width,
+                       pad_height, output_activation_min, output_activation_max,
+                       output_data, output_dims, im2col_data, im2col_dims);
+  }
+
   (void)im2col_data;
   (void)im2col_dims;
   gemmlowp::ScopedProfilingLabel label("Conv");
@@ -805,6 +880,23 @@ inline void Conv(const float* input_data, const Dims<4>& input_dims,
                                    output_activation_max);
 }
 
+template <FusedActivationFunctionType Ac>
+void Conv(const float* input_data, const Dims<4>& input_dims,
+          const float* filter_data, const Dims<4>& filter_dims,
+          const float* bias_data, const Dims<4>& bias_dims, int stride_width,
+          int stride_height, int dilation_width_factor,
+          int dilation_height_factor, int pad_width, int pad_height,
+          float* output_data, const Dims<4>& output_dims, float* im2col_data,
+          const Dims<4>& im2col_dims) {
+  float output_activation_min, output_activation_max;
+  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
+  Conv(input_data, input_dims, filter_data, filter_dims, bias_data, bias_dims,
+       stride_width, stride_height, dilation_width_factor,
+       dilation_height_factor, pad_width, pad_height, output_activation_min,
+       output_activation_max, output_data, output_dims, im2col_data,
+       im2col_dims);
+}
+
 // legacy, for compatibility with old checked-in code
 template <FusedActivationFunctionType Ac>
 void Conv(const float* input_data, const Dims<4>& input_dims,
@@ -816,7 +908,7 @@ void Conv(const float* input_data, const Dims<4>& input_dims,
   float output_activation_min, output_activation_max;
   GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
   Conv(input_data, input_dims, filter_data, filter_dims, bias_data, bias_dims,
-       stride_width, stride_height, pad_width, pad_height,
+       stride_width, stride_height, 1, 1, pad_width, pad_height,
        output_activation_min, output_activation_max, output_data, output_dims,
        im2col_data, im2col_dims);
 }
@@ -830,7 +922,7 @@ void Conv(const float* input_data, const Dims<4>& input_dims,
           const Dims<4>& output_dims, float* im2col_data,
           const Dims<4>& im2col_dims) {
   Conv<Ac>(input_data, input_dims, filter_data, filter_dims, bias_data,
-           bias_dims, stride, stride, pad_width, pad_height, output_data,
+           bias_dims, stride, stride, 1, 1, pad_width, pad_height, output_data,
            output_dims, im2col_data, im2col_dims);
 }
 
@@ -2081,6 +2173,198 @@ inline void LstmCell(const float* input_data, const Dims<4>& input_dims,
       output_state_map.tanh();
 }
 
+#ifdef GEMMLOWP_NEON
+// In the common case of batch size 1, a fully-connected node degenerates
+// to a matrix*vector product. LSTM cells contain a fully-connected node;
+// when quantized, this becomes a special type of GEMV operation where
+// the output is 16bit-quantized, thus needs its own special path.
+inline void GEMVForLstmCell(const uint8* input_data, const Dims<4>& input_dims,
+                            const uint8* weights_data,
+                            const Dims<4>& weights_dims,
+                            uint8 weights_zero_point, const int32* bias_data,
+                            const Dims<4>& bias_dims, int32 accum_multiplier,
+                            int accum_shift, int16* output_data,
+                            const Dims<4>& output_dims) {
+  gemmlowp::ScopedProfilingLabel label("GEMVForLstmCell");
+  TFLITE_DCHECK(IsPackedWithoutStrides(input_dims));
+  TFLITE_DCHECK(IsPackedWithoutStrides(weights_dims));
+  TFLITE_DCHECK(IsPackedWithoutStrides(bias_dims));
+  TFLITE_DCHECK(IsPackedWithoutStrides(output_dims));
+  TFLITE_DCHECK_EQ(ArraySize(output_dims, 1) * ArraySize(output_dims, 2) *
+                       ArraySize(output_dims, 3),
+                   1);
+  const int input_size = input_dims.strides[3];
+  const int output_size = MatchingArraySize(weights_dims, 1, output_dims, 0);
+  // This special fast path for quantized LSTM cells does not try to support
+  // odd sizes that we haven't encountered in any LSTM cell, that would
+  // require special code (that would go untested until any LSTM cell
+  // exercises it). We just guard our assumptions about size evenness with
+  // the following assertions.
+  TFLITE_DCHECK(!(output_size % 4));
+  TFLITE_DCHECK(!(input_size % 8));
+  const int32* bias_ptr = bias_data;
+  int16* output_ptr = output_data;
+  for (int out = 0; out < output_size; out += 4) {
+    int32x4_t acc_0 = vdupq_n_s32(0);
+    int32x4_t acc_1 = vdupq_n_s32(0);
+    int32x4_t acc_2 = vdupq_n_s32(0);
+    int32x4_t acc_3 = vdupq_n_s32(0);
+    const int16x8_t input_offset_vec = vdupq_n_s16(-128);
+    const int16x8_t weights_offset_vec = vdupq_n_s16(-weights_zero_point);
+    int in = 0;
+    // Handle 16 levels of depth at a time.
+    for (; in <= input_size - 16; in += 16) {
+      const uint8x16_t input_val_u8 = vld1q_u8(input_data + in);
+      const uint8* weights_ptr = weights_data + in + out * input_size;
+      uint8x16_t weights_val_u8_0 = vld1q_u8(weights_ptr + 0 * input_size);
+      uint8x16_t weights_val_u8_1 = vld1q_u8(weights_ptr + 1 * input_size);
+      uint8x16_t weights_val_u8_2 = vld1q_u8(weights_ptr + 2 * input_size);
+      uint8x16_t weights_val_u8_3 = vld1q_u8(weights_ptr + 3 * input_size);
+      int16x8_t input_val_0, input_val_1;
+      const uint8x8_t low = vget_low_u8(input_val_u8);
+      const uint8x8_t high = vget_high_u8(input_val_u8);
+      input_val_0 = vreinterpretq_s16_u16(vmovl_u8(low));
+      input_val_1 = vreinterpretq_s16_u16(vmovl_u8(high));
+      input_val_0 = vaddq_s16(input_val_0, input_offset_vec);
+      input_val_1 = vaddq_s16(input_val_1, input_offset_vec);
+      int16x8_t weights_val_0_0, weights_val_1_0, weights_val_2_0,
+          weights_val_3_0;
+      int16x8_t weights_val_0_1, weights_val_1_1, weights_val_2_1,
+          weights_val_3_1;
+      weights_val_0_0 = vaddq_s16(
+          vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(weights_val_u8_0))),
+          weights_offset_vec);
+      weights_val_0_1 = vaddq_s16(
+          vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(weights_val_u8_0))),
+          weights_offset_vec);
+      weights_val_1_0 = vaddq_s16(
+          vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(weights_val_u8_1))),
+          weights_offset_vec);
+      weights_val_1_1 = vaddq_s16(
+          vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(weights_val_u8_1))),
+          weights_offset_vec);
+      weights_val_2_0 = vaddq_s16(
+          vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(weights_val_u8_2))),
+          weights_offset_vec);
+      weights_val_2_1 = vaddq_s16(
+          vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(weights_val_u8_2))),
+          weights_offset_vec);
+      weights_val_3_0 = vaddq_s16(
+          vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(weights_val_u8_3))),
+          weights_offset_vec);
+      weights_val_3_1 = vaddq_s16(
+          vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(weights_val_u8_3))),
+          weights_offset_vec);
+      acc_0 = vmlal_s16(acc_0, vget_low_s16(weights_val_0_0),
+                        vget_low_s16(input_val_0));
+      acc_1 = vmlal_s16(acc_1, vget_low_s16(weights_val_1_0),
+                        vget_low_s16(input_val_0));
+      acc_2 = vmlal_s16(acc_2, vget_low_s16(weights_val_2_0),
+                        vget_low_s16(input_val_0));
+      acc_3 = vmlal_s16(acc_3, vget_low_s16(weights_val_3_0),
+                        vget_low_s16(input_val_0));
+      acc_0 = vmlal_s16(acc_0, vget_high_s16(weights_val_0_0),
+                        vget_high_s16(input_val_0));
+      acc_1 = vmlal_s16(acc_1, vget_high_s16(weights_val_1_0),
+                        vget_high_s16(input_val_0));
+      acc_2 = vmlal_s16(acc_2, vget_high_s16(weights_val_2_0),
+                        vget_high_s16(input_val_0));
+      acc_3 = vmlal_s16(acc_3, vget_high_s16(weights_val_3_0),
+                        vget_high_s16(input_val_0));
+      acc_0 = vmlal_s16(acc_0, vget_low_s16(weights_val_0_1),
+                        vget_low_s16(input_val_1));
+      acc_1 = vmlal_s16(acc_1, vget_low_s16(weights_val_1_1),
+                        vget_low_s16(input_val_1));
+      acc_2 = vmlal_s16(acc_2, vget_low_s16(weights_val_2_1),
+                        vget_low_s16(input_val_1));
+      acc_3 = vmlal_s16(acc_3, vget_low_s16(weights_val_3_1),
+                        vget_low_s16(input_val_1));
+      acc_0 = vmlal_s16(acc_0, vget_high_s16(weights_val_0_1),
+                        vget_high_s16(input_val_1));
+      acc_1 = vmlal_s16(acc_1, vget_high_s16(weights_val_1_1),
+                        vget_high_s16(input_val_1));
+      acc_2 = vmlal_s16(acc_2, vget_high_s16(weights_val_2_1),
+                        vget_high_s16(input_val_1));
+      acc_3 = vmlal_s16(acc_3, vget_high_s16(weights_val_3_1),
+                        vget_high_s16(input_val_1));
+    }
+    // Handle 8 levels of depth at a time.
+    for (; in < input_size; in += 8) {
+      const uint8x8_t input_val_u8 = vld1_u8(input_data + in);
+      const uint8* weights_ptr = weights_data + in + out * input_size;
+      uint8x8_t weights_val_u8_0 = vld1_u8(weights_ptr + 0 * input_size);
+      uint8x8_t weights_val_u8_1 = vld1_u8(weights_ptr + 1 * input_size);
+      uint8x8_t weights_val_u8_2 = vld1_u8(weights_ptr + 2 * input_size);
+      uint8x8_t weights_val_u8_3 = vld1_u8(weights_ptr + 3 * input_size);
+      int16x8_t input_val;
+      input_val = vreinterpretq_s16_u16(vmovl_u8(input_val_u8));
+      input_val = vaddq_s16(input_val, input_offset_vec);
+      int16x8_t weights_val_0, weights_val_1, weights_val_2, weights_val_3;
+      weights_val_0 =
+          vaddq_s16(vreinterpretq_s16_u16(vmovl_u8(weights_val_u8_0)),
+                    weights_offset_vec);
+      weights_val_1 =
+          vaddq_s16(vreinterpretq_s16_u16(vmovl_u8(weights_val_u8_1)),
+                    weights_offset_vec);
+      weights_val_2 =
+          vaddq_s16(vreinterpretq_s16_u16(vmovl_u8(weights_val_u8_2)),
+                    weights_offset_vec);
+      weights_val_3 =
+          vaddq_s16(vreinterpretq_s16_u16(vmovl_u8(weights_val_u8_3)),
+                    weights_offset_vec);
+      acc_0 = vmlal_s16(acc_0, vget_low_s16(weights_val_0),
+                        vget_low_s16(input_val));
+      acc_1 = vmlal_s16(acc_1, vget_low_s16(weights_val_1),
+                        vget_low_s16(input_val));
+      acc_2 = vmlal_s16(acc_2, vget_low_s16(weights_val_2),
+                        vget_low_s16(input_val));
+      acc_3 = vmlal_s16(acc_3, vget_low_s16(weights_val_3),
+                        vget_low_s16(input_val));
+      acc_0 = vmlal_s16(acc_0, vget_high_s16(weights_val_0),
+                        vget_high_s16(input_val));
+      acc_1 = vmlal_s16(acc_1, vget_high_s16(weights_val_1),
+                        vget_high_s16(input_val));
+      acc_2 = vmlal_s16(acc_2, vget_high_s16(weights_val_2),
+                        vget_high_s16(input_val));
+      acc_3 = vmlal_s16(acc_3, vget_high_s16(weights_val_3),
+                        vget_high_s16(input_val));
+    }
+    // Horizontally reduce accumulators
+    int32x2_t pairwise_reduced_acc_0, pairwise_reduced_acc_1,
+        pairwise_reduced_acc_2, pairwise_reduced_acc_3;
+    pairwise_reduced_acc_0 =
+        vpadd_s32(vget_low_s32(acc_0), vget_high_s32(acc_0));
+    pairwise_reduced_acc_1 =
+        vpadd_s32(vget_low_s32(acc_1), vget_high_s32(acc_1));
+    pairwise_reduced_acc_2 =
+        vpadd_s32(vget_low_s32(acc_2), vget_high_s32(acc_2));
+    pairwise_reduced_acc_3 =
+        vpadd_s32(vget_low_s32(acc_3), vget_high_s32(acc_3));
+    const int32x2_t reduced_lo =
+        vpadd_s32(pairwise_reduced_acc_0, pairwise_reduced_acc_1);
+    const int32x2_t reduced_hi =
+        vpadd_s32(pairwise_reduced_acc_2, pairwise_reduced_acc_3);
+    int32x4_t reduced = vcombine_s32(reduced_lo, reduced_hi);
+    // Add bias values.
+    int32x4_t bias_vec = vld1q_s32(bias_ptr);
+    bias_ptr += 4;
+    reduced = vaddq_s32(reduced, bias_vec);
+    int left_shift = accum_shift > 0 ? accum_shift : 0;
+    int right_shift = accum_shift > 0 ? 0 : -accum_shift;
+    reduced = vshlq_s32(reduced, vdupq_n_s32(left_shift));
+    // Multiply by the fixed-point multiplier.
+    reduced = vqrdmulhq_n_s32(reduced, accum_multiplier);
+    // Rounding-shift-right.
+    using gemmlowp::RoundingDivideByPOT;
+    reduced = RoundingDivideByPOT(reduced, right_shift);
+    // Narrow values down to 16 bit signed.
+    const int16x4_t res16 = vqmovn_s32(reduced);
+    vst1_s16(output_ptr, res16);
+    output_ptr += 4;
+  }
+}
+#endif
+
 // Quantized LSTM cell. Currently just a copy of the reference impl in
 // reference_ops.h. See the big function comment there, not replicating it
 // here.
@@ -2095,7 +2379,8 @@ void LstmCell(const uint8* input_data_uint8, const Dims<4>& input_dims,
               const Dims<4>& output_activ_dims, uint8* concat_temp_data_uint8,
               const Dims<4>& concat_temp_dims, int16* activ_temp_data_int16,
               const Dims<4>& activ_temp_dims, int32 weights_zero_point,
-              int32 accum_multiplier, int accum_shift) {
+              int32 accum_multiplier, int accum_shift,
+              gemmlowp::GemmContext* gemm_context) {
   gemmlowp::ScopedProfilingLabel label(
       "LstmCell/quantized (8bit external, 16bit internal)");
   // Gather dimensions information, and perform consistency checks.
@@ -2144,42 +2429,131 @@ void LstmCell(const uint8* input_data_uint8, const Dims<4>& input_dims,
   // integers, and the output is 16-bit fixed-point with 3 integer bits so
   // the output range is [-2^3, 2^3] == [-8, 8]. The rationale for that
   // is explained in the function comment above.
-  for (int b = 0; b < fc_batches; ++b) {
-    for (int out_c = 0; out_c < fc_output_depth; ++out_c) {
-      // Internal accumulation.
-      // Initialize accumulator with the bias-value.
-      int32 accum = bias_data_int32[out_c];
-      // Accumulation loop.
-      for (int d = 0; d < fc_accum_depth; ++d) {
-        int16 input_val = concat_temp_data_uint8[b * fc_accum_depth + d] - 128;
-        int16 weights_val =
-            weights_data_uint8[out_c * fc_accum_depth + d] - weights_zero_point;
-        accum += input_val * weights_val;
-      }
-      // Down-scale the final int32 accumulator to the scale used by our
-      // (16-bit, using 3 integer bits) fixed-point format. The quantized
-      // multiplier and shift here have been pre-computed offline
-      // (e.g. by toco).
-      // Note that the implicit assumption here, that this multiplier is smaller
-      // than one, is equivalent to the assumption that the fully-connected
-      // weights min-max is enclosed within [-4, 4] (it may be narrower).
-      // If that eventually fails, offline tools (e.g. toco) will fail early
-      // and that will be easy to support as needed. For now, assuming that
-      // this multiplier is less than one allows us to use a simpler, more
-      // accurate implementation.
-      accum =
-          MultiplyByQuantizedMultiplier(accum, accum_multiplier, accum_shift);
-      // Saturate, cast to int16, and store to the temporary activations array.
-      accum = std::max(-32768, std::min(32767, accum));
-      activ_temp_data_int16[out_c + fc_output_depth * b] = accum;
-    }
+  bool gemm_already_performed = false;
+#ifdef GEMMLOWP_NEON
+  if (fc_batches == 1 && !(fc_output_depth % 4) && !(fc_accum_depth % 8)) {
+    GEMVForLstmCell(concat_temp_data_uint8, concat_temp_dims,
+                    weights_data_uint8, weights_dims, weights_zero_point,
+                    bias_data_int32, bias_dims, accum_multiplier, accum_shift,
+                    activ_temp_data_int16, activ_temp_dims);
+    gemm_already_performed = true;
+  }
+#endif
+  if (!gemm_already_performed) {
+    gemmlowp::MatrixMap<const uint8, gemmlowp::MapOrder::RowMajor>
+        weights_matrix(weights_data_uint8, fc_output_depth, fc_accum_depth);
+    gemmlowp::MatrixMap<const uint8, gemmlowp::MapOrder::ColMajor> input_matrix(
+        concat_temp_data_uint8, fc_accum_depth, fc_batches);
+    gemmlowp::MatrixMap<int16, gemmlowp::MapOrder::ColMajor> output_matrix(
+        activ_temp_data_int16, fc_output_depth, fc_batches);
+    typedef gemmlowp::VectorMap<const int32, gemmlowp::VectorShape::Col>
+        ColVectorMap;
+    ColVectorMap bias_vector(bias_data_int32, fc_output_depth);
+    gemmlowp::OutputStageBiasAddition<ColVectorMap> bias_addition_stage;
+    bias_addition_stage.bias_vector = bias_vector;
+    gemmlowp::OutputStageScaleInt32ByFixedPointAndExponent scale_stage;
+    scale_stage.result_offset_after_shift = 0;
+    scale_stage.result_fixedpoint_multiplier = accum_multiplier;
+    scale_stage.result_exponent = accum_shift;
+    gemmlowp::OutputStageSaturatingCastToInt16 saturating_cast_int16_stage;
+    auto output_pipeline = std::make_tuple(bias_addition_stage, scale_stage,
+                                           saturating_cast_int16_stage);
+    gemmlowp::GemmWithOutputPipeline<
+        uint8, int16, gemmlowp::L8R8WithLhsNonzeroBitDepthParams>(
+        gemm_context, weights_matrix, input_matrix, &output_matrix,
+        -weights_zero_point, -128, output_pipeline);
   }
 
   // Rest of the LSTM cell: tanh and logistic math functions, and some adds
   // and muls, all done in 16-bit fixed-point.
   const int outer_size = batches * width * height;
+  const int16* input_gate_input_ptr = activ_temp_data_int16;
+  const int16* input_modulation_gate_input_ptr =
+      activ_temp_data_int16 + output_depth;
+  const int16* forget_gate_input_ptr = activ_temp_data_int16 + 2 * output_depth;
+  const int16* output_gate_input_ptr = activ_temp_data_int16 + 3 * output_depth;
+  const int16* prev_state_ptr = prev_state_data_int16;
+  int16* output_state_data_ptr = output_state_data_int16;
+  uint8* output_activ_data_ptr = output_activ_data_uint8;
+
   for (int b = 0; b < outer_size; ++b) {
-    for (int c = 0; c < output_depth; ++c) {
+    int c = 0;
+#ifdef GEMMLOWP_NEON
+    for (; c <= output_depth - 8; c += 8) {
+      // Define the fixed-point data types that we will use here. All use
+      // int16 as the underlying integer type i.e. all are 16-bit fixed-point.
+      // They only differ by the number of integral vs. fractional bits,
+      // determining the range of values that they can represent.
+      //
+      // F0 uses 0 integer bits, range [-1, 1].
+      // This is the return type of math functions such as tanh, logistic,
+      // whose range is in [-1, 1].
+      using F0 = gemmlowp::FixedPoint<int16x8_t, 0>;
+      // F3 uses 3 integer bits, range [-8, 8].
+      // This is the range of the previous fully-connected node's output,
+      // which is our input here.
+      using F3 = gemmlowp::FixedPoint<int16x8_t, 3>;
+      // FS uses StateIntegerBits integer bits, range [-2^StateIntegerBits,
+      // 2^StateIntegerBits]. It's used to represent the internal state, whose
+      // number of integer bits is currently dictated by the model. See comment
+      // on the StateIntegerBits template parameter above.
+      using FS = gemmlowp::FixedPoint<int16x8_t, StateIntegerBits>;
+      // Implementation of input gate, using fixed-point logistic function.
+      F3 input_gate_input = F3::FromRaw(vld1q_s16(input_gate_input_ptr));
+      input_gate_input_ptr += 8;
+      F0 input_gate_output = gemmlowp::logistic(input_gate_input);
+      // Implementation of input modulation gate, using fixed-point tanh
+      // function.
+      F3 input_modulation_gate_input =
+          F3::FromRaw(vld1q_s16(input_modulation_gate_input_ptr));
+      input_modulation_gate_input_ptr += 8;
+      F0 input_modulation_gate_output =
+          gemmlowp::tanh(input_modulation_gate_input);
+      // Implementation of forget gate, using fixed-point logistic function.
+      F3 forget_gate_input = F3::FromRaw(vld1q_s16(forget_gate_input_ptr));
+      forget_gate_input_ptr += 8;
+      F0 forget_gate_output = gemmlowp::logistic(forget_gate_input);
+      // Implementation of output gate, using fixed-point logistic function.
+      F3 output_gate_input = F3::FromRaw(vld1q_s16(output_gate_input_ptr));
+      output_gate_input_ptr += 8;
+      F0 output_gate_output = gemmlowp::logistic(output_gate_input);
+      // Implementation of internal multiplication nodes, still in fixed-point.
+      F0 input_times_input_modulation =
+          input_gate_output * input_modulation_gate_output;
+      FS prev_state = FS::FromRaw(vld1q_s16(prev_state_ptr));
+      prev_state_ptr += 8;
+      FS prev_state_times_forget_state = forget_gate_output * prev_state;
+      // Implementation of internal addition node, saturating.
+      FS new_state = gemmlowp::SaturatingAdd(
+          gemmlowp::Rescale<StateIntegerBits>(input_times_input_modulation),
+          prev_state_times_forget_state);
+      // Implementation of last internal Tanh node, still in fixed-point.
+      // Since a Tanh fixed-point implementation is specialized for a given
+      // number or integer bits, and each specialization can have a substantial
+      // code size, and we already used above a Tanh on an input with 3 integer
+      // bits, and per the table in the above function comment there is no
+      // significant accuracy to be lost by clamping to [-8, +8] for a
+      // 3-integer-bits representation, let us just do that. This helps people
+      // porting this to targets where code footprint must be minimized.
+      F3 new_state_f3 = gemmlowp::Rescale<3>(new_state);
+      F0 output_activ_int16 = output_gate_output * gemmlowp::tanh(new_state_f3);
+      // Store the new internal state back to memory, as 16-bit integers.
+      // Note: here we store the original value with StateIntegerBits, not
+      // the rescaled 3-integer-bits value fed to tanh.
+      vst1q_s16(output_state_data_ptr, new_state.raw());
+      output_state_data_ptr += 8;
+      // Down-scale the output activations to 8-bit integers, saturating,
+      // and store back to memory.
+      int16x8_t rescaled_output_activ =
+          gemmlowp::RoundingDivideByPOT(output_activ_int16.raw(), 8);
+      int8x8_t int8_output_activ = vqmovn_s16(rescaled_output_activ);
+      uint8x8_t uint8_output_activ =
+          vadd_u8(vdup_n_u8(128), vreinterpret_u8_s8(int8_output_activ));
+      vst1_u8(output_activ_data_ptr, uint8_output_activ);
+      output_activ_data_ptr += 8;
+    }
+#endif
+    for (; c < output_depth; ++c) {
       // Define the fixed-point data types that we will use here. All use
       // int16 as the underlying integer type i.e. all are 16-bit fixed-point.
       // They only differ by the number of integral vs. fractional bits,
@@ -2199,45 +2573,55 @@ void LstmCell(const uint8* input_data_uint8, const Dims<4>& input_dims,
       // on the StateIntegerBits template parameter above.
       using FS = gemmlowp::FixedPoint<std::int16_t, StateIntegerBits>;
       // Implementation of input gate, using fixed-point logistic function.
-      F3 input_gate_input = F3::FromRaw(
-          activ_temp_data_int16[b * fc_output_depth + 0 * output_depth + c]);
+      F3 input_gate_input = F3::FromRaw(*input_gate_input_ptr++);
       F0 input_gate_output = gemmlowp::logistic(input_gate_input);
       // Implementation of input modulation gate, using fixed-point tanh
       // function.
-      F3 input_modulation_gate_input = F3::FromRaw(
-          activ_temp_data_int16[b * fc_output_depth + 1 * output_depth + c]);
+      F3 input_modulation_gate_input =
+          F3::FromRaw(*input_modulation_gate_input_ptr++);
       F0 input_modulation_gate_output =
           gemmlowp::tanh(input_modulation_gate_input);
       // Implementation of forget gate, using fixed-point logistic function.
-      F3 forget_gate_input = F3::FromRaw(
-          activ_temp_data_int16[b * fc_output_depth + 2 * output_depth + c]);
+      F3 forget_gate_input = F3::FromRaw(*forget_gate_input_ptr++);
       F0 forget_gate_output = gemmlowp::logistic(forget_gate_input);
       // Implementation of output gate, using fixed-point logistic function.
-      F3 output_gate_input = F3::FromRaw(
-          activ_temp_data_int16[b * fc_output_depth + 3 * output_depth + c]);
+      F3 output_gate_input = F3::FromRaw(*output_gate_input_ptr++);
       F0 output_gate_output = gemmlowp::logistic(output_gate_input);
       // Implementation of internal multiplication nodes, still in fixed-point.
       F0 input_times_input_modulation =
           input_gate_output * input_modulation_gate_output;
-      FS prev_state = FS::FromRaw(prev_state_data_int16[b * output_depth + c]);
+      FS prev_state = FS::FromRaw(*prev_state_ptr++);
       FS prev_state_times_forget_state = forget_gate_output * prev_state;
       // Implementation of internal addition node, saturating.
       FS new_state = gemmlowp::SaturatingAdd(
           gemmlowp::Rescale<StateIntegerBits>(input_times_input_modulation),
           prev_state_times_forget_state);
-      // Implementation of last internal tanh node, still in fixed-point.
-      F0 output_activ_int16 = output_gate_output * gemmlowp::tanh(new_state);
+      // Implementation of last internal Tanh node, still in fixed-point.
+      // Since a Tanh fixed-point implementation is specialized for a given
+      // number or integer bits, and each specialization can have a substantial
+      // code size, and we already used above a Tanh on an input with 3 integer
+      // bits, and per the table in the above function comment there is no
+      // significant accuracy to be lost by clamping to [-8, +8] for a
+      // 3-integer-bits representation, let us just do that. This helps people
+      // porting this to targets where code footprint must be minimized.
+      F3 new_state_f3 = gemmlowp::Rescale<3>(new_state);
+      F0 output_activ_int16 = output_gate_output * gemmlowp::tanh(new_state_f3);
       // Store the new internal state back to memory, as 16-bit integers.
-      output_state_data_int16[b * output_depth + c] = new_state.raw();
+      // Note: here we store the original value with StateIntegerBits, not
+      // the rescaled 3-integer-bits value fed to tanh.
+      *output_state_data_ptr++ = new_state.raw();
       // Down-scale the output activations to 8-bit integers, saturating,
       // and store back to memory.
       int16 rescaled_output_activ =
           gemmlowp::RoundingDivideByPOT(output_activ_int16.raw(), 8);
       int16 clamped_output_activ =
           std::max<int16>(-128, std::min<int16>(127, rescaled_output_activ));
-      output_activ_data_uint8[b * output_depth + c] =
-          128 + clamped_output_activ;
+      *output_activ_data_ptr++ = 128 + clamped_output_activ;
     }
+    input_gate_input_ptr += 3 * output_depth;
+    input_modulation_gate_input_ptr += 3 * output_depth;
+    forget_gate_input_ptr += 3 * output_depth;
+    output_gate_input_ptr += 3 * output_depth;
   }
 }
 
@@ -3060,6 +3444,43 @@ inline void Softmax(const uint8* input_data, const Dims<4>& input_dims,
   }
 }
 
+// TODO(myenik): This is the same as the reference implementation, not actually
+// optimized yet.
+inline void LogSoftmax(const float* input_data, const Dims<4>& input_dims,
+                       float* output_data, const Dims<4>& output_dims) {
+  const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
+  const int height = MatchingArraySize(input_dims, 2, output_dims, 2);
+  const int width = MatchingArraySize(input_dims, 1, output_dims, 1);
+  const int depth = MatchingArraySize(input_dims, 0, output_dims, 0);
+
+  for (int b = 0; b < batches; ++b) {
+    for (int y = 0; y < height; ++y) {
+      for (int x = 0; x < width; ++x) {
+        // Find max element value which we'll use to ensure numerical stability
+        // taking advantage of the following equality:
+        // log(exp(x[i])/sum(exp(x[i]))) == log(exp(x[i]+C)/sum(exp(x[i]+C)))
+        float max = std::numeric_limits<float>::lowest();
+        for (int c = 0; c < depth; ++c) {
+          max = std::max(max, input_data[Offset(input_dims, c, x, y, b)]);
+        }
+
+        // Compute sum.
+        float sum = 0.f;
+        for (int c = 0; c < depth; ++c) {
+          sum += std::exp(input_data[Offset(input_dims, c, x, y, b)] - max);
+        }
+
+        // Compute result.
+        const float log_sum = std::log(sum);
+        for (int c = 0; c < depth; ++c) {
+          output_data[Offset(output_dims, c, x, y, b)] =
+              input_data[Offset(input_dims, c, x, y, b)] - max - log_sum;
+        }
+      }
+    }
+  }
+}
+
 inline void Logistic(const float* input_data, const Dims<4>& input_dims,
                      float* output_data, const Dims<4>& output_dims) {
   gemmlowp::ScopedProfilingLabel label("Logistic");
@@ -4275,6 +4696,35 @@ void ArgMax(const T3* axis, const T1* input_data, const Dims<4>& input_dims,
   }
 }
 
+template <typename T>
+void Transpose(const T* input, const Dims<4>& input_dims, T* output,
+               const Dims<4>& output_dims, const int* permuted_axes) {
+  int out_sizes[4];
+  // Compute the inverse permutation array so we can do an output centered
+  // transpose. Also, check to make sure output_dims is matching input_dims.
+  for (int k = 0; k < 4; k++) {
+    out_sizes[k] =
+        MatchingArraySize(input_dims, permuted_axes[k], output_dims, k);
+  }
+
+  // Naive transpose loop (iterate on output index and compute input index).
+  int o[4];  // loop index (on output).
+  int i[4];
+  for (o[3] = 0; o[3] < out_sizes[3]; o[3]++) {
+    i[permuted_axes[3]] = o[3];
+    for (o[2] = 0; o[2] < out_sizes[2]; o[2]++) {
+      i[permuted_axes[2]] = o[2];
+      for (o[1] = 0; o[1] < out_sizes[1]; o[1]++) {
+        i[permuted_axes[1]] = o[1];
+        for (o[0] = 0; o[0] < out_sizes[0]; o[0]++) {
+          i[permuted_axes[0]] = o[0];
+          output[Offset(output_dims, o)] = input[Offset(input_dims, i)];
+        }
+      }
+    }
+  }
+}
+
 }  // namespace optimized_ops
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h b/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
index 5f4d5be3235433787e3eed6d37a7f403348d9eee..53de21697b95039e32383a7a9d99c2e3168068c2 100644
--- a/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
+++ b/tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h
@@ -157,11 +157,11 @@ inline void NdArrayDescsForElementwiseBroadcast(const Dims<N>& input0_dims,
 inline void Conv(const float* input_data, const Dims<4>& input_dims,
                  const float* filter_data, const Dims<4>& filter_dims,
                  const float* bias_data, const Dims<4>& bias_dims,
-                 int stride_width, int stride_height, int pad_width,
-                 int pad_height, float output_activation_min,
-                 float output_activation_max, float* output_data,
-                 const Dims<4>& output_dims, float* im2col_data,
-                 const Dims<4>& im2col_dims) {
+                 int stride_width, int stride_height, int dilation_width_factor,
+                 int dilation_height_factor, int pad_width, int pad_height,
+                 float output_activation_min, float output_activation_max,
+                 float* output_data, const Dims<4>& output_dims,
+                 float* im2col_data, const Dims<4>& im2col_dims) {
   (void)im2col_data;  // only used in optimized code.
   (void)im2col_dims;  // only used in optimized code.
   const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
@@ -186,8 +186,9 @@ inline void Conv(const float* input_data, const Dims<4>& input_dims,
           for (int filter_y = 0; filter_y < filter_height; ++filter_y) {
             for (int filter_x = 0; filter_x < filter_width; ++filter_x) {
               for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
-                const int in_x = in_x_origin + filter_x;
-                const int in_y = in_y_origin + filter_y;
+                const int in_x = in_x_origin + dilation_width_factor * filter_x;
+                const int in_y =
+                    in_y_origin + dilation_height_factor * filter_y;
                 // If the location is outside the bounds of the input image,
                 // use zero as a default value.
                 if ((in_x >= 0) && (in_x < input_width) && (in_y >= 0) &&
@@ -216,6 +217,23 @@ inline void Conv(const float* input_data, const Dims<4>& input_dims,
   }
 }
 
+template <FusedActivationFunctionType Ac>
+void Conv(const float* input_data, const Dims<4>& input_dims,
+          const float* filter_data, const Dims<4>& filter_dims,
+          const float* bias_data, const Dims<4>& bias_dims, int stride_width,
+          int stride_height, int dilation_width_factor,
+          int dilation_height_factor, int pad_width, int pad_height,
+          float* output_data, const Dims<4>& output_dims, float* im2col_data,
+          const Dims<4>& im2col_dims) {
+  float output_activation_min, output_activation_max;
+  GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
+  Conv(input_data, input_dims, filter_data, filter_dims, bias_data, bias_dims,
+       stride_width, stride_height, dilation_width_factor,
+       dilation_height_factor, pad_width, pad_height, output_activation_min,
+       output_activation_max, output_data, output_dims, im2col_data,
+       im2col_dims);
+}
+
 // legacy, for compatibility with old checked-in code
 template <FusedActivationFunctionType Ac>
 void Conv(const float* input_data, const Dims<4>& input_dims,
@@ -227,7 +245,7 @@ void Conv(const float* input_data, const Dims<4>& input_dims,
   float output_activation_min, output_activation_max;
   GetActivationMinMax(Ac, &output_activation_min, &output_activation_max);
   Conv(input_data, input_dims, filter_data, filter_dims, bias_data, bias_dims,
-       stride_width, stride_height, pad_width, pad_height,
+       stride_width, stride_height, 1, 1, pad_width, pad_height,
        output_activation_min, output_activation_max, output_data, output_dims,
        im2col_data, im2col_dims);
 }
@@ -241,7 +259,7 @@ void Conv(const float* input_data, const Dims<4>& input_dims,
           const Dims<4>& output_dims, float* im2col_data,
           const Dims<4>& im2col_dims) {
   Conv<Ac>(input_data, input_dims, filter_data, filter_dims, bias_data,
-           bias_dims, stride, stride, pad_width, pad_height, output_data,
+           bias_dims, stride, stride, 1, 1, pad_width, pad_height, output_data,
            output_dims, im2col_data, im2col_dims);
 }
 
@@ -1453,7 +1471,10 @@ void LstmCell(const uint8* input_data_uint8, const Dims<4>& input_dims,
               const Dims<4>& output_activ_dims, uint8* concat_temp_data_uint8,
               const Dims<4>& concat_temp_dims, int16* activ_temp_data_int16,
               const Dims<4>& activ_temp_dims, int32 weights_zero_point,
-              int32 accum_multiplier, int accum_shift) {
+              int32 accum_multiplier, int accum_shift,
+              gemmlowp::GemmContext* gemm_context) {
+  (void)gemm_context;  // only used in optimized code.
+
   // Gather dimensions information, and perform consistency checks.
   const int batches =
       MatchingArraySize(input_dims, 3, prev_activ_dims, 3, prev_state_dims, 3,
@@ -1574,9 +1595,19 @@ void LstmCell(const uint8* input_data_uint8, const Dims<4>& input_dims,
       FS new_state = gemmlowp::SaturatingAdd(
           gemmlowp::Rescale<StateIntegerBits>(input_times_input_modulation),
           prev_state_times_forget_state);
-      // Implementation of last internal tanh node, still in fixed-point.
-      F0 output_activ_int16 = output_gate_output * gemmlowp::tanh(new_state);
+      // Implementation of last internal Tanh node, still in fixed-point.
+      // Since a Tanh fixed-point implementation is specialized for a given
+      // number or integer bits, and each specialization can have a substantial
+      // code size, and we already used above a Tanh on an input with 3 integer
+      // bits, and per the table in the above function comment there is no
+      // significant accuracy to be lost by clamping to [-8, +8] for a
+      // 3-integer-bits representation, let us just do that. This helps people
+      // porting this to targets where code footprint must be minimized.
+      F3 new_state_f3 = gemmlowp::Rescale<3>(new_state);
+      F0 output_activ_int16 = output_gate_output * gemmlowp::tanh(new_state_f3);
       // Store the new internal state back to memory, as 16-bit integers.
+      // Note: here we store the original value with StateIntegerBits, not
+      // the rescaled 3-integer-bits value fed to tanh.
       output_state_data_int16[b * output_depth + c] = new_state.raw();
       // Down-scale the output activations to 8-bit integers, saturating,
       // and store back to memory.
@@ -2203,6 +2234,41 @@ inline void Softmax(const uint8* input_data, const Dims<4>& input_dims,
   }
 }
 
+inline void LogSoftmax(const float* input_data, const Dims<4>& input_dims,
+                       float* output_data, const Dims<4>& output_dims) {
+  const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
+  const int height = MatchingArraySize(input_dims, 2, output_dims, 2);
+  const int width = MatchingArraySize(input_dims, 1, output_dims, 1);
+  const int depth = MatchingArraySize(input_dims, 0, output_dims, 0);
+
+  for (int b = 0; b < batches; ++b) {
+    for (int y = 0; y < height; ++y) {
+      for (int x = 0; x < width; ++x) {
+        // Find max element value which we'll use to ensure numerical stability
+        // taking advantage of the following equality:
+        // log(exp(x[i])/sum(exp(x[i]))) == log(exp(x[i]+C)/sum(exp(x[i]+C)))
+        float max = std::numeric_limits<float>::lowest();
+        for (int c = 0; c < depth; ++c) {
+          max = std::max(max, input_data[Offset(input_dims, c, x, y, b)]);
+        }
+
+        // Compute sum.
+        float sum = 0.f;
+        for (int c = 0; c < depth; ++c) {
+          sum += std::exp(input_data[Offset(input_dims, c, x, y, b)] - max);
+        }
+
+        // Compute result.
+        const float log_sum = std::log(sum);
+        for (int c = 0; c < depth; ++c) {
+          output_data[Offset(output_dims, c, x, y, b)] =
+              input_data[Offset(input_dims, c, x, y, b)] - max - log_sum;
+        }
+      }
+    }
+  }
+}
+
 inline void Logistic(const float* input_data, const Dims<4>& input_dims,
                      float* output_data, const Dims<4>& output_dims) {
   const int batches = MatchingArraySize(input_dims, 3, output_dims, 3);
@@ -2833,9 +2899,11 @@ inline void Mean(T* input_data, const int* input_dims, const int input_num_dims,
   for (int idx = 0; idx < num_resolved_axis; ++idx) {
     num_elements_in_axis *= static_cast<size_t>(input_dims[resolved_axis[idx]]);
   }
-  for (size_t idx = 0; idx < num_outputs; ++idx) {
-    output_data[idx] = static_cast<T>(static_cast<float>(output_data[idx]) /
-                                      num_elements_in_axis);
+  if (num_elements_in_axis > 0) {
+    for (size_t idx = 0; idx < num_outputs; ++idx) {
+      output_data[idx] = static_cast<T>(static_cast<float>(output_data[idx]) /
+                                        num_elements_in_axis);
+    }
   }
 }
 
diff --git a/tensorflow/contrib/lite/kernels/log_softmax_test.cc b/tensorflow/contrib/lite/kernels/log_softmax_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..62820a2f5113cb6ae252386aaf3842135383b79f
--- /dev/null
+++ b/tensorflow/contrib/lite/kernels/log_softmax_test.cc
@@ -0,0 +1,112 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+// Unit test for TFLite LOG_SOFTMAX op.
+
+#include <iomanip>
+#include <memory>
+#include <vector>
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+#include "tensorflow/contrib/lite/interpreter.h"
+#include "tensorflow/contrib/lite/kernels/internal/reference/reference_ops.h"
+#include "tensorflow/contrib/lite/kernels/register.h"
+#include "tensorflow/contrib/lite/kernels/test_util.h"
+#include "tensorflow/contrib/lite/model.h"
+
+namespace tflite {
+namespace {
+
+class LogSoftmaxOpModel : public SingleOpModel {
+ public:
+  LogSoftmaxOpModel(int batches, int size)
+      : batches_(batches), input_size_(size) {
+    input_ = AddInput(TensorType_FLOAT32);
+    output_ = AddOutput(TensorType_FLOAT32);
+    SetBuiltinOp(BuiltinOperator_LOG_SOFTMAX, BuiltinOptions_LogSoftmaxOptions,
+                 CreateLogSoftmaxOptions(builder_).Union());
+    BuildInterpreter({{batches_, input_size_}});
+  }
+
+  void SetInput(std::initializer_list<float> data) {
+    PopulateTensor(input_, data);
+  }
+
+  void SetInput(int offset, float* begin, float* end) {
+    PopulateTensor(input_, offset, begin, end);
+  }
+
+  std::vector<float> GetOutput() { return ExtractVector<float>(output_); }
+
+ private:
+  int input_;
+  int output_;
+
+  int batches_;
+  int input_size_;
+};
+
+TEST(LogSoftmaxOpTest, SimpleTest) {
+  LogSoftmaxOpModel m(/*batches=*/2, /*size=*/5);
+  m.SetInput({
+      1.0, 2.0, 3.0, 4.0, 5.0,       // b = 0
+      -1.0, -2.0, -3.0, -4.0, -5.0,  // b = 1
+  });
+
+  m.Invoke();
+
+  EXPECT_THAT(
+      m.GetOutput(),
+      ElementsAreArray(ArrayFloatNear(
+          {-4.45191431, -3.45191431, -2.45191431, -1.45191443, -0.4519144,
+           -0.4519144, -1.45191443, -2.45191431, -3.45191431, -4.45191431},
+          1e-6)));
+}
+
+TEST(LogSoftmaxOpTest, CompareWithTFmini) {
+  const int batch_size = 2;
+  const int input_size = 5;
+  static float input_buffer[] = {
+      1.0,  2.0,  3.0,  4.0,  5.0,   // b = 0
+      -1.0, -2.0, -3.0, -4.0, -5.0,  // b = 1
+  };
+
+  LogSoftmaxOpModel m(batch_size, input_size);
+
+  m.SetInput(0, input_buffer, input_buffer + input_size * batch_size);
+
+  m.Invoke();
+
+  std::unique_ptr<float[]> output_buffer(new float[input_size * batch_size]);
+  static tflite::Dims<4> input_dims = {{input_size, 1, 1, batch_size},
+                                       {1, 0, 0, input_size}};
+  tflite::reference_ops::LogSoftmax(input_buffer, input_dims,
+                                    output_buffer.get(), input_dims);
+
+  std::vector<float> expected;
+  expected.insert(expected.end(), output_buffer.get(),
+                  output_buffer.get() + input_size * batch_size);
+
+  EXPECT_THAT(m.GetOutput(), ElementsAreArray(ArrayFloatNear(expected, 1e-6)));
+}
+
+}  // namespace
+}  // namespace tflite
+
+int main(int argc, char** argv) {
+  ::tflite::LogToStderr();
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}
diff --git a/tensorflow/contrib/lite/kernels/lstm.cc b/tensorflow/contrib/lite/kernels/lstm.cc
index 6c06264d845c24e71647b6fd2374734be32383ef..b9255b23a5573788e6290633723313b6db7b4f76 100644
--- a/tensorflow/contrib/lite/kernels/lstm.cc
+++ b/tensorflow/contrib/lite/kernels/lstm.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/kernels/activation_functor.h"
+#include "tensorflow/contrib/lite/kernels/internal/kernel_utils.h"
 #include "tensorflow/contrib/lite/kernels/internal/tensor_utils.h"
 #include "tensorflow/contrib/lite/kernels/kernel_util.h"
 #include "tensorflow/contrib/lite/kernels/op_macros.h"
@@ -377,127 +378,54 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
     output_gate_scratch = scratch_buffer->data.f + 3 * n_cell * n_batch;
   }
 
-  // Initialize scratch buffers with bias.
-  if (!use_cifg) {
-    tensor_utils::VectorBatchVectorAssign(input_gate_bias->data.f, n_cell,
-                                          n_batch, input_gate_scratch);
-  }
-  tensor_utils::VectorBatchVectorAssign(forget_gate_bias->data.f, n_cell,
-                                        n_batch, forget_gate_scratch);
-  tensor_utils::VectorBatchVectorAssign(cell_bias->data.f, n_cell, n_batch,
-                                        cell_scratch);
-  tensor_utils::VectorBatchVectorAssign(output_gate_bias->data.f, n_cell,
-                                        n_batch, output_gate_scratch);
-
-  // For each batch and cell: compute input_weight * input.
-  if (!use_cifg) {
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        input_to_input_weights->data.f, n_cell, n_input, input->data.f, n_batch,
-        input_gate_scratch, /*result_stride=*/1);
-  }
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      input_to_forget_weights->data.f, n_cell, n_input, input->data.f, n_batch,
-      forget_gate_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      input_to_cell_weights->data.f, n_cell, n_input, input->data.f, n_batch,
-      cell_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      input_to_output_weights->data.f, n_cell, n_input, input->data.f, n_batch,
-      output_gate_scratch, /*result_stride=*/1);
-
-  // For each batch and cell: compute recurrent_weight * output_state.
-  if (!use_cifg) {
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        recurrent_to_input_weights->data.f, n_cell, n_output,
-        output_state->data.f, n_batch, input_gate_scratch, /*result_stride=*/1);
-  }
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      recurrent_to_forget_weights->data.f, n_cell, n_output,
-      output_state->data.f, n_batch, forget_gate_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      recurrent_to_cell_weights->data.f, n_cell, n_output, output_state->data.f,
-      n_batch, cell_scratch, /*result_stride=*/1);
-  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-      recurrent_to_output_weights->data.f, n_cell, n_output,
-      output_state->data.f, n_batch, output_gate_scratch, /*result_stride=*/1);
-
-  // For each batch and cell: update input gate.
-  if (!use_cifg) {
-    if (use_peephole) {
-      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-          cell_to_input_weights->data.f, n_cell, cell_state->data.f, n_batch,
-          input_gate_scratch);
-    }
-    tensor_utils::ApplySigmoidToVector(input_gate_scratch, n_cell * n_batch,
-                                       input_gate_scratch);
-  }
-
-  // For each batch and cell: update forget gate.
-  if (use_peephole) {
-    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-        cell_to_forget_weights->data.f, n_cell, cell_state->data.f, n_batch,
-        forget_gate_scratch);
-  }
-  tensor_utils::ApplySigmoidToVector(forget_gate_scratch, n_cell * n_batch,
-                                     forget_gate_scratch);
-
-  // For each batch and cell: update the cell.
-  tensor_utils::VectorVectorCwiseProduct(forget_gate_scratch,
-                                         cell_state->data.f, n_batch * n_cell,
-                                         cell_state->data.f);
-  tensor_utils::ApplyActivationToVector(cell_scratch, n_batch * n_cell,
-                                        params->activation, cell_scratch);
-  if (use_cifg) {
-    tensor_utils::Sub1Vector(forget_gate_scratch, n_batch * n_cell,
-                             forget_gate_scratch);
-    tensor_utils::VectorVectorCwiseProductAccumulate(
-        cell_scratch, forget_gate_scratch, n_batch * n_cell,
-        cell_state->data.f);
-  } else {
-    tensor_utils::VectorVectorCwiseProductAccumulate(
-        cell_scratch, input_gate_scratch, n_batch * n_cell, cell_state->data.f);
-  }
-  if (params->cell_clip > 0.0) {
-    tensor_utils::ClipVector(cell_state->data.f, n_batch * n_cell,
-                             params->cell_clip, cell_state->data.f);
-  }
-
-  // For each batch and cell: update the output gate.
-  if (use_peephole) {
-    tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-        cell_to_output_weights->data.f, n_cell, cell_state->data.f, n_batch,
-        output_gate_scratch);
-  }
-  tensor_utils::ApplySigmoidToVector(output_gate_scratch, n_batch * n_cell,
-                                     output_gate_scratch);
-  tensor_utils::ApplyActivationToVector(cell_state->data.f, n_batch * n_cell,
-                                        params->activation, cell_scratch);
-  tensor_utils::VectorVectorCwiseProduct(output_gate_scratch, cell_scratch,
-                                         n_batch * n_cell, output_gate_scratch);
-
-  // For each batch: update the projection and output_state.
-  const bool use_projection_weight = (projection_weights != nullptr);
-  const bool use_projection_bias = (projection_bias != nullptr);
-  if (use_projection_weight) {
-    if (use_projection_bias) {
-      tensor_utils::VectorBatchVectorAssign(projection_bias->data.f, n_output,
-                                            n_batch, output->data.f);
-    } else {
-      tensor_utils::ZeroVector(output->data.f, n_batch * n_output);
-    }
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        projection_weights->data.f, n_output, n_cell, output_gate_scratch,
-        n_batch, output->data.f, /*result_stride=*/1);
-    if (params->proj_clip > 0.0) {
-      tensor_utils::ClipVector(output->data.f, n_batch * n_output,
-                               params->proj_clip, output->data.f);
-    }
-  } else {
-    tensor_utils::CopyVector(output_gate_scratch, n_batch * n_output,
-                             output->data.f);
-  }
-  tensor_utils::CopyVector(output->data.f, n_batch * n_output,
-                           output_state->data.f);
+  // Check optional tensors, the respective pointers can be null.
+  const float* input_to_input_weights_ptr =
+      (use_cifg) ? nullptr : input_to_input_weights->data.f;
+  const float* recurrent_to_input_weights_ptr =
+      (use_cifg) ? nullptr : recurrent_to_input_weights->data.f;
+  const float* input_gate_bias_ptr =
+      (use_cifg) ? nullptr : input_gate_bias->data.f;
+  const float* cell_to_input_weights_ptr =
+      (use_peephole && !use_cifg) ? cell_to_input_weights->data.f : nullptr;
+  const float* cell_to_forget_weights_ptr =
+      (use_peephole) ? cell_to_forget_weights->data.f : nullptr;
+  const float* cell_to_output_weights_ptr =
+      (use_peephole) ? cell_to_output_weights->data.f : nullptr;
+  const float* projection_weights_ptr =
+      (projection_weights == nullptr) ? nullptr : projection_weights->data.f;
+  const float* projection_bias_ptr =
+      (projection_bias == nullptr) ? nullptr : projection_bias->data.f;
+
+  // Required tensors, pointers are non-null.
+  const float* input_ptr_batch = input->data.f;
+  const float* input_to_forget_weights_ptr = input_to_forget_weights->data.f;
+  const float* input_to_cell_weights_ptr = input_to_cell_weights->data.f;
+  const float* input_to_output_weights_ptr = input_to_output_weights->data.f;
+  const float* recurrent_to_forget_weights_ptr =
+      recurrent_to_forget_weights->data.f;
+  const float* recurrent_to_cell_weights_ptr =
+      recurrent_to_cell_weights->data.f;
+  const float* recurrent_to_output_weights_ptr =
+      recurrent_to_output_weights->data.f;
+  const float* forget_gate_bias_ptr = forget_gate_bias->data.f;
+  const float* cell_bias_ptr = cell_bias->data.f;
+  const float* output_gate_bias_ptr = output_gate_bias->data.f;
+
+  float* output_state_ptr = output_state->data.f;
+  float* cell_state_ptr = cell_state->data.f;
+  float* output_ptr_batch = output->data.f;
+
+  kernel_utils::LstmStep(
+      input_ptr_batch, input_to_input_weights_ptr, input_to_forget_weights_ptr,
+      input_to_cell_weights_ptr, input_to_output_weights_ptr,
+      recurrent_to_input_weights_ptr, recurrent_to_forget_weights_ptr,
+      recurrent_to_cell_weights_ptr, recurrent_to_output_weights_ptr,
+      cell_to_input_weights_ptr, cell_to_forget_weights_ptr,
+      cell_to_output_weights_ptr, input_gate_bias_ptr, forget_gate_bias_ptr,
+      cell_bias_ptr, output_gate_bias_ptr, projection_weights_ptr,
+      projection_bias_ptr, params, n_batch, n_cell, n_input, n_output,
+      output_state_ptr, cell_state_ptr, input_gate_scratch, forget_gate_scratch,
+      cell_scratch, output_gate_scratch, output_ptr_batch);
 
   return kTfLiteOk;
 }
diff --git a/tensorflow/contrib/lite/kernels/mean_test.cc b/tensorflow/contrib/lite/kernels/mean_test.cc
index c4c53c2ded351849e7c458fc754c36395a25ebd0..2d6d4bc2da4b75289ee27c3f2a12787216716d44 100644
--- a/tensorflow/contrib/lite/kernels/mean_test.cc
+++ b/tensorflow/contrib/lite/kernels/mean_test.cc
@@ -74,7 +74,7 @@ class MeanOpDynamicModel : public BaseMeanOpModel {
   }
 };
 
-TEST(ConstMeanOpTest, NotKeepDims) {
+TEST(ConstFloatMeanOpTest, NotKeepDims) {
   std::initializer_list<float> data = {
       1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
       13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
@@ -86,7 +86,7 @@ TEST(ConstMeanOpTest, NotKeepDims) {
   EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({12, 13})));
 }
 
-TEST(ConstMeanOpTest, KeepDims) {
+TEST(ConstFloatMeanOpTest, KeepDims) {
   std::initializer_list<float> data = {
       1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
       13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
@@ -99,7 +99,7 @@ TEST(ConstMeanOpTest, KeepDims) {
               ElementsAreArray(ArrayFloatNear({10.5, 12.5, 14.5})));
 }
 
-TEST(DynamicMeanOpTest, NotKeepDims) {
+TEST(DynamicFloatMeanOpTest, NotKeepDims) {
   std::initializer_list<float> data = {
       1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
       13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
@@ -114,7 +114,7 @@ TEST(DynamicMeanOpTest, NotKeepDims) {
   EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({12, 13})));
 }
 
-TEST(DynamicMeanOpTest, KeepDims) {
+TEST(DynamicFloatMeanOpTest, KeepDims) {
   std::initializer_list<float> data = {
       1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0,  10.0, 11.0, 12.0,
       13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0};
@@ -130,6 +130,70 @@ TEST(DynamicMeanOpTest, KeepDims) {
               ElementsAreArray(ArrayFloatNear({10.5, 12.5, 14.5})));
 }
 
+TEST(DynamicFloatMeanOpTest, Scale) {
+  std::initializer_list<float> data = {9.527};
+  MeanOpDynamicModel m({TensorType_FLOAT32, {1}}, {TensorType_FLOAT32, {1}},
+                       {TensorType_INT32, {1}}, true);
+  std::initializer_list<int> axis = {0};
+  m.SetAxis(axis);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1}));
+  EXPECT_THAT(m.GetOutput<float>(), ElementsAreArray(ArrayFloatNear({9.527})));
+}
+
+TEST(ConstUint8MeanOpTest, NotKeepDims) {
+  std::initializer_list<uint8_t> data = {1,  2,  3,  4,  5,  6,  7,  8,
+                                         9,  10, 11, 12, 13, 14, 15, 16,
+                                         17, 18, 19, 20, 21, 22, 23, 24};
+  MeanOpConstModel m({TensorType_UINT8, {4, 3, 2}}, {TensorType_UINT8, {2}},
+                     {4}, {1, 0, -3, -3}, false);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2}));
+  EXPECT_THAT(m.GetOutput<uint8_t>(), ElementsAreArray({12, 13}));
+}
+
+TEST(ConstUint8MeanOpTest, KeepDims) {
+  std::initializer_list<uint8_t> data = {1,  2,  3,  4,  5,  6,  7,  8,
+                                         9,  10, 11, 12, 13, 14, 15, 16,
+                                         17, 18, 19, 20, 21, 22, 23, 24};
+  MeanOpConstModel m({TensorType_UINT8, {4, 3, 2}}, {TensorType_UINT8, {3}},
+                     {2}, {0, 2}, true);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 3, 1}));
+  EXPECT_THAT(m.GetOutput<uint8_t>(), ElementsAreArray({10, 12, 14}));
+}
+
+TEST(DynamicUint8MeanOpTest, NotKeepDims) {
+  std::initializer_list<uint8_t> data = {1,  2,  3,  4,  5,  6,  7,  8,
+                                         9,  10, 11, 12, 13, 14, 15, 16,
+                                         17, 18, 19, 20, 21, 22, 23, 24};
+  MeanOpDynamicModel m({TensorType_UINT8, {4, 3, 2}}, {TensorType_UINT8, {2}},
+                       {TensorType_INT32, {4}}, false);
+  std::initializer_list<int> axis = {1, 0, -3, -3};
+  m.SetAxis(axis);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({2}));
+  EXPECT_THAT(m.GetOutput<uint8_t>(), ElementsAreArray({12, 13}));
+}
+
+TEST(DynamicUint8MeanOpTest, KeepDims) {
+  std::initializer_list<uint8_t> data = {1,  2,  3,  4,  5,  6,  7,  8,
+                                         9,  10, 11, 12, 13, 14, 15, 16,
+                                         17, 18, 19, 20, 21, 22, 23, 24};
+  MeanOpDynamicModel m({TensorType_UINT8, {4, 3, 2}}, {TensorType_UINT8, {3}},
+                       {TensorType_INT32, {2}}, true);
+  std::initializer_list<int> axis = {0, 2};
+  m.SetAxis(axis);
+  m.SetInput(data);
+  m.Invoke();
+  EXPECT_THAT(m.GetOutputShape(), ElementsAreArray({1, 3, 1}));
+  EXPECT_THAT(m.GetOutput<uint8_t>(), ElementsAreArray({10, 12, 14}));
+}
+
 }  // namespace
 }  // namespace tflite
 
diff --git a/tensorflow/contrib/lite/kernels/register.cc b/tensorflow/contrib/lite/kernels/register.cc
index edc4e26edbd44784f8604e7da32156a8e695d2e2..aea6f8d9d34420363cc1045425f3d27b12af449e 100644
--- a/tensorflow/contrib/lite/kernels/register.cc
+++ b/tensorflow/contrib/lite/kernels/register.cc
@@ -49,6 +49,7 @@ TfLiteRegistration* Register_MUL();
 TfLiteRegistration* Register_L2_NORMALIZATION();
 TfLiteRegistration* Register_LOCAL_RESPONSE_NORMALIZATION();
 TfLiteRegistration* Register_LSTM();
+TfLiteRegistration* Register_BIDIRECTIONAL_SEQUENCE_LSTM();
 TfLiteRegistration* Register_UNIDIRECTIONAL_SEQUENCE_LSTM();
 TfLiteRegistration* Register_PAD();
 TfLiteRegistration* Register_RESHAPE();
@@ -63,6 +64,7 @@ TfLiteRegistration* Register_SQUEEZE();
 TfLiteRegistration* Register_STRIDED_SLICE();
 TfLiteRegistration* Register_EXP();
 TfLiteRegistration* Register_TOPK_V2();
+TfLiteRegistration* Register_LOG_SOFTMAX();
 
 BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_RELU, Register_RELU());
@@ -97,6 +99,8 @@ BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_LOCAL_RESPONSE_NORMALIZATION,
              Register_LOCAL_RESPONSE_NORMALIZATION());
   AddBuiltin(BuiltinOperator_LSTM, Register_LSTM());
+  AddBuiltin(BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM,
+             Register_BIDIRECTIONAL_SEQUENCE_LSTM());
   AddBuiltin(BuiltinOperator_UNIDIRECTIONAL_SEQUENCE_LSTM,
              Register_UNIDIRECTIONAL_SEQUENCE_LSTM());
   AddBuiltin(BuiltinOperator_PAD, Register_PAD());
@@ -114,6 +118,7 @@ BuiltinOpResolver::BuiltinOpResolver() {
   AddBuiltin(BuiltinOperator_STRIDED_SLICE, Register_STRIDED_SLICE());
   AddBuiltin(BuiltinOperator_EXP, Register_EXP());
   AddBuiltin(BuiltinOperator_TOPK_V2, Register_TOPK_V2());
+  AddBuiltin(BuiltinOperator_LOG_SOFTMAX, Register_LOG_SOFTMAX());
 }
 
 TfLiteRegistration* BuiltinOpResolver::FindOp(
diff --git a/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc b/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
index 9cdb58714edb5fee771fc45f3c53a570f8fb28d1..508a570e2e5fd52cfad28baf45824ef25061e13f 100644
--- a/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
+++ b/tensorflow/contrib/lite/kernels/unidirectional_sequence_lstm.cc
@@ -24,6 +24,7 @@ limitations under the License.
 #include "tensorflow/contrib/lite/builtin_op_data.h"
 #include "tensorflow/contrib/lite/context.h"
 #include "tensorflow/contrib/lite/kernels/activation_functor.h"
+#include "tensorflow/contrib/lite/kernels/internal/kernel_utils.h"
 #include "tensorflow/contrib/lite/kernels/internal/tensor_utils.h"
 #include "tensorflow/contrib/lite/kernels/kernel_util.h"
 #include "tensorflow/contrib/lite/kernels/op_macros.h"
@@ -380,135 +381,57 @@ TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
     output_gate_scratch = scratch_buffer->data.f + 3 * n_cell * n_batch;
   }
 
+  // Check optional tensors, the respective pointers can be null.
+  const float* input_to_input_weights_ptr =
+      (use_cifg) ? nullptr : input_to_input_weights->data.f;
+  const float* recurrent_to_input_weights_ptr =
+      (use_cifg) ? nullptr : recurrent_to_input_weights->data.f;
+  const float* input_gate_bias_ptr =
+      (use_cifg) ? nullptr : input_gate_bias->data.f;
+  const float* cell_to_input_weights_ptr =
+      (use_peephole && !use_cifg) ? cell_to_input_weights->data.f : nullptr;
+  const float* cell_to_forget_weights_ptr =
+      (use_peephole) ? cell_to_forget_weights->data.f : nullptr;
+  const float* cell_to_output_weights_ptr =
+      (use_peephole) ? cell_to_output_weights->data.f : nullptr;
+  const float* projection_weights_ptr =
+      (projection_weights == nullptr) ? nullptr : projection_weights->data.f;
+  const float* projection_bias_ptr =
+      (projection_bias == nullptr) ? nullptr : projection_bias->data.f;
+
+  // Required tensors, pointers are non-null.
+  const float* input_to_forget_weights_ptr = input_to_forget_weights->data.f;
+  const float* input_to_cell_weights_ptr = input_to_cell_weights->data.f;
+  const float* input_to_output_weights_ptr = input_to_output_weights->data.f;
+  const float* recurrent_to_forget_weights_ptr =
+      recurrent_to_forget_weights->data.f;
+  const float* recurrent_to_cell_weights_ptr =
+      recurrent_to_cell_weights->data.f;
+  const float* recurrent_to_output_weights_ptr =
+      recurrent_to_output_weights->data.f;
+  const float* forget_gate_bias_ptr = forget_gate_bias->data.f;
+  const float* cell_bias_ptr = cell_bias->data.f;
+  const float* output_gate_bias_ptr = output_gate_bias->data.f;
+
+  float* output_state_ptr = output_state->data.f;
+  float* cell_state_ptr = cell_state->data.f;
+
   for (int t = 0; t < max_time; t++) {
-    const float* input_ptr_time = input->data.f + t * n_batch * n_input;
-    // Initialize scratch buffers with bias.
-    if (!use_cifg) {
-      tensor_utils::VectorBatchVectorAssign(input_gate_bias->data.f, n_cell,
-                                            n_batch, input_gate_scratch);
-    }
-    tensor_utils::VectorBatchVectorAssign(forget_gate_bias->data.f, n_cell,
-                                          n_batch, forget_gate_scratch);
-    tensor_utils::VectorBatchVectorAssign(cell_bias->data.f, n_cell, n_batch,
-                                          cell_scratch);
-    tensor_utils::VectorBatchVectorAssign(output_gate_bias->data.f, n_cell,
-                                          n_batch, output_gate_scratch);
-
-    // For each batch and cell: compute input_weight * input.
-    if (!use_cifg) {
-      tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-          input_to_input_weights->data.f, n_cell, n_input, input_ptr_time,
-          n_batch, input_gate_scratch, /*result_stride=*/1);
-    }
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        input_to_forget_weights->data.f, n_cell, n_input, input_ptr_time,
-        n_batch, forget_gate_scratch, /*result_stride=*/1);
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        input_to_cell_weights->data.f, n_cell, n_input, input_ptr_time, n_batch,
-        cell_scratch, /*result_stride=*/1);
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        input_to_output_weights->data.f, n_cell, n_input, input_ptr_time,
-        n_batch, output_gate_scratch, /*result_stride=*/1);
-
-    // For each batch and cell: compute recurrent_weight * output_state.
-    if (!use_cifg) {
-      tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-          recurrent_to_input_weights->data.f, n_cell, n_output,
-          output_state->data.f, n_batch, input_gate_scratch,
-          /*result_stride=*/1);
-    }
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        recurrent_to_forget_weights->data.f, n_cell, n_output,
-        output_state->data.f, n_batch, forget_gate_scratch,
-        /*result_stride=*/1);
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        recurrent_to_cell_weights->data.f, n_cell, n_output,
-        output_state->data.f, n_batch, cell_scratch, /*result_stride=*/1);
-    tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-        recurrent_to_output_weights->data.f, n_cell, n_output,
-        output_state->data.f, n_batch, output_gate_scratch,
-        /*result_stride=*/1);
-
-    // For each batch and cell: update input gate.
-    if (!use_cifg) {
-      if (use_peephole) {
-        tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-            cell_to_input_weights->data.f, n_cell, cell_state->data.f, n_batch,
-            input_gate_scratch);
-      }
-      tensor_utils::ApplySigmoidToVector(input_gate_scratch, n_cell * n_batch,
-                                         input_gate_scratch);
-    }
-
-    // For each batch and cell: update forget gate.
-    if (use_peephole) {
-      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-          cell_to_forget_weights->data.f, n_cell, cell_state->data.f, n_batch,
-          forget_gate_scratch);
-    }
-    tensor_utils::ApplySigmoidToVector(forget_gate_scratch, n_cell * n_batch,
-                                       forget_gate_scratch);
-
-    // For each batch and cell: update the cell.
-    tensor_utils::VectorVectorCwiseProduct(forget_gate_scratch,
-                                           cell_state->data.f, n_batch * n_cell,
-                                           cell_state->data.f);
-    tensor_utils::ApplyActivationToVector(cell_scratch, n_batch * n_cell,
-                                          params->activation, cell_scratch);
-    if (use_cifg) {
-      tensor_utils::Sub1Vector(forget_gate_scratch, n_batch * n_cell,
-                               forget_gate_scratch);
-      tensor_utils::VectorVectorCwiseProductAccumulate(
-          cell_scratch, forget_gate_scratch, n_batch * n_cell,
-          cell_state->data.f);
-    } else {
-      tensor_utils::VectorVectorCwiseProductAccumulate(
-          cell_scratch, input_gate_scratch, n_batch * n_cell,
-          cell_state->data.f);
-    }
-    if (params->cell_clip > 0.0) {
-      tensor_utils::ClipVector(cell_state->data.f, n_batch * n_cell,
-                               params->cell_clip, cell_state->data.f);
-    }
-
-    // For each batch and cell: update the output gate.
-    if (use_peephole) {
-      tensor_utils::VectorBatchVectorCwiseProductAccumulate(
-          cell_to_output_weights->data.f, n_cell, cell_state->data.f, n_batch,
-          output_gate_scratch);
-    }
-    tensor_utils::ApplySigmoidToVector(output_gate_scratch, n_batch * n_cell,
-                                       output_gate_scratch);
-    tensor_utils::ApplyActivationToVector(cell_state->data.f, n_batch * n_cell,
-                                          params->activation, cell_scratch);
-    tensor_utils::VectorVectorCwiseProduct(output_gate_scratch, cell_scratch,
-                                           n_batch * n_cell,
-                                           output_gate_scratch);
-
-    // For each batch: update the projection and output_state.
-    const bool use_projection_weight = (projection_weights != nullptr);
-    const bool use_projection_bias = (projection_bias != nullptr);
-    float* output_ptr_time = output->data.f + t * n_batch * n_output;
-    if (use_projection_weight) {
-      if (use_projection_bias) {
-        tensor_utils::VectorBatchVectorAssign(projection_bias->data.f, n_output,
-                                              n_batch, output_ptr_time);
-      } else {
-        tensor_utils::ZeroVector(output_ptr_time, n_batch * n_output);
-      }
-      tensor_utils::MatrixBatchVectorMultiplyAccumulate(
-          projection_weights->data.f, n_output, n_cell, output_gate_scratch,
-          n_batch, output_ptr_time, /*result_stride=*/1);
-      if (params->proj_clip > 0.0) {
-        tensor_utils::ClipVector(output_ptr_time, n_batch * n_output,
-                                 params->proj_clip, output_ptr_time);
-      }
-    } else {
-      tensor_utils::CopyVector(output_gate_scratch, n_batch * n_output,
-                               output_ptr_time);
-    }
-    tensor_utils::CopyVector(output_ptr_time, n_batch * n_output,
-                             output_state->data.f);
+    const float* input_ptr_batch = input->data.f + t * n_batch * n_input;
+    float* output_ptr_batch = output->data.f + t * n_batch * n_output;
+
+    kernel_utils::LstmStep(
+        input_ptr_batch, input_to_input_weights_ptr,
+        input_to_forget_weights_ptr, input_to_cell_weights_ptr,
+        input_to_output_weights_ptr, recurrent_to_input_weights_ptr,
+        recurrent_to_forget_weights_ptr, recurrent_to_cell_weights_ptr,
+        recurrent_to_output_weights_ptr, cell_to_input_weights_ptr,
+        cell_to_forget_weights_ptr, cell_to_output_weights_ptr,
+        input_gate_bias_ptr, forget_gate_bias_ptr, cell_bias_ptr,
+        output_gate_bias_ptr, projection_weights_ptr, projection_bias_ptr,
+        params, n_batch, n_cell, n_input, n_output, output_state_ptr,
+        cell_state_ptr, input_gate_scratch, forget_gate_scratch, cell_scratch,
+        output_gate_scratch, output_ptr_batch);
   }
   return kTfLiteOk;
 }
diff --git a/tensorflow/contrib/lite/model.cc b/tensorflow/contrib/lite/model.cc
index d6522fc077d03bb49fe54b7a04fa6341ccf4cf3a..725f2838c574fcc2ba389401f92575279ebc144c 100644
--- a/tensorflow/contrib/lite/model.cc
+++ b/tensorflow/contrib/lite/model.cc
@@ -124,14 +124,20 @@ TfLiteStatus InterpreterBuilder::BuildLocalIndexToRegistrationMapping() {
   auto opcodes = model_->operator_codes();
   for (const OperatorCode* opcode : *opcodes) {
     TfLiteRegistration* registration = nullptr;
-
-    if (opcode->builtin_code() != BuiltinOperator_CUSTOM) {
-      auto x = opcode->builtin_code();
-      flatbuffer_op_index_to_registration_types_.push_back(x);
-      registration = op_resolver_.FindOp(x);
+    auto builtin_code = opcode->builtin_code();
+    if (builtin_code > BuiltinOperator_MAX ||
+        builtin_code < BuiltinOperator_MIN) {
+      error_reporter_->Report(
+          "Op builtin_code out or range: %d. Are you using old TFLite binary "
+          "with newer model?",
+          builtin_code);
+      status = kTfLiteError;
+    } else if (builtin_code != BuiltinOperator_CUSTOM) {
+      flatbuffer_op_index_to_registration_types_.push_back(builtin_code);
+      registration = op_resolver_.FindOp(builtin_code);
       if (registration == nullptr) {
         error_reporter_->Report("Didn't find op for builtin opcode '%s'\n",
-                                EnumNameBuiltinOperator(x));
+                                EnumNameBuiltinOperator(builtin_code));
         status = kTfLiteError;
       }
     } else if (!opcode->custom_code()) {
@@ -280,6 +286,7 @@ void* ParseOpData(const Operator* op, BuiltinOperator op_type,
     case BuiltinOperator_CONCAT_EMBEDDINGS:
     case BuiltinOperator_EXP:
     case BuiltinOperator_TOPK_V2:
+    case BuiltinOperator_LOG_SOFTMAX:
       break;
     case BuiltinOperator_LSH_PROJECTION: {
       TfLiteLSHProjectionParams* params =
@@ -455,6 +462,7 @@ void* ParseOpData(const Operator* op, BuiltinOperator op_type,
       builtin_data = reinterpret_cast<void*>(params);
       break;
     }
+    case BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM:
     case BuiltinOperator_UNIDIRECTIONAL_SEQUENCE_LSTM:
     case BuiltinOperator_LSTM: {
       TfLiteLSTMParams* params = MallocPOD<TfLiteLSTMParams>();
@@ -566,6 +574,11 @@ void* ParseOpData(const Operator* op, BuiltinOperator op_type,
       builtin_data = reinterpret_cast<void*>(params);
       break;
     }
+    case BuiltinOperator_DELEGATE: {
+      // TODO(ycling): Revisit when supporting saving delegated models.
+      error_reporter->Report("DELEGATE op shouldn't exist in model.");
+      break;
+    }
   }
   return builtin_data;
 }
@@ -779,6 +792,8 @@ TfLiteStatus InterpreterBuilder::operator()(
     return cleanup_and_error();
   }
 
+  (**interpreter).set_model(model_);
+
   // Parse inputs/outputs
   (**interpreter).SetInputs(FlatBufferIntArrayToVector(subgraph->inputs()));
   (**interpreter).SetOutputs(FlatBufferIntArrayToVector(subgraph->outputs()));
diff --git a/tensorflow/contrib/lite/models/speech_test.cc b/tensorflow/contrib/lite/models/speech_test.cc
index daa8c3100b64e9290256aa14a6ab641f19174a0a..a354179a9480c136d65f83836d81f69c2089fdbe 100644
--- a/tensorflow/contrib/lite/models/speech_test.cc
+++ b/tensorflow/contrib/lite/models/speech_test.cc
@@ -97,7 +97,12 @@ bool ConvertCsvData(const string& model_name, const string& in_name,
   return true;
 }
 
-TEST(SpeechTest, HotwordOkGoogleRank1Test) {
+class SpeechTest : public ::testing::TestWithParam<int> {
+ protected:
+  int GetMaxInvocations() { return GetParam(); }
+};
+
+TEST_P(SpeechTest, HotwordOkGoogleRank1Test) {
   std::stringstream os;
   ASSERT_TRUE(ConvertCsvData(
       "speech_hotword_model_rank1.tflite", "speech_hotword_model_in.csv",
@@ -105,11 +110,11 @@ TEST(SpeechTest, HotwordOkGoogleRank1Test) {
       /*output_tensor=*/"18", /*persistent_tensors=*/"4",
       /*sequence_size=*/40, &os));
   testing::TfLiteDriver test_driver(/*use_nnapi=*/false);
-  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver))
+  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver, GetMaxInvocations()))
       << test_driver.GetErrorMessage();
 }
 
-TEST(SpeechTest, HotwordOkGoogleRank2Test) {
+TEST_P(SpeechTest, HotwordOkGoogleRank2Test) {
   std::stringstream os;
   ASSERT_TRUE(ConvertCsvData(
       "speech_hotword_model_rank2.tflite", "speech_hotword_model_in.csv",
@@ -117,11 +122,11 @@ TEST(SpeechTest, HotwordOkGoogleRank2Test) {
       /*output_tensor=*/"18", /*persistent_tensors=*/"1",
       /*sequence_size=*/40, &os));
   testing::TfLiteDriver test_driver(/*use_nnapi=*/false);
-  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver))
+  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver, GetMaxInvocations()))
       << test_driver.GetErrorMessage();
 }
 
-TEST(SpeechTest, SpeakerIdOkGoogleTest) {
+TEST_P(SpeechTest, SpeakerIdOkGoogleTest) {
   std::stringstream os;
   ASSERT_TRUE(ConvertCsvData(
       "speech_speakerid_model.tflite", "speech_speakerid_model_in.csv",
@@ -130,11 +135,11 @@ TEST(SpeechTest, SpeakerIdOkGoogleTest) {
       /*persistent_tensors=*/"19,20,40,41,61,62",
       /*sequence_size=*/80, &os));
   testing::TfLiteDriver test_driver(/*use_nnapi=*/false);
-  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver))
+  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver, GetMaxInvocations()))
       << test_driver.GetErrorMessage();
 }
 
-TEST(SpeechTest, AsrAmTest) {
+TEST_P(SpeechTest, AsrAmTest) {
   std::stringstream os;
   ASSERT_TRUE(
       ConvertCsvData("speech_asr_am_model.tflite", "speech_asr_am_model_in.csv",
@@ -143,7 +148,7 @@ TEST(SpeechTest, AsrAmTest) {
                      /*persistent_tensors=*/"19,20,40,41,61,62,82,83,103,104",
                      /*sequence_size=*/320, &os));
   testing::TfLiteDriver test_driver(/*use_nnapi=*/false);
-  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver))
+  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver, GetMaxInvocations()))
       << test_driver.GetErrorMessage();
 }
 
@@ -151,15 +156,16 @@ TEST(SpeechTest, AsrAmTest) {
 // through the interpreter and stored the sum of all the output, which was them
 // compared for correctness. In this test we are comparing all the intermediate
 // results.
-TEST(SpeechTest, AsrLmTest) {
+TEST_P(SpeechTest, AsrLmTest) {
   std::ifstream in_file;
   testing::TfLiteDriver test_driver(/*use_nnapi=*/false);
   ASSERT_TRUE(Init("speech_asr_lm_model.test_spec", &test_driver, &in_file));
-  ASSERT_TRUE(testing::ParseAndRunTests(&in_file, &test_driver))
+  ASSERT_TRUE(
+      testing::ParseAndRunTests(&in_file, &test_driver, GetMaxInvocations()))
       << test_driver.GetErrorMessage();
 }
 
-TEST(SpeechTest, EndpointerTest) {
+TEST_P(SpeechTest, EndpointerTest) {
   std::stringstream os;
   ASSERT_TRUE(ConvertCsvData(
       "speech_endpointer_model.tflite", "speech_endpointer_model_in.csv",
@@ -168,11 +174,11 @@ TEST(SpeechTest, EndpointerTest) {
       /*persistent_tensors=*/"28,29,49,50",
       /*sequence_size=*/320, &os));
   testing::TfLiteDriver test_driver(/*use_nnapi=*/false);
-  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver))
+  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver, GetMaxInvocations()))
       << test_driver.GetErrorMessage();
 }
 
-TEST(SpeechTest, TtsTest) {
+TEST_P(SpeechTest, TtsTest) {
   std::stringstream os;
   ASSERT_TRUE(ConvertCsvData("speech_tts_model.tflite",
                              "speech_tts_model_in.csv",
@@ -181,9 +187,19 @@ TEST(SpeechTest, TtsTest) {
                              /*persistent_tensors=*/"25,26,46,47,67,68,73",
                              /*sequence_size=*/334, &os));
   testing::TfLiteDriver test_driver(/*use_nnapi=*/false);
-  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver))
+  ASSERT_TRUE(testing::ParseAndRunTests(&os, &test_driver, GetMaxInvocations()))
       << test_driver.GetErrorMessage();
 }
 
+// Define two instantiations. The "ShortTests" instantiations is used when
+// running the tests on Android, in order to prevent timeouts (It takes about
+// 200s just to bring up the Android emulator.)
+static const int kAllInvocations = -1;
+static const int kFirstFewInvocations = 10;
+INSTANTIATE_TEST_CASE_P(LongTests, SpeechTest,
+                        ::testing::Values(kAllInvocations));
+INSTANTIATE_TEST_CASE_P(ShortTests, SpeechTest,
+                        ::testing::Values(kFirstFewInvocations));
+
 }  // namespace
 }  // namespace tflite
diff --git a/tensorflow/contrib/lite/nnapi_delegate.cc b/tensorflow/contrib/lite/nnapi_delegate.cc
index 02e8499f61c6a3d5fceb978aa0e63a4ee90cf19a..e631ffd845d3b31232070b935c12aa8a2e8ce05e 100644
--- a/tensorflow/contrib/lite/nnapi_delegate.cc
+++ b/tensorflow/contrib/lite/nnapi_delegate.cc
@@ -323,6 +323,7 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
       case tflite::BuiltinOperator_UNIDIRECTIONAL_SEQUENCE_RNN:
       case tflite::BuiltinOperator_EMBEDDING_LOOKUP:
       case tflite::BuiltinOperator_EMBEDDING_LOOKUP_SPARSE:
+      case tflite::BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM:
       case tflite::BuiltinOperator_UNIDIRECTIONAL_SEQUENCE_LSTM:
       case tflite::BuiltinOperator_L2_NORMALIZATION:
       case tflite::BuiltinOperator_LOCAL_RESPONSE_NORMALIZATION:
@@ -344,6 +345,8 @@ void AddOpsAndParams(tflite::Interpreter* interpreter,
       case tflite::BuiltinOperator_SQUEEZE:
       case tflite::BuiltinOperator_STRIDED_SLICE:
       case tflite::BuiltinOperator_EXP:
+      case tflite::BuiltinOperator_LOG_SOFTMAX:
+      case tflite::BuiltinOperator_DELEGATE:
         FATAL("Op code %d is currently not delegated to NNAPI", builtin);
         nn_op_type = -1;  // set to invalid
         break;
diff --git a/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc b/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc
index b983d59d85955b241a22012b4e9adbeea346f80d..08bcfe451685f488be2c3bc180f2dfc43dfe4f05 100644
--- a/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc
+++ b/tensorflow/contrib/lite/schema/builtin_ops_header/generator.cc
@@ -45,6 +45,9 @@ limitations under the License.
 extern "C" {
 #endif  // __cplusplus
 
+// The enum for builtin operators.
+// Note: CUSTOM and DELEGATE are 2 special ops which are not real biultin
+// ops.
 typedef enum {
 )";
 
diff --git a/tensorflow/contrib/lite/schema/schema.fbs b/tensorflow/contrib/lite/schema/schema.fbs
index 75970b41267613058199c22a0fcb0c80a1c8f04f..98ac0469d1b885aa8047d35c8d814da4b61eff0c 100644
--- a/tensorflow/contrib/lite/schema/schema.fbs
+++ b/tensorflow/contrib/lite/schema/schema.fbs
@@ -123,6 +123,12 @@ enum BuiltinOperator : byte {
   EXP = 47,
   TOPK_V2 = 48,
   SPLIT = 49,
+  LOG_SOFTMAX = 50,
+  // DELEGATE is a special op type for the operations which are delegated to
+  // other backends.
+  // WARNING: Experimental interface, subject to change
+  DELEGATE = 51,
+  BIDIRECTIONAL_SEQUENCE_LSTM = 52,
 }
 
 // Options for the builtin operators.
@@ -162,6 +168,7 @@ union BuiltinOptions {
   ExpOptions,
   TopKV2Options,
   SplitOptions,
+  LogSoftmaxOptions,
 }
 
 enum Padding : byte { SAME, VALID }
@@ -364,6 +371,9 @@ table StridedSliceOptions {
   shrink_axis_mask: int;
 }
 
+table LogSoftmaxOptions {
+}
+
 // An OperatorCode can be an enum value (BuiltinOperator) if the operator is a
 // builtin, or a string if the operator is custom.
 table OperatorCode {
diff --git a/tensorflow/contrib/lite/schema/schema_generated.h b/tensorflow/contrib/lite/schema/schema_generated.h
index 06989c7b61dc9904bff380e7f1cdc11097cb340d..99e1accaa71ffc92514595a745fcb60115ef61a0 100755
--- a/tensorflow/contrib/lite/schema/schema_generated.h
+++ b/tensorflow/contrib/lite/schema/schema_generated.h
@@ -136,6 +136,9 @@ struct SplitOptionsT;
 struct StridedSliceOptions;
 struct StridedSliceOptionsT;
 
+struct LogSoftmaxOptions;
+struct LogSoftmaxOptionsT;
+
 struct OperatorCode;
 struct OperatorCodeT;
 
@@ -240,11 +243,14 @@ enum BuiltinOperator {
   BuiltinOperator_EXP = 47,
   BuiltinOperator_TOPK_V2 = 48,
   BuiltinOperator_SPLIT = 49,
+  BuiltinOperator_LOG_SOFTMAX = 50,
+  BuiltinOperator_DELEGATE = 51,
+  BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM = 52,
   BuiltinOperator_MIN = BuiltinOperator_ADD,
-  BuiltinOperator_MAX = BuiltinOperator_SPLIT
+  BuiltinOperator_MAX = BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM
 };
 
-inline BuiltinOperator (&EnumValuesBuiltinOperator())[47] {
+inline BuiltinOperator (&EnumValuesBuiltinOperator())[50] {
   static BuiltinOperator values[] = {
     BuiltinOperator_ADD,
     BuiltinOperator_AVERAGE_POOL_2D,
@@ -292,7 +298,10 @@ inline BuiltinOperator (&EnumValuesBuiltinOperator())[47] {
     BuiltinOperator_BIDIRECTIONAL_SEQUENCE_RNN,
     BuiltinOperator_EXP,
     BuiltinOperator_TOPK_V2,
-    BuiltinOperator_SPLIT
+    BuiltinOperator_SPLIT,
+    BuiltinOperator_LOG_SOFTMAX,
+    BuiltinOperator_DELEGATE,
+    BuiltinOperator_BIDIRECTIONAL_SEQUENCE_LSTM
   };
   return values;
 }
@@ -349,6 +358,9 @@ inline const char **EnumNamesBuiltinOperator() {
     "EXP",
     "TOPK_V2",
     "SPLIT",
+    "LOG_SOFTMAX",
+    "DELEGATE",
+    "BIDIRECTIONAL_SEQUENCE_LSTM",
     nullptr
   };
   return names;
@@ -396,11 +408,12 @@ enum BuiltinOptions {
   BuiltinOptions_ExpOptions = 33,
   BuiltinOptions_TopKV2Options = 34,
   BuiltinOptions_SplitOptions = 35,
+  BuiltinOptions_LogSoftmaxOptions = 36,
   BuiltinOptions_MIN = BuiltinOptions_NONE,
-  BuiltinOptions_MAX = BuiltinOptions_SplitOptions
+  BuiltinOptions_MAX = BuiltinOptions_LogSoftmaxOptions
 };
 
-inline BuiltinOptions (&EnumValuesBuiltinOptions())[36] {
+inline BuiltinOptions (&EnumValuesBuiltinOptions())[37] {
   static BuiltinOptions values[] = {
     BuiltinOptions_NONE,
     BuiltinOptions_Conv2DOptions,
@@ -437,7 +450,8 @@ inline BuiltinOptions (&EnumValuesBuiltinOptions())[36] {
     BuiltinOptions_StridedSliceOptions,
     BuiltinOptions_ExpOptions,
     BuiltinOptions_TopKV2Options,
-    BuiltinOptions_SplitOptions
+    BuiltinOptions_SplitOptions,
+    BuiltinOptions_LogSoftmaxOptions
   };
   return values;
 }
@@ -480,6 +494,7 @@ inline const char **EnumNamesBuiltinOptions() {
     "ExpOptions",
     "TopKV2Options",
     "SplitOptions",
+    "LogSoftmaxOptions",
     nullptr
   };
   return names;
@@ -634,6 +649,10 @@ template<> struct BuiltinOptionsTraits<SplitOptions> {
   static const BuiltinOptions enum_value = BuiltinOptions_SplitOptions;
 };
 
+template<> struct BuiltinOptionsTraits<LogSoftmaxOptions> {
+  static const BuiltinOptions enum_value = BuiltinOptions_LogSoftmaxOptions;
+};
+
 struct BuiltinOptionsUnion {
   BuiltinOptions type;
   void *value;
@@ -945,6 +964,14 @@ struct BuiltinOptionsUnion {
     return type == BuiltinOptions_SplitOptions ?
       reinterpret_cast<const SplitOptionsT *>(value) : nullptr;
   }
+  LogSoftmaxOptionsT *AsLogSoftmaxOptions() {
+    return type == BuiltinOptions_LogSoftmaxOptions ?
+      reinterpret_cast<LogSoftmaxOptionsT *>(value) : nullptr;
+  }
+  const LogSoftmaxOptionsT *AsLogSoftmaxOptions() const {
+    return type == BuiltinOptions_LogSoftmaxOptions ?
+      reinterpret_cast<const LogSoftmaxOptionsT *>(value) : nullptr;
+  }
 };
 
 bool VerifyBuiltinOptions(flatbuffers::Verifier &verifier, const void *obj, BuiltinOptions type);
@@ -3568,6 +3595,46 @@ inline flatbuffers::Offset<StridedSliceOptions> CreateStridedSliceOptions(
 
 flatbuffers::Offset<StridedSliceOptions> CreateStridedSliceOptions(flatbuffers::FlatBufferBuilder &_fbb, const StridedSliceOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
 
+struct LogSoftmaxOptionsT : public flatbuffers::NativeTable {
+  typedef LogSoftmaxOptions TableType;
+  LogSoftmaxOptionsT() {
+  }
+};
+
+struct LogSoftmaxOptions FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
+  typedef LogSoftmaxOptionsT NativeTableType;
+  bool Verify(flatbuffers::Verifier &verifier) const {
+    return VerifyTableStart(verifier) &&
+           verifier.EndTable();
+  }
+  LogSoftmaxOptionsT *UnPack(const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  void UnPackTo(LogSoftmaxOptionsT *_o, const flatbuffers::resolver_function_t *_resolver = nullptr) const;
+  static flatbuffers::Offset<LogSoftmaxOptions> Pack(flatbuffers::FlatBufferBuilder &_fbb, const LogSoftmaxOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+};
+
+struct LogSoftmaxOptionsBuilder {
+  flatbuffers::FlatBufferBuilder &fbb_;
+  flatbuffers::uoffset_t start_;
+  explicit LogSoftmaxOptionsBuilder(flatbuffers::FlatBufferBuilder &_fbb)
+        : fbb_(_fbb) {
+    start_ = fbb_.StartTable();
+  }
+  LogSoftmaxOptionsBuilder &operator=(const LogSoftmaxOptionsBuilder &);
+  flatbuffers::Offset<LogSoftmaxOptions> Finish() {
+    const auto end = fbb_.EndTable(start_);
+    auto o = flatbuffers::Offset<LogSoftmaxOptions>(end);
+    return o;
+  }
+};
+
+inline flatbuffers::Offset<LogSoftmaxOptions> CreateLogSoftmaxOptions(
+    flatbuffers::FlatBufferBuilder &_fbb) {
+  LogSoftmaxOptionsBuilder builder_(_fbb);
+  return builder_.Finish();
+}
+
+flatbuffers::Offset<LogSoftmaxOptions> CreateLogSoftmaxOptions(flatbuffers::FlatBufferBuilder &_fbb, const LogSoftmaxOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher = nullptr);
+
 struct OperatorCodeT : public flatbuffers::NativeTable {
   typedef OperatorCode TableType;
   BuiltinOperator builtin_code;
@@ -3790,6 +3857,9 @@ struct Operator FLATBUFFERS_FINAL_CLASS : private flatbuffers::Table {
   const SplitOptions *builtin_options_as_SplitOptions() const {
     return builtin_options_type() == BuiltinOptions_SplitOptions ? static_cast<const SplitOptions *>(builtin_options()) : nullptr;
   }
+  const LogSoftmaxOptions *builtin_options_as_LogSoftmaxOptions() const {
+    return builtin_options_type() == BuiltinOptions_LogSoftmaxOptions ? static_cast<const LogSoftmaxOptions *>(builtin_options()) : nullptr;
+  }
   const flatbuffers::Vector<uint8_t> *custom_options() const {
     return GetPointer<const flatbuffers::Vector<uint8_t> *>(VT_CUSTOM_OPTIONS);
   }
@@ -3956,6 +4026,10 @@ template<> inline const SplitOptions *Operator::builtin_options_as<SplitOptions>
   return builtin_options_as_SplitOptions();
 }
 
+template<> inline const LogSoftmaxOptions *Operator::builtin_options_as<LogSoftmaxOptions>() const {
+  return builtin_options_as_LogSoftmaxOptions();
+}
+
 struct OperatorBuilder {
   flatbuffers::FlatBufferBuilder &fbb_;
   flatbuffers::uoffset_t start_;
@@ -5415,6 +5489,29 @@ inline flatbuffers::Offset<StridedSliceOptions> CreateStridedSliceOptions(flatbu
       _shrink_axis_mask);
 }
 
+inline LogSoftmaxOptionsT *LogSoftmaxOptions::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
+  auto _o = new LogSoftmaxOptionsT();
+  UnPackTo(_o, _resolver);
+  return _o;
+}
+
+inline void LogSoftmaxOptions::UnPackTo(LogSoftmaxOptionsT *_o, const flatbuffers::resolver_function_t *_resolver) const {
+  (void)_o;
+  (void)_resolver;
+}
+
+inline flatbuffers::Offset<LogSoftmaxOptions> LogSoftmaxOptions::Pack(flatbuffers::FlatBufferBuilder &_fbb, const LogSoftmaxOptionsT* _o, const flatbuffers::rehasher_function_t *_rehasher) {
+  return CreateLogSoftmaxOptions(_fbb, _o, _rehasher);
+}
+
+inline flatbuffers::Offset<LogSoftmaxOptions> CreateLogSoftmaxOptions(flatbuffers::FlatBufferBuilder &_fbb, const LogSoftmaxOptionsT *_o, const flatbuffers::rehasher_function_t *_rehasher) {
+  (void)_rehasher;
+  (void)_o;
+  struct _VectorArgs { flatbuffers::FlatBufferBuilder *__fbb; const LogSoftmaxOptionsT* __o; const flatbuffers::rehasher_function_t *__rehasher; } _va = { &_fbb, _o, _rehasher}; (void)_va;
+  return tflite::CreateLogSoftmaxOptions(
+      _fbb);
+}
+
 inline OperatorCodeT *OperatorCode::UnPack(const flatbuffers::resolver_function_t *_resolver) const {
   auto _o = new OperatorCodeT();
   UnPackTo(_o, _resolver);
@@ -5735,6 +5832,10 @@ inline bool VerifyBuiltinOptions(flatbuffers::Verifier &verifier, const void *ob
       auto ptr = reinterpret_cast<const SplitOptions *>(obj);
       return verifier.VerifyTable(ptr);
     }
+    case BuiltinOptions_LogSoftmaxOptions: {
+      auto ptr = reinterpret_cast<const LogSoftmaxOptions *>(obj);
+      return verifier.VerifyTable(ptr);
+    }
     default: return false;
   }
 }
@@ -5893,6 +5994,10 @@ inline void *BuiltinOptionsUnion::UnPack(const void *obj, BuiltinOptions type, c
       auto ptr = reinterpret_cast<const SplitOptions *>(obj);
       return ptr->UnPack(resolver);
     }
+    case BuiltinOptions_LogSoftmaxOptions: {
+      auto ptr = reinterpret_cast<const LogSoftmaxOptions *>(obj);
+      return ptr->UnPack(resolver);
+    }
     default: return nullptr;
   }
 }
@@ -6039,6 +6144,10 @@ inline flatbuffers::Offset<void> BuiltinOptionsUnion::Pack(flatbuffers::FlatBuff
       auto ptr = reinterpret_cast<const SplitOptionsT *>(value);
       return CreateSplitOptions(_fbb, ptr, _rehasher).Union();
     }
+    case BuiltinOptions_LogSoftmaxOptions: {
+      auto ptr = reinterpret_cast<const LogSoftmaxOptionsT *>(value);
+      return CreateLogSoftmaxOptions(_fbb, ptr, _rehasher).Union();
+    }
     default: return 0;
   }
 }
@@ -6185,6 +6294,10 @@ inline BuiltinOptionsUnion::BuiltinOptionsUnion(const BuiltinOptionsUnion &u) FL
       value = new SplitOptionsT(*reinterpret_cast<SplitOptionsT *>(u.value));
       break;
     }
+    case BuiltinOptions_LogSoftmaxOptions: {
+      value = new LogSoftmaxOptionsT(*reinterpret_cast<LogSoftmaxOptionsT *>(u.value));
+      break;
+    }
     default:
       break;
   }
@@ -6367,6 +6480,11 @@ inline void BuiltinOptionsUnion::Reset() {
       delete ptr;
       break;
     }
+    case BuiltinOptions_LogSoftmaxOptions: {
+      auto ptr = reinterpret_cast<LogSoftmaxOptionsT *>(value);
+      delete ptr;
+      break;
+    }
     default: break;
   }
   value = nullptr;
diff --git a/tensorflow/contrib/lite/testing/BUILD b/tensorflow/contrib/lite/testing/BUILD
index 06570ae9aa3d10c3cb73ab362e30244ec0b78a35..83b9e2142798c685cbc8e1fd4d1db5c40b70389f 100644
--- a/tensorflow/contrib/lite/testing/BUILD
+++ b/tensorflow/contrib/lite/testing/BUILD
@@ -33,6 +33,8 @@ gen_zipped_test_files(
         "l2_pool.zip",
         "l2norm.zip",
         "local_response_norm.zip",
+        "log_softmax.zip",
+        "lstm.zip",
         "max_pool.zip",
         "mean.zip",
         "mul.zip",
@@ -240,6 +242,91 @@ cc_test(
     ],
 )
 
+cc_library(
+    name = "generate_testspec",
+    testonly = 1,
+    srcs = ["generate_testspec.cc"],
+    hdrs = ["generate_testspec.h"],
+    deps = [
+        ":join",
+        ":split",
+        ":tf_driver",
+        "//tensorflow/core:framework",
+    ],
+)
+
+cc_test(
+    name = "generate_testspec_test",
+    size = "small",
+    srcs = ["generate_testspec_test.cc"],
+    deps = [
+        ":generate_testspec",
+        "@com_google_googletest//:gtest_main",
+    ],
+)
+
+cc_library(
+    name = "tflite_diff_util",
+    testonly = 1,
+    srcs = ["tflite_diff_util.cc"],
+    hdrs = ["tflite_diff_util.h"],
+    deps = [
+        ":generate_testspec",
+        ":parse_testdata_lib",
+        ":split",
+        ":tflite_driver",
+        ":util",
+        "//tensorflow/contrib/lite:builtin_op_data",
+        "//tensorflow/contrib/lite:framework",
+        "//tensorflow/contrib/lite:string",
+        "//tensorflow/contrib/lite/kernels:builtin_ops",
+    ],
+)
+
+cc_library(
+    name = "tflite_diff_flags",
+    testonly = 1,
+    hdrs = ["tflite_diff_flags.h"],
+    deps = [
+        ":split",
+        ":tflite_diff_util",
+    ] + select({
+        "//conditions:default": [
+            "//tensorflow/core:framework_internal",
+            "//tensorflow/core:lib",
+        ],
+        "//tensorflow:android": [
+            "//tensorflow/core:android_tensorflow_lib",
+        ],
+    }),
+)
+
+tf_cc_test(
+    name = "tflite_diff_example_test",
+    size = "medium",
+    srcs = ["tflite_diff_example_test.cc"],
+    args = [
+        "--tensorflow_model=third_party/tensorflow/contrib/lite/testdata/multi_add.pb",
+        "--tflite_model=third_party/tensorflow/contrib/lite/testdata/multi_add.bin",
+        "--input_layer=a,b,c,d",
+        "--input_layer_type=float,float,float,float",
+        "--input_layer_shape=1,3,4,3:1,3,4,3:1,3,4,3:1,3,4,3",
+        "--output_layer=x,y",
+    ],
+    data = [
+        "//tensorflow/contrib/lite:testdata/multi_add.bin",
+        "//tensorflow/contrib/lite:testdata/multi_add.pb",
+    ],
+    tags = [
+        "no_cuda_on_cpu_tap",
+        "no_oss",
+    ],
+    deps = [
+        ":tflite_diff_flags",
+        ":tflite_diff_util",
+    ],
+)
+
 tf_cc_test(
     name = "generated_examples_zip_test",
     size = "large",
diff --git a/tensorflow/contrib/lite/testing/generate_examples.py b/tensorflow/contrib/lite/testing/generate_examples.py
index b6c09306d6adb8e54d5108dac850f0249ffcb838..5488b71fcf644070710acc4b2b2886e9a96facb6 100644
--- a/tensorflow/contrib/lite/testing/generate_examples.py
+++ b/tensorflow/contrib/lite/testing/generate_examples.py
@@ -36,6 +36,7 @@ import traceback
 import zipfile
 import numpy as np
 from six import StringIO
+from six.moves import xrange
 
 # TODO(aselle): Disable GPU for now
 os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
@@ -46,6 +47,7 @@ from google.protobuf import text_format
 # TODO(aselle): switch to TensorFlow's resource_loader
 from tensorflow.contrib.lite.testing import generate_examples_report as report_lib
 from tensorflow.python.framework import graph_util as tf_graph_util
+from tensorflow.python.ops import rnn
 
 parser = argparse.ArgumentParser(description="Script to generate TFLite tests.")
 parser.add_argument("output_path",
@@ -108,11 +110,23 @@ KNOWN_BUGS = {
 }
 
 
+class ExtraTocoOptions(object):
+  """Additonal toco options besides input, output, shape."""
+
+  def __init__(self):
+    # Whether to ignore control dependency nodes.
+    self.drop_control_dependency = False
+    # Allow custom ops in the toco conversion.
+    self.allow_custom_ops = False
+    # Rnn states that are used to support rnn / lstm cells.
+    self.rnn_states = None
+
+
 def toco_options(data_types,
                  input_arrays,
                  output_arrays,
                  shapes,
-                 drop_control_dependency):
+                 extra_toco_options=ExtraTocoOptions()):
   """Create TOCO options to process a model.
 
   Args:
@@ -120,8 +134,7 @@ def toco_options(data_types,
     input_arrays: names of the input tensors
     output_arrays: name of the output tensors
     shapes: shapes of the input tensors
-    drop_control_dependency: whether to ignore control dependency nodes.
-
+    extra_toco_options: additional toco options
   Returns:
     the options in a string.
   """
@@ -137,37 +150,15 @@ def toco_options(data_types,
        " --input_arrays=%s" % ",".join(input_arrays) +
        " --input_shapes=%s" % shape_str +
        " --output_arrays=%s" % ",".join(output_arrays))
-  if drop_control_dependency:
+  if extra_toco_options.drop_control_dependency:
     s += " --drop_control_dependency"
+  if extra_toco_options.allow_custom_ops:
+    s += " --allow_custom_ops"
+  if extra_toco_options.rnn_states:
+    s += (" --rnn_states='" + extra_toco_options.rnn_states + "'")
   return s
 
 
-def write_toco_options(filename,
-                       data_types,
-                       input_arrays,
-                       output_arrays,
-                       shapes,
-                       drop_control_dependency=False):
-  """Create TOCO options to process a model.
-
-  Args:
-    filename: Filename to write the options to.
-    data_types: input and inference types used by TOCO.
-    input_arrays: names of the input tensors
-    output_arrays: names of the output tensors
-    shapes: shapes of the input tensors
-    drop_control_dependency: whether to ignore control dependency nodes.
-  """
-  with open(filename, "w") as fp:
-    fp.write(
-        toco_options(
-            data_types=data_types,
-            input_arrays=input_arrays,
-            output_arrays=output_arrays,
-            shapes=shapes,
-            drop_control_dependency=drop_control_dependency))
-
-
 def write_examples(fp, examples):
   """Given a list `examples`, write a text format representation.
 
@@ -285,12 +276,14 @@ def make_control_dep_tests(zip_path):
     return [input_values], sess.run(
         outputs, feed_dict=dict(zip(inputs, [input_values])))
 
+  extra_toco_options = ExtraTocoOptions()
+  extra_toco_options.drop_control_dependency = True
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs,
-                    drop_control_dependency=True)
+                    extra_toco_options)
 
 
 def toco_convert(graph_def_str, input_tensors, output_tensors,
-                 drop_control_dependency=False):
+                 extra_toco_options):
   """Convert a model's graph def into a tflite model.
 
   NOTE: this currently shells out to the toco binary, but we would like
@@ -298,9 +291,9 @@ def toco_convert(graph_def_str, input_tensors, output_tensors,
 
   Args:
     graph_def_str: Graph def proto in serialized string format.
-    input_tensors: List of input tensor tuples `(name, shape, type)`
-    output_tensors: List of output tensors (names)
-    drop_control_dependency: whether to ignore control dependency nodes.
+    input_tensors: List of input tensor tuples `(name, shape, type)`.
+    output_tensors: List of output tensors (names).
+    extra_toco_options: Additional toco options.
 
   Returns:
     output tflite model, log_txt from conversion
@@ -312,7 +305,7 @@ def toco_convert(graph_def_str, input_tensors, output_tensors,
       input_arrays=[x[0] for x in input_tensors],
       shapes=[x[1] for x in input_tensors],
       output_arrays=output_tensors,
-      drop_control_dependency=drop_control_dependency)
+      extra_toco_options=extra_toco_options)
 
   with tempfile.NamedTemporaryFile() as graphdef_file, \
        tempfile.NamedTemporaryFile() as output_file, \
@@ -341,7 +334,8 @@ def make_zip_of_tests(zip_path,
                       test_parameters,
                       make_graph,
                       make_test_inputs,
-                      drop_control_dependency=False):
+                      extra_toco_options=ExtraTocoOptions(),
+                      use_frozen_graph=False):
   """Helper to make a zip file of a bunch of TensorFlow models.
 
   This does a cartestian product of the dictionary of test_parameters and
@@ -359,7 +353,9 @@ def make_zip_of_tests(zip_path,
       `[input1, input2, ...], [output1, output2, ...]`
     make_test_inputs: function taking `curr_params`, `session`, `input_tensors`,
       `output_tensors` and returns tuple `(input_values, output_values)`.
-    drop_control_dependency: whether to ignore control dependency nodes.
+    extra_toco_options: Additional toco options.
+    use_frozen_graph: Whether or not freeze graph before toco converter.
+
   Raises:
     RuntimeError: if there are toco errors that can't be ignored.
   """
@@ -419,21 +415,25 @@ def make_zip_of_tests(zip_path,
           return None, report
         report["toco"] = report_lib.FAILED
         report["tf"] = report_lib.SUCCESS
-
         # Convert graph to toco
+        input_tensors = [(input_tensor.name.split(":")[0],
+                          input_tensor.get_shape(), input_tensor.dtype)
+                         for input_tensor in inputs]
+        output_tensors = [normalize_output_name(out.name) for out in outputs]
+        graph_def = freeze_graph(
+            sess,
+            tf.global_variables() + inputs +
+            outputs) if use_frozen_graph else sess.graph_def
         tflite_model_binary, toco_log = toco_convert(
-            sess.graph_def.SerializeToString(),
-            [(input_tensor.name.split(":")[0], input_tensor.get_shape(),
-              input_tensor.dtype) for input_tensor in inputs],
-            [normalize_output_name(out.name) for out in outputs],
-            drop_control_dependency)
+            graph_def.SerializeToString(), input_tensors, output_tensors,
+            extra_toco_options)
         report["toco"] = (report_lib.SUCCESS if tflite_model_binary is not None
                           else report_lib.FAILED)
         report["toco_log"] = toco_log
 
         if FLAGS.save_graphdefs:
           archive.writestr(label + ".pb",
-                           text_format.MessageToString(sess.graph_def),
+                           text_format.MessageToString(graph_def),
                            zipfile.ZIP_DEFLATED)
 
         if tflite_model_binary:
@@ -783,6 +783,37 @@ def make_exp_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
+def make_log_softmax_tests(zip_path):
+  """Make a set of tests to do log_softmax."""
+
+  test_parameters = [{
+      "input_dtype": [tf.float32],
+      "input_shape": [[1, 100], [4, 2], [5, 224]],
+  }]
+
+  def build_graph(parameters):
+    """Build the log_softmax op testing graph."""
+    input_tensor = tf.placeholder(
+        dtype=parameters["input_dtype"],
+        name="input",
+        shape=parameters["input_shape"])
+
+    out = tf.nn.log_softmax(input_tensor)
+    return [input_tensor], [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    values = [
+        create_tensor_data(
+            parameters["input_dtype"],
+            parameters["input_shape"],
+            min_value=-100,
+            max_value=9)
+    ]
+    return values, sess.run(outputs, feed_dict=dict(zip(inputs, values)))
+
+  make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
+
+
 def make_binary_op_tests_func(binary_operator):
   """Return a function that does a test on a binary operator."""
   return lambda zip_path: make_binary_op_tests(zip_path, binary_operator)
@@ -1730,6 +1761,84 @@ def make_strided_slice_tests(zip_path):
   make_zip_of_tests(zip_path, test_parameters, build_graph, build_inputs)
 
 
+def make_lstm_tests(zip_path):
+  """Make a set of tests to do basic Lstm cell."""
+
+  test_parameters = [
+      {
+          "dtype": [tf.float32],
+          "num_batchs": [1],
+          "time_step_size": [1],
+          "input_vec_size": [3],
+          "num_cells": [4],
+      },
+  ]
+
+  def build_graph(parameters):
+    """Build a simple graph with BasicLSTMCell."""
+
+    num_batchs = parameters["num_batchs"]
+    time_step_size = parameters["time_step_size"]
+    input_vec_size = parameters["input_vec_size"]
+    num_cells = parameters["num_cells"]
+    inputs_after_split = []
+    for i in xrange(time_step_size):
+      one_timestamp_input = tf.placeholder(
+          dtype=parameters["dtype"],
+          name="split_{}".format(i),
+          shape=[num_batchs, input_vec_size])
+      inputs_after_split.append(one_timestamp_input)
+    # Currently lstm identifier has a few limitations: only supports
+    # forget_bias == 0, inner state activiation == tanh.
+    # TODO(zhixianyan): Add another test with forget_bias == 1.
+    # TODO(zhixianyan): Add another test with relu as activation.
+    lstm_cell = tf.contrib.rnn.BasicLSTMCell(
+        num_cells, forget_bias=0.0, state_is_tuple=True)
+    cell_outputs, _ = rnn.static_rnn(
+        lstm_cell, inputs_after_split, dtype=tf.float32)
+    out = cell_outputs[-1]
+    return inputs_after_split, [out]
+
+  def build_inputs(parameters, sess, inputs, outputs):
+    """Feed inputs, assign vairables, and freeze graph."""
+
+    with tf.variable_scope("", reuse=True):
+      kernel = tf.get_variable("rnn/basic_lstm_cell/kernel")
+      bias = tf.get_variable("rnn/basic_lstm_cell/bias")
+      kernel_values = create_tensor_data(
+          parameters["dtype"], [kernel.shape[0], kernel.shape[1]], -1, 1)
+      bias_values = create_tensor_data(parameters["dtype"], [bias.shape[0]], 0,
+                                       1)
+      sess.run(tf.group(kernel.assign(kernel_values), bias.assign(bias_values)))
+
+    num_batchs = parameters["num_batchs"]
+    time_step_size = parameters["time_step_size"]
+    input_vec_size = parameters["input_vec_size"]
+    input_values = []
+    for _ in xrange(time_step_size):
+      tensor_data = create_tensor_data(parameters["dtype"],
+                                       [num_batchs, input_vec_size], 0, 1)
+      input_values.append(tensor_data)
+    out = sess.run(outputs, feed_dict=dict(zip(inputs, input_values)))
+    return input_values, out
+
+  # TODO(zhixianyan): Automatically generate rnn_states for lstm cell.
+  extra_toco_options = ExtraTocoOptions()
+  extra_toco_options.rnn_states = (
+      "{state_array:rnn/BasicLSTMCellZeroState/zeros,"
+      "back_edge_source_array:rnn/basic_lstm_cell/Add_1,size:4},"
+      "{state_array:rnn/BasicLSTMCellZeroState/zeros_1,"
+      "back_edge_source_array:rnn/basic_lstm_cell/Mul_2,size:4}")
+
+  make_zip_of_tests(
+      zip_path,
+      test_parameters,
+      build_graph,
+      build_inputs,
+      extra_toco_options,
+      use_frozen_graph=True)
+
+
 def make_l2_pool(input_tensor, ksize, strides, padding, data_format):
   """Given an input perform a sequence of TensorFlow ops to produce l2pool."""
   return tf.sqrt(tf.nn.avg_pool(
@@ -1818,6 +1927,8 @@ def main(unused_args):
         "squeeze.zip": make_squeeze_tests,
         "strided_slice.zip": make_strided_slice_tests,
         "exp.zip": make_exp_tests,
+        "log_softmax.zip": make_log_softmax_tests,
+        "lstm.zip": make_lstm_tests,
     }
     out = FLAGS.zip_to_output
     bin_path = FLAGS.toco
diff --git a/tensorflow/contrib/lite/testing/generate_testspec.cc b/tensorflow/contrib/lite/testing/generate_testspec.cc
new file mode 100644
index 0000000000000000000000000000000000000000..eb3deafb6986e877f0a553a8b6f712102af4caca
--- /dev/null
+++ b/tensorflow/contrib/lite/testing/generate_testspec.cc
@@ -0,0 +1,88 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/lite/testing/generate_testspec.h"
+#include "tensorflow/contrib/lite/testing/join.h"
+#include "tensorflow/contrib/lite/testing/split.h"
+#include "tensorflow/contrib/lite/testing/tf_driver.h"
+#include "tensorflow/core/framework/types.h"
+
+namespace tflite {
+namespace testing {
+
+void GenerateTestSpecFromTensorflowModel(
+    std::iostream& stream, const string& tensorflow_model_path,
+    const string& tflite_model_path, const std::vector<string>& input_layer,
+    const std::vector<string>& input_layer_type,
+    const std::vector<string>& input_layer_shape,
+    const std::vector<string>& output_layer) {
+  CHECK_EQ(input_layer.size(), input_layer_type.size());
+  CHECK_EQ(input_layer.size(), input_layer_shape.size());
+
+  // Initialize random functions.
+  static unsigned int seed = 0;
+  std::function<float(int)> float_rand = [](int idx) {
+    return static_cast<float>(rand_r(&seed)) / RAND_MAX - 0.5f;
+  };
+
+  // Generate inputs.
+  std::vector<string> input_values;
+  input_values.resize(input_layer.size());
+  for (int i = 0; i < input_layer.size(); i++) {
+    tensorflow::DataType type;
+    CHECK(DataTypeFromString(input_layer_type[i], &type));
+    auto shape = Split<int>(input_layer_shape[i], ",");
+
+    switch (type) {
+      case tensorflow::DT_FLOAT: {
+        const auto& data = GenerateRandomTensor<float>(shape, float_rand);
+        input_values[i] = Join(data.data(), data.size(), ",");
+        break;
+      }
+      default:
+
+        fprintf(stderr, "Unsupported type %d when generating testspec\n", type);
+        return;
+    }
+  }
+
+  // Invoke tensorflow model.
+  TfDriver runner(input_layer, input_layer_type, input_layer_shape,
+                  output_layer);
+  runner.LoadModel(tensorflow_model_path);
+  for (int i = 0; i < input_values.size(); i++) {
+    runner.SetInput(i, input_values[i]);
+  }
+  runner.Invoke();
+
+  // Write test spec.
+  stream << "load_model: " << tflite_model_path << "\n";
+  stream << "reshape {\n";
+  for (const auto& shape : input_layer_shape) {
+    stream << "  input: \"" << shape << "\"\n";
+  }
+  stream << "}\n";
+  stream << "invoke {\n";
+  for (const auto& value : input_values) {
+    stream << "  input: \"" << value << "\"\n";
+  }
+  for (int i = 0; i < output_layer.size(); i++) {
+    stream << "  output: \"" << runner.ReadOutput(i) << "\"\n";
+  }
+  stream << "}\n";
+}
+
+}  // namespace testing
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/testing/generate_testspec.h b/tensorflow/contrib/lite/testing/generate_testspec.h
new file mode 100644
index 0000000000000000000000000000000000000000..3529ee709b66625fff6e2a35b78e47f3778f0fe7
--- /dev/null
+++ b/tensorflow/contrib/lite/testing/generate_testspec.h
@@ -0,0 +1,64 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_TESTING_GENERATE_TESTSPEC_H_
+#define TENSORFLOW_CONTRIB_LITE_TESTING_GENERATE_TESTSPEC_H_
+
+#include <functional>
+#include <iostream>
+#include <vector>
+
+namespace tflite {
+namespace testing {
+
+// Generate test spec by executing TensorFlow model on random inputs.
+// The test spec can be consumed by ParseAndRunTests.
+// See test spec format in parse_testdata.h
+//
+// Inputs:
+//   stream: mutable iostream that contains the contents of test spec.
+//   tensorflow_model_path: path to TensorFlow model.
+//   tflite_model_path: path to tflite_model_path that the test spec runs
+//   against. input_layer: names of input tensors. Example: input1
+//   input_layer_type: datatypes of input tensors. Example: float
+//   input_layer_shape: shapes of input tensors, separated by comma. example:
+//   1,3,4 output_layer: names of output tensors. Example: output
+void GenerateTestSpecFromTensorflowModel(
+    std::iostream& stream, const string& tensorflow_model_path,
+    const string& tflite_model_path, const std::vector<string>& input_layer,
+    const std::vector<string>& input_layer_type,
+    const std::vector<string>& input_layer_shape,
+    const std::vector<string>& output_layer);
+
+// Generates random values that are filled into the tensor.
+// random_func returns the generated random element at given index.
+template <typename T>
+std::vector<T> GenerateRandomTensor(const std::vector<int>& shape,
+                                    const std::function<T(int)>& random_func) {
+  int64_t num_elements = 1;
+  for (const int dim : shape) {
+    num_elements *= dim;
+  }
+
+  std::vector<T> result(num_elements);
+  for (int i = 0; i < num_elements; i++) {
+    result[i] = random_func(i);
+  }
+  return result;
+}
+
+}  // namespace testing
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_TESTING_GENERATE_TESTSPEC_H_
diff --git a/tensorflow/contrib/lite/testing/generate_testspec_test.cc b/tensorflow/contrib/lite/testing/generate_testspec_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2a97b757a413246c9ad9b5f453741b13e381c903
--- /dev/null
+++ b/tensorflow/contrib/lite/testing/generate_testspec_test.cc
@@ -0,0 +1,54 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/contrib/lite/testing/generate_testspec.h"
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+
+namespace tflite {
+namespace testing {
+namespace {
+
+TEST(GenerateRandomTensor, FloatValue) {
+  static unsigned int seed = 0;
+  std::function<float(int)> float_rand = [](int idx) {
+    return static_cast<float>(rand_r(&seed)) / RAND_MAX - 0.5f;
+  };
+
+  std::set<float> values;
+  float sum_x_square = 0.0f;
+  float sum_x = 0.0f;
+  for (int i = 0; i < 100; i++) {
+    const auto& data = GenerateRandomTensor<float>({1, 3, 4}, float_rand);
+    for (float value : data) {
+      values.insert(value);
+      sum_x_square += value * value;
+      sum_x += value;
+    }
+  }
+
+  // Eech round, generated tensor has different values.
+  EXPECT_GT(values.size(), 200);
+  int num = 1 * 3 * 4 * 100;
+  float stddev = sum_x_square / num - (sum_x / num) * (sum_x / num);
+
+  // Stddev is greater than 1/2 stddev of uniform distribution: (B-A)^2 / 12
+  float minstddev = 1.0f / 12 / 2;
+  EXPECT_GT(stddev, minstddev);
+}
+
+}  // namespace
+}  // namespace testing
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc b/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
index 49766cedac8d1acd96f9b38665119e99f8bb9ac0..86606d12393b94567fbe1fceb6d708b266efe4a8 100644
--- a/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
+++ b/tensorflow/contrib/lite/testing/generated_examples_zip_test.cc
@@ -92,6 +92,9 @@ std::map<string, string> kBrokenTests = {
 
     // Transpose only supports 1D-4D input tensors.
     {R"(^\/transpose.*input_shape=\[.,.,.,.,.\])", "71545879"},
+
+    // Lstm kernel gets different results on tsan, asan, msan.
+    {R"(^\/lstmdtype=tf.float32.*)", "73830845"},
 };
 
 // Allows test data to be unzipped into a temporary directory and makes
@@ -250,6 +253,7 @@ INSTANTIATE_TESTS(global_batch_norm)
 INSTANTIATE_TESTS(l2norm)
 INSTANTIATE_TESTS(l2_pool)
 INSTANTIATE_TESTS(local_response_norm)
+INSTANTIATE_TESTS(log_softmax)
 INSTANTIATE_TESTS(max_pool)
 INSTANTIATE_TESTS(mul)
 INSTANTIATE_TESTS(pad)
@@ -265,6 +269,7 @@ INSTANTIATE_TESTS(sub)
 INSTANTIATE_TESTS(split)
 INSTANTIATE_TESTS(div)
 INSTANTIATE_TESTS(transpose)
+INSTANTIATE_TESTS(lstm)
 INSTANTIATE_TESTS(mean)
 INSTANTIATE_TESTS(squeeze)
 INSTANTIATE_TESTS(strided_slice)
diff --git a/tensorflow/contrib/lite/testing/parse_testdata.cc b/tensorflow/contrib/lite/testing/parse_testdata.cc
index 0caef0fe2201a668b2235a98304eb353072a3c2f..389688d552051ea735ce71533943af33df5059ef 100644
--- a/tensorflow/contrib/lite/testing/parse_testdata.cc
+++ b/tensorflow/contrib/lite/testing/parse_testdata.cc
@@ -192,27 +192,25 @@ TfLiteStatus CheckOutputs(tflite::Interpreter* interpreter,
   int model_outputs = interpreter->outputs().size();
   TF_LITE_ENSURE_EQ(context, model_outputs, example.outputs.size());
   for (size_t i = 0; i < interpreter->outputs().size(); i++) {
+    bool tensors_differ = false;
     int output_index = interpreter->outputs()[i];
     if (const float* data = interpreter->typed_tensor<float>(output_index)) {
       for (size_t idx = 0; idx < example.outputs[i].flat_data.size(); idx++) {
         float computed = data[idx];
         float reference = example.outputs[0].flat_data[idx];
         float diff = std::abs(computed - reference);
-        bool error_is_large = false;
         // For very small numbers, try absolute error, otherwise go with
         // relative.
-        if (std::abs(reference) < kRelativeThreshold) {
-          error_is_large = (diff > kAbsoluteThreshold);
-        } else {
-          error_is_large = (diff > kRelativeThreshold * std::abs(reference));
-        }
-        if (error_is_large) {
+        bool local_tensors_differ =
+            std::abs(reference) < kRelativeThreshold
+                ? diff > kAbsoluteThreshold
+                : diff > kRelativeThreshold * std::abs(reference);
+        if (local_tensors_differ) {
           fprintf(stdout, "output[%zu][%zu] did not match %f vs reference %f\n",
                   i, idx, data[idx], reference);
-          return kTfLiteError;
+          tensors_differ = local_tensors_differ;
         }
       }
-      fprintf(stderr, "\n");
     } else if (const int32_t* data =
                    interpreter->typed_tensor<int32_t>(output_index)) {
       for (size_t idx = 0; idx < example.outputs[i].flat_data.size(); idx++) {
@@ -221,10 +219,9 @@ TfLiteStatus CheckOutputs(tflite::Interpreter* interpreter,
         if (std::abs(computed - reference) > 0) {
           fprintf(stderr, "output[%zu][%zu] did not match %d vs reference %d\n",
                   i, idx, computed, reference);
-          return kTfLiteError;
+          tensors_differ = true;
         }
       }
-      fprintf(stderr, "\n");
     } else if (const int64_t* data =
                    interpreter->typed_tensor<int64_t>(output_index)) {
       for (size_t idx = 0; idx < example.outputs[i].flat_data.size(); idx++) {
@@ -235,14 +232,15 @@ TfLiteStatus CheckOutputs(tflite::Interpreter* interpreter,
                   "output[%zu][%zu] did not match %" PRId64
                   " vs reference %" PRId64 "\n",
                   i, idx, computed, reference);
-          return kTfLiteError;
+          tensors_differ = true;
         }
       }
-      fprintf(stderr, "\n");
     } else {
       fprintf(stderr, "output[%zu] was not float or int data\n", i);
       return kTfLiteError;
     }
+    fprintf(stderr, "\n");
+    if (tensors_differ) return kTfLiteError;
   }
   return kTfLiteOk;
 }
@@ -319,8 +317,9 @@ class Reshape : public Message {
 // This is the top-level message in a test file.
 class TestData : public Message {
  public:
-  explicit TestData(TestRunner* test_runner) : test_runner_(test_runner) {}
-
+  explicit TestData(TestRunner* test_runner)
+      : test_runner_(test_runner), num_invocations_(0), max_invocations_(-1) {}
+  void SetMaxInvocations(int max) { max_invocations_ = max; }
   void SetField(const std::string& name, const std::string& value) override {
     if (name == "load_model") {
       test_runner_->LoadModel(value);
@@ -334,7 +333,12 @@ class TestData : public Message {
   Message* AddChild(const std::string& s) override {
     if (s == "invoke") {
       test_runner_->AllocateTensors();
-      return Store(new Invoke(test_runner_));
+      if (max_invocations_ == -1 || num_invocations_ < max_invocations_) {
+        ++num_invocations_;
+        return Store(new Invoke(test_runner_));
+      } else {
+        return nullptr;
+      }
     } else if (s == "reshape") {
       return Store(new Reshape(test_runner_));
     }
@@ -343,10 +347,14 @@ class TestData : public Message {
 
  private:
   TestRunner* test_runner_;
+  int num_invocations_;
+  int max_invocations_;
 };
 
-bool ParseAndRunTests(std::istream* input, TestRunner* test_runner) {
+bool ParseAndRunTests(std::istream* input, TestRunner* test_runner,
+                      int max_invocations) {
   TestData test_data(test_runner);
+  test_data.SetMaxInvocations(max_invocations);
   Message::Read(input, &test_data);
   return test_runner->IsValid() && test_runner->GetOverallSuccess();
 }
diff --git a/tensorflow/contrib/lite/testing/parse_testdata.h b/tensorflow/contrib/lite/testing/parse_testdata.h
index 7ebf362eb99c5f4cf6ea3654cf71e13ff1de99b3..d94361d735e2be8dc130dc8d6bf0bb5c822ebb7c 100644
--- a/tensorflow/contrib/lite/testing/parse_testdata.h
+++ b/tensorflow/contrib/lite/testing/parse_testdata.h
@@ -66,7 +66,8 @@ TfLiteStatus CheckOutputs(tflite::Interpreter* interpreter, const Example&);
 //     output: "12,3,4,545,3"
 //     output: "0.01,0.02"
 //   }
-bool ParseAndRunTests(std::istream* input, TestRunner* test_runner);
+bool ParseAndRunTests(std::istream* input, TestRunner* test_runner,
+                      int max_invocations = -1);
 
 }  // namespace testing
 }  // namespace tflite
diff --git a/tensorflow/compiler/xla/array2d.cc b/tensorflow/contrib/lite/testing/tflite_diff_example_test.cc
similarity index 51%
rename from tensorflow/compiler/xla/array2d.cc
rename to tensorflow/contrib/lite/testing/tflite_diff_example_test.cc
index 418587c1f75c7249f92e925455d40685d870c57a..3817e68111dbaaf2a38ceff9fbc38f30f303cb5f 100644
--- a/tensorflow/compiler/xla/array2d.cc
+++ b/tensorflow/contrib/lite/testing/tflite_diff_example_test.cc
@@ -13,24 +13,16 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/compiler/xla/array2d.h"
-#include "tensorflow/compiler/xla/ptr_util.h"
+#include "tensorflow/contrib/lite/testing/tflite_diff_flags.h"
+#include "tensorflow/contrib/lite/testing/tflite_diff_util.h"
 
-namespace xla {
-
-std::unique_ptr<Array2D<float>> MakeLinspaceArray2D(float from, float to,
-                                                    int64 n1, int64 n2) {
-  auto array = MakeUnique<Array2D<float>>(n1, n2);
-  int64 count = n1 * n2;
-  float step = (count > 1) ? (to - from) / (count - 1) : 0.0f;
-  auto set = [&array, n1, n2](int64 index, float value) {
-    (*array)(index / n2, index % n2) = value;
-  };
-  for (int64 i = 0; i < count - 1; ++i) {
-    set(i, from + i * step);
+int main(int argc, char** argv) {
+  ::tflite::testing::DiffOptions options =
+      ::tflite::testing::ParseTfliteDiffFlags(&argc, argv);
+  for (int i = 0; i < 100; i++) {
+    if (!tflite::testing::RunDiffTest(options)) {
+      return 1;
+    }
   }
-  set(count - 1, to);
-  return array;
+  return 0;
 }
-
-}  // namespace xla
diff --git a/tensorflow/contrib/lite/testing/tflite_diff_flags.h b/tensorflow/contrib/lite/testing/tflite_diff_flags.h
new file mode 100644
index 0000000000000000000000000000000000000000..5f1129d501b7235f1202b704cf36904e07b8720e
--- /dev/null
+++ b/tensorflow/contrib/lite/testing/tflite_diff_flags.h
@@ -0,0 +1,70 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_TESTING_TFLITE_DIFF_FLAGS_H_
+#define TENSORFLOW_CONTRIB_LITE_TESTING_TFLITE_DIFF_FLAGS_H_
+
+#include "tensorflow/contrib/lite/testing/split.h"
+#include "tensorflow/contrib/lite/testing/tflite_diff_util.h"
+#include "tensorflow/core/util/command_line_flags.h"
+
+namespace tflite {
+namespace testing {
+
+DiffOptions ParseTfliteDiffFlags(int* argc, char** argv) {
+  struct {
+    string tensorflow_model;
+    string tflite_model;
+    string input_layer;
+    string input_layer_type;
+    string input_layer_shape;
+    string output_layer;
+  } values;
+
+  std::vector<tensorflow::Flag> flags = {
+      tensorflow::Flag("tensorflow_model", &values.tensorflow_model,
+                       "Path of tensorflow model."),
+      tensorflow::Flag("tflite_model", &values.tflite_model,
+                       "Path of tensorflow lite model."),
+      tensorflow::Flag("input_layer", &values.input_layer,
+                       "Names of input tensors, separated by comma. Example: "
+                       "input_1,input_2"),
+      tensorflow::Flag("input_layer_type", &values.input_layer_type,
+                       "Data types of input tensors, separated by comma. "
+                       "Example: float,int"),
+      tensorflow::Flag(
+          "input_layer_shape", &values.input_layer_shape,
+          "Shapes of input tensors, separated by colon. Example: 1,3,4,1:2"),
+      tensorflow::Flag("output_layer", &values.output_layer,
+                       "Names of output tensors, separated by comma. Example "
+                       "output_1,output_2"),
+  };
+
+  bool success = tensorflow::Flags::Parse(argc, argv, flags);
+  if (!success || (*argc == 2 && !strcmp(argv[1], "--helpfull"))) {
+    fprintf(stderr, "%s", tensorflow::Flags::Usage(argv[0], flags).c_str());
+  }
+
+  return {values.tensorflow_model,
+          values.tflite_model,
+          Split<string>(values.input_layer, ","),
+          Split<string>(values.input_layer_type, ","),
+          Split<string>(values.input_layer_shape, ":"),
+          Split<string>(values.output_layer, ",")};
+}
+
+}  // namespace testing
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_TESTING_TFLITE_DIFF_FLAGS_H_
diff --git a/tensorflow/contrib/lite/testing/tflite_diff_util.cc b/tensorflow/contrib/lite/testing/tflite_diff_util.cc
new file mode 100644
index 0000000000000000000000000000000000000000..9ef4e1f66c7d31c746c18d63495e760585d4af9e
--- /dev/null
+++ b/tensorflow/contrib/lite/testing/tflite_diff_util.cc
@@ -0,0 +1,41 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <cstdarg>
+#include <cstdio>
+#include <cstdlib>
+#include <sstream>
+
+#include "tensorflow/contrib/lite/testing/generate_testspec.h"
+#include "tensorflow/contrib/lite/testing/parse_testdata.h"
+#include "tensorflow/contrib/lite/testing/tflite_diff_util.h"
+#include "tensorflow/contrib/lite/testing/tflite_driver.h"
+
+namespace tflite {
+namespace testing {
+
+bool RunDiffTest(const DiffOptions& options) {
+  std::stringstream tflite_stream;
+  GenerateTestSpecFromTensorflowModel(
+      tflite_stream, options.tensorflow_model, options.tflite_model,
+      options.input_layer, options.input_layer_type, options.input_layer_shape,
+      options.output_layer);
+  TfLiteDriver tflite_driver(/*use_nnapi=*/true);
+  tflite_driver.LoadModel(options.tflite_model);
+  std::cout << tflite_stream.str();
+  return tflite::testing::ParseAndRunTests(&tflite_stream, &tflite_driver);
+}
+}  // namespace testing
+
+}  // namespace tflite
diff --git a/tensorflow/contrib/lite/testing/tflite_diff_util.h b/tensorflow/contrib/lite/testing/tflite_diff_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..326fa6c3e28000dee9b6eb9cc5b3a6c5c87e28d0
--- /dev/null
+++ b/tensorflow/contrib/lite/testing/tflite_diff_util.h
@@ -0,0 +1,51 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CONTRIB_LITE_TESTING_TFLITE_DIFF_UTIL_H_
+#define TENSORFLOW_CONTRIB_LITE_TESTING_TFLITE_DIFF_UTIL_H_
+
+#include <vector>
+
+#include "tensorflow/contrib/lite/string.h"
+
+namespace tflite {
+namespace testing {
+
+// Configurations to run Tflite diff test.
+struct DiffOptions {
+  // Path of tensorflow model.
+  string tensorflow_model;
+  // Path of tensorflow lite model.
+  string tflite_model;
+  // Names of input tensors.
+  // Example: input_1,input_2
+  std::vector<string> input_layer;
+  // Data types of input tensors.
+  // Example: float,int
+  std::vector<string> input_layer_type;
+  // Shapes of input tensors, separated by comma.
+  // Example: 1,3,4,1
+  std::vector<string> input_layer_shape;
+  // Names of output tensors.
+  // Example output_1,output_2
+  std::vector<string> output_layer;
+};
+
+// Run a single TensorFLow Lite diff test with a given options.
+bool RunDiffTest(const DiffOptions& options);
+
+}  // namespace testing
+}  // namespace tflite
+
+#endif  // TENSORFLOW_CONTRIB_LITE_TESTING_TFLITE_DIFF_UTIL_H_
diff --git a/tensorflow/contrib/lite/toco/BUILD b/tensorflow/contrib/lite/toco/BUILD
index e2879fad327d965799d84da5c9092a12a36aa65b..845bc0460f45cb225111e1c71d6d5196415fea82 100644
--- a/tensorflow/contrib/lite/toco/BUILD
+++ b/tensorflow/contrib/lite/toco/BUILD
@@ -186,6 +186,7 @@ cc_library(
         "graph_transformations/fuse_binary_into_preceding_affine.cc",
         "graph_transformations/graph_transformations.cc",
         "graph_transformations/hardcode_min_max.cc",
+        "graph_transformations/identify_dilated_conv.cc",
         "graph_transformations/identify_l2_normalization.cc",
         "graph_transformations/identify_l2_pool.cc",
         "graph_transformations/identify_lstm.cc",
@@ -239,6 +240,7 @@ cc_library(
         "graph_transformations/resolve_tensorflow_tile.cc",
         "graph_transformations/resolve_transpose_attributes.cc",
         "graph_transformations/unfuse_activation_functions.cc",
+        "graph_transformations/unpartition_embedding_lookup.cc",
         "graph_transformations/unroll_batch_matmul.cc",
     ],
     hdrs = [
diff --git a/tensorflow/contrib/lite/toco/args.h b/tensorflow/contrib/lite/toco/args.h
index b97a4720a7c4e69f8b69574475d19e0522cfe86d..59a6115920614d38900c0370708324c122384420 100644
--- a/tensorflow/contrib/lite/toco/args.h
+++ b/tensorflow/contrib/lite/toco/args.h
@@ -229,6 +229,7 @@ struct ParsedTocoFlags {
   // Deprecated flags
   Arg<string> input_type;
   Arg<string> input_types;
+  Arg<bool> debug_disable_recurrent_cell_fusion = Arg<bool>(false);
   Arg<bool> drop_control_dependency = Arg<bool>(false);
 };
 
diff --git a/tensorflow/contrib/lite/toco/dump_graphviz.cc b/tensorflow/contrib/lite/toco/dump_graphviz.cc
index c726eb6d8678e2703f5acba8b3d8d740186939f5..c8352741b44cd627ff9edb9c4677b994c4cb9a09 100644
--- a/tensorflow/contrib/lite/toco/dump_graphviz.cc
+++ b/tensorflow/contrib/lite/toco/dump_graphviz.cc
@@ -142,14 +142,8 @@ NodeProperties GetPropertiesForArray(const Model& model,
 
   // Append array shape to the label.
   auto& array = model.GetArray(array_name);
-
-  if (array.data_type == ArrayDataType::kFloat) {
-    AppendF(&node_properties.label, "\\nType: float");
-  } else if (array.data_type == ArrayDataType::kInt32) {
-    AppendF(&node_properties.label, "\\nType: int32");
-  } else if (array.data_type == ArrayDataType::kUint8) {
-    AppendF(&node_properties.label, "\\nType: uint8");
-  }
+  AppendF(&node_properties.label, "\\nType: %s",
+          ArrayDataTypeName(array.data_type));
 
   if (array.has_shape()) {
     auto& array_shape = array.shape();
@@ -199,12 +193,12 @@ NodeProperties GetPropertiesForArray(const Model& model,
   }
 
   if (array.minmax) {
-    AppendF(&node_properties.label, "\\nMinMax: [%.3g, %.3g]",
+    AppendF(&node_properties.label, "\\nMinMax: [%.7g, %.7g]",
             array.minmax->min, array.minmax->max);
   }
 
   if (array.quantization_params) {
-    AppendF(&node_properties.label, "\\nQuantization: %.3g * (x - %d)",
+    AppendF(&node_properties.label, "\\nQuantization: %7g * (x - %d)",
             array.quantization_params->scale,
             array.quantization_params->zero_point);
   }
diff --git a/tensorflow/contrib/lite/toco/export_tensorflow.cc b/tensorflow/contrib/lite/toco/export_tensorflow.cc
index 570cc7943b2926136f6fdb21d20b8aa6acf8cd26..6900468ec6484d5c1896752286a2fa72f4d38c07 100644
--- a/tensorflow/contrib/lite/toco/export_tensorflow.cc
+++ b/tensorflow/contrib/lite/toco/export_tensorflow.cc
@@ -239,6 +239,7 @@ void ConvertIntTensorConst(const Model& model, const string& name,
 }
 
 void CreateIntTensorConst(const string& name, const std::vector<int32>& data,
+                          const std::vector<int32>& shape,
                           GraphDef* tensorflow_graph) {
   if (HasAlreadyExportedConst(name, *tensorflow_graph)) {
     return;
@@ -252,8 +253,13 @@ void CreateIntTensorConst(const string& name, const std::vector<int32>& data,
   for (auto index : data) {
     tensor->add_int_val(index);
   }
-  auto* shape = tensor->mutable_tensor_shape();
-  shape->add_dim()->set_size(data.size());
+  auto* tensor_shape = tensor->mutable_tensor_shape();
+  int num_elements = 1;
+  for (int size : shape) {
+    tensor_shape->add_dim()->set_size(size);
+    num_elements *= size;
+  }
+  CHECK_EQ(num_elements, data.size());
 }
 
 void CreateMatrixShapeTensorConst(const string& name, int rows, int cols,
@@ -385,6 +391,84 @@ void ConvertConvOperator(const Model& model, const ConvOperator& src_op,
   }
 }
 
+void ConvertDilatedConvOperator(const Model& model, const ConvOperator& src_op,
+                                GraphDef* tensorflow_graph) {
+  CHECK((src_op.dilation_width_factor > 1) ||
+        (src_op.dilation_height_factor > 1))
+      << "Conv operator must have height or width dilation factor > 1. "
+         "Otherwise, use regular conv op.";
+  CHECK_EQ(src_op.stride_width, 1)
+      << "Dilated AND strided convolution is unsupported";
+  CHECK_EQ(src_op.stride_height, 1)
+      << "Dilated AND strided convolution is unsupported";
+
+  // Emulate dilated convolution with a chain of SpaceToBatchND -> Conv ->
+  // BatchToSpaceND ops.
+
+  // Compute padding
+  const auto& input_array = model.GetArray(src_op.inputs[0]);
+  const auto& input_shape = input_array.shape();
+  CHECK_EQ(input_shape.dimensions_count(), 4);
+  int height_mod_dilation = input_shape.dims(1) % src_op.dilation_height_factor;
+  int pad_height;
+  if (height_mod_dilation) {
+    pad_height = src_op.dilation_height_factor - height_mod_dilation;
+  } else {
+    pad_height = 0;
+  }
+  int pad_width;
+  int width_mod_dilation = input_shape.dims(2) % src_op.dilation_width_factor;
+  if (width_mod_dilation) {
+    pad_width = src_op.dilation_width_factor - width_mod_dilation;
+  } else {
+    pad_width = 0;
+  }
+
+  // SpaceToBatchND op "collapses" the spatially separated elements together
+  string stb_output = src_op.outputs[0] + "/dilated_conv_SpaceToBatch";
+  auto* stb_op = tensorflow_graph->add_node();
+  stb_op->set_op("SpaceToBatchND");
+  stb_op->set_name(stb_output);
+  *stb_op->add_input() = src_op.inputs[0];
+  (*stb_op->mutable_attr())["T"].set_type(DT_FLOAT);
+  string block_shape = src_op.outputs[0] + "/dilated_conv_block_shape";
+  CreateIntTensorConst(
+      block_shape,
+      {src_op.dilation_height_factor, src_op.dilation_width_factor}, {2},
+      tensorflow_graph);
+  *stb_op->add_input() = block_shape;
+  (*stb_op->mutable_attr())["Tblock_shape"].set_type(DT_INT32);
+  string stb_paddings = src_op.outputs[0] + "/dilated_conv_paddings";
+  CreateIntTensorConst(stb_paddings, {0, pad_height, pad_width, 0}, {2, 2},
+                       tensorflow_graph);
+  *stb_op->add_input() = stb_paddings;
+  (*stb_op->mutable_attr())["Tpaddings"].set_type(DT_INT32);
+
+  // Perform a regular conv on the "collapsed" elements
+  ConvOperator conv_op;
+  string conv_output = src_op.outputs[0] + "/dilated_conv_Conv2D";
+  conv_op.inputs = src_op.inputs;
+  conv_op.inputs[0] = stb_output;
+  conv_op.outputs = {conv_output};
+  conv_op.padding.type = src_op.padding.type;
+  conv_op.stride_width = src_op.stride_width;
+  conv_op.stride_height = src_op.stride_height;
+  conv_op.dilation_width_factor = 1;
+  conv_op.dilation_height_factor = 1;
+  ConvertConvOperator(model, conv_op, tensorflow_graph);
+
+  // BatchToSpaceND op restores elements to their original layout
+  auto* bts_op = tensorflow_graph->add_node();
+  bts_op->set_op("BatchToSpaceND");
+  bts_op->set_name(src_op.outputs[0]);
+  *bts_op->add_input() = conv_output;
+  (*bts_op->mutable_attr())["T"].set_type(DT_FLOAT);
+  *bts_op->add_input() = block_shape;
+  (*bts_op->mutable_attr())["Tblock_shape"].set_type(DT_INT32);
+  *bts_op->add_input() = stb_paddings;
+  (*bts_op->mutable_attr())["Tcrops"].set_type(DT_INT32);
+}
+
 void ConvertDepthwiseConvOperator(const Model& model,
                                   const DepthwiseConvOperator& src_op,
                                   GraphDef* tensorflow_graph) {
@@ -520,7 +604,7 @@ void ConvertFullyConnectedOperator(const Model& model,
       AvailableArrayName(model, matmul_output + "/transpose_weights");
   const string transpose_perm =
       AvailableArrayName(model, transpose_output + "/perm");
-  CreateIntTensorConst(transpose_perm, {1, 0}, tensorflow_graph);
+  CreateIntTensorConst(transpose_perm, {1, 0}, {2}, tensorflow_graph);
   auto transpose_op = tensorflow_graph->add_node();
   transpose_op->set_op("Transpose");
   transpose_op->set_name(transpose_output);
@@ -720,7 +804,8 @@ void ConvertLogSoftmaxOperator(const Model& model,
                                GraphDef* tensorflow_graph) {
   string softmax_input;
   Operator* providing_op = GetOpWithOutput(model, src_op.inputs[0]);
-  if (providing_op->type == OperatorType::kTensorFlowReshape) {
+  if (providing_op != nullptr &&
+      providing_op->type == OperatorType::kTensorFlowReshape) {
     softmax_input = src_op.inputs[0];
   } else {
     // Insert a reshape operator that reduces the dimensions down to the 2 that
@@ -1600,8 +1685,13 @@ void ConvertOperator(const Model& model, const Operator& src_op,
   }
 
   if (src_op.type == OperatorType::kConv) {
-    ConvertConvOperator(model, static_cast<const ConvOperator&>(src_op),
-                        tensorflow_graph);
+    const ConvOperator& conv_op = static_cast<const ConvOperator&>(src_op);
+    if ((conv_op.dilation_width_factor != 1) ||
+        (conv_op.dilation_height_factor != 1)) {
+      return ConvertDilatedConvOperator(model, conv_op, tensorflow_graph);
+    } else {
+      ConvertConvOperator(model, conv_op, tensorflow_graph);
+    }
   } else if (src_op.type == OperatorType::kDepthwiseConv) {
     ConvertDepthwiseConvOperator(
         model, static_cast<const DepthwiseConvOperator&>(src_op),
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h b/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
index 616bdac268c41d29135368d685729c961f44132b..f0739990adc988a1b178a374aa542d728c6415b5 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
+++ b/tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h
@@ -128,6 +128,7 @@ DECLARE_GRAPH_TRANSFORMATION(IdentifyLstmCell)
 DECLARE_GRAPH_TRANSFORMATION(SplitLstmCellInputs)
 DECLARE_GRAPH_TRANSFORMATION(MergeLstmCellInputs)
 DECLARE_GRAPH_TRANSFORMATION(IdentifyRelu1)
+DECLARE_GRAPH_TRANSFORMATION(IdentifyDilatedConv)
 DECLARE_GRAPH_TRANSFORMATION(MakeInitialDequantizeOperator)
 DECLARE_GRAPH_TRANSFORMATION(PropagateArrayDataTypes)
 DECLARE_GRAPH_TRANSFORMATION(PropagateFixedSizes)
@@ -176,6 +177,7 @@ DECLARE_GRAPH_TRANSFORMATION(ResolveConstantStridedSlice)
 DECLARE_GRAPH_TRANSFORMATION(ResolveConstantFill)
 DECLARE_GRAPH_TRANSFORMATION(ResolveMultiplyByZero)
 DECLARE_GRAPH_TRANSFORMATION(Dequantize)
+DECLARE_GRAPH_TRANSFORMATION(UnpartitionEmbeddingLookup)
 
 class ResolveReshapeAttributes : public GraphTransformation {
  public:
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc b/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
index 1b0be858107b54f5a6ecd2a1cb87c9dbde1c06bb..938d76386d6f315abfe6fe55b133cb4d19014f01 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/hardcode_min_max.cc
@@ -125,6 +125,27 @@ bool HardcodeMinMaxForConcatenation(Model* model, Operator* op) {
   return changed;
 }
 
+bool HardcodeMinMaxForSplit(Model* model, Operator* op) {
+  for (const auto& output : op->outputs) {
+    if (model->GetArray(output).minmax) {
+      LOG(WARNING) << "Skipping min-max setting for " << LogName(*op)
+                   << " because output " << output << " already has min-max.";
+      return false;
+    }
+  }
+  // Data is in second input.
+  auto& input_array = model->GetArray(op->inputs[1]);
+  if (!input_array.minmax) {
+    return false;
+  } else {
+    for (const auto& output : op->outputs) {
+      auto& array = model->GetArray(output);
+      array.GetOrCreateMinMax() = *input_array.minmax;
+    }
+    return true;
+  }
+}
+
 // The output of average or max pooling is within the same range as its input.
 bool HardcodeMinMaxForAverageOrMaxPool(Model* model, Operator* op) {
   auto& output_array = model->GetArray(op->outputs[0]);
@@ -296,6 +317,10 @@ bool HardcodeMinMax::Run(Model* model, std::size_t op_index) {
       changed = HardcodeMinMaxForConcatenation(model, op);
       break;
 
+    case OperatorType::kTensorFlowSplit:
+      changed = HardcodeMinMaxForSplit(model, op);
+      break;
+
     case OperatorType::kAveragePool:
     case OperatorType::kMaxPool:
       changed = HardcodeMinMaxForAverageOrMaxPool(model, op);
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/identify_dilated_conv.cc b/tensorflow/contrib/lite/toco/graph_transformations/identify_dilated_conv.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ae3301f467de5714230e731b4bab87ddc1637201
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/identify_dilated_conv.cc
@@ -0,0 +1,213 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <string>
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace toco {
+
+// A dilated convolution can be emulated with a regular convolution by chaining
+// SpaceToBatch and BatchToSpace ops before and after it:
+//
+//     SpaceToBatchND -> Conv2D -> BatchToSpaceND
+//
+// This method was common before Conv2D fully supported dilated convolution in
+// TensorFlow. This transformation detects this "emulation", and replaces it
+// with a true dilated convolution, eliminating the SpaceToBatch and
+// BatchtoSpace ops.
+//
+// Detecting this alone would be relatively easy. However, in practice some
+// extra ops are used, so we detect the following patterns:
+//
+//
+//   SpaceToBatchND -> Expand -> Conv2D -> Squeeze -> BatchToSpaceND -> BiasAdd
+//
+//   SpaceToBatchND -> Expand -> Conv2D -> Squeeze -> Pad -> BatchToSpaceND ->
+//   BiasAdd
+//
+//   SpaceToBatchND -> Expand -> Conv2D -> Squeeze -> BiasAdd -> BatchToSpaceND
+//
+//   SpaceToBatchND -> Conv2D -> Pad -> BatchToSpaceND -> BiasAdd
+//
+//   SpaceToBatchND -> Conv2D -> BatchToSpaceND -> BiasAdd
+//
+//
+// The Expand/Squeeze combination is used to adapt a 3D array (such as in
+// WaveNet) to the 4D arrays that Conv2D requires. Padding and BiasAdd are
+// thrown in just for the extra headache. Padding adapts non-conforming input
+// sizes, and can be discarded. The bias is necessary, so is kept.
+
+bool IdentifyDilatedConv::Run(Model* model, std::size_t op_index) {
+  const auto it = model->operators.begin() + op_index;
+  auto* stb_op = it->get();
+
+  // 1. IDENTIFY OPERATORS
+  // ***************************************************************************
+  // SpaceToBatch Op.
+  if (stb_op->type != OperatorType::kSpaceToBatchND) {
+    return false;
+  }
+  if (stb_op->inputs.size() != 3) {
+    return false;
+  }
+  CHECK_EQ(stb_op->outputs.size(), 1);
+  // Extract the dilation factor from Input[1] of SpaceToBatch
+  // TODO(mjmatthews): Support 2D dilation factors.
+  const auto& block_shape_array = model->GetArray(stb_op->inputs[1]);
+  if (!block_shape_array.buffer) {
+    return false;
+  }
+  CHECK_EQ(block_shape_array.shape().dimensions_count(), 1);
+  int dilation_factor =
+      block_shape_array.Array::GetBuffer<ArrayDataType::kInt32>().data[0];
+
+  // Expand Op
+  auto* post_stb_op = GetOpWithInput(*model, stb_op->outputs[0]);
+  if (!post_stb_op) {
+    return false;
+  }
+  bool has_expand_op = false;
+  if (post_stb_op->type == OperatorType::kExpandDims) {
+    has_expand_op = true;
+    CHECK_EQ(post_stb_op->inputs.size(), 2);
+    CHECK_EQ(post_stb_op->outputs.size(), 1);
+  }
+
+  // Conv Op
+  ConvOperator* conv_op = dynamic_cast<ConvOperator*>(
+      has_expand_op ? GetOpWithInput(*model, post_stb_op->outputs[0])
+                    : GetOpWithInput(*model, stb_op->outputs[0]));
+  if (!conv_op || conv_op->type != OperatorType::kConv) {
+    return false;
+  }
+  if (conv_op->inputs.size() != 2) {
+    // The conv op must only have weights, no bias.
+    return false;
+  }
+  CHECK_EQ(conv_op->outputs.size(), 1);
+
+  // Squeeze Op
+  auto* post_conv_op = GetOpWithInput(*model, conv_op->outputs[0]);
+  if (!post_conv_op) {
+    return false;
+  }
+  if (has_expand_op) {
+    if (post_conv_op->type != OperatorType::kSqueeze) {
+      // If an expand op was used, the post-conv op must be a squeeze op
+      return false;
+    }
+    CHECK_EQ(post_conv_op->inputs.size(), 1);
+    CHECK_EQ(post_conv_op->outputs.size(), 1);
+  }
+
+  // Pad Op
+  const auto* pad_op = has_expand_op
+                           ? GetOpWithInput(*model, post_conv_op->outputs[0])
+                           : GetOpWithInput(*model, conv_op->outputs[0]);
+  bool has_pad_op = false;
+  if (pad_op->type == OperatorType::kPad) {
+    has_pad_op = true;
+    CHECK_EQ(pad_op->inputs.size(), 2);
+    CHECK_EQ(pad_op->outputs.size(), 1);
+  }
+  // TODO(mjmatthews): Perform validity checking on padding dimensions.
+
+  // Pre-BatchToSpace Bias Op
+  auto* next_op = has_pad_op
+                      ? GetOpWithInput(*model, pad_op->outputs[0])
+                      : has_expand_op
+                            ? GetOpWithInput(*model, post_conv_op->outputs[0])
+                            : GetOpWithInput(*model, conv_op->outputs[0]);
+  bool has_bias_before_bts = false;
+  if (next_op->type == OperatorType::kAdd) {
+    has_bias_before_bts = true;
+  }
+  auto final_op = GetOpWithInput(*model, next_op->outputs[0]);
+
+  // BatchToSpace Op
+  const auto* bts_op = has_bias_before_bts ? final_op : next_op;
+  if (bts_op->type != OperatorType::kBatchToSpaceND) {
+    return false;
+  }
+  CHECK_EQ(bts_op->inputs.size(), 3);
+  CHECK_EQ(bts_op->outputs.size(), 1);
+
+  // Post-BatchToSpace Bias Op
+  Operator* bias_add_op = !has_bias_before_bts ? final_op : next_op;
+  if (bias_add_op->type != OperatorType::kAdd) {
+    // Bias op is required before or after BatchToSpace
+    return false;
+  }
+  CHECK_EQ(bias_add_op->inputs.size(), 2);
+  CHECK_EQ(bias_add_op->outputs.size(), 1);
+
+  LOG(INFO) << "Identified sub-network emulating dilated convolution.";
+
+  // 2. RE-WIRE OPERATORS
+  // ***************************************************************************
+  // Re-use the existing Conv2D op.
+  conv_op->dilation_width_factor = dilation_factor;
+  conv_op->dilation_height_factor = dilation_factor;
+  conv_op->padding.type = PaddingType::kSame;
+
+  // Rewire the ops to bypass SpaceToBatch, BatchToSpace, and Pad.
+  bias_add_op->outputs[0] = final_op->outputs[0];
+  if (has_expand_op) {
+    bias_add_op->inputs[0] = post_conv_op->outputs[0];
+    post_conv_op->inputs[0] = conv_op->outputs[0];
+    conv_op->inputs[0] = post_stb_op->outputs[0];
+    post_stb_op->inputs[0] = stb_op->inputs[0];
+  } else {
+    bias_add_op->inputs[0] = conv_op->outputs[0];
+    conv_op->inputs[0] = stb_op->inputs[0];
+  }
+  // TODO(mjmatthews): Connect bias directly into the Conv2D?
+
+  // 3. DELETE LEFTOVER OPERATORS
+  // ***************************************************************************
+  // Order is important. Delete the output array first, then the op, then it's
+  // redundant inputs.
+  // BatchToSpace Op
+  DeleteArrayIfUnused(bts_op->outputs[0], model);
+  std::vector<string> bts_op_inputs = bts_op->inputs;
+  model->operators.erase(FindOp(*model, bts_op));
+  DeleteArrayIfUnused(bts_op_inputs[1], model);
+  DeleteArrayIfUnused(bts_op_inputs[2], model);
+
+  // Pad Op if present
+  if (has_pad_op) {
+    DeleteArrayIfUnused(pad_op->outputs[0], model);
+    std::vector<string> pad_op_inputs = pad_op->inputs;
+    model->operators.erase(FindOp(*model, pad_op));
+    DeleteArrayIfUnused(pad_op_inputs[1], model);
+  }
+
+  // SpaceToBatch Op
+  DeleteArrayIfUnused(stb_op->outputs[0], model);
+  std::vector<string> stb_op_inputs = stb_op->inputs;
+  model->operators.erase(FindOp(*model, stb_op));
+  DeleteArrayIfUnused(stb_op_inputs[1], model);
+  DeleteArrayIfUnused(stb_op_inputs[2], model);
+
+  LOG(INFO) << "Replaced with Dilated Conv2D op outputting \""
+            << conv_op->outputs[0] << "\".";
+  return true;
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
index f0d107232b4517115aa3f64b39b825dbaffb83ce..bde947f78d2eb890aab77839f93923ee9593815f 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_array_data_types.cc
@@ -97,10 +97,13 @@ bool PropagateArrayDataTypes::Run(Model* model, std::size_t op_index) {
     SetDataTypeForAllOutputs(model, op, data_type);
   } else if (op->type == OperatorType::kTensorFlowUnsupported) {
     auto* unsupported_op = static_cast<TensorFlowUnsupportedOperator*>(op);
-    if (unsupported_op->output_data_types.size() != op->outputs.size()) {
+    // Some output tensors from the op could be eliminated by optimization.
+    // This can make unsupported_op->output_data_types have more elements than
+    // op->outputs.
+    if (unsupported_op->output_data_types.size() < op->outputs.size()) {
       return false;
     }
-    for (int i = 0; i < unsupported_op->output_data_types.size(); ++i) {
+    for (int i = 0; i < op->outputs.size(); ++i) {
       auto output = op->outputs[i];
       auto data_type = unsupported_op->output_data_types[i];
       model->GetArray(output).data_type = data_type;
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
index 0cf0994b43bb048616bab1abe79db1aae2223d37..fc26f997a6920d5ef382817b64ac83ec9ee9bc44 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/propagate_fixed_sizes.cc
@@ -31,17 +31,22 @@ namespace {
 
 void ComputeConvSizes(const Shape& input_shape, int output_depth, int kwidth,
                       int kheight, int stride_width, int stride_height,
+                      int dilation_width_factor, int dilation_height_factor,
                       PaddingType padding_type, Shape* output_shape,
                       FixedPadding* fixed_padding) {
   const int input_width = input_shape.dims(2);
   const int input_height = input_shape.dims(1);
   const int batch = input_shape.dims(0);
 
+  int dilated_kwidth = dilation_width_factor * (kwidth - 1) + 1;
+  int dilated_kheight = dilation_height_factor * (kheight - 1) + 1;
+
   int output_height = 0;
   int output_width = 0;
   if (padding_type == PaddingType::kValid) {
-    output_height = (input_height + stride_height - kheight) / stride_height;
-    output_width = (input_width + stride_width - kwidth) / stride_width;
+    output_height =
+        (input_height + stride_height - dilated_kheight) / stride_height;
+    output_width = (input_width + stride_width - dilated_kwidth) / stride_width;
   } else if (padding_type == PaddingType::kSame) {
     output_height = (input_height + stride_height - 1) / stride_height;
     output_width = (input_width + stride_width - 1) / stride_width;
@@ -49,10 +54,12 @@ void ComputeConvSizes(const Shape& input_shape, int output_depth, int kwidth,
     LOG(FATAL) << "Only supporting SAME or VALID padding";
   }
 
-  fixed_padding->height = std::max(
-      0, ((output_height - 1) * stride_height + kheight - input_height) / 2);
+  fixed_padding->height = std::max(0, ((output_height - 1) * stride_height +
+                                       dilated_kheight - input_height) /
+                                          2);
   fixed_padding->width = std::max(
-      0, ((output_width - 1) * stride_width + kwidth - input_width) / 2);
+      0,
+      ((output_width - 1) * stride_width + dilated_kwidth - input_width) / 2);
 
   // Actually had to debug a situation where those were negative due to bad
   // propagation of placeholder -1 sizes in TensorFlowReshape.
@@ -166,7 +173,8 @@ void ProcessConvOperator(Model* model, ConvOperator* op) {
   const int kheight = weights_shape.dims(1);
   const int kwidth = weights_shape.dims(2);
   ComputeConvSizes(input_shape, output_depth, kwidth, kheight, op->stride_width,
-                   op->stride_height, op->padding.type,
+                   op->stride_height, op->dilation_width_factor,
+                   op->dilation_height_factor, op->padding.type,
                    output_array.mutable_shape(),
                    &op->padding.GetOrCreateFixedPadding());
   CHECK_EQ(output_array.shape().dimensions_count(), 4);
@@ -222,7 +230,7 @@ void ProcessDepthwiseConvOperator(Model* model, DepthwiseConvOperator* op) {
   const int kheight = weights_shape.dims(1);
   const int kwidth = weights_shape.dims(2);
   ComputeConvSizes(input_shape, output_depth, kwidth, kheight, op->stride_width,
-                   op->stride_height, op->padding.type,
+                   op->stride_height, 1, 1, op->padding.type,
                    model->GetArray(output_name).mutable_shape(),
                    &op->padding.GetOrCreateFixedPadding());
 }
@@ -697,7 +705,7 @@ void ProcessAveragePoolOperator(Model* model, AveragePoolOperator* op) {
   const string& output_name = op->outputs[0];
   const int output_depth = input_shape.dims(3);
   ComputeConvSizes(input_shape, output_depth, op->kwidth, op->kheight,
-                   op->stride_width, op->stride_height, op->padding.type,
+                   op->stride_width, op->stride_height, 1, 1, op->padding.type,
                    model->GetArray(output_name).mutable_shape(),
                    &op->padding.GetOrCreateFixedPadding());
 }
@@ -714,7 +722,7 @@ void ProcessMaxPoolOperator(Model* model, MaxPoolOperator* op) {
   const string& output_name = op->outputs[0];
   const int output_depth = input_shape.dims(3);
   ComputeConvSizes(input_shape, output_depth, op->kwidth, op->kheight,
-                   op->stride_width, op->stride_height, op->padding.type,
+                   op->stride_width, op->stride_height, 1, 1, op->padding.type,
                    model->GetArray(output_name).mutable_shape(),
                    &op->padding.GetOrCreateFixedPadding());
 }
@@ -733,7 +741,7 @@ void ProcessL2PoolOperator(Model* model, L2PoolOperator* op) {
   const string& output_name = op->outputs[0];
   const int output_depth = input_shape.dims(3);
   ComputeConvSizes(input_shape, output_depth, op->kwidth, op->kheight,
-                   op->stride_width, op->stride_height, op->padding.type,
+                   op->stride_width, op->stride_height, 1, 1, op->padding.type,
                    model->GetArray(output_name).mutable_shape(),
                    &op->padding.GetOrCreateFixedPadding());
 }
@@ -1534,6 +1542,12 @@ bool PropagateFixedSizes::Run(Model* model, std::size_t op_index) {
     case OperatorType::kTranspose:
       ProcessTransposeOperator(model, static_cast<TransposeOperator*>(op));
       break;
+    case OperatorType::kDynamicPartition:
+    case OperatorType::kDynamicStitch:
+      // DynamicPartition/DynamicStitch are currently only supported for
+      // transforms that remove them, so we avoid propagating shapes through
+      // them and let things settle once they've been removed.
+      break;
     default:
       // Unimplemented, another graph transformation should drop it.
       LOG(FATAL) << "Unhandled operator type " << OperatorTypeName(op->type);
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc b/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
index d7f804ee432598cafe6b6c05d03219aa7d2783fa..77316751bc2642a0c974d16f694aeebe1cd53a9f 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/quantize.cc
@@ -100,7 +100,13 @@ void QuantizeArray(GraphTransformation* transformation, Model* model,
 void QuantizeArray(GraphTransformation* transformation, Model* model,
                    const string& name, ArrayDataType quantized_data_type,
                    const QuantizationParams& quantization_params) {
-  switch (quantized_data_type) {
+  ArrayDataType adjusted_data_type = quantized_data_type;
+  auto& array = model->GetArray(name);
+  if (array.final_data_type == ArrayDataType::kInt16) {
+    adjusted_data_type = array.final_data_type;
+  }
+
+  switch (adjusted_data_type) {
     case ArrayDataType::kUint8:
       return QuantizeArray<ArrayDataType::kUint8>(transformation, model, name,
                                                   quantization_params);
@@ -166,6 +172,60 @@ const MinMax& GetOrComputeMinMax(Model* model, const string& array_name) {
                 "proceed with quantization.";
 }
 
+struct QuantizationPoints {
+  int64 min_value;
+  int64 max_value;
+  int64 central_value;
+};
+
+template <ArrayDataType A>
+QuantizationPoints GetQuantizationPoints() {
+  QuantizationPoints qp;
+  using Integer = DataType<A>;
+  qp.min_value = std::numeric_limits<Integer>::min();
+  qp.max_value = std::numeric_limits<Integer>::max();
+  // eg [-128,127]...
+  qp.central_value = (qp.min_value / 2 +        // -128 -> -64.
+                      (qp.max_value - 1) / 2 +  // 127 -> 63.
+                      1);
+  return qp;
+}
+
+QuantizationPoints GetQuantizationPoints(ArrayDataType data_type) {
+  switch (data_type) {
+    case ArrayDataType::kUint8:
+      return GetQuantizationPoints<ArrayDataType::kUint8>();
+    case ArrayDataType::kInt16:
+      return GetQuantizationPoints<ArrayDataType::kInt16>();
+    case ArrayDataType::kInt32:
+      return GetQuantizationPoints<ArrayDataType::kInt32>();
+    default:
+      LOG(FATAL) << "Unhandled case.";
+  }
+}
+
+ArrayDataType GetQuantizedDataType(const Array& array,
+                                   ArrayDataType default_type) {
+  switch (array.final_data_type) {
+    case ArrayDataType::kInt8:
+    case ArrayDataType::kUint8:
+    case ArrayDataType::kInt16:
+    case ArrayDataType::kUint16:
+    case ArrayDataType::kInt32:
+    case ArrayDataType::kUint32:
+    case ArrayDataType::kInt64:
+    case ArrayDataType::kUint64:
+      return array.final_data_type;
+    case ArrayDataType::kFloat:
+    case ArrayDataType::kNone:
+      return default_type;
+    default:
+      LOG(FATAL) << "Unhandled final quantization type "
+                 << static_cast<int>(array.final_data_type);
+      return default_type;
+  }
+}
+
 bool ChooseQuantizationForOperatorInput(
     GraphTransformation* transformation, Model* model, const Operator& op,
     std::size_t input_index, ArrayDataType* quantized_data_type,
@@ -212,7 +272,7 @@ bool ChooseQuantizationForOperatorInput(
     const auto input_weights_scale = input_weights.quantization_params->scale;
     quantization_params->scale = input_activations_scale * input_weights_scale;
     quantization_params->zero_point = 0;
-    *quantized_data_type = ArrayDataType::kInt32;
+    *quantized_data_type = GetQuantizedDataType(array, ArrayDataType::kInt32);
     transformation->AddMessageF(
         "Input array %s is a bias vector. Choosing quantization params "
         "accordingly.",
@@ -233,14 +293,14 @@ bool ChooseQuantizationForOperatorInput(
 
   GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(model->flags, minmax,
                                                          quantization_params);
+  *quantized_data_type = GetQuantizedDataType(array, ArrayDataType::kUint8);
   transformation->AddMessageF(
       "For input array %s with min=%g"
       ", max=%g"
-      ", chose to quantize as uint8 with zero_point=%d"
+      ", chose to quantize as %s with zero_point=%d"
       ", scale=%g",
-      input, minmax.min, minmax.max, quantization_params->zero_point,
-      quantization_params->scale);
-  *quantized_data_type = ArrayDataType::kUint8;
+      input, minmax.min, minmax.max, ArrayDataTypeName(*quantized_data_type),
+      quantization_params->zero_point, quantization_params->scale);
   return true;
 }
 
@@ -262,16 +322,18 @@ bool IsExactlyRepresentable(double real_value, ArrayDataType data_type,
   return true;
 }
 
+// Quantized data type is preset to the type of the input before this function.
 bool ChooseHardcodedQuantizationForOperatorOutput(
-    const Operator& op, ArrayDataType* quantized_data_type,
+    const Operator& op, const Array& array, ArrayDataType* quantized_data_type,
     QuantizationParams* quantization_params) {
   if (op.type == OperatorType::kL2Normalization) {
     // L2Normalization has range: [-1, 1].
     // 0 should be exactly representable, as values will typically be centered
     // around 0, with many values near 0.
-    *quantized_data_type = ArrayDataType::kUint8;
-    quantization_params->zero_point = 128;
-    quantization_params->scale = 1. / 128.;
+    *quantized_data_type = GetQuantizedDataType(array, *quantized_data_type);
+    const QuantizationPoints qp = GetQuantizationPoints(*quantized_data_type);
+    quantization_params->zero_point = qp.central_value;
+    quantization_params->scale = 1. / (qp.central_value - qp.min_value);
     CHECK(
         IsExactlyRepresentable(0., *quantized_data_type, *quantization_params));
     return true;
@@ -284,18 +346,20 @@ bool ChooseHardcodedQuantizationForOperatorOutput(
     // will typically exploit the symmetry logistic(-x) = 1 - logistic(x), and
     // the glueing of the two halves of the graph will only be seamless if we
     // are accurately representing logistic(0) == 0.5.
-    *quantized_data_type = ArrayDataType::kUint8;
+    *quantized_data_type = GetQuantizedDataType(array, *quantized_data_type);
+    const QuantizationPoints qp = GetQuantizationPoints(*quantized_data_type);
     quantization_params->zero_point = 0;
-    quantization_params->scale = 1. / 256.;
+    quantization_params->scale = 1. / (qp.max_value + 1);
     CHECK(IsExactlyRepresentable(0.5, *quantized_data_type,
                                  *quantization_params));
     return true;
   }
   if (op.type == OperatorType::kTanh) {
     // Tanh has the range: [-1, 1].
-    *quantized_data_type = ArrayDataType::kUint8;
-    quantization_params->zero_point = 128;
-    quantization_params->scale = 1. / 128.;
+    *quantized_data_type = GetQuantizedDataType(array, *quantized_data_type);
+    const QuantizationPoints qp = GetQuantizationPoints(*quantized_data_type);
+    quantization_params->zero_point = qp.central_value;
+    quantization_params->scale = 1. / (qp.central_value - qp.min_value);
     // 0 should be exactly representable, as values will typically be centered
     // around 0, with many values near 0.
     CHECK(
@@ -314,8 +378,9 @@ bool ChooseQuantizationForOperatorOutput(
   if (array.data_type != ArrayDataType::kFloat) {
     return false;
   }
-  if (ChooseHardcodedQuantizationForOperatorOutput(op, quantized_data_type,
-                                                   quantization_params)) {
+  *quantized_data_type = model->GetArray(op.inputs[0]).data_type;
+  if (ChooseHardcodedQuantizationForOperatorOutput(
+          op, array, quantized_data_type, quantization_params)) {
     transformation->AddMessageF(
         "Output array %s is produced by a %s operator. Choosing fixed "
         "quantization params accordingly.",
@@ -323,12 +388,21 @@ bool ChooseQuantizationForOperatorOutput(
     return true;
   }
   if ((op.type == OperatorType::kDepthToSpace) ||
-      (op.type == OperatorType::kSpaceToDepth)) {
-    // DepthToSpace and SpaceToDepth should preserve the quantization parameters
-    // of the input array, as these are simple reshape operations.
-    const auto& input_quantization_params =
-        model->GetArray(op.inputs[0]).GetQuantizationParams();
-    *quantized_data_type = ArrayDataType::kUint8;
+      (op.type == OperatorType::kSpaceToDepth) ||
+      (op.type == OperatorType::kTensorFlowReshape) ||
+      (op.type == OperatorType::kTensorFlowSplit) ||
+      (op.type == OperatorType::kConcatenation)) {
+    int data_input_index = 0;
+    if (op.type == OperatorType::kTensorFlowSplit) {
+      data_input_index = 1;
+    }
+    // Copying and rearrangement ops should preserve the quantization parameters
+    // of the input array.
+    const auto& input_array = model->GetArray(op.inputs[data_input_index]);
+    const auto& input_quantization_params = input_array.GetQuantizationParams();
+    *quantized_data_type =
+        GetQuantizedDataType(input_array, ArrayDataType::kUint8);
+    *quantized_data_type = GetQuantizedDataType(array, *quantized_data_type);
     quantization_params->zero_point = input_quantization_params.zero_point;
     quantization_params->scale = input_quantization_params.scale;
 
@@ -350,13 +424,13 @@ bool ChooseQuantizationForOperatorOutput(
   }
   GetQuantizationParamsFromMinMax<ArrayDataType::kUint8>(model->flags, minmax,
                                                          quantization_params);
-  *quantized_data_type = ArrayDataType::kUint8;
+  *quantized_data_type = GetQuantizedDataType(array, ArrayDataType::kUint8);
   transformation->AddMessageF(
       "For output array %s with min=%g, max=%g"
-      ", chose to quantize as uint8 with zero_point=%d"
+      ", chose to quantize as %s with zero_point=%d"
       ", scale=%g",
-      output, minmax.min, minmax.max, quantization_params->zero_point,
-      quantization_params->scale);
+      output, minmax.min, minmax.max, ArrayDataTypeName(*quantized_data_type),
+      quantization_params->zero_point, quantization_params->scale);
 
   return true;
 }
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_passthrough.cc b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_passthrough.cc
index 587f171bbf823408a45083c36d52f1d38c300123..aa93ace03af300f9cbd3f9c6620a6a58b9329aa4 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_passthrough.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/remove_trivial_passthrough.cc
@@ -60,7 +60,9 @@ bool RemoveTrivialPassthroughOp(GraphTransformation* transformation,
   for (int i = 0; i < passthru_op->inputs.size(); i++) {
     if (!model->GetArray(passthru_op->inputs[i]).buffer) {
       count_nonconstant_input_arrays++;
-      main_input_array_index = i;
+      if (count_nonconstant_input_arrays == 1) {
+        main_input_array_index = i;
+      }
     }
   }
 
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc
index f227554bc505efe6a758fdd9894fee43f2500641..d96b3d522d3d3475496cc8a2ad3c1752daa0c842 100644
--- a/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc
+++ b/tensorflow/contrib/lite/toco/graph_transformations/resolve_constant_unary.cc
@@ -138,12 +138,32 @@ bool ResolveConstantUnaryOperator::Run(Model* model, std::size_t op_index) {
     memcpy(output_float_data.data(), (*input_float_data).data(),
            output_buffer_size * sizeof(output_float_data[0]));
   } else if (unary_op->type == OperatorType::kTensorFlowSum) {
-    // At the moment only full reduction across all dimensions is supported.
-    float sum = 0.f;
-    for (int i = 0; i < input_buffer_size; i++) {
-      sum += (*input_float_data)[i];
+    CHECK_EQ(unary_op->inputs.size(), 2) << "Sum needs 2 inputs";
+    if (!IsConstantParameterArray(*model, unary_op->inputs[1])) {
+      AddMessageF("Axis input is non-constant");
+      return false;
     }
-    for (int i = 0; i < output_buffer_size; ++i) {
+    auto& axis_array = model->GetArray(unary_op->inputs[1]);
+    CHECK(axis_array.data_type == ArrayDataType::kInt32);
+    int axis = axis_array.GetBuffer<ArrayDataType::kInt32>().data[0];
+    CHECK_LT(axis, input_shape.dimensions_count()) << "Axis out of bounds";
+
+    // We currently only handle reduction on axis 0.
+    CHECK_EQ(axis, 0) << "Only reduction along axis 0 is supported";
+    // We currently only handle 1-D and 2-D input tensors.
+    CHECK_LE(input_shape.dimensions_count(), 2) << "Rank >2 not yet supported";
+    // We only support keep_dims=true; shape prop will need to change otherwise.
+    auto sum_op = static_cast<const TensorFlowSumOperator*>(unary_op);
+    CHECK(sum_op->keep_dims) << "Only keep_dims=true is supported";
+
+    std::vector<int> indices(input_shape.dimensions_count());
+    for (int i = 0; i < input_shape.dims(1); ++i) {
+      indices[1] = i;
+      float sum = 0.f;
+      for (int j = 0; j < input_shape.dims(0); ++j) {
+        indices[0] = j;
+        sum += (*input_float_data)[Offset(input_shape, indices)];
+      }
       output_float_data[i] = sum;
     }
   } else if (unary_op->type == OperatorType::kTensorFlowMin) {
diff --git a/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc b/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc
new file mode 100644
index 0000000000000000000000000000000000000000..419fb9a79928985860ec0378fd3f33045fab0ff7
--- /dev/null
+++ b/tensorflow/contrib/lite/toco/graph_transformations/unpartition_embedding_lookup.cc
@@ -0,0 +1,237 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.h"
+#include "tensorflow/contrib/lite/toco/model.h"
+#include "tensorflow/contrib/lite/toco/tooling_util.h"
+
+namespace toco {
+
+bool UnpartitionEmbeddingLookup::Run(Model* model, std::size_t op_index) {
+  // Collapses a partitioned tf.nn.embedding_lookup back into a single Gather.
+  // https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup
+  // This transform attempts to identify the len(params) > 1 case and collapse
+  // it to the len(params) = 1 case by concatenating the original params and
+  // reversing the partitioning.
+  //
+  // If len(params) to the tf.nn.embedding_lookup == 1, the whole op becomes
+  // simply a gather:
+  // https://github.com/tensorflow/tensorflow/blob/r1.5/tensorflow/python/ops/embedding_ops.py#L150
+  //
+  // Notes on this implementation:
+  // - only supports partition_strategy='mod'
+  //
+  // A rough graph of a partitioned embedding_lookup looks like:
+  //   (ids)--+-->FloorDiv--+-->DynamicPartition-->[[Gather]]--\
+  //          \-->FloorMod--/                                  |
+  //                 V                                         |
+  //   Range-->DynamicPartition-------->DynamicStitch<---------/
+  //  (const)                                V
+  //                                     (embeddings)
+
+  // First look for the final DynamicStitch.
+  auto op_it = model->operators.begin() + op_index;
+  if (op_it->get()->type != OperatorType::kDynamicStitch) {
+    return false;
+  }
+  auto* stitch_op = static_cast<DynamicStitchOperator*>(op_it->get());
+
+  // Split up the DynamicStitch inputs into the indices and data.
+  std::vector<string> stitch_indices_inputs;
+  std::vector<string> stitch_data_inputs;
+  for (size_t i = 0; i < stitch_op->num_partitions; ++i) {
+    stitch_indices_inputs.push_back(stitch_op->inputs[i]);
+  }
+  for (size_t i = stitch_op->num_partitions; i < stitch_op->num_partitions * 2;
+       ++i) {
+    stitch_data_inputs.push_back(stitch_op->inputs[i]);
+  }
+
+  // Validate all indices come from the same DynamicPartition.
+  DynamicPartitionOperator* indices_partition_op = nullptr;
+  for (const string& indices_partition_output_name : stitch_indices_inputs) {
+    auto* op = GetOpWithOutput(*model, indices_partition_output_name);
+    CHECK(op) << "Source of " << indices_partition_output_name << " not found";
+    if (op->type != OperatorType::kDynamicPartition) {
+      AddMessageF(
+          "Skipping because indices input %s into "
+          "%s is unexpected",
+          LogName(*op), LogName(*stitch_op));
+      return false;
+    }
+    if (!indices_partition_op) {
+      indices_partition_op = static_cast<DynamicPartitionOperator*>(op);
+    } else {
+      // Ensure this is the same op as previous ones.
+      if (op != indices_partition_op) {
+        AddMessageF(
+            "Skipping because indices input %s into "
+            "%s is from a different source op than others",
+            LogName(*op), LogName(*stitch_op));
+        return false;
+      }
+    }
+  }
+  CHECK(indices_partition_op) << "No indices inputs";
+
+  // The data for the indices must be a constant range of the array shape.
+  if (!IsConstantParameterArray(*model, indices_partition_op->inputs[0])) {
+    AddMessageF("Skipping because indices partition data is non-constant");
+    return false;
+  }
+  auto& indices_data_array = model->GetArray(indices_partition_op->inputs[0]);
+  if (indices_data_array.data_type == ArrayDataType::kNone) {
+    // Yield until data types are propagated.
+    return false;
+  }
+  CHECK(indices_data_array.data_type == ArrayDataType::kInt32)
+      << "Indices partition inputs must be int32";
+  const auto& indices_data_buffer =
+      indices_data_array.GetBuffer<ArrayDataType::kInt32>().data;
+  for (size_t i = 0; i < indices_data_buffer.size(); ++i) {
+    CHECK_EQ(indices_data_buffer[i], i) << "Indices range must be identity";
+  }
+
+  // Find all of the gathers used for the data inputs.
+  std::vector<GatherOperator*> gather_ops;
+  for (const string& gather_output_name : stitch_data_inputs) {
+    auto* op = GetOpWithOutput(*model, gather_output_name);
+    CHECK(op) << "Source of " << gather_output_name << " not found";
+    if (op->type != OperatorType::kGather) {
+      AddMessageF(
+          "Skipping because data input %s into %s "
+          "is unexpected",
+          LogName(*op), LogName(*stitch_op));
+      return false;
+    }
+    gather_ops.push_back(static_cast<GatherOperator*>(op));
+  }
+
+  // Validate all gathers come from the same DynamicPartition.
+  DynamicPartitionOperator* data_partition_op = nullptr;
+  for (auto* gather_op : gather_ops) {
+    auto* op = GetOpWithOutput(*model, gather_op->inputs[1]);
+    CHECK(op) << "Source of " << gather_op->inputs[1] << " not found";
+    if (op->type != OperatorType::kDynamicPartition) {
+      AddMessageF(
+          "Skipping because data input %s into "
+          "%s is unexpected",
+          LogName(*op), LogName(*gather_op));
+      return false;
+    }
+    if (!data_partition_op) {
+      data_partition_op = static_cast<DynamicPartitionOperator*>(op);
+    } else {
+      // Ensure this is the same op as previous ones.
+      if (op != data_partition_op) {
+        AddMessageF(
+            "Skipping because data input %s into "
+            "%s is from a different source op than others",
+            LogName(*op), LogName(*gather_op));
+        return false;
+      }
+    }
+  }
+  CHECK(data_partition_op) << "No data inputs";
+
+  // Validate the partition ops have the same sizes.
+  CHECK_EQ(indices_partition_op->num_partitions,
+           data_partition_op->num_partitions)
+      << "Indices and data partition ops have differing dimensions";
+  int num_partitions = indices_partition_op->num_partitions;
+
+  // Partition strategy of 'mod' gives us a FloorMod and FloorDiv.
+  // The gather partition uses the FloorDiv as the data and FloorMod as the
+  // partitions and the indices use the FloorMod as their partitions.
+  Operator* div_op = GetOpWithOutput(*model, data_partition_op->inputs[0]);
+  Operator* mod_op = GetOpWithOutput(*model, data_partition_op->inputs[1]);
+  CHECK(div_op && div_op->type == OperatorType::kFloorDiv)
+      << "Unsupported partition strategy";
+  CHECK(mod_op && mod_op->type == OperatorType::kFloorMod)
+      << "Unsupported partition strategy";
+  CHECK_EQ(mod_op, GetOpWithOutput(*model, indices_partition_op->inputs[1]))
+      << "Indices and data parition ops require the same partition strategy "
+         "and inputs";
+
+  // Glob together all of the gather data. This is not yet in the correct order.
+  auto* gather_params_concat_op = new ConcatenationOperator;
+  for (const auto& gather_op : gather_ops) {
+    gather_params_concat_op->inputs.push_back(gather_op->inputs[0]);
+  }
+  gather_params_concat_op->outputs.push_back(
+      AvailableArrayName(*model, gather_ops[0]->inputs[0] + "_unpartitioned"));
+  op_it = model->operators.emplace(op_it, gather_params_concat_op) + 1;
+  model->GetOrCreateArray(gather_params_concat_op->outputs[0]);
+
+  // Permute the gather params to undo the partitioning that was originally
+  // done.
+  auto* gather_params_permute_op = new GatherOperator;
+  gather_params_permute_op->inputs.push_back(
+      gather_params_concat_op->outputs[0]);
+  gather_params_permute_op->inputs.push_back(
+      AvailableArrayName(*model, gather_ops[0]->inputs[0] + "_permuted/perm"));
+  gather_params_permute_op->outputs.push_back(
+      AvailableArrayName(*model, gather_ops[0]->inputs[0] + "_permuted"));
+  op_it = model->operators.emplace(op_it, gather_params_permute_op) + 1;
+  model->GetOrCreateArray(gather_params_permute_op->outputs[0]);
+  const auto& partition_array = model->GetArray(gather_ops[0]->inputs[0]);
+  const auto& partition_array_dims = partition_array.shape().dims();
+  auto& perm_array =
+      model->GetOrCreateArray(gather_params_permute_op->inputs[1]);
+  perm_array.data_type = ArrayDataType::kInt32;
+  perm_array.mutable_shape()->ReplaceDims(
+      {num_partitions * partition_array_dims[0]});
+  auto& perm_data = perm_array.GetMutableBuffer<ArrayDataType::kInt32>().data;
+  perm_data.resize(RequiredBufferSizeForShape(perm_array.shape()));
+  // NOTE: this is what relies on the partition_strategy.
+  for (int i = 0; i < num_partitions * partition_array_dims[0]; ++i) {
+    int p = i % num_partitions;
+    perm_data[i] = p * partition_array_dims[0] + i / num_partitions;
+  }
+
+  // Insert the new unpartitioned gather op.
+  auto* merged_gather_op = new GatherOperator;
+  merged_gather_op->inputs = {gather_params_permute_op->outputs[0],
+                              mod_op->inputs[0]};
+  merged_gather_op->outputs = {stitch_op->outputs[0]};
+  model->operators.emplace(op_it, merged_gather_op);
+
+  AddMessageF(
+      "Replacing suspected partitioned tf.nn.embedding_lookup (starting at %s "
+      "+ %s and ending at %s) with a single unpartitioned gather %s",
+      LogName(*div_op), LogName(*mod_op), LogName(*stitch_op),
+      LogName(*merged_gather_op));
+
+  // Ensure the stitch output array is dead, as we don't want whatever was in it
+  // previously now that we've redefined it. It'll be recreated when needed.
+  model->EraseArray(stitch_op->outputs[0]);
+  model->GetOrCreateArray(merged_gather_op->outputs[0]);
+
+  // Erase all the original ops.
+  DeleteOpAndArraysIfUnused(model, div_op);
+  DeleteOpAndArraysIfUnused(model, mod_op);
+  for (auto* gather_op : gather_ops) {
+    DeleteOpAndArraysIfUnused(model, gather_op);
+  }
+  DeleteOpAndArraysIfUnused(model, indices_partition_op);
+  DeleteOpAndArraysIfUnused(model, data_partition_op);
+  DeleteOpAndArraysIfUnused(model, stitch_op);
+  return true;
+}
+
+}  // namespace toco
diff --git a/tensorflow/contrib/lite/toco/import_tensorflow.cc b/tensorflow/contrib/lite/toco/import_tensorflow.cc
index 9c01b67420603a0d3c0e095dafe6a3359f2514b5..41abca864d49cd25e83d740ac21e926e59695836 100644
--- a/tensorflow/contrib/lite/toco/import_tensorflow.cc
+++ b/tensorflow/contrib/lite/toco/import_tensorflow.cc
@@ -272,6 +272,39 @@ void ImportInt64Array(const TensorProto& input_tensor, Array* output_array) {
   }
 }
 
+void ImportBoolArray(const TensorProto& input_tensor, Array* output_array) {
+  CHECK_EQ(input_tensor.dtype(), DT_BOOL);
+  const auto& input_shape = input_tensor.tensor_shape();
+  CHECK_LE(input_shape.dim_size(), 4);
+  ImportShape(input_shape.dim(), output_array->mutable_shape());
+  int input_flat_size = 1;
+  for (int k = 0; k < input_shape.dim_size(); k++) {
+    input_flat_size *= input_shape.dim(k).size();
+  }
+  auto& output_bool_data =
+      output_array->GetMutableBuffer<ArrayDataType::kBool>().data;
+  output_bool_data.resize(RequiredBufferSizeForShape(output_array->shape()),
+                          false);
+  if (input_tensor.bool_val_size()) {
+    for (int i = 0; i < input_tensor.bool_val_size(); i++) {
+      output_bool_data[i] = input_tensor.bool_val(i);
+    }
+  } else if (input_tensor.tensor_content().size() == input_flat_size) {
+    std::vector<char> buf(input_tensor.tensor_content().size());
+    toco::port::CopyToBuffer(input_tensor.tensor_content(), buf.data());
+    for (int i = 0; i < input_tensor.tensor_content().size(); i++) {
+      output_bool_data[i] = static_cast<bool>(buf[i]);
+    }
+  } else {
+    // Some graphs have bool const nodes without actual value...
+    // assuming that 'false' is implied.
+    // So far only encountered that in an array with 1 entry, let's
+    // require that until we encounter a graph where that's not the case.
+    CHECK_EQ(output_bool_data.size(), 1);
+    output_bool_data[0] = false;
+  }
+}
+
 void ImportStringArray(const TensorProto& input_tensor, Array* output_array) {
   CHECK_EQ(input_tensor.dtype(), DT_STRING);
   const auto& input_shape = input_tensor.tensor_shape();
@@ -347,6 +380,10 @@ void ConvertConstOperator(const NodeDef& node,
       array.data_type = ArrayDataType::kString;
       ImportStringArray(tensor, &array);
       break;
+    case DT_BOOL:
+      array.data_type = ArrayDataType::kBool;
+      ImportBoolArray(tensor, &array);
+      break;
     default:
       array.data_type = ArrayDataType::kNone;
       // do nothing, silently ignore the Const data.
@@ -365,7 +402,7 @@ void ConvertConvOperator(const NodeDef& node,
 
   // We only support NHWC, which is the default data_format.
   // So if data_format is not defined, we're all good.
-  if (node.attr().count("data_format")) {
+  if (HasAttr(node, "data_format")) {
     CHECK_EQ(GetStringAttr(node, "data_format"), "NHWC");
   }
   CHECK_EQ(GetDataTypeAttr(node, "T"), DT_FLOAT);
@@ -399,6 +436,17 @@ void ConvertConvOperator(const NodeDef& node,
   CHECK_EQ(strides.i(3), 1);
   conv->stride_height = strides.i(1);
   conv->stride_width = strides.i(2);
+  if (HasAttr(node, "dilations")) {
+    const auto& dilations = GetListAttr(node, "dilations");
+    CHECK_EQ(dilations.i_size(), 4);
+    CHECK_EQ(dilations.i(0), 1);
+    CHECK_EQ(dilations.i(3), 1);
+    conv->dilation_height_factor = dilations.i(1);
+    conv->dilation_width_factor = dilations.i(2);
+  } else {
+    conv->dilation_height_factor = 1;
+    conv->dilation_width_factor = 1;
+  }
   const auto& padding = GetStringAttr(node, "padding");
   if (padding == "SAME") {
     conv->padding.type = PaddingType::kSame;
@@ -418,7 +466,7 @@ void ConvertDepthwiseConvOperator(const NodeDef& node,
 
   // We only support NHWC, which is the default data_format.
   // So if data_format is not defined, we're all good.
-  if (node.attr().count("data_format")) {
+  if (HasAttr(node, "data_format")) {
     CHECK_EQ(GetStringAttr(node, "data_format"), "NHWC");
   }
   CHECK_EQ(GetDataTypeAttr(node, "T"), DT_FLOAT);
@@ -1848,6 +1896,42 @@ void ConvertTopKV2Operator(const NodeDef& node,
   op->outputs.push_back(node.name() + ":1");
   model->operators.emplace_back(op.release());
 }
+
+void ConvertDynamicPartitionOperator(
+    const NodeDef& node, const TensorFlowImportFlags& tf_import_flags,
+    Model* model) {
+  auto op = absl::make_unique<DynamicPartitionOperator>();
+  CHECK(HasAttr(node, "num_partitions"));
+  op->num_partitions = GetIntAttr(node, "num_partitions");
+  CheckInputsCount(node, tf_import_flags, 2);
+  op->inputs.push_back(node.input(0));
+  op->inputs.push_back(node.input(1));
+  CHECK_GT(op->num_partitions, 1);
+  op->outputs.push_back(node.name());  // Implicit :0.
+  for (int i = 1; i < op->num_partitions; ++i) {
+    op->outputs.push_back(node.name() + ":" + std::to_string(i));
+  }
+  model->operators.emplace_back(op.release());
+}
+
+void ConvertDynamicStitchOperator(const NodeDef& node,
+                                  const TensorFlowImportFlags& tf_import_flags,
+                                  Model* model) {
+  // The parallel and non-parallel variants are the same besides whether they
+  // have a parallel loop; there are no behavioral differences.
+  CHECK(node.op() == "DynamicStitch" || node.op() == "ParallelDynamicStitch");
+  auto op = absl::make_unique<DynamicStitchOperator>();
+  CHECK(HasAttr(node, "N"));
+  op->num_partitions = GetIntAttr(node, "N");
+  // Expect all ID partitions + all value partitions.
+  CheckInputsCount(node, tf_import_flags, op->num_partitions * 2);
+  for (int i = 0; i < op->num_partitions * 2; ++i) {
+    op->inputs.push_back(node.input(i));
+  }
+  op->outputs.push_back(node.name());
+  model->operators.emplace_back(op.release());
+}
+
 }  // namespace
 
 std::unique_ptr<Model> ImportTensorFlowGraphDef(
@@ -2033,6 +2117,11 @@ std::unique_ptr<Model> ImportTensorFlowGraphDef(
       ConvertExpOperator(node, tf_import_flags, model);
     } else if (node.op() == "TopK" || node.op() == "TopKV2") {
       ConvertTopKV2Operator(node, tf_import_flags, model);
+    } else if (node.op() == "DynamicPartition") {
+      ConvertDynamicPartitionOperator(node, tf_import_flags, model);
+    } else if (node.op() == "DynamicStitch" ||
+               node.op() == "ParallelDynamicStitch") {
+      ConvertDynamicStitchOperator(node, tf_import_flags, model);
     } else {
       ConvertUnsupportedOperator(node, tf_import_flags, model);
     }
diff --git a/tensorflow/contrib/lite/toco/model.h b/tensorflow/contrib/lite/toco/model.h
index c55bf664f8b65c6eb53ff9ae926bed11adc7b183..ed0dedc00360ab12a8e6e0350726d9c80414af51 100644
--- a/tensorflow/contrib/lite/toco/model.h
+++ b/tensorflow/contrib/lite/toco/model.h
@@ -115,6 +115,8 @@ enum class OperatorType {
   kTensorFlowTile,
   kTranspose,
   kTopK_V2,
+  kDynamicPartition,
+  kDynamicStitch,
   // An unsupported TF operation. It's only needed to be able to represent TF
   // graph internally and is expected to be dropped by graph transformations.
   kTensorFlowUnsupported,
@@ -244,6 +246,8 @@ struct GenericBuffer {
   // in containers and have the containers call the right subclass destructor.
   virtual ~GenericBuffer() {}
 
+  virtual int Length() const = 0;
+
   const ArrayDataType type;
 
  protected:
@@ -256,6 +260,8 @@ template <ArrayDataType A>
 struct Buffer : GenericBuffer {
   Buffer() : GenericBuffer(A) {}
 
+  int Length() const override { return data.size(); }
+
   std::vector<DataType<A>> data;
 };
 
@@ -359,7 +365,8 @@ struct ConvOperator : Operator {
   // A dilation_rate of 0 is invalid and this field is an optional attribute.
   // Thus initializing it to 1 to allow default conv behavior when the
   // attribute is not present.
-  int dilation_rate = 1;
+  int dilation_width_factor = 1;
+  int dilation_height_factor = 1;
 };
 
 // Depthwise-separable convolution operator.
@@ -1409,6 +1416,30 @@ struct TopKV2Operator : Operator {
   TopKV2Operator() : Operator(OperatorType::kTopK_V2) {}
 };
 
+// DynamicPartition operator:
+//
+// Inputs:
+//  inputs[0]: required: data.
+//  inputs[1]: required: partitions.
+//
+// TensorFlow equivalent: DynamicPartition
+struct DynamicPartitionOperator : Operator {
+  DynamicPartitionOperator() : Operator(OperatorType::kDynamicPartition) {}
+  int num_partitions;
+};
+
+// DynamicStitch operator:
+//
+// Inputs:
+//  inputs[0,N): required: indices.
+//  inputs[N,2N): required: data.
+//
+// TensorFlow equivalent: DynamicStitch/ParallelDynamicStitch
+struct DynamicStitchOperator : Operator {
+  DynamicStitchOperator() : Operator(OperatorType::kDynamicStitch) {}
+  int num_partitions;
+};
+
 // Alloc's are used for transient arrays only. An Alloc specifies which interval
 // of the "transient_data" workspace buffer passed to inference functions, is to
 // be used for the transient array at hand. The 'start' and 'end' values are
diff --git a/tensorflow/contrib/lite/toco/model_flags.proto b/tensorflow/contrib/lite/toco/model_flags.proto
index e4b39b34e85e4d703c1b41cb68f8139abd1f6279..867b86f31d16b502a7aeb92cb3d8c96117630cd2 100644
--- a/tensorflow/contrib/lite/toco/model_flags.proto
+++ b/tensorflow/contrib/lite/toco/model_flags.proto
@@ -96,9 +96,11 @@ message RnnState {
 // model that does not already contain such MinMax information.
 message ArraysExtraInfo {
   message Entry {
+    // Next ID to use: 5.
     optional string name = 1;
     optional float min = 2;
     optional float max = 3;
+    optional IODataType data_type = 4;
   }
   repeated Entry entries = 1;
 }
diff --git a/tensorflow/contrib/lite/toco/tflite/import.cc b/tensorflow/contrib/lite/toco/tflite/import.cc
index 5b1ab514b23248cd98e66847185d0e8b9fe2d6aa..d2aeb7811497aa52da133eead9effa9d52a7f5c3 100644
--- a/tensorflow/contrib/lite/toco/tflite/import.cc
+++ b/tensorflow/contrib/lite/toco/tflite/import.cc
@@ -64,6 +64,9 @@ void ImportTensors(const ::tflite::Model& input_model, Model* model) {
 
     auto shape = input_tensor->shape();
     if (shape) {
+      // If the shape is 0-dimensional, make sure to record it as such,
+      // as oppose to leaving the array without a shape.
+      array.mutable_shape()->mutable_dims()->clear();
       for (int i = 0; i < shape->Length(); ++i) {
         auto d = shape->Get(i);
         array.mutable_shape()->mutable_dims()->push_back(d);
diff --git a/tensorflow/contrib/lite/toco/tflite/operator.cc b/tensorflow/contrib/lite/toco/tflite/operator.cc
index aabc7c5109ddc205a7862c3ee2253390dae25095..f2cc4ef71f71902e363ac4cddd3695446af30c7d 100644
--- a/tensorflow/contrib/lite/toco/tflite/operator.cc
+++ b/tensorflow/contrib/lite/toco/tflite/operator.cc
@@ -859,6 +859,8 @@ std::vector<std::unique_ptr<BaseOperator>> BuildOperatorList() {
   ops.emplace_back(
       new SimpleOperator<TanhOperator>("TANH", OperatorType::kTanh));
   ops.emplace_back(new SimpleOperator<ExpOperator>("EXP", OperatorType::kExp));
+  ops.emplace_back(new SimpleOperator<LogSoftmaxOperator>(
+      "LOG_SOFTMAX", OperatorType::kLogSoftmax));
 
   return ops;
 }
diff --git a/tensorflow/contrib/lite/toco/tflite/operator_test.cc b/tensorflow/contrib/lite/toco/tflite/operator_test.cc
index 5c486f72ade9ec5f366f075fcc39274bb7b12679..9c19f8d4649acf40fdd85b78874f7b18798533f2 100644
--- a/tensorflow/contrib/lite/toco/tflite/operator_test.cc
+++ b/tensorflow/contrib/lite/toco/tflite/operator_test.cc
@@ -107,6 +107,8 @@ TEST_F(OperatorTest, SimpleOperators) {
   CheckSimpleOperator<LogisticOperator>("LOGISTIC", OperatorType::kLogistic);
   CheckSimpleOperator<TanhOperator>("TANH", OperatorType::kTanh);
   CheckSimpleOperator<ExpOperator>("EXP", OperatorType::kExp);
+  CheckSimpleOperator<LogSoftmaxOperator>("LOG_SOFTMAX",
+                                          OperatorType::kLogSoftmax);
 }
 
 TEST_F(OperatorTest, BuiltinAdd) {
diff --git a/tensorflow/contrib/lite/toco/toco_cmdline_flags.cc b/tensorflow/contrib/lite/toco/toco_cmdline_flags.cc
index c5a62fdb620ee7d6b7195f6e8e2bc3cb208feb10..0f67c2de728532b5b8101b3514811a78a3b3bc38 100644
--- a/tensorflow/contrib/lite/toco/toco_cmdline_flags.cc
+++ b/tensorflow/contrib/lite/toco/toco_cmdline_flags.cc
@@ -112,6 +112,11 @@ bool ParseTocoFlagsFromCommandLineFlags(
           "If true, ignore control dependency requirements in input TensorFlow "
           "GraphDef. Otherwise an error will be raised upon control dependency "
           "inputs."),
+      Flag("debug_disable_recurrent_cell_fusion",
+           parsed_flags.debug_disable_recurrent_cell_fusion.bind(),
+           parsed_flags.debug_disable_recurrent_cell_fusion.default_value(),
+           "If true, disable fusion of known identifiable cell subgraphs into "
+           "cells. This includes, for example, specific forms of LSTM cell."),
   };
   bool asked_for_help =
       *argc == 2 && (!strcmp(argv[1], "--help") || !strcmp(argv[1], "-help"));
diff --git a/tensorflow/contrib/lite/toco/toco_flags.proto b/tensorflow/contrib/lite/toco/toco_flags.proto
index 3b9d7e22570b66aef2c9fc819e5ab4ec38e179f5..3237147a736f97f65953ca965420fcea934820a4 100644
--- a/tensorflow/contrib/lite/toco/toco_flags.proto
+++ b/tensorflow/contrib/lite/toco/toco_flags.proto
@@ -36,7 +36,8 @@ enum FileFormat {
 // are not normally encoded in model files and in general may not be thought
 // of as properties of models, instead describing how models are to be
 // processed in the context of the present tooling job.
-// Next Id: 13
+//
+// Next ID to use: 14.
 message TocoFlags {
   // Input file format
   optional FileFormat input_format = 1;
@@ -136,4 +137,8 @@ message TocoFlags {
   //    - Default to false if the output format is TENSORFLOW_GRAPHDEF.
   //    - Default to true in all other cases.
   optional bool drop_control_dependency = 12;
+
+  // Disables transformations that fuse subgraphs such as known LSTMs (not all
+  // LSTMs are identified).
+  optional bool debug_disable_recurrent_cell_fusion = 13;
 }
diff --git a/tensorflow/contrib/lite/toco/toco_tooling.cc b/tensorflow/contrib/lite/toco/toco_tooling.cc
index 1b836fbc151db2141ad64d5370f15a43246fdd8b..42e0a89017e1510a805469a493a769116607c03f 100644
--- a/tensorflow/contrib/lite/toco/toco_tooling.cc
+++ b/tensorflow/contrib/lite/toco/toco_tooling.cc
@@ -87,6 +87,7 @@ void MakeGeneralGraphTransformationsSet(
   transformations->Add(new ResolveTensorFlowTile);
   transformations->Add(new ResolveTensorFlowConcat);
   transformations->Add(new ResolveMultiplyByZero);
+  transformations->Add(new IdentifyDilatedConv);
   transformations->Add(new IdentifyL2Normalization);
   transformations->Add(new IdentifyL2Pool);
   transformations->Add(new IdentifyRelu1);
@@ -101,6 +102,7 @@ void MakeGeneralGraphTransformationsSet(
   transformations->Add(new ResolveConstantShapeOrRank);
   transformations->Add(new MakeInitialDequantizeOperator);
   transformations->Add(new ResolveConstantFakeQuant);
+  transformations->Add(new UnpartitionEmbeddingLookup);
 }
 
 bool SupportsQuantization(FileFormat format) {
@@ -198,7 +200,8 @@ void Transform(const TocoFlags& toco_flags, Model* model) {
   const IODataType inference_type = toco_flags.inference_type();
 
   const bool quantize_output =
-      SupportsQuantization(output_format) && inference_type == QUANTIZED_UINT8;
+      SupportsQuantization(output_format) &&
+      (inference_type == QUANTIZED_UINT8 || inference_type == QUANTIZED_INT16);
 
   if (quantize_output) {
     QCHECK_NE(toco_flags.inference_input_type(), FLOAT)
@@ -234,7 +237,9 @@ void Transform(const TocoFlags& toco_flags, Model* model) {
   }
   transformations.Add(new ConvertPureConvToDepthwise);
   if (SupportsLstmCell(output_format)) {
-    transformations.Add(new IdentifyLstmCell);
+    if (!toco_flags.debug_disable_recurrent_cell_fusion()) {
+      transformations.Add(new IdentifyLstmCell);
+    }
     if (output_format == TFLITE) {
       transformations.Add(new toco::SplitLstmCellInputs);
     } else {
diff --git a/tensorflow/contrib/lite/toco/tooling_util.cc b/tensorflow/contrib/lite/toco/tooling_util.cc
index dcb409c84d8d80790f3c5e41a6eb7bce1b1efd2e..f92e10752de6d61d844d397a551358fbfbf05881 100644
--- a/tensorflow/contrib/lite/toco/tooling_util.cc
+++ b/tensorflow/contrib/lite/toco/tooling_util.cc
@@ -62,6 +62,37 @@ string LogName(const Operator& op) {
   }
 }
 
+string ArrayDataTypeName(ArrayDataType data_type) {
+  switch (data_type) {
+    case ArrayDataType::kFloat:
+      return "Float";
+    case ArrayDataType::kInt8:
+      return "Int8";
+    case ArrayDataType::kUint8:
+      return "Uint8";
+    case ArrayDataType::kInt16:
+      return "Int16";
+    case ArrayDataType::kUint16:
+      return "Uint16";
+    case ArrayDataType::kInt32:
+      return "Int32";
+    case ArrayDataType::kUint32:
+      return "Uint32";
+    case ArrayDataType::kInt64:
+      return "Int64";
+    case ArrayDataType::kUint64:
+      return "Uint64";
+    case ArrayDataType::kString:
+      return "String";
+    case ArrayDataType::kBool:
+      return "Bool";
+    case ArrayDataType::kNone:
+      return "None";
+    default:
+      LOG(FATAL) << "Unhandled array data type " << static_cast<int>(data_type);
+  }
+}
+
 bool IsInputArray(const Model& model, const string& name) {
   for (const auto& input_array : model.flags.input_arrays()) {
     if (input_array.name() == name) {
@@ -128,6 +159,15 @@ bool DeleteArrayIfUsedOnce(const string& array_name, Model* model) {
   return false;
 }
 
+void DeleteOpAndArraysIfUnused(Model* model, Operator* op) {
+  for (const string& array_name : op->inputs) {
+    DeleteArrayIfUsedOnce(array_name, model);
+  }
+  auto op_it = FindOp(*model, op);
+  CHECK(op_it != model->operators.end());
+  model->operators.erase(op_it);
+}
+
 std::vector<std::unique_ptr<Operator>>::const_iterator FindOpWithOutput(
     const Model& model, const string& array_name) {
   for (auto it = model.operators.begin(); it != model.operators.end(); ++it) {
@@ -316,6 +356,8 @@ const char* OperatorTypeName(OperatorType type) {
     HANDLE_OPERATORTYPENAME_CASE(TopK_V2)
     HANDLE_OPERATORTYPENAME_CASE(TensorFlowUnsupported)
     HANDLE_OPERATORTYPENAME_CASE(Exp)
+    HANDLE_OPERATORTYPENAME_CASE(DynamicPartition)
+    HANDLE_OPERATORTYPENAME_CASE(DynamicStitch)
     default:
       LOG(FATAL) << "Unhandled op type";
 #undef HANDLE_OPERATORTYPENAME_CASE
@@ -363,48 +405,9 @@ void LogSummary(int log_level, const Model& model) {
 void LogArray(int log_level, const Model& model, const string& name) {
   const auto& array = model.GetArray(name);
   VLOG(log_level) << "Array: " << name;
-  switch (array.data_type) {
-    case ArrayDataType::kNone:
-      VLOG(log_level) << "  Data type:";
-      break;
-    case ArrayDataType::kFloat:
-      VLOG(log_level) << "  Data type: kFloat";
-      break;
-    case ArrayDataType::kInt32:
-      VLOG(log_level) << "  Data type: kInt32";
-      break;
-    case ArrayDataType::kUint8:
-      VLOG(log_level) << "  Data type: kUint8";
-      break;
-    case ArrayDataType::kString:
-      VLOG(log_level) << "  Data type: kString";
-      break;
-    default:
-      VLOG(log_level) << "  Data type: other (numerical value: "
-                      << static_cast<int>(array.data_type) << ")";
-      break;
-  }
-  switch (array.final_data_type) {
-    case ArrayDataType::kNone:
-      VLOG(log_level) << "  Final type:";
-      break;
-    case ArrayDataType::kFloat:
-      VLOG(log_level) << "  Final type: kFloat";
-      break;
-    case ArrayDataType::kInt32:
-      VLOG(log_level) << "  Final type: kInt32";
-      break;
-    case ArrayDataType::kUint8:
-      VLOG(log_level) << "  Final type: kUint8";
-      break;
-    case ArrayDataType::kString:
-      VLOG(log_level) << "  Final type: kString";
-      break;
-    default:
-      VLOG(log_level) << "  Final type: other (numerical value: "
-                      << static_cast<int>(array.data_type) << ")";
-      break;
-  }
+  VLOG(log_level) << "  Data type: " << ArrayDataTypeName(array.data_type);
+  VLOG(log_level) << "  Final type: "
+                  << ArrayDataTypeName(array.final_data_type);
   if (array.buffer) {
     VLOG(log_level) << "  Constant Buffer";
   }
@@ -819,9 +822,15 @@ void CheckEachArray(const Model& model) {
     // It's OK to have a buffer or an alloc, but not both.
     // (Since allocs are for transient arrays without a buffer).
     CHECK(!array->buffer || !array->alloc);
-    // If there is a buffer, its type should be consistent with data_type.
     if (array->buffer) {
+      // If there is a buffer, its type should be consistent with data_type.
       CHECK(array->buffer->type == array->data_type);
+      // The presence of a fixed buffer should imply the presence of a fixed
+      // shape.
+      CHECK(array->has_shape());
+      // The shape flat-size should agree with the buffer length.
+      CHECK_EQ(array->buffer->Length(),
+               RequiredBufferSizeForShape(array->shape()));
     }
 
     // Check name.  Either "name_with_suffix_8", "name_with_port:3", but not
@@ -1201,7 +1210,7 @@ void ResolveModelFlags(const ModelFlags& model_flags, Model* model) {
       << "This model does not define output arrays, so a "
          "--output_arrays flag must be given on the command-line.";
 
-  for (const auto& input_array_proto : model->flags.input_arrays()) {
+  for (auto& input_array_proto : *model->flags.mutable_input_arrays()) {
     auto& input_array = model->GetOrCreateArray(input_array_proto.name());
     if (input_array_proto.has_data_type()) {
       const ArrayDataType specified_type =
@@ -1245,6 +1254,11 @@ void ResolveModelFlags(const ModelFlags& model_flags, Model* model) {
         for (int i = 0; i < input_array_dims.size(); i++) {
           CHECK_EQ(input_array_dims[i], input_array_proto.shape().dims(i));
         }
+      } else {
+        for (int i = 0; i < input_array.shape().dimensions_count(); i++) {
+          input_array_proto.mutable_shape()->add_dims(
+              input_array.shape().dims(i));
+        }
       }
     }
 
@@ -1811,6 +1825,8 @@ ArrayDataType ConvertIODataTypeToArrayDataType(IODataType type) {
       return ArrayDataType::kFloat;
     case QUANTIZED_UINT8:
       return ArrayDataType::kUint8;
+    case QUANTIZED_INT16:
+      return ArrayDataType::kInt16;
     case INT32:
       return ArrayDataType::kInt32;
     case INT64:
@@ -1842,9 +1858,17 @@ void UseArraysExtraInfo(Model* model) {
     QCHECK(model->HasArray(entry.name()))
         << "ArraysExtraInfo refers to non-existent array name: "
         << entry.name();
-    auto& minmax = model->GetArray(entry.name()).GetOrCreateMinMax();
-    minmax.min = entry.min();
-    minmax.max = entry.max();
+    auto& array = model->GetArray(entry.name());
+    auto& minmax = array.GetOrCreateMinMax();
+    if (entry.has_min() || entry.has_max()) {
+      CHECK_EQ(entry.has_min(), entry.has_max());
+      minmax.min = entry.min();
+      minmax.max = entry.max();
+    }
+    if (entry.has_data_type()) {
+      array.final_data_type =
+          ConvertIODataTypeToArrayDataType(entry.data_type());
+    }
   }
 }
 
diff --git a/tensorflow/contrib/lite/toco/tooling_util.h b/tensorflow/contrib/lite/toco/tooling_util.h
index 0aaa0f6a215288430dfdde7d5042012730c3be4c..01917b29def13c11a3d3df304ac7b40090af307c 100644
--- a/tensorflow/contrib/lite/toco/tooling_util.h
+++ b/tensorflow/contrib/lite/toco/tooling_util.h
@@ -54,6 +54,8 @@ absl::string_view FindLongestCommonPrefix(absl::string_view a,
                                           absl::string_view b);
 string LogName(const Operator& op);
 
+string ArrayDataTypeName(ArrayDataType data_type);
+
 bool IsInputArray(const Model& model, const string& name);
 bool IsArrayConsumed(const Model& model, const string& name);
 int CountTrueOutputs(const Model& model, const Operator& op);
@@ -62,6 +64,10 @@ int CountOpsWithInput(const Model& model, const string& array_name);
 bool DeleteArrayIfUnused(const string& array_name, Model* model);
 bool DeleteArrayIfUsedOnce(const string& array_name, Model* model);
 
+// Deletes the op and any of its input and output arrays if they are unused
+// after the op has been deleted.
+void DeleteOpAndArraysIfUnused(Model* model, Operator* op);
+
 std::vector<std::unique_ptr<Operator>>::const_iterator FindOpWithOutput(
     const Model& model, const string& array_name);
 Operator* GetOpWithOutput(const Model& model, const string& array_name);
@@ -69,8 +75,6 @@ Operator* GetOpWithOutput(const Model& model, const string& array_name);
 std::vector<std::unique_ptr<Operator>>::iterator FindOpWithOutput(
     Model& model, const string& array_name);
 
-Operator* GetOpWithOutput(const Model& model, const string& array_name);
-
 std::vector<std::unique_ptr<Operator>>::const_iterator FindOpWithInput(
     const Model& model, const string& array_name);
 
diff --git a/tensorflow/contrib/lite/toco/types.proto b/tensorflow/contrib/lite/toco/types.proto
index 318fd4b7b2c2df093562e73c3fe707675ee98876..03bd6150bc86bb27221814cd191b17f1a09585fa 100644
--- a/tensorflow/contrib/lite/toco/types.proto
+++ b/tensorflow/contrib/lite/toco/types.proto
@@ -34,4 +34,7 @@ enum IODataType {
 
   // String, not quantized
   STRING = 5;
+
+  // Int16, quantized
+  QUANTIZED_INT16 = 6;
 }
diff --git a/tensorflow/contrib/lookup/lookup_ops.py b/tensorflow/contrib/lookup/lookup_ops.py
index a430dac4ec43ce31f0b5aaae5e7b0b51d25c9632..62f1c810fc72ba7e27c4553006c947f8fa0ef629 100644
--- a/tensorflow/contrib/lookup/lookup_ops.py
+++ b/tensorflow/contrib/lookup/lookup_ops.py
@@ -341,23 +341,21 @@ class MutableHashTable(LookupInterface):
     # training to work correctly. Use the node name if no shared_name has been
     # explicitly specified.
     use_node_name_sharing = checkpoint and shared_name is None
-    # pylint: disable=protected-access
     if self._default_value.get_shape().ndims == 0:
-      self._table_ref = gen_lookup_ops._mutable_hash_table_v2(
+      self._table_ref = gen_lookup_ops.mutable_hash_table_v2(
           shared_name=shared_name,
           use_node_name_sharing=use_node_name_sharing,
           key_dtype=key_dtype,
           value_dtype=value_dtype,
           name=name)
     else:
-      self._table_ref = gen_lookup_ops._mutable_hash_table_of_tensors_v2(
+      self._table_ref = gen_lookup_ops.mutable_hash_table_of_tensors_v2(
           shared_name=shared_name,
           use_node_name_sharing=use_node_name_sharing,
           key_dtype=key_dtype,
           value_dtype=value_dtype,
           value_shape=self._default_value.get_shape(),
           name=name)
-    # pylint: enable=protected-access
     super(MutableHashTable, self).__init__(key_dtype, value_dtype,
                                            self._table_ref.op.name.split(
                                                "/")[-1])
@@ -378,9 +376,7 @@ class MutableHashTable(LookupInterface):
     with ops.name_scope(name, "%s_Size" % self._name,
                         [self._table_ref]) as name:
       with ops.colocate_with(self._table_ref):
-
-        # pylint: disable=protected-access
-        return gen_lookup_ops._lookup_table_size_v2(self._table_ref, name=name)
+        return gen_lookup_ops.lookup_table_size_v2(self._table_ref, name=name)
 
   def lookup(self, keys, name=None):
     """Looks up `keys` in a table, outputs the corresponding values.
@@ -406,8 +402,7 @@ class MutableHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_find" % self._name,
                         (self._table_ref, keys, self._default_value)) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        values = gen_lookup_ops._lookup_table_find_v2(
+        values = gen_lookup_ops.lookup_table_find_v2(
             self._table_ref, keys, self._default_value, name=name)
 
         values.set_shape(keys.get_shape().concatenate(self._value_shape))
@@ -437,7 +432,7 @@ class MutableHashTable(LookupInterface):
                         [self._table_ref, keys, values]) as name:
       with ops.colocate_with(self._table_ref):
         # pylint: disable=protected-access
-        op = gen_lookup_ops._lookup_table_insert_v2(
+        op = gen_lookup_ops.lookup_table_insert_v2(
             self._table_ref, keys, values, name=name)
     return op
 
@@ -454,8 +449,7 @@ class MutableHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_export_values" % self._name,
                         [self._table_ref]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        exported_keys, exported_values = gen_lookup_ops._lookup_table_export_v2(
+        exported_keys, exported_values = gen_lookup_ops.lookup_table_export_v2(
             self._table_ref, self._key_dtype, self._value_dtype, name=name)
 
     exported_values.set_shape(exported_keys.get_shape().concatenate(
@@ -477,7 +471,7 @@ class MutableHashTable(LookupInterface):
     def restore(self, restored_tensors, unused_restored_shapes):
       # pylint: disable=protected-access
       with ops.colocate_with(self.op._table_ref):
-        return gen_lookup_ops._lookup_table_import_v2(
+        return gen_lookup_ops.lookup_table_import_v2(
             self.op._table_ref, restored_tensors[0], restored_tensors[1])
 
 
@@ -551,8 +545,7 @@ class MutableDenseHashTable(LookupInterface):
     # explicitly specified.
     use_node_name_sharing = checkpoint and shared_name is None
     empty_key = ops.convert_to_tensor(empty_key, dtype=key_dtype)
-    # pylint: disable=protected-access
-    self._table_ref = gen_lookup_ops._mutable_dense_hash_table_v2(
+    self._table_ref = gen_lookup_ops.mutable_dense_hash_table_v2(
         empty_key=empty_key,
         shared_name=shared_name,
         use_node_name_sharing=use_node_name_sharing,
@@ -560,7 +553,6 @@ class MutableDenseHashTable(LookupInterface):
         value_shape=self._value_shape,
         initial_num_buckets=initial_num_buckets,
         name=name)
-    # pylint: enable=protected-access
     super(MutableDenseHashTable, self).__init__(
         key_dtype, value_dtype, self._table_ref.op.name.split("/")[-1])
 
@@ -580,8 +572,7 @@ class MutableDenseHashTable(LookupInterface):
     with ops.name_scope(name, "%s_Size" % self._name,
                         [self._table_ref]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        return gen_lookup_ops._lookup_table_size_v2(self._table_ref, name=name)
+        return gen_lookup_ops.lookup_table_size_v2(self._table_ref, name=name)
 
   def lookup(self, keys, name=None):
     """Looks up `keys` in a table, outputs the corresponding values.
@@ -607,8 +598,7 @@ class MutableDenseHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_find" % self._name,
                         [self._table_ref, keys]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        values = gen_lookup_ops._lookup_table_find_v2(
+        values = gen_lookup_ops.lookup_table_find_v2(
             self._table_ref, keys, self._default_value, name=name)
 
     if keys.get_shape().ndims is not None and keys.get_shape().ndims > 0:
@@ -640,8 +630,7 @@ class MutableDenseHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_insert" % self._name,
                         [self._table_ref, keys, values]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        op = gen_lookup_ops._lookup_table_insert_v2(
+        op = gen_lookup_ops.lookup_table_insert_v2(
             self._table_ref, keys, values, name=name)
       return op
 
@@ -658,8 +647,7 @@ class MutableDenseHashTable(LookupInterface):
     with ops.name_scope(name, "%s_lookup_table_export_values" % self._name,
                         [self._table_ref]) as name:
       with ops.colocate_with(self._table_ref):
-        # pylint: disable=protected-access
-        exported_keys, exported_values = gen_lookup_ops._lookup_table_export_v2(
+        exported_keys, exported_values = gen_lookup_ops.lookup_table_export_v2(
             self._table_ref, self._key_dtype, self._value_dtype, name=name)
 
     exported_values.set_shape(exported_keys.get_shape().concatenate(
@@ -681,5 +669,5 @@ class MutableDenseHashTable(LookupInterface):
     def restore(self, restored_tensors, unused_restored_shapes):
       # pylint: disable=protected-access
       with ops.colocate_with(self.op._table_ref):
-        return gen_lookup_ops._lookup_table_import_v2(
+        return gen_lookup_ops.lookup_table_import_v2(
             self.op._table_ref, restored_tensors[0], restored_tensors[1])
diff --git a/tensorflow/contrib/losses/python/metric_learning/metric_loss_ops.py b/tensorflow/contrib/losses/python/metric_learning/metric_loss_ops.py
index 6842bc38eb108b46cc3eff715c9cbc74f991308b..2b9eee4ef7b418e2b90d388d2f165537b8660a9a 100644
--- a/tensorflow/contrib/losses/python/metric_learning/metric_loss_ops.py
+++ b/tensorflow/contrib/losses/python/metric_learning/metric_loss_ops.py
@@ -50,16 +50,12 @@ def pairwise_distance(feature, squared=False):
     pairwise_distances: 2-D Tensor of size [number of data, number of data].
   """
   pairwise_distances_squared = math_ops.add(
+      math_ops.reduce_sum(math_ops.square(feature), axis=[1], keepdims=True),
       math_ops.reduce_sum(
-          math_ops.square(feature),
-          axis=[1],
-          keepdims=True),
-      math_ops.reduce_sum(
-          math_ops.square(
-              array_ops.transpose(feature)),
+          math_ops.square(array_ops.transpose(feature)),
           axis=[0],
-          keepdims=True)) - 2.0 * math_ops.matmul(
-              feature, array_ops.transpose(feature))
+          keepdims=True)) - 2.0 * math_ops.matmul(feature,
+                                                  array_ops.transpose(feature))
 
   # Deal with numerical inaccuracies. Set small negatives to zero.
   pairwise_distances_squared = math_ops.maximum(pairwise_distances_squared, 0.0)
@@ -134,8 +130,8 @@ def masked_maximum(data, mask, dim=1):
   """
   axis_minimums = math_ops.reduce_min(data, dim, keepdims=True)
   masked_maximums = math_ops.reduce_max(
-      math_ops.multiply(
-          data - axis_minimums, mask), dim, keepdims=True) + axis_minimums
+      math_ops.multiply(data - axis_minimums, mask), dim,
+      keepdims=True) + axis_minimums
   return masked_maximums
 
 
@@ -153,8 +149,8 @@ def masked_minimum(data, mask, dim=1):
   """
   axis_maximums = math_ops.reduce_max(data, dim, keepdims=True)
   masked_minimums = math_ops.reduce_min(
-      math_ops.multiply(
-          data - axis_maximums, mask), dim, keepdims=True) + axis_maximums
+      math_ops.multiply(data - axis_maximums, mask), dim,
+      keepdims=True) + axis_maximums
   return masked_minimums
 
 
@@ -202,8 +198,7 @@ def triplet_semihard_loss(labels, embeddings, margin=1.0):
   mask_final = array_ops.reshape(
       math_ops.greater(
           math_ops.reduce_sum(
-              math_ops.cast(
-                  mask, dtype=dtypes.float32), 1, keepdims=True),
+              math_ops.cast(mask, dtype=dtypes.float32), 1, keepdims=True),
           0.0), [batch_size, batch_size])
   mask_final = array_ops.transpose(mask_final)
 
@@ -450,8 +445,8 @@ def lifted_struct_loss(labels, embeddings, margin=1.0):
   #     this is to take the max only among negatives.
   row_minimums = math_ops.reduce_min(diff, 1, keepdims=True)
   row_negative_maximums = math_ops.reduce_max(
-      math_ops.multiply(
-          diff - row_minimums, mask), 1, keepdims=True) + row_minimums
+      math_ops.multiply(diff - row_minimums, mask), 1,
+      keepdims=True) + row_minimums
 
   # Compute the loss.
   # Keep track of matrix of maximums where M_ij = max(m_i, m_j)
@@ -467,10 +462,11 @@ def lifted_struct_loss(labels, embeddings, margin=1.0):
       array_ops.transpose(max_elements), [-1, 1])
 
   loss_exp_left = array_ops.reshape(
-      math_ops.reduce_sum(math_ops.multiply(
-          math_ops.exp(
-              diff_tiled - max_elements_vect),
-          mask_tiled), 1, keepdims=True), [batch_size, batch_size])
+      math_ops.reduce_sum(
+          math_ops.multiply(
+              math_ops.exp(diff_tiled - max_elements_vect), mask_tiled),
+          1,
+          keepdims=True), [batch_size, batch_size])
 
   loss_mat = max_elements + math_ops.log(
       loss_exp_left + array_ops.transpose(loss_exp_left))
diff --git a/tensorflow/contrib/makefile/Makefile b/tensorflow/contrib/makefile/Makefile
index 81327407d44b4317b7aecb964a689a35aa35c163..05e8d9064bea748c935859f5f9b4c7e646f504cf 100644
--- a/tensorflow/contrib/makefile/Makefile
+++ b/tensorflow/contrib/makefile/Makefile
@@ -677,6 +677,7 @@ endif  # TEGRA
 TF_CC_SRCS := $(filter-out $(CORE_CC_EXCLUDE_SRCS), $(CORE_CC_ALL_SRCS))
 # Add in any extra files that don't fit the patterns easily
 TF_CC_SRCS += tensorflow/contrib/makefile/downloads/fft2d/fftsg.c
+TF_CC_SRCS += tensorflow/core/common_runtime/gpu/gpu_id_manager.cc
 # Also include the op and kernel definitions.
 TF_CC_SRCS += $(shell cat $(MAKEFILE_DIR)/tf_op_files.txt)
 PBT_CC_SRCS := $(shell cat $(MAKEFILE_DIR)/tf_pb_text_files.txt)
diff --git a/tensorflow/contrib/makefile/README.md b/tensorflow/contrib/makefile/README.md
index b0228c543505c3d14e41bf1dd540b027b00489e6..995230dfa848532dc2a50b85f58d19ba264f293e 100644
--- a/tensorflow/contrib/makefile/README.md
+++ b/tensorflow/contrib/makefile/README.md
@@ -155,7 +155,7 @@ CC_PREFIX=ccache tensorflow/contrib/makefile/build_all_android.sh -s tensorflow/
 (add -T on subsequent builds to skip protobuf downloading/building)
 
 
-#### Testing the the CUDA-enabled benchmark via adb:
+#### Testing the CUDA-enabled benchmark via adb:
 Build binaries first as above, then run:
 
 ```bash
diff --git a/tensorflow/contrib/model_pruning/README.md b/tensorflow/contrib/model_pruning/README.md
index d286750c257e9a78a82c95c1fc872b3ca6972203..52b659c69fdfc507e6259e928d79c65471f2f025 100644
--- a/tensorflow/contrib/model_pruning/README.md
+++ b/tensorflow/contrib/model_pruning/README.md
@@ -134,7 +134,7 @@ $ bazel-bin/$examples_dir/cifar10/cifar10_eval --run_once
 
 ### Block Sparsity
 
-For some hardware architectures, it may be beneficial to induce spatially correlated sparsity. To train models in which the weight tensors have block sparse structure, set *block_height* and *block_width* hyperparameters to the desired block configuration (2x2, 4x4, 4x1, 1x8, etc). Currently, block sparsity is supported for weight tensors with rank 2 only. The matrix is partitioned into non-overlapping blocks of size *[block_height, block_dim]* and the either the average or max absolute value in this block is taken as a proxy for the entire block (set by *block_pooling_function* hyperparameter).
+For some hardware architectures, it may be beneficial to induce spatially correlated sparsity. To train models in which the weight tensors have block sparse structure, set *block_height* and *block_width* hyperparameters to the desired block configuration (2x2, 4x4, 4x1, 1x8, etc). Currently, block sparsity is only supported for weight tensors which can be squeezed to rank 2. The matrix is partitioned into non-overlapping blocks of size *[block_height, block_dim]* and the either the average or max absolute value in this block is taken as a proxy for the entire block (set by *block_pooling_function* hyperparameter).
 The convolution layer tensors are always pruned used block dimensions of [1,1].
 
 ## References
diff --git a/tensorflow/contrib/model_pruning/python/pruning.py b/tensorflow/contrib/model_pruning/python/pruning.py
index d16af9da19816211ee22f6ea48a347f0b9a4e612..86963be4b8aee396704752bab87e0a6b49ab1a49 100644
--- a/tensorflow/contrib/model_pruning/python/pruning.py
+++ b/tensorflow/contrib/model_pruning/python/pruning.py
@@ -523,7 +523,8 @@ class Pruning(object):
     """Performs block-granular masking of the weights.
 
     Block pruning occurs only if the block_height or block_width is > 1 and
-    if the weight tensor has ndims = 2. Otherwise, elementwise pruning occurs.
+    if the weight tensor, when squeezed, has ndims = 2. Otherwise, elementwise
+    pruning occurs.
     Args:
       weights: The weight tensor that needs to be masked.
       threshold: The current threshold value. The function will compute a new
@@ -540,7 +541,8 @@ class Pruning(object):
     Raises:
       ValueError: if block pooling function is not AVG or MAX
     """
-    if weights.get_shape().ndims != 2 or self._block_dim == [1, 1]:
+    squeezed_weights = array_ops.squeeze(weights)
+    if squeezed_weights.get_shape().ndims != 2 or self._block_dim == [1, 1]:
       return self._update_mask(weights, threshold)
 
     if self._block_pooling_function not in ['AVG', 'MAX']:
@@ -549,9 +551,11 @@ class Pruning(object):
 
     with ops.name_scope(weights.op.name + '_pruning_ops'):
       abs_weights = math_ops.abs(
-          array_ops.reshape(
-              weights, [1, weights.get_shape()[0],
-                        weights.get_shape()[1], 1]))
+          array_ops.reshape(weights, [
+              1,
+              squeezed_weights.get_shape()[0],
+              squeezed_weights.get_shape()[1], 1
+          ]))
       pool_window = [self._block_dim[0], self._block_dim[1]]
       pooled_weights = nn_ops.pool(
           abs_weights,
@@ -572,9 +576,10 @@ class Pruning(object):
                                         array_ops.ones(self._block_dim))
       sliced_mask = array_ops.slice(
           updated_mask, [0, 0],
-          [weights.get_shape()[0],
-           weights.get_shape()[1]])
-    return smoothed_threshold, sliced_mask
+          [squeezed_weights.get_shape()[0],
+           squeezed_weights.get_shape()[1]])
+    return smoothed_threshold, array_ops.reshape(sliced_mask,
+                                                 array_ops.shape(weights))
 
   def _get_mask_assign_ops(self):
     # Make sure the assignment ops have not already been added to the list
diff --git a/tensorflow/contrib/model_pruning/python/pruning_test.py b/tensorflow/contrib/model_pruning/python/pruning_test.py
index 1767b4bb94a9bb56bc6a4933423ad27d8cf3ed35..89e65713197afc6ed37346cb67a6e9be3fa9290f 100644
--- a/tensorflow/contrib/model_pruning/python/pruning_test.py
+++ b/tensorflow/contrib/model_pruning/python/pruning_test.py
@@ -140,6 +140,23 @@ class PruningTest(test.TestCase):
          [0.0, -0.3, 0.0, -0.4]])
     expected_mask = [[0, 0, 0, 0], [0, 0, 0, 0], [1, 1, 1, 1], [1, 1, 1, 1]]
 
+    self._blockMasking(param_list + ["block_pooling_function=MAX"], weights_max,
+                       expected_mask)
+    self._blockMasking(param_list + ["block_pooling_function=AVG"], weights_avg,
+                       expected_mask)
+
+  def testBlockMaskingWithHigherDimensions(self):
+    param_list = ["block_height=2", "block_width=2", "threshold_decay=0"]
+
+    # Weights as in testBlockMasking, but with one extra dimension.
+    weights_avg = constant_op.constant(
+        [[[0.1, 0.1, 0.2, 0.2], [0.1, 0.1, 0.2, 0.2], [0.3, 0.3, 0.4, 0.4],
+          [0.3, 0.3, 0.4, 0.4]]])
+    weights_max = constant_op.constant(
+        [[[0.1, 0.0, 0.2, 0.0], [0.0, -0.1, 0.0, -0.2], [0.3, 0.0, 0.4, 0.0],
+          [0.0, -0.3, 0.0, -0.4]]])
+    expected_mask = [[[0, 0, 0, 0], [0, 0, 0, 0], [1, 1, 1, 1], [1, 1, 1, 1]]]
+
     self._blockMasking(param_list + ["block_pooling_function=MAX"], weights_max,
                        expected_mask)
     self._blockMasking(param_list + ["block_pooling_function=AVG"],
diff --git a/tensorflow/contrib/opt/python/training/variable_clipping_optimizer.py b/tensorflow/contrib/opt/python/training/variable_clipping_optimizer.py
index 74036082f0ca2bae23b30deb1b1986befd6601d8..3c0b8394be51e8744b5461a00a99ead5e45d90b2 100644
--- a/tensorflow/contrib/opt/python/training/variable_clipping_optimizer.py
+++ b/tensorflow/contrib/opt/python/training/variable_clipping_optimizer.py
@@ -109,7 +109,7 @@ class VariableClippingOptimizer(optimizer.Optimizer):
 
   def _clip_dense(self, var):
     with self._maybe_colocate_with(var):
-      updated_var_value = var._ref()  # pylint: disable=protected-access
+      updated_var_value = var.read_value()
       normalized_var = clip_ops.clip_by_norm(
           updated_var_value, self._max_norm, self._vars_to_clip_dims[var])
       delta = updated_var_value - normalized_var
diff --git a/tensorflow/contrib/py2tf/__init__.py b/tensorflow/contrib/py2tf/__init__.py
index 379fa7fd5c2a22b5b16a21cca8c2ea8afdcaeefa..6531183cb59af774299eb767cce111d2ec6f32b4 100644
--- a/tensorflow/contrib/py2tf/__init__.py
+++ b/tensorflow/contrib/py2tf/__init__.py
@@ -23,6 +23,7 @@ from __future__ import print_function
 
 from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.impl.api import convert
+from tensorflow.contrib.py2tf.impl.api import converted_call
 from tensorflow.contrib.py2tf.impl.api import graph_ready
 from tensorflow.contrib.py2tf.impl.api import to_code
 from tensorflow.contrib.py2tf.impl.api import to_graph
@@ -30,7 +31,8 @@ from tensorflow.contrib.py2tf.pyct.transformer import PyFlowParseError
 from tensorflow.python.util.all_util import remove_undocumented
 
 _allowed_symbols = [
-    'to_graph', 'to_code', 'convert', 'graph_ready', 'utils', 'PyFlowParseError'
+    'to_graph', 'to_code', 'convert', 'graph_ready', 'converted_call', 'utils',
+    'PyFlowParseError'
 ]
 
 remove_undocumented(__name__, _allowed_symbols)
diff --git a/tensorflow/contrib/py2tf/converters/BUILD b/tensorflow/contrib/py2tf/converters/BUILD
index 93c751b28dae3aa480aed839029bd37a2f47056b..78f46bc05f2e6f4c5e0b6868ce93dbdeb8c7625a 100644
--- a/tensorflow/contrib/py2tf/converters/BUILD
+++ b/tensorflow/contrib/py2tf/converters/BUILD
@@ -27,6 +27,7 @@ py_library(
         "for_loops.py",
         "list_comprehension.py",
         "logical_expressions.py",
+        "name_scopes.py",
         "side_effect_guards.py",
     ],
     srcs_version = "PY2AND3",
@@ -45,6 +46,7 @@ py_library(
     visibility = ["//tensorflow:__subpackages__"],
     deps = [
         ":converters",
+        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/contrib/py2tf/pyct/static_analysis",
         "//tensorflow/contrib/py2tf/utils",
         "@gast_archive//:gast",
@@ -58,7 +60,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -69,7 +70,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -80,7 +80,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -91,7 +90,7 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
+        "//tensorflow/contrib/py2tf/impl",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -102,7 +101,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -113,7 +111,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -124,7 +121,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -132,6 +128,16 @@ py_test(
 py_test(
     name = "for_loops_test",
     srcs = ["for_loops_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":test_lib",
+        "//tensorflow/python:client_testlib",
+    ],
+)
+
+py_test(
+    name = "name_scopes_test",
+    srcs = ["name_scopes_test.py"],
     deps = [
         ":test_lib",
         "//tensorflow/contrib/py2tf/pyct",
@@ -145,7 +151,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -156,7 +161,6 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
@@ -165,9 +169,13 @@ py_test(
     name = "side_effect_guards_test",
     srcs = ["side_effect_guards_test.py"],
     srcs_version = "PY2AND3",
+    tags = [
+        # TODO(mdan): Fix.
+        "flaky",
+        "notap",
+    ],
     deps = [
         ":test_lib",
-        "//tensorflow/contrib/py2tf/pyct",
         "//tensorflow/python:client_testlib",
     ],
 )
diff --git a/tensorflow/contrib/py2tf/converters/builtin_functions.py b/tensorflow/contrib/py2tf/converters/builtin_functions.py
index 2eb00f90575920ac948e799b0e97a9cfccb42fad..b5aa9756da6a139e542e9a0ead86cf4cc8207449 100644
--- a/tensorflow/contrib/py2tf/converters/builtin_functions.py
+++ b/tensorflow/contrib/py2tf/converters/builtin_functions.py
@@ -36,23 +36,24 @@ class BuiltinFunctionTransformer(transformer.Base):
 
   # pylint:disable=invalid-name
 
-  def _convert_len(self, node):
+  def _convert_builtin(self, node):
     template = """
-      tf.shape(args)[0]
+      py2tf_utils.dynamic_builtin(func, args)
     """
-    return templates.replace(template, args=node.args)[0].value
+    return templates.replace(template, func=node.func, args=node.args)[0].value
 
   def _convert_print(self, node):
     template = """
-      py2tf_utils.call_print(args)
+      py2tf_utils.dynamic_print(args)
     """
     return templates.replace(template, args=node.args)[0].value
 
   def visit_Call(self, node):
     self.generic_visit(node)
     # TODO(mdan): This won't work if the function was hidden.
-    if isinstance(node.func, gast.Name) and node.func.id == 'len':
-      return self._convert_len(node)
+    if isinstance(node.func, gast.Name) and node.func.id in ('len',):
+      return self._convert_builtin(node)
+    # Print needs to be handled separately because it can be read as statement.
     if isinstance(node.func, gast.Name) and node.func.id == 'print':
       return self._convert_print(node)
     return node
diff --git a/tensorflow/contrib/py2tf/converters/builtin_functions_test.py b/tensorflow/contrib/py2tf/converters/builtin_functions_test.py
index b279ff77ef10b96586d3d68585adb0d5424afb90..eb60a1d8ae2b56907df8f3ffafe7604883cfc2a9 100644
--- a/tensorflow/contrib/py2tf/converters/builtin_functions_test.py
+++ b/tensorflow/contrib/py2tf/converters/builtin_functions_test.py
@@ -47,6 +47,8 @@ class BuiltinFunctionsTest(converter_test_base.TestCase):
                          sess.run(
                              result.test_fn(constant_op.constant([0, 0, 0]))))
 
+        self.assertEqual(3, result.test_fn([0, 0, 0]))
+
   def test_print_with_op(self):
 
     def test_fn(a):
diff --git a/tensorflow/contrib/py2tf/converters/call_trees.py b/tensorflow/contrib/py2tf/converters/call_trees.py
index 1050ba654c63bb52c1c5e71c981a6a0baa3fc987..ca8726f9160d106ebd82e01e399e65fb77b02aab 100644
--- a/tensorflow/contrib/py2tf/converters/call_trees.py
+++ b/tensorflow/contrib/py2tf/converters/call_trees.py
@@ -27,6 +27,7 @@ import types
 import gast
 
 from tensorflow.contrib.py2tf.pyct import anno
+from tensorflow.contrib.py2tf.pyct import inspect_utils
 from tensorflow.contrib.py2tf.pyct import parser
 from tensorflow.contrib.py2tf.pyct import templates
 from tensorflow.contrib.py2tf.pyct import transformer
@@ -72,9 +73,8 @@ class CallTreeTransformer(transformer.Base):
     self.uncompiled_modules = uncompiled_modules
     self.nocompile_decorators = nocompile_decorators
 
-  # pylint:disable=invalid-name
-
   def _resolve_name(self, node):
+    """Used to resolve decorator info."""
     if isinstance(node, gast.Call):
       return self._resolve_name(node.func)
     if isinstance(node, gast.Name):
@@ -99,7 +99,13 @@ class CallTreeTransformer(transformer.Base):
                          (owner_type, node.attr))
     return None
 
+  def _function_is_compilable(self, target_entity):
+    """Determines whether an entity can be compiled at all."""
+    # TODO(mdan): This is just a placeholder. Implement.
+    return not isinstance(target_entity, types.BuiltinFunctionType)
+
   def _should_compile(self, node, fqn):
+    """Determines whether an entity should be compiled in the context."""
     for i in range(1, len(fqn)):
       if fqn[:i] in self.uncompiled_modules:
         return False
@@ -141,33 +147,6 @@ class CallTreeTransformer(transformer.Base):
 
     return True
 
-  def _determine_function_owner(self, m):
-    # TODO(mdan): The parent type should be known at analysis. Use that instead.
-    if hasattr(m, 'im_class'):  # Python 2
-      return m.im_class
-    if hasattr(m, '__qualname__'):  # Python 3
-      # Object attributes: should be bound to "self".
-      if hasattr(m, '__self__'):
-        return type(m.__self__)
-
-      # Class attributes: should have the owner name in their namespace.
-      qn = m.__qualname__.split('.')
-      if len(qn) < 2:
-        return None
-      owner_name, func_name = qn[-2:]
-      if func_name != m.__name__:
-        raise ValueError('Inconsistent names detected '
-                         '(__qualname__[1] = "%s", __name__ = "%s") for %s.' %
-                         (func_name, m.__name__, m))
-      if owner_name == '<locals>':
-        return None
-      if owner_name not in self.context.namespace:
-        raise ValueError(
-            'Could not resolve name "%s" while analyzing %s. Namespace:\n%s' %
-            (owner_name, m, self.context.namespace))
-      return self.context.namespace[owner_name]
-    return None
-
   def _rename_compilable_function(self, node):
     assert anno.hasanno(node.func, 'live_val')
     assert anno.hasanno(node.func, 'fqn')
@@ -182,7 +161,11 @@ class CallTreeTransformer(transformer.Base):
           target_fqn, live_entity=target_entity)
       do_rename = True
     else:
-      owner_type = self._determine_function_owner(target_entity)
+      if anno.hasanno(node.func, 'parent_type'):
+        owner_type = anno.getanno(node.func, 'parent_type')
+      else:
+        # Fallback - not reliable.
+        owner_type = inspect_utils.getmethodclass(target_entity)
       new_name, do_rename = self.context.namer.compiled_function_name(
           target_fqn, live_entity=target_entity, owner_type=owner_type)
 
@@ -202,9 +185,35 @@ class CallTreeTransformer(transformer.Base):
     """
     return templates.replace(template, func=node.func, original_args=node.args)
 
-  def _function_is_compilable(self, target_entity):
-    # TODO(mdan): This is just a placeholder. Implement.
-    return not isinstance(target_entity, types.BuiltinFunctionType)
+  def _insert_dynamic_conversion(self, node):
+    """Inlines a dynamic conversion for a dynamic function."""
+    # TODO(mdan): Pass information on the statically compiled functions.
+    # Having access to the statically compiled functions can help avoid
+    # unnecessary compilation.
+    # For example, this would lead to function `a` being compiled twice:
+    #
+    #   def a():
+    #     v = b
+    #     b()
+    #   def b():
+    #     a()
+    #
+    # This is really a problem with recursive calls, which currently can
+    # only be gated by a static condition, and should be rare.
+    # TODO(mdan): It probably makes sense to use dynamic conversion every time.
+    # Before we could convert all the time though, we'd need a reasonable
+    # caching mechanism.
+    template = """
+      py2tf_api.converted_call(func, True, False, {}, original_args)
+    """
+    call_expr = templates.replace(
+        template, func=node.func, original_args=node.args)
+    new_call = call_expr[0].value
+    # TODO(mdan): Improve the template mechanism to better support this.
+    new_call.keywords = node.keywords
+    return new_call
+
+  # pylint:disable=invalid-name
 
   def visit_Expr(self, node):
     if isinstance(node.value, gast.Call):
@@ -245,9 +254,9 @@ class CallTreeTransformer(transformer.Base):
         raise NotImplementedError('py_func with return values')
     else:
       if self.context.recursive:
-        raise NotImplementedError('Could not resolve target function.')
+        node = self._insert_dynamic_conversion(node)
       else:
-        # TODO(mdan): Double check. Is this reachable code?
+        # Unresolved functions are allowed in non-recursive mode.
         pass
     return node
 
diff --git a/tensorflow/contrib/py2tf/converters/call_trees_test.py b/tensorflow/contrib/py2tf/converters/call_trees_test.py
index 777648dc0b31863227262fbf931aba680bb4ed98..d482a9ef7897388839bbf8f9e4bfc5839d42b2d7 100644
--- a/tensorflow/contrib/py2tf/converters/call_trees_test.py
+++ b/tensorflow/contrib/py2tf/converters/call_trees_test.py
@@ -47,6 +47,21 @@ class CallTreesTest(converter_test_base.TestCase):
       result.renamed_test_fn_1 = renamed_test_fn_1
       self.assertEquals(3, result.test_fn_2(1))
 
+  def test_dynamic_function(self):
+
+    def test_fn_1():
+      raise ValueError('This should be masked by the mock.')
+
+    def test_fn_2(f):
+      return f() + 3
+
+    node = self.parse_and_analyze(test_fn_2, {})
+    node = call_trees.transform(node, self.ctx, (), ())
+
+    with self.compiled(node) as result:
+      # 10 = 7 (from the mock) + 3 (from test_fn_2)
+      self.assertEquals(10, result.test_fn_2(test_fn_1))
+
   def test_simple_methods(self):
 
     class TestClass(object):
@@ -59,6 +74,7 @@ class CallTreesTest(converter_test_base.TestCase):
 
     node = self.parse_and_analyze(
         TestClass.test_fn_2, {'TestClass': TestClass},
+        namer=converter_test_base.FakeNoRenameNamer(),
         arg_types={'self': (TestClass.__name__, TestClass)})
     node = call_trees.transform(node, self.ctx, (), ())
 
diff --git a/tensorflow/contrib/py2tf/converters/converter_test_base.py b/tensorflow/contrib/py2tf/converters/converter_test_base.py
index 67747183dd323a799a04943ce4c7fe8c4093d002..1f98d8469c1b3032fe6babb5a63dde1747027f21 100644
--- a/tensorflow/contrib/py2tf/converters/converter_test_base.py
+++ b/tensorflow/contrib/py2tf/converters/converter_test_base.py
@@ -25,6 +25,7 @@ from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.pyct import compiler
 from tensorflow.contrib.py2tf.pyct import context
 from tensorflow.contrib.py2tf.pyct import parser
+from tensorflow.contrib.py2tf.pyct import pretty_printer
 from tensorflow.contrib.py2tf.pyct import qual_names
 from tensorflow.contrib.py2tf.pyct.static_analysis import activity
 from tensorflow.contrib.py2tf.pyct.static_analysis import live_values
@@ -52,26 +53,43 @@ class FakeNamer(object):
     return ('renamed_%s' % '_'.join(original_fqn)), True
 
 
+class FakeNoRenameNamer(FakeNamer):
+
+  def compiled_function_name(self, original_fqn, **_):
+    return str(original_fqn), False
+
+
 class TestCase(test.TestCase):
   """Base class for unit tests in this module. Contains relevant utilities."""
 
   @contextlib.contextmanager
   def compiled(self, node, *symbols):
-    source = '<compile failed>'
+    source = None
+
+    self.dynamic_calls = []
+    def converted_call(*args):
+      """Mock version of api.converted_call."""
+      self.dynamic_calls.append(args)
+      return 7
+
     try:
       result, source = compiler.ast_to_object(node)
-      result.tf = self.make_fake_tf(*symbols)
+      result.tf = self.make_fake_mod('fake_tf', *symbols)
       result.py2tf_utils = utils
+      result.py2tf_api = self.make_fake_mod('fake_api', converted_call)
       yield result
     except Exception:  # pylint:disable=broad-except
-      print('Offending compiled code:\n%s' % source)
+      if source is None:
+        print('Offending AST:\n%s' % pretty_printer.fmt(node, color=False))
+      else:
+        print('Offending compiled code:\n%s' % source)
       raise
 
-  def make_fake_tf(self, *symbols):
-    fake_tf = imp.new_module('fake_tf')
+  def make_fake_mod(self, name, *symbols):
+    fake_mod = imp.new_module(name)
     for s in symbols:
-      setattr(fake_tf, s.__name__, s)
-    return fake_tf
+      setattr(fake_mod, s.__name__, s)
+    return fake_mod
 
   def attach_namespace(self, module, **ns):
     for k, v in ns.items():
@@ -83,6 +101,7 @@ class TestCase(test.TestCase):
                         namer=None,
                         arg_types=None,
                         include_type_analysis=True,
+                        owner_type=None,
                         recursive=True):
     node, source = parser.parse_entity(test_fn)
     ctx = context.EntityContext(
@@ -92,6 +111,7 @@ class TestCase(test.TestCase):
         namespace=namespace,
         arg_values=None,
         arg_types=arg_types,
+        owner_type=owner_type,
         recursive=recursive)
     node = qual_names.resolve(node)
     node = activity.resolve(node, ctx)
diff --git a/tensorflow/contrib/py2tf/converters/decorators.py b/tensorflow/contrib/py2tf/converters/decorators.py
index 3f620c1cd2d9b75f82410754a7e812e13eabe3ae..68bf241ef33292f0581ccb3c44f313f853c92ba7 100644
--- a/tensorflow/contrib/py2tf/converters/decorators.py
+++ b/tensorflow/contrib/py2tf/converters/decorators.py
@@ -33,6 +33,7 @@ class DecoratorsTransformer(gast.NodeTransformer):
 
   def __init__(self, remove_decorators):
     self.remove_decorators = remove_decorators
+    self.additional_dependencies = set()
 
   # pylint:disable=invalid-name
 
@@ -44,13 +45,38 @@ class DecoratorsTransformer(gast.NodeTransformer):
         dec_func = dec.func
       else:
         dec_func = dec
+
+      # Special cases.
+      # TODO(mdan): Is there any way we can treat these more generically?
+      # We may want to forego using decorators altogether if we can't
+      # properly support them.
+      if isinstance(dec_func, gast.Name) and dec_func.id in ('classmethod',):
+        # Assumption: decorators are only visible in the AST when converting
+        # a function inline (via another decorator).
+        # In that case, the converted function is no longer part of the
+        # original object that it was declared into.
+        # This is currently verified by tests.
+        continue
+
       if not anno.hasanno(dec_func, 'live_val'):
         raise ValueError(
             'Could not resolve decorator: %s' % pretty_printer.fmt(dec_func))
+
       dec_value = anno.getanno(dec_func, 'live_val')
       if dec_value not in self.remove_decorators:
-        kept_decorators.append(dec)
-    node.decorator_list = kept_decorators
+        kept_decorators.append((dec, dec_value))
+
+    for _, dec_value in kept_decorators:
+      if dec_value.__module__ == '__main__':
+        raise ValueError(
+            'decorator "%s" was not allowed because it is declared '
+            'in the module "%s". To fix this, declare it in a separate '
+            'module that we can import it from.' % (dec_value,
+                                                    dec_value.__module__))
+      else:
+        self.additional_dependencies.add(dec_value)
+
+    node.decorator_list = [dec for dec, _ in kept_decorators]
     return node
 
   # pylint:enable=invalid-name
@@ -59,4 +85,4 @@ class DecoratorsTransformer(gast.NodeTransformer):
 def transform(node, remove_decorators):
   transformer = DecoratorsTransformer(remove_decorators)
   node = transformer.visit(node)
-  return node
+  return node, transformer.additional_dependencies
diff --git a/tensorflow/contrib/py2tf/converters/decorators_test.py b/tensorflow/contrib/py2tf/converters/decorators_test.py
index 402fa0dda28e696f70d0354ca4abf3a6c83506d9..c75e5461746f27d14a54b7ac06e7f77d868372c8 100644
--- a/tensorflow/contrib/py2tf/converters/decorators_test.py
+++ b/tensorflow/contrib/py2tf/converters/decorators_test.py
@@ -18,84 +18,121 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import textwrap
+from functools import wraps
 
 from tensorflow.contrib.py2tf.converters import converter_test_base
 from tensorflow.contrib.py2tf.converters import decorators
 from tensorflow.contrib.py2tf.pyct import compiler
 from tensorflow.python.platform import test
-from tensorflow.python.util import tf_inspect
+
+
+# The Python parser only briefly captures decorators into the AST.
+# The interpreter desugars them on load, and the decorated function loses any
+# trace of the decorator (which is notmally what you would expect, since
+# they are meant to be transparent).
+# However, decorators are still visible when you analyze the function
+# from inside a decorator, before it was applied - as is the case
+# with our conversion decorators.
+
+
+def simple_decorator(f):
+  return lambda a: f(a) + 1
+
+
+def self_removing_decorator(removing_wrapper):
+  def decorator(f):
+    @wraps(f)
+    def wrapper(*args):
+      # This removing wrapper is defined in the test below. This setup is so
+      # intricate just to simulate how we use the transformer in practice.
+      transformed_f = removing_wrapper(f, (self_removing_decorator,))
+      return transformed_f(*args) + 1
+    return wrapper
+  return decorator
 
 
 class DecoratorsTest(converter_test_base.TestCase):
 
-  def test_function_decorator(self):
+  def _remover_wrapper(self, f, remove_decorators):
+    namespace = {
+        'self_removing_decorator': self_removing_decorator,
+        'simple_decorator': simple_decorator
+    }
+    node = self.parse_and_analyze(f, namespace)
+    node, _ = decorators.transform(node, remove_decorators=remove_decorators)
+    result, _ = compiler.ast_to_object(node)
+    return getattr(result, f.__name__)
 
-    def function_decorator():
+  def test_noop(self):
 
-      def decorator(f):
-        return lambda a: f(a) + 1
+    def test_fn(a):
+      return a
 
-      return decorator
+    node = self.parse_and_analyze(test_fn, {})
+    node, deps = decorators.transform(node, remove_decorators=())
+    result, _ = compiler.ast_to_object(node)
 
-    # The Python parser does capture decorators into the AST.
-    # However, the interpreter desugars them on load, and refering to the
-    # decorated function at runtime usually loses any trace of the decorator.
-    # Below is an example when that doesn't happen.
-    def static_wrapper():
+    self.assertFalse(deps)
+    self.assertEqual(1, result.test_fn(1))
 
-      @function_decorator()
-      def test_fn(a):  # pylint:disable=unused-variable
-        return a
+  def test_function(self):
 
-    node = self.parse_and_analyze(static_wrapper,
-                                  {'function_decorator': function_decorator})
-    node = node.body[0].body[0]
+    @self_removing_decorator(self._remover_wrapper)
+    def test_fn(a):
+      return a
 
-    node = decorators.transform(node, remove_decorators=())
-    # Since the decorator is not removed, we need to include its source
-    # code. We cannot do it after the fact because decorators are executed
-    # on load.
-    result, _ = compiler.ast_to_object(
-        node,
-        source_prefix=textwrap.dedent(tf_inspect.getsource(function_decorator)))
-    self.assertEqual(2, result.test_fn(1))
+    # 2 = 1 (a) + 1 (decorator applied exactly once)
+    self.assertEqual(2, test_fn(1))
 
-    node = decorators.transform(node, remove_decorators=(function_decorator,))
-    with self.compiled(node) as result:
-      self.assertEqual(1, result.test_fn(1))
+  def test_method(self):
 
-  def test_simple_decorator(self):
+    class TestClass(object):
 
-    def simple_decorator(f):
-      return lambda a: f(a) + 1
+      @self_removing_decorator(self._remover_wrapper)
+      def test_fn(self, a):
+        return a
 
-    # The Python parser does capture decorators into the AST.
-    # However, the interpreter desugars them upon load, and refering to the
-    # decorated function at runtime usually loses any trace of the decorator.
-    # Below is an example when that doesn't happen.
-    def static_wrapper():
+    # 2 = 1 (a) + 1 (decorator applied exactly once)
+    self.assertEqual(2, TestClass().test_fn(1))
 
-      @simple_decorator
-      def test_fn(a):  # pylint:disable=unused-variable
+  def test_multiple_decorators(self):
+
+    class TestClass(object):
+
+      # Note that reversing the order of this two doesn't work.
+      @classmethod
+      @self_removing_decorator(self._remover_wrapper)
+      def test_fn(cls, a):
         return a
 
-    node = self.parse_and_analyze(static_wrapper,
-                                  {'simple_decorator': simple_decorator})
-    node = node.body[0].body[0]
-
-    node = decorators.transform(node, remove_decorators=())
-    # Since the decorator is not removed, we need to include its source
-    # code. We cannot do it after the fact because decorators are executed
-    # on load.
-    result, _ = compiler.ast_to_object(
-        node,
-        source_prefix=textwrap.dedent(tf_inspect.getsource(simple_decorator)))
-    self.assertEqual(2, result.test_fn(1))
-
-    node = decorators.transform(node, remove_decorators=(simple_decorator,))
-    with self.compiled(node) as result:
-      self.assertEqual(1, result.test_fn(1))
+    # 2 = 1 (a) + 1 (decorator applied exactly once)
+    self.assertEqual(2, TestClass.test_fn(1))
+
+  def test_nested_decorators(self):
+
+    @self_removing_decorator(self._remover_wrapper)
+    def test_fn(a):
+      @simple_decorator
+      def inner_fn(b):
+        return b + 11
+      return inner_fn(a)
+
+    with self.assertRaises(ValueError):
+      test_fn(1)
+
+  # TODO(mdan): Uncomment this test once converter_test_base is updated.
+  # (can't do it now because it has unrelated pending changes)
+  # def test_nested_decorators(self):
+  #
+  #   @self_removing_decorator(self._remover_wrapper)
+  #   def test_fn(a):
+  #     @imported_decorator
+  #     def inner_fn(b):
+  #       return b + 11
+  #     return inner_fn(a)
+  #
+  #   # 14 = 1 (a) + 1 (simple_decorator) + 11 (inner_fn)
+  #   self.assertEqual(14, test_fn(1))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/contrib/py2tf/converters/name_scopes.py b/tensorflow/contrib/py2tf/converters/name_scopes.py
new file mode 100644
index 0000000000000000000000000000000000000000..c702823fcf047fcad3254318bd323d2b8fddd700
--- /dev/null
+++ b/tensorflow/contrib/py2tf/converters/name_scopes.py
@@ -0,0 +1,52 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Wraps a function body with a `name_scope` of the function name.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import gast
+
+from tensorflow.contrib.py2tf.pyct import templates
+from tensorflow.contrib.py2tf.pyct import transformer
+
+
+class FunctionNameScopeTransformer(transformer.Base):
+  """Wrap a function body with a `name_scope` of the function name."""
+
+  def __init__(self, context):
+    super(FunctionNameScopeTransformer, self).__init__(context)
+    self._function_level = 0
+
+  def visit_FunctionDef(self, node):
+    self._function_level += 1
+    try:
+      self.generic_visit(node)
+    finally:
+      self._function_level -= 1
+    scope_name = node.name
+    if self._function_level == 0 and self.context.owner_type is not None:
+      scope_name = '{}/{}'.format(self.context.owner_type.__name__, scope_name)
+    node.body = templates.replace(
+        'with tf.name_scope(scope_name): body',
+        scope_name=gast.Str(scope_name),
+        body=node.body)
+    return node
+
+
+def transform(node, context):
+  return FunctionNameScopeTransformer(context).visit(node)
diff --git a/tensorflow/contrib/py2tf/converters/name_scopes_test.py b/tensorflow/contrib/py2tf/converters/name_scopes_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8ca341602ee5f06dbb812643a58794339d98afe
--- /dev/null
+++ b/tensorflow/contrib/py2tf/converters/name_scopes_test.py
@@ -0,0 +1,92 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for for_canonicalization module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.py2tf.converters import converter_test_base
+from tensorflow.contrib.py2tf.converters import name_scopes
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.platform import test
+
+
+class FunctionNameScopeTransformer(converter_test_base.TestCase):
+
+  def test_basic_name(self):
+
+    def test_fn(l):
+      a = 5
+      l += a
+      return l
+
+    node = self.parse_and_analyze(test_fn, {})
+    node = name_scopes.transform(node, self.ctx)
+
+    with self.compiled(node, ops.name_scope) as result:
+      result_op = result.test_fn(constant_op.constant([1, 2, 3]))
+      self.assertIn('test_fn/', result_op.op.name)
+
+  def test_nested_name(self):
+
+    def test_fn(l):
+
+      def body(i):
+        return i**2
+
+      l += [4]
+      return body(l)
+
+    node = self.parse_and_analyze(test_fn, {})
+    node = name_scopes.transform(node, self.ctx)
+
+    with self.compiled(node, ops.name_scope) as result:
+      result_op = result.test_fn(constant_op.constant([1, 2, 3]))
+      first_result_input_name = result_op.op.inputs[0].name
+      second_result_input_name = result_op.op.inputs[1].name
+      self.assertIn('test_fn/', first_result_input_name)
+      self.assertNotIn('body/', first_result_input_name)
+      self.assertIn('test_fn/body/', second_result_input_name)
+
+  def test_class_name(self):
+
+    class TestClass(object):
+
+      def test_fn(self, l):
+
+        def body(i):
+          return i**2
+
+        l += [4]
+        return body(l)
+
+    # Note that 'TestClass' was needed in the namespace here.
+    node = self.parse_and_analyze(
+        TestClass, {'TestClass': TestClass}, owner_type=TestClass)
+    node = name_scopes.transform(node, self.ctx)
+
+    with self.compiled(node, ops.name_scope) as result:
+      result_op = result.TestClass().test_fn(constant_op.constant([1, 2, 3]))
+      first_result_input_name = result_op.op.inputs[0].name
+      second_result_input_name = result_op.op.inputs[1].name
+      self.assertIn('TestClass/test_fn/', first_result_input_name)
+      self.assertNotIn('body/', first_result_input_name)
+      self.assertIn('TestClass/test_fn/body/', second_result_input_name)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/py2tf/impl/api.py b/tensorflow/contrib/py2tf/impl/api.py
index 8ae1c701698ae9a4efbde45222ff6c3db6e92521..48100aac32844f5f10604b9c7a544c76d0b04eed 100644
--- a/tensorflow/contrib/py2tf/impl/api.py
+++ b/tensorflow/contrib/py2tf/impl/api.py
@@ -26,7 +26,9 @@ import six
 from tensorflow.contrib.py2tf.impl import config
 from tensorflow.contrib.py2tf.impl import conversion
 from tensorflow.contrib.py2tf.pyct import compiler
+from tensorflow.contrib.py2tf.pyct import inspect_utils
 from tensorflow.contrib.py2tf.pyct import parser
+from tensorflow.contrib.py2tf.utils import builtins
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util import tf_inspect
 
@@ -110,28 +112,7 @@ def convert(recursive=False, verbose=False, arg_types=None):
 
     @wraps(f)
     def wrapper(*args, **kwargs):
-      """Wrapper that calls the compiled version of the wrapped function."""
-      partial_types = ()
-      arg_values = {}
-      arg_names = tf_inspect.getargspec(f)[0]
-      for name, arg in zip(arg_names, args):
-        arg_values[name] = arg
-        arg_class = arg.__class__
-        # If arg_value_hints specifies any name, use that instead.
-        if name not in arg_types:
-          arg_types[name] = (arg_class.__name__, arg_class)
-        if name == 'self' and tf_inspect.isclass(arg_class):
-          # Annotated methods need to specify that their owner type is partial,
-          # otherwise other members they call will not be converted.
-          partial_types = (arg_class,)
-      wrapped = to_graph(
-          f,
-          recursive=recursive,
-          verbose=verbose,
-          arg_values=arg_values,
-          arg_types=arg_types,
-          partial_types=partial_types)
-      return wrapped(*args, **kwargs)
+      return converted_call(f, recursive, verbose, arg_types, *args, **kwargs)
 
     # Sometimes the decorator is just desugared, making it impossible to detect.
     # This attribute makes detection easier.
@@ -141,6 +122,78 @@ def convert(recursive=False, verbose=False, arg_types=None):
   return decorator
 
 
+def converted_call(f, recursive, verbose, arg_types, *args, **kwargs):
+  """Compiles a function call inline."""
+  # TODO(mdan): This needs cleanup.
+  # In particular, we may want to avoid renaming functions altogether.
+
+  if conversion.is_whitelisted_for_graph(f):
+    return f(*args, **kwargs)
+
+  unknown_arg_value = object()  # Sentinel for arguments of unknown value
+
+  if tf_inspect.isbuiltin(f):
+    return builtins.dynamic_builtin(f, *args, **kwargs)
+
+  if tf_inspect.isfunction(f) or tf_inspect.ismethod(f):
+    # Regular functions
+    target_entity = f
+    arg_map_target = f
+    effective_args = args
+    f_class = inspect_utils.getmethodclass(f)
+
+    if f_class is not None:
+      partial_types = (f_class,)
+    else:
+      partial_types = ()
+
+  elif tf_inspect.isclass(f):
+    # Constructors
+    target_entity = f
+    arg_map_target = f.__init__
+    effective_args = (unknown_arg_value,) + args
+    partial_types = ()
+
+  elif hasattr(f, '__call__') and hasattr(f, '__class__'):
+    # Callable objects
+    target_entity = f.__call__
+    arg_map_target = f.__call__
+    effective_args = (f,) + args
+    partial_types = (f.__class__,)
+
+  else:
+    NotImplementedError('unknown callable type "%s"' % type(f))
+
+  arg_values = tf_inspect.getcallargs(arg_map_target, *args, **kwargs)
+  for name, arg in arg_values.items():
+    if arg is unknown_arg_value:
+      continue
+    arg_class = arg.__class__
+    # If arg_value_hints specifies any name, use that instead.
+    if name not in arg_types:
+      arg_types[name] = (arg_class.__name__, arg_class)
+
+  # When called from within a decorator, this is the only indication that
+  # the function is a method - it appears that the decorator is applied
+  # before the method is bound.
+  if not partial_types:
+    if 'self' in arg_values:
+      if tf_inspect.isclass(arg_values['self'].__class__):
+        partial_types = (arg_values['self'].__class__,)
+    elif 'cls' in arg_values:
+      if tf_inspect.isclass(arg_values['cls']):
+        partial_types = (arg_values['cls'],)
+
+  converted_f = to_graph(
+      target_entity,
+      recursive=recursive,
+      verbose=verbose,
+      arg_values=arg_values,
+      arg_types=arg_types,
+      partial_types=partial_types)
+  return converted_f(*effective_args, **kwargs)
+
+
 def to_graph(e,
              recursive=True,
              verbose=False,
@@ -175,7 +228,8 @@ def to_graph(e,
   conversion_map = conversion.ConversionMap(
       recursive=recursive,
       nocompile_decorators=(convert, graph_ready, convert_inline),
-      partial_types=partial_types)
+      partial_types=partial_types,
+      api_module=tf_inspect.getmodule(to_graph))
   _, name = conversion.entity_to_graph(e, conversion_map, arg_values, arg_types)
 
   module = gast.Module([])
@@ -188,7 +242,7 @@ def to_graph(e,
   # The compiled code should see everything the entry function saw.
   # TODO(mdan): This might not work well if the call tree spans modules?
   if tf_inspect.isfunction(e):
-    compiled_node.__dict__.update(six.get_function_globals(e))
+    compiled_node.__dict__.update(inspect_utils.getnamespace(e))
   compiled_fn = getattr(compiled_node, name)
 
   if verbose:
@@ -221,7 +275,8 @@ def to_code(e,
   conversion_map = conversion.ConversionMap(
       recursive=recursive,
       nocompile_decorators=(convert, graph_ready, convert_inline),
-      partial_types=partial_types)
+      partial_types=partial_types,
+      api_module=tf_inspect.getmodule(to_graph))
   conversion.entity_to_graph(e, conversion_map, arg_values, arg_types)
 
   imports = '\n'.join(config.COMPILED_IMPORT_STATEMENTS)
diff --git a/tensorflow/contrib/py2tf/impl/api_test.py b/tensorflow/contrib/py2tf/impl/api_test.py
index 02cd8ed2d0ffee8ef2d31ea65902d2b493df9d64..13f8e66018920a5b13f8bd3f00c67d3bbdd519aa 100644
--- a/tensorflow/contrib/py2tf/impl/api_test.py
+++ b/tensorflow/contrib/py2tf/impl/api_test.py
@@ -18,23 +18,26 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.impl import api
 from tensorflow.contrib.py2tf.impl import config
 from tensorflow.contrib.py2tf.pyct import parser
 from tensorflow.python.framework import constant_op
-from tensorflow.python.ops import math_ops
 from tensorflow.python.platform import test
 
 
+tf = utils.fake_tf()
+
+
 class ApiTest(test.TestCase):
 
   def setUp(self):
-    config.DEFAULT_UNCOMPILED_MODULES.add((math_ops.__name__,))
     config.COMPILED_IMPORT_STATEMENTS = (
-        'from tensorflow.python.ops '
-        'import control_flow_ops as tf',
+        'from __future__ import print_function',
         'from tensorflow.contrib.py2tf import utils as '
-        'py2tf_utils')
+        'py2tf_utils',
+        'tf = py2tf_utils.fake_tf()'
+    )
 
   def test_decorator_recurses(self):
 
@@ -47,7 +50,7 @@ class ApiTest(test.TestCase):
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
+        while tf.reduce_sum(x) > s:
           x //= self.called_member(a)
         return x
 
@@ -63,11 +66,11 @@ class ApiTest(test.TestCase):
     class TestClass(object):
 
       def called_member(self, a):
-        return math_ops.negative(a)
+        return tf.negative(a)
 
       @api.convert(recursive=False)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
+        while tf.reduce_sum(x) > s:
           x //= self.called_member(a)
         return x
 
@@ -84,11 +87,11 @@ class ApiTest(test.TestCase):
 
       @api.graph_ready
       def called_member(self, a):
-        return math_ops.negative(a)
+        return tf.negative(a)
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
+        while tf.reduce_sum(x) > s:
           x //= self.called_member(a)
         return x
 
@@ -111,7 +114,7 @@ class ApiTest(test.TestCase):
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
+        while tf.reduce_sum(x) > s:
           x //= self.called_member(a)
         return x
 
@@ -133,7 +136,7 @@ class ApiTest(test.TestCase):
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
+        while tf.reduce_sum(x) > s:
           x //= api.convert_inline(self.called_member, a)
         return x
 
@@ -149,11 +152,11 @@ class ApiTest(test.TestCase):
     class TestClass(object):
 
       def called_member(self, a):
-        return math_ops.negative(a)
+        return tf.negative(a)
 
       @api.convert(recursive=True)
       def test_method(self, x, s, a):
-        while math_ops.reduce_sum(x) > s:
+        while tf.reduce_sum(x) > s:
           x //= api.graph_ready(self.called_member(a))
         return x
 
@@ -166,7 +169,7 @@ class ApiTest(test.TestCase):
 
   def test_to_graph_basic(self):
     def test_fn(x, s):
-      while math_ops.reduce_sum(x) > s:
+      while tf.reduce_sum(x) > s:
         x //= 2
       return x
 
@@ -178,7 +181,7 @@ class ApiTest(test.TestCase):
 
   def test_to_code_basic(self):
     def test_fn(x, s):
-      while math_ops.reduce_sum(x) > s:
+      while tf.reduce_sum(x) > s:
         x /= 2
       return x
 
diff --git a/tensorflow/contrib/py2tf/impl/config.py b/tensorflow/contrib/py2tf/impl/config.py
index 7c3ecefff0f8858d5505ff30e1270b2fd42c9ad8..bdbc6663dd65ed66c55ad2d2e52428084bbea219 100644
--- a/tensorflow/contrib/py2tf/impl/config.py
+++ b/tensorflow/contrib/py2tf/impl/config.py
@@ -31,15 +31,20 @@ PYTHON_LITERALS = {
 DEFAULT_UNCOMPILED_MODULES = set((
     ('tensorflow',),
     (utils.__name__,),
+
+    # All of tensorflow's subpackages. Unlike the root tf module, they don't
+    # have well-known names. Not refering to the module directly to avoid
+    # circular imports.
+    (utils.__name__[:-len('.contrib.py2tf.utils')],),
 ))
 
 NO_SIDE_EFFECT_CONSTRUCTORS = set(('tensorflow',))
 
 # TODO(mdan): Also allow controlling the generated names (for testability).
-# TODO(mdan): Verify that these names are not hidden by generated code.
-# TODO(mdan): Make sure copybara renames the reference below.
 COMPILED_IMPORT_STATEMENTS = (
     'from __future__ import print_function',
     'import tensorflow as tf',
+    'from tensorflow.contrib.py2tf.impl import api as '
+    'py2tf_api',
     'from tensorflow.contrib.py2tf import utils as '
     'py2tf_utils')
diff --git a/tensorflow/contrib/py2tf/impl/conversion.py b/tensorflow/contrib/py2tf/impl/conversion.py
index 3d5624b187ed47e9eed8afbb2e101e1098f81c15..d95469ea532d5c3acc44d1e65b852f27714b8049 100644
--- a/tensorflow/contrib/py2tf/impl/conversion.py
+++ b/tensorflow/contrib/py2tf/impl/conversion.py
@@ -19,7 +19,6 @@ from __future__ import division
 from __future__ import print_function
 
 import gast
-import six
 
 from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.converters import asserts
@@ -31,10 +30,12 @@ from tensorflow.contrib.py2tf.converters import control_flow
 from tensorflow.contrib.py2tf.converters import decorators
 from tensorflow.contrib.py2tf.converters import for_loops
 from tensorflow.contrib.py2tf.converters import logical_expressions
+from tensorflow.contrib.py2tf.converters import name_scopes
 from tensorflow.contrib.py2tf.converters import side_effect_guards
 from tensorflow.contrib.py2tf.impl import config
 from tensorflow.contrib.py2tf.impl import naming
 from tensorflow.contrib.py2tf.pyct import context
+from tensorflow.contrib.py2tf.pyct import inspect_utils
 from tensorflow.contrib.py2tf.pyct import parser
 from tensorflow.contrib.py2tf.pyct import qual_names
 from tensorflow.contrib.py2tf.pyct.static_analysis import activity
@@ -56,18 +57,26 @@ class ConversionMap(object):
         off.
     dependency_cache: dict[object]: ast; maps original entities to their
         converted AST
+    additional_imports: set(object); additional entities which for any reason
+        cannot be attached after loading and need to be explicitly imported
+        in the generated code
     name_map: dict[string]: string; maps original entities to the name of
         their converted counterparts
+    api_module: A reference to the api module. The reference needs to be passed
+        to avoid circular dependencies.
   """
 
   # TODO(mdan): Rename to ConversionContext, and pull in additional flags.
 
-  def __init__(self, recursive, nocompile_decorators, partial_types):
+  def __init__(self, recursive, nocompile_decorators, partial_types,
+               api_module):
     self.recursive = recursive
     self.nocompile_decorators = nocompile_decorators
     self.partial_types = partial_types if partial_types else ()
     self.dependency_cache = {}
+    self.additional_imports = set()
     self.name_map = {}
+    self.api_module = api_module
 
   def new_namer(self, namespace):
     return naming.Namer(namespace, self.recursive, self.name_map,
@@ -88,6 +97,24 @@ class ConversionMap(object):
     self.dependency_cache[original_entity] = converted_ast
 
 
+def is_whitelisted_for_graph(o):
+  """Check whether an entity is whitelisted for use in graph mode.
+
+  Examples of whitelisted entities include all members of the tensorflow
+  package.
+
+  Args:
+    o: A Python entity.
+  Returns:
+    Boolean
+  """
+  m = tf_inspect.getmodule(o)
+  for prefix, in config.DEFAULT_UNCOMPILED_MODULES:
+    if m.__name__.startswith(prefix):
+      return True
+  return False
+
+
 def entity_to_graph(o, conversion_map, arg_values, arg_types):
   """Compile a Python entity into equivalent TensorFlow.
 
@@ -146,7 +173,7 @@ def class_to_graph(c, conversion_map):
   if not members:
     raise ValueError('Cannot convert %s: it has no member methods.')
 
-  class_globals = None
+  class_namespace = None
   for _, m in members:
     node, _ = function_to_graph(
         m,
@@ -155,10 +182,10 @@ def class_to_graph(c, conversion_map):
         arg_types={'self': (c.__name__, c)},
         owner_type=c)
     # TODO(mdan): Do not assume all members have the same view of globals.
-    if class_globals is None:
-      class_globals = six.get_function_globals(m)
+    if class_namespace is None:
+      class_namespace = inspect_utils.getnamespace(m)
     converted_members[m] = node
-  namer = conversion_map.new_namer(class_globals)
+  namer = conversion_map.new_namer(class_namespace)
   class_name = namer.compiled_class_name(c.__name__, c)
   node = gast.ClassDef(
       class_name,
@@ -170,29 +197,34 @@ def class_to_graph(c, conversion_map):
   return node, class_name
 
 
-def function_to_graph(f, conversion_map, arg_values, arg_types,
-                      owner_type=None):
-  """Specialization of `entity_to_graph` for callable functions."""
-  node, source = parser.parse_entity(f)
-  node = node.body[0]
-  namespace = six.get_function_globals(f)
-
-  # This is needed for non-global functions.
-  closure = six.get_function_closure(f)
-  if closure:
-    for e in closure:
-      if callable(e.cell_contents):
-        fn = e.cell_contents
-        namespace[fn.__name__] = fn
-
+def _add_self_references(namespace, api_module):
+  """Self refs are only required for analysis and are not used directly."""
   # Manually add the utils namespace which may be used from generated code.
   if 'py2tf_util' not in namespace:
     namespace['py2tf_utils'] = utils
   elif namespace['py2tf_utils'] != utils:
     raise ValueError(
-        'The module name py2tf_utils is reserved and may not be used.')
+        'The module name "py2tf_utils" is reserved and may not be used.')
+
+  # We also make reference to the api module for dynamic conversion, but
+  # to avoid circular references we don't import it here.
+  if 'py2tf_api' not in namespace:
+    namespace['py2tf_api'] = api_module
+  elif namespace['py2tf_api'] != api_module:
+    raise ValueError(
+        'The module name "py2tf_api" is reserved and may not be used.')
+
 
+def function_to_graph(f, conversion_map, arg_values, arg_types,
+                      owner_type=None):
+  """Specialization of `entity_to_graph` for callable functions."""
+  node, source = parser.parse_entity(f)
+  node = node.body[0]
+
+  namespace = inspect_utils.getnamespace(f)
+  _add_self_references(namespace, conversion_map.api_module)
   namer = conversion_map.new_namer(namespace)
+
   ctx = context.EntityContext(
       namer=namer,
       source_code=source,
@@ -200,8 +232,9 @@ def function_to_graph(f, conversion_map, arg_values, arg_types,
       namespace=namespace,
       arg_values=arg_values,
       arg_types=arg_types,
+      owner_type=owner_type,
       recursive=conversion_map.recursive)
-  node = node_to_graph(node, ctx, conversion_map.nocompile_decorators)
+  node, deps = node_to_graph(node, ctx, conversion_map.nocompile_decorators)
 
   # TODO(mdan): This somewhat duplicates the call rename logic in call_treest.py
   new_name, did_rename = namer.compiled_function_name(f.__name__, f, owner_type)
@@ -212,6 +245,9 @@ def function_to_graph(f, conversion_map, arg_values, arg_types,
 
   node.name = new_name
   conversion_map.update_name_map(namer)
+  # TODO(mdan): Use this at compilation.
+  conversion_map.additional_imports.update(deps)
+
   return node, new_name
 
 
@@ -254,7 +290,7 @@ def node_to_graph(node, ctx, nocompile_decorators):
   # source.
   # TODO(mdan): Is it feasible to reconstruct intermediate source code?
   ctx.source_code = None
-  node = decorators.transform(node, nocompile_decorators)
+  node, deps = decorators.transform(node, nocompile_decorators)
   node = break_statements.transform(node, ctx)
   node = asserts.transform(node, ctx)
 
@@ -278,5 +314,6 @@ def node_to_graph(node, ctx, nocompile_decorators):
   node = _static_analysis_pass(node, ctx)
   node = logical_expressions.transform(node)
   node = side_effect_guards.transform(node, ctx)
+  node = name_scopes.transform(node, ctx)
 
-  return node
+  return node, deps
diff --git a/tensorflow/contrib/py2tf/impl/conversion_test.py b/tensorflow/contrib/py2tf/impl/conversion_test.py
index 3888958f19b9fa13b759924c5188722e500e30a1..9ff256aace7a0e7ac5e7ac07e580b8bed7d8df6f 100644
--- a/tensorflow/contrib/py2tf/impl/conversion_test.py
+++ b/tensorflow/contrib/py2tf/impl/conversion_test.py
@@ -20,15 +20,26 @@ from __future__ import print_function
 
 import gast
 
+from tensorflow.contrib.py2tf import utils
 from tensorflow.contrib.py2tf.impl import conversion
+from tensorflow.python.framework import constant_op
 from tensorflow.python.platform import test
 
 
 class ConversionTest(test.TestCase):
 
+  def test_is_whitelisted_for_graph(self):
+
+    def test_fn():
+      return constant_op.constant(1)
+
+    self.assertFalse(conversion.is_whitelisted_for_graph(test_fn))
+    self.assertTrue(conversion.is_whitelisted_for_graph(utils))
+    self.assertTrue(conversion.is_whitelisted_for_graph(constant_op.constant))
+
   def test_entity_to_graph_unsupported_types(self):
     with self.assertRaises(ValueError):
-      conversion_map = conversion.ConversionMap(True, (), ())
+      conversion_map = conversion.ConversionMap(True, (), (), None)
       conversion.entity_to_graph('dummy', conversion_map, None, None)
 
   def test_entity_to_graph_callable(self):
@@ -36,7 +47,7 @@ class ConversionTest(test.TestCase):
     def f(a):
       return a
 
-    conversion_map = conversion.ConversionMap(True, (), ())
+    conversion_map = conversion.ConversionMap(True, (), (), None)
     ast, new_name = conversion.entity_to_graph(f, conversion_map, None, None)
     self.assertTrue(isinstance(ast, gast.FunctionDef), ast)
     self.assertEqual('tf__f', new_name)
@@ -49,14 +60,17 @@ class ConversionTest(test.TestCase):
     def f(a):
       return g(a)
 
-    conversion_map = conversion.ConversionMap(True, (), ())
+    conversion_map = conversion.ConversionMap(True, (), (), None)
     conversion.entity_to_graph(f, conversion_map, None, None)
 
     self.assertTrue(f in conversion_map.dependency_cache)
     self.assertTrue(g in conversion_map.dependency_cache)
     self.assertEqual('tf__f', conversion_map.dependency_cache[f].name)
+    # need the extra .body[0] in order to step past the with tf.name_scope('f')
+    # that is added automatically
     self.assertEqual(
-        'tf__g', conversion_map.dependency_cache[f].body[0].value.func.id)
+        'tf__g',
+        conversion_map.dependency_cache[f].body[0].body[0].value.func.id)
     self.assertEqual('tf__g', conversion_map.dependency_cache[g].name)
 
 
diff --git a/tensorflow/contrib/py2tf/pyct/BUILD b/tensorflow/contrib/py2tf/pyct/BUILD
index e3c0da4b10f9ffbee1b2a906b64d4762f41d97b4..edec5f7712d08247437c9e95d743e59dafffcd7b 100644
--- a/tensorflow/contrib/py2tf/pyct/BUILD
+++ b/tensorflow/contrib/py2tf/pyct/BUILD
@@ -24,6 +24,7 @@ py_library(
         "ast_util.py",
         "compiler.py",
         "context.py",
+        "inspect_utils.py",
         "parser.py",
         "pretty_printer.py",
         "qual_names.py",
@@ -72,6 +73,17 @@ py_test(
     ],
 )
 
+py_test(
+    name = "inspect_utils_test",
+    srcs = ["inspect_utils_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":pyct",
+        "//tensorflow/python:client_testlib",
+        "@gast_archive//:gast",
+    ],
+)
+
 py_test(
     name = "parser_test",
     srcs = ["parser_test.py"],
diff --git a/tensorflow/contrib/py2tf/pyct/context.py b/tensorflow/contrib/py2tf/pyct/context.py
index fef74ebefa290369c7310af6d7e4faeef44d9aee..4fcf2a687d58af951adfc0dcf52ff7303d2b17f5 100644
--- a/tensorflow/contrib/py2tf/pyct/context.py
+++ b/tensorflow/contrib/py2tf/pyct/context.py
@@ -30,14 +30,16 @@ class EntityContext(object):
         (excluding parameters).
     arg_values: Dict[str->*], containing parameter values, if known.
     arg_types: Dict[str->*], containing parameter types, if known.
+    owner_type: The surrounding class type of the function, if present.
   """
 
   def __init__(self, namer, source_code, source_file, namespace, arg_values,
-               arg_types, recursive):
+               arg_types, owner_type, recursive):
     self.namer = namer
     self.source_code = source_code
     self.source_file = source_file
     self.namespace = namespace
     self.arg_values = {} if arg_values is None else arg_values
     self.arg_types = {} if arg_types is None else arg_types
+    self.owner_type = owner_type
     self.recursive = recursive
diff --git a/tensorflow/contrib/py2tf/pyct/inspect_utils.py b/tensorflow/contrib/py2tf/pyct/inspect_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d19c6ed75e0f0651781d6e1ed80f7be11fb8a5a4
--- /dev/null
+++ b/tensorflow/contrib/py2tf/pyct/inspect_utils.py
@@ -0,0 +1,119 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Live entity inspection utilities.
+
+This module contains whatever inspect doesn't offer out of the box.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import itertools
+
+import six
+
+from tensorflow.python.util import tf_inspect
+
+
+def getnamespace(f):
+  """Returns the complete namespace of a function.
+
+  Namespace is defined here as the mapping of all non-local variables to values.
+  This includes the globals and the closure variables. Note that this captures
+  the entire globals collection of the function, and may contain extra symbols
+  that it does not actually use.
+
+  Args:
+    f: User defined function.
+  Returns:
+    A dict mapping symbol names to values.
+  """
+  namespace = dict(six.get_function_globals(f))
+  closure = six.get_function_closure(f)
+  freevars = six.get_function_code(f).co_freevars
+  if freevars and closure:
+    for name, cell in zip(freevars, closure):
+      namespace[name] = cell.cell_contents
+  return namespace
+
+
+def getmethodclass(m):
+  """Resolves a function's owner, e.g. a method's class.
+
+  Note that this returns the object that the function was retrieved from, not
+  necessarily the class where it was defined.
+
+  This function relies on Python stack frame support in the interpreter, and
+  has the same limitations that inspect.currentframe.
+
+  Limitations. This function will only work correctly if the owned class is
+  visible in the caller's global or local variables.
+
+  Args:
+    m: A user defined function
+
+  Returns:
+    The class that this function was retrieved from, or None if the function
+    is not an object or class method, or the class that owns the object or
+    method is not visible to m.
+
+  Raises:
+    ValueError: if the class could not be resolved for any unexpected reason.
+  """
+
+  # Instance method and class methods: should be bound to a non-null "self".
+  # If self is a class, then it's a class method.
+  if hasattr(m, '__self__'):
+    if m.__self__:
+      if tf_inspect.isclass(m.__self__):
+        return m.__self__
+      return type(m.__self__)
+
+  # Class, static and unbound methods: search all defined classes in any
+  # namespace. This is inefficient but more robust method.
+  owners = []
+  caller_frame = tf_inspect.currentframe().f_back
+  try:
+    # TODO(mdan): This doesn't consider cell variables.
+    # TODO(mdan): This won't work if the owner is hidden inside a container.
+    # Cell variables may be pulled using co_freevars and the closure.
+    for v in itertools.chain(caller_frame.f_locals.values(),
+                             caller_frame.f_globals.values()):
+      if hasattr(v, m.__name__):
+        candidate = getattr(v, m.__name__)
+        # Py2 methods may be bound or unbound, extract im_func to get the
+        # underlying function.
+        if hasattr(candidate, 'im_func'):
+          candidate = candidate.im_func
+        if hasattr(m, 'im_func'):
+          m = m.im_func
+        if candidate is m:
+          owners.append(v)
+  finally:
+    del caller_frame
+
+  if owners:
+    if len(owners) == 1:
+      return owners[0]
+
+    # If multiple owners are found, and are not subclasses, raise an error.
+    owner_types = tuple(o if tf_inspect.isclass(o) else type(o) for o in owners)
+    for o in owner_types:
+      if tf_inspect.isclass(o) and issubclass(o, tuple(owner_types)):
+        return o
+    raise ValueError('Found too many owners of %s: %s' % (m, owners))
+
+  return None
diff --git a/tensorflow/contrib/py2tf/pyct/inspect_utils_test.py b/tensorflow/contrib/py2tf/pyct/inspect_utils_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..5528ac851f74bd7b7dacdbe7b930945afa8c9783
--- /dev/null
+++ b/tensorflow/contrib/py2tf/pyct/inspect_utils_test.py
@@ -0,0 +1,230 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for unspect_utils module."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from functools import wraps
+
+import six
+
+from tensorflow.contrib.py2tf.pyct import inspect_utils
+from tensorflow.python.platform import test
+
+
+def decorator(f):
+  return f
+
+
+def function_decorator():
+  def dec(f):
+    return f
+  return dec
+
+
+def wrapping_decorator():
+  def dec(f):
+    def replacement(*_):
+      return None
+
+    @wraps(f)
+    def wrapper(*args, **kwargs):
+      return replacement(*args, **kwargs)
+    return wrapper
+  return dec
+
+
+class TestClass(object):
+
+  def member_function(self):
+    pass
+
+  @decorator
+  def decorated_member(self):
+    pass
+
+  @function_decorator()
+  def fn_decorated_member(self):
+    pass
+
+  @wrapping_decorator()
+  def wrap_decorated_member(self):
+    pass
+
+  @staticmethod
+  def static_method():
+    pass
+
+  @classmethod
+  def class_method(cls):
+    pass
+
+
+def free_function():
+  pass
+
+
+def factory():
+  return free_function
+
+
+def free_factory():
+  def local_function():
+    pass
+  return local_function
+
+
+class InspectUtilsTest(test.TestCase):
+
+  def test_getnamespace_globals(self):
+    ns = inspect_utils.getnamespace(factory)
+    self.assertEqual(ns['free_function'], free_function)
+
+  def test_getnamespace_hermetic(self):
+
+    # Intentionally hiding the global function to make sure we don't overwrite
+    # it in the global namespace.
+    free_function = object()  # pylint:disable=redefined-outer-name
+
+    def test_fn():
+      return free_function
+
+    ns = inspect_utils.getnamespace(test_fn)
+    globs = six.get_function_globals(test_fn)
+    self.assertTrue(ns['free_function'] is free_function)
+    self.assertFalse(globs['free_function'] is free_function)
+
+  def test_getnamespace_locals(self):
+
+    def called_fn():
+      return 0
+
+    closed_over_list = []
+    closed_over_primitive = 1
+
+    def local_fn():
+      closed_over_list.append(1)
+      local_var = 1
+      return called_fn() + local_var + closed_over_primitive
+
+    ns = inspect_utils.getnamespace(local_fn)
+    self.assertEqual(ns['called_fn'], called_fn)
+    self.assertEqual(ns['closed_over_list'], closed_over_list)
+    self.assertEqual(ns['closed_over_primitive'], closed_over_primitive)
+    self.assertTrue('local_var' not in ns)
+
+  def test_getmethodclass(self):
+
+    self.assertEqual(
+        inspect_utils.getmethodclass(free_function), None)
+    self.assertEqual(
+        inspect_utils.getmethodclass(free_factory()), None)
+
+    self.assertEqual(
+        inspect_utils.getmethodclass(TestClass.member_function),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(TestClass.decorated_member),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(TestClass.fn_decorated_member),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(TestClass.wrap_decorated_member),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(TestClass.static_method),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(TestClass.class_method),
+        TestClass)
+
+    test_obj = TestClass()
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.member_function),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.decorated_member),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.fn_decorated_member),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.wrap_decorated_member),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.static_method),
+        TestClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.class_method),
+        TestClass)
+
+  def test_getmethodclass_locals(self):
+
+    def local_function():
+      pass
+
+    class LocalClass(object):
+
+      def member_function(self):
+        pass
+
+      @decorator
+      def decorated_member(self):
+        pass
+
+      @function_decorator()
+      def fn_decorated_member(self):
+        pass
+
+      @wrapping_decorator()
+      def wrap_decorated_member(self):
+        pass
+
+    self.assertEqual(
+        inspect_utils.getmethodclass(local_function), None)
+
+    self.assertEqual(
+        inspect_utils.getmethodclass(LocalClass.member_function),
+        LocalClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(LocalClass.decorated_member),
+        LocalClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(LocalClass.fn_decorated_member),
+        LocalClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(LocalClass.wrap_decorated_member),
+        LocalClass)
+
+    test_obj = LocalClass()
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.member_function),
+        LocalClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.decorated_member),
+        LocalClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.fn_decorated_member),
+        LocalClass)
+    self.assertEqual(
+        inspect_utils.getmethodclass(test_obj.wrap_decorated_member),
+        LocalClass)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/activity.py b/tensorflow/contrib/py2tf/pyct/static_analysis/activity.py
index 1c93e1603113d48176af7a97a0f37321e6f67586..02ea6fdeaf78152b6bc48983f79b36f43d4f665d 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/activity.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/activity.py
@@ -24,6 +24,7 @@ import gast
 
 from tensorflow.contrib.py2tf.pyct import anno
 from tensorflow.contrib.py2tf.pyct import transformer
+from tensorflow.contrib.py2tf.pyct.qual_names import QN
 from tensorflow.contrib.py2tf.pyct.static_analysis.annos import NodeAnno
 
 # TODO(mdan): Add support for PY3 (e.g. Param vs arg).
@@ -237,6 +238,18 @@ class ActivityAnalizer(transformer.Base):
       self.scope.merge_from(after_child)
     return parent
 
+  def visit_FunctionDef(self, node):
+    if self.scope:
+      qn = QN(node.name)
+      self.scope.mark_write(qn)
+    current_scope = self.scope
+    fndef_scope = Scope(current_scope, isolated=True)
+    self.scope = fndef_scope
+    self.generic_visit(node)
+    anno.setanno(node, NodeAnno.BODY_SCOPE, fndef_scope)
+    self.scope = current_scope
+    return node
+
   def visit_If(self, node):
     self.visit(node.test)
     node = self._process_parallel_blocks(node,
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/activity_test.py b/tensorflow/contrib/py2tf/pyct/static_analysis/activity_test.py
index e1eb954a5efef4d6a00ac492e7c85394d54e28c9..69f5f4fc582f159e46c8b8929a90ca95b724794d 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/activity_test.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/activity_test.py
@@ -108,6 +108,7 @@ class ActivityAnalizerTest(test.TestCase):
         namespace={},
         arg_values=None,
         arg_types=None,
+        owner_type=None,
         recursive=True)
     node = qual_names.resolve(node)
     node = activity.resolve(node, ctx)
@@ -239,6 +240,33 @@ class ActivityAnalizerTest(test.TestCase):
         anno.getanno(if_node, NodeAnno.ORELSE_SCOPE).parent, ('x', 'z', 'u'),
         ('x', 'y', 'z', 'u'), ('x', 'y', 'z', 'u'))
 
+  def test_functiondef(self):
+
+    def test_fn(a):
+
+      def f(x):
+        y = x * x
+        return y
+
+      b = a
+      for i in a:
+        c = b
+        b -= f(i)
+      return b, c
+
+    node = self._parse_and_analyze(test_fn)
+    fndef_node = node.body[0].body[0]
+
+    self.assertScopeIs(
+        anno.getanno(fndef_node,
+                     NodeAnno.BODY_SCOPE).parent, ('b', 'i', 'f', 'c', 'a'),
+        ('f', 'b', 'c', 'i'), ('f', 'a', 'b', 'c', 'i'))
+    self.assertScopeIs(
+        anno.getanno(fndef_node, NodeAnno.BODY_SCOPE), ('x', 'y'), ('y',), (
+            'x',
+            'y',
+        ))
+
   def test_call_with_composite_names(self):
 
     def foo(*_):
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/live_values.py b/tensorflow/contrib/py2tf/pyct/static_analysis/live_values.py
index 9c0a9a9e74eccb3d22840032e8f0c2b81e051e7e..0388be5d252389f2f3516c8b27828905d6475589 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/live_values.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/live_values.py
@@ -86,6 +86,7 @@ class LiveValueResolver(transformer.Base):
       if not hasattr(parent_object, node.attr):
         raise AttributeError('%s has no attribute %s' % (parent_object,
                                                          node.attr))
+      anno.setanno(node, 'parent_type', type(parent_object))
       anno.setanno(node, 'live_val', getattr(parent_object, node.attr))
       anno.setanno(node, 'fqn', anno.getanno(node.value, 'fqn') + (node.attr,))
     # TODO(mdan): Investigate the role built-in annotations can play here.
@@ -96,6 +97,7 @@ class LiveValueResolver(transformer.Base):
         # This would not hold for dynamic members like function attributes.
         # For the dynamic case, we simply leave the node without an annotation,
         # and let downstream consumers figure out what to do.
+        anno.setanno(node, 'parent_type', parent_type)
         anno.setanno(node, 'live_val', getattr(parent_type, node.attr))
         anno.setanno(node, 'fqn',
                      anno.getanno(node.value, 'type_fqn') + (node.attr,))
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/live_values_test.py b/tensorflow/contrib/py2tf/pyct/static_analysis/live_values_test.py
index 9f64689401e3594a77fbdd7b6f02880bd6e90492..c133a455b3dd328689102634c6076f366212ac25 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/live_values_test.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/live_values_test.py
@@ -46,6 +46,7 @@ class LiveValuesResolverTest(test.TestCase):
         namespace=namespace,
         arg_values=None,
         arg_types=arg_types,
+        owner_type=None,
         recursive=True)
     node = qual_names.resolve(node)
     node = activity.resolve(node, ctx)
@@ -102,6 +103,7 @@ class LiveValuesResolverTest(test.TestCase):
         arg_types={'self': (TestClass.__name__, TestClass)})
     func_node = node.body[0].body[0].value.func
     self.assertEquals(TestClass.member, anno.getanno(func_node, 'live_val'))
+    self.assertEquals(TestClass, anno.getanno(func_node, 'parent_type'))
     self.assertEquals(('TestClass', 'member'), anno.getanno(func_node, 'fqn'))
 
 
diff --git a/tensorflow/contrib/py2tf/pyct/static_analysis/type_info_test.py b/tensorflow/contrib/py2tf/pyct/static_analysis/type_info_test.py
index 3659f949db9910534870d8dd9e42fd4ee8297253..a3e78202c80e45552c038a6a1da763eb30aff52f 100644
--- a/tensorflow/contrib/py2tf/pyct/static_analysis/type_info_test.py
+++ b/tensorflow/contrib/py2tf/pyct/static_analysis/type_info_test.py
@@ -65,6 +65,7 @@ class TypeInfoResolverTest(test.TestCase):
         namespace=namespace,
         arg_values=None,
         arg_types=arg_types,
+        owner_type=None,
         recursive=True)
     node = qual_names.resolve(node)
     node = activity.resolve(node, ctx)
diff --git a/tensorflow/contrib/py2tf/pyct/templates.py b/tensorflow/contrib/py2tf/pyct/templates.py
index 6ee6c0c5ceb70d87779ee313670135cadc5214b5..7021e2ba93743deb5ba6fecfe88428600b9489db 100644
--- a/tensorflow/contrib/py2tf/pyct/templates.py
+++ b/tensorflow/contrib/py2tf/pyct/templates.py
@@ -79,6 +79,17 @@ class ReplaceTransformer(gast.NodeTransformer):
     else:
       raise ValueError('unexpected node type "%s"' % node)
 
+  def visit_Attribute(self, node):
+    node = self.generic_visit(node)
+    if node.attr not in self.replacements:
+      return node
+    repl = self.replacements[node.attr]
+    if not isinstance(repl, gast.Name):
+      raise ValueError(
+          'An attribute can only be replaced by a Name node. Found: %s' % repl)
+    node.attr = repl.id
+    return node
+
   def visit_Name(self, node):
     if node.id not in self.replacements:
       return node
diff --git a/tensorflow/contrib/py2tf/pyct/templates_test.py b/tensorflow/contrib/py2tf/pyct/templates_test.py
index 8ccfde8573724741b0bbe4eacb3c54beb381ee7e..0d1c1c5d9ecf3fb9d7956f35bfce736389c0ec57 100644
--- a/tensorflow/contrib/py2tf/pyct/templates_test.py
+++ b/tensorflow/contrib/py2tf/pyct/templates_test.py
@@ -18,6 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import imp
+
 import gast
 
 from tensorflow.contrib.py2tf.pyct import compiler
@@ -62,7 +64,7 @@ class TemplatesTest(test.TestCase):
     result, _ = compiler.ast_to_object(node)
     self.assertEquals(7, result.test_fn(2))
 
-  def test_code_block(self):
+  def test_replace_code_block(self):
     template = """
       def test_fn(a):
         block
@@ -79,6 +81,21 @@ class TemplatesTest(test.TestCase):
     result, _ = compiler.ast_to_object(node)
     self.assertEquals(3, result.test_fn(1))
 
+  def test_replace_attribute(self):
+    template = """
+      def test_fn(a):
+        return a.foo
+    """
+
+    node = templates.replace(template, foo='b')[0]
+    result, _ = compiler.ast_to_object(node)
+    mod = imp.new_module('test')
+    mod.b = 3
+    self.assertEquals(3, result.test_fn(mod))
+
+    with self.assertRaises(ValueError):
+      templates.replace(template, foo=1)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/contrib/py2tf/pyct/transformer.py b/tensorflow/contrib/py2tf/pyct/transformer.py
index 877d52af016af720424c8a56257fec9ab64611cb..57016bb4ce84776dfc8dfbe380322a03eb4b37b8 100644
--- a/tensorflow/contrib/py2tf/pyct/transformer.py
+++ b/tensorflow/contrib/py2tf/pyct/transformer.py
@@ -44,6 +44,12 @@ class Base(gast.NodeTransformer):
     self._col_offset = 0
     self.context = context
 
+  def debug_print(self, node):
+    """Helper method useful for debugging."""
+    if __debug__:
+      print(pretty_printer.fmt(node))
+    return node
+
   def visit(self, node):
     source_code = self.context.source_code
     source_file = self.context.source_file
diff --git a/tensorflow/contrib/py2tf/utils/BUILD b/tensorflow/contrib/py2tf/utils/BUILD
index c2fdd40707775783140390e4b5c0186c9c3e562e..63261d5043d818bea57435e9a9f22f058041a087 100644
--- a/tensorflow/contrib/py2tf/utils/BUILD
+++ b/tensorflow/contrib/py2tf/utils/BUILD
@@ -20,12 +20,13 @@ py_library(
     name = "utils",
     srcs = [
         "__init__.py",
+        "builtins.py",
         "context_managers.py",
         "misc.py",
         "multiple_dispatch.py",
-        "printing.py",
         "py_func.py",
         "tensor_list.py",
+        "testing.py",
         "type_check.py",
     ],
     srcs_version = "PY2AND3",
@@ -76,16 +77,6 @@ py_test(
     ],
 )
 
-py_test(
-    name = "printing_test",
-    srcs = ["printing_test.py"],
-    srcs_version = "PY2AND3",
-    deps = [
-        ":utils",
-        "//tensorflow/python:client_testlib",
-    ],
-)
-
 py_test(
     name = "type_check_test",
     srcs = ["type_check_test.py"],
diff --git a/tensorflow/contrib/py2tf/utils/__init__.py b/tensorflow/contrib/py2tf/utils/__init__.py
index 0a1b993fd366e1317e5f7e01fe849d86c93b8fc2..313e5c97cc113509169bfb5e7489469ebb81577a 100644
--- a/tensorflow/contrib/py2tf/utils/__init__.py
+++ b/tensorflow/contrib/py2tf/utils/__init__.py
@@ -18,10 +18,12 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.contrib.py2tf.utils.builtins import dynamic_builtin
+from tensorflow.contrib.py2tf.utils.builtins import dynamic_print
 from tensorflow.contrib.py2tf.utils.context_managers import control_dependency_on_returns
 from tensorflow.contrib.py2tf.utils.misc import alias_tensors
 from tensorflow.contrib.py2tf.utils.multiple_dispatch import run_cond
 from tensorflow.contrib.py2tf.utils.multiple_dispatch import run_while
-from tensorflow.contrib.py2tf.utils.printing import call_print
 from tensorflow.contrib.py2tf.utils.py_func import wrap_py_func
+from tensorflow.contrib.py2tf.utils.testing import fake_tf
 from tensorflow.contrib.py2tf.utils.type_check import is_tensor
diff --git a/tensorflow/contrib/py2tf/utils/printing.py b/tensorflow/contrib/py2tf/utils/builtins.py
similarity index 62%
rename from tensorflow/contrib/py2tf/utils/printing.py
rename to tensorflow/contrib/py2tf/utils/builtins.py
index 95a62bd80b5f4854e6a062df18d882f7bd495555..0a50b80b60101afaa9aa0f445079727e9708ac35 100644
--- a/tensorflow/contrib/py2tf/utils/printing.py
+++ b/tensorflow/contrib/py2tf/utils/builtins.py
@@ -12,14 +12,40 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""TensorFlow printing support utilities."""
+"""Builtin conversion utilities."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 from tensorflow.contrib.py2tf.utils import py_func
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import logging_ops
+from tensorflow.python.util import tf_inspect
+
+
+def dynamic_builtin(f, *args, **kwargs):
+  """Converts a builtin function call inline."""
+  if not tf_inspect.isbuiltin(f):
+    return f(*args, **kwargs)
+
+  if f is len:
+    return dynamic_len(*args, **kwargs)
+
+  raise NotImplementedError('The "%s" builtin is not yet supported.' % f)
+
+
+def dynamic_len(list_or_tensor):
+  """Implementation of len using dynamic dispatch."""
+  if tensor_util.is_tensor(list_or_tensor):
+    shape = list_or_tensor.shape
+    if not shape:
+      raise ValueError(
+          'len requires non-zero rank for tensor "%s"' % list_or_tensor)
+    return array_ops.shape(list_or_tensor)[0]
+
+  return len(list_or_tensor)
 
 
 def is_tf_print_compatible(value):
@@ -30,8 +56,8 @@ def is_tf_print_compatible(value):
   return False
 
 
-def call_print(*values):
-  """Compiled counterpart of the print builtin.
+def dynamic_print(*values):
+  """Implementartion of print using dynamic dispatch.
 
   The function attempts to use tf.Print if all the values are compatible.
   Otherwise, it will fall back to py_func.
diff --git a/tensorflow/contrib/py2tf/utils/printing_test.py b/tensorflow/contrib/py2tf/utils/builtins_test.py
similarity index 56%
rename from tensorflow/contrib/py2tf/utils/printing_test.py
rename to tensorflow/contrib/py2tf/utils/builtins_test.py
index 2070deb304d8df2433fb9a95ae36d48415578482..19a72c63ecc873c52abde18e481221fc782ad490 100644
--- a/tensorflow/contrib/py2tf/utils/printing_test.py
+++ b/tensorflow/contrib/py2tf/utils/builtins_test.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for printing module."""
+"""Tests for builtins module."""
 
 from __future__ import absolute_import
 from __future__ import division
@@ -22,28 +22,53 @@ import sys
 
 import six
 
-from tensorflow.contrib.py2tf.utils import printing
+from tensorflow.contrib.py2tf.utils import builtins
+from tensorflow.python.framework import constant_op
 from tensorflow.python.platform import test
 
 
-class ContextManagersTest(test.TestCase):
+class BuiltinsTest(test.TestCase):
 
-  def test_call_print_tf(self):
+  def test_dynamic_len_tf_scalar(self):
+    a = constant_op.constant(1)
+
+    with self.assertRaises(ValueError):
+      with self.test_session() as sess:
+        sess.run(builtins.dynamic_builtin(len, a))
+
+  def test_dynamic_len_tf_array(self):
+    a = constant_op.constant([1, 2, 3])
+
+    with self.test_session() as sess:
+      self.assertEqual(3, sess.run(builtins.dynamic_builtin(len, a)))
+
+  def test_dynamic_len_tf_matrix(self):
+    a = constant_op.constant([[1, 2], [3, 4]])
+
+    with self.test_session() as sess:
+      self.assertEqual(2, sess.run(builtins.dynamic_builtin(len, a)))
+
+  def test_dynamic_len_py_list(self):
+    a = [3] * 5
+
+    self.assertEqual(5, builtins.dynamic_builtin(len, a))
+
+  def test_dynamic_print_tf(self):
     try:
       out_capturer = six.StringIO()
       sys.stdout = out_capturer
       with self.test_session() as sess:
-        sess.run(printing.call_print('test message', 1))
+        sess.run(builtins.dynamic_print('test message', 1))
         self.assertEqual(out_capturer.getvalue(), 'test message 1\n')
     finally:
       sys.stdout = sys.__stdout__
 
-  def test_call_print_py_func(self):
+  def test_dynamic_print_complex(self):
     try:
       out_capturer = six.StringIO()
       sys.stdout = out_capturer
       with self.test_session() as sess:
-        sess.run(printing.call_print('test message', [1, 2]))
+        sess.run(builtins.dynamic_print('test message', [1, 2]))
         self.assertEqual(out_capturer.getvalue(), 'test message [1, 2]\n')
     finally:
       sys.stdout = sys.__stdout__
diff --git a/tensorflow/contrib/py2tf/utils/misc_test.py b/tensorflow/contrib/py2tf/utils/misc_test.py
index bfcb304c838df69e9e3961907362c7939c065117..8aedd4cd64798660cc07364c45487399986c9be6 100644
--- a/tensorflow/contrib/py2tf/utils/misc_test.py
+++ b/tensorflow/contrib/py2tf/utils/misc_test.py
@@ -18,29 +18,29 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.py2tf.utils import misc
-from tensorflow.python.framework import constant_op
-from tensorflow.python.ops import variables
+from tensorflow.contrib.py2tf.utils.misc import alias_tensors
+from tensorflow.python.framework.constant_op import constant
+from tensorflow.python.ops.variables import Variable
 from tensorflow.python.platform import test
 
 
-class ContextManagersTest(test.TestCase):
+class MiscTest(test.TestCase):
 
   def test_alias_single_tensor(self):
-    a = constant_op.constant(1)
+    a = constant(1)
 
-    new_a = misc.alias_tensors(a)
+    new_a = alias_tensors(a)
     self.assertFalse(new_a is a)
     with self.test_session() as sess:
       self.assertEqual(1, sess.run(new_a))
 
   def test_alias_tensors(self):
-    a = constant_op.constant(1)
-    v = variables.Variable(2)
+    a = constant(1)
+    v = Variable(2)
     s = 'a'
     l = [1, 2, 3]
 
-    new_a, new_v, new_s, new_l = misc.alias_tensors(a, v, s, l)
+    new_a, new_v, new_s, new_l = alias_tensors(a, v, s, l)
 
     self.assertFalse(new_a is a)
     self.assertTrue(new_v is v)
diff --git a/tensorflow/contrib/bayesflow/python/ops/variable_utils.py b/tensorflow/contrib/py2tf/utils/testing.py
similarity index 59%
rename from tensorflow/contrib/bayesflow/python/ops/variable_utils.py
rename to tensorflow/contrib/py2tf/utils/testing.py
index eadf6f4d5fa1c776e2c71c66c4b64b8f5ac98359..cb4785d0dc0f4674b3560418daeb6733364b21e7 100644
--- a/tensorflow/contrib/bayesflow/python/ops/variable_utils.py
+++ b/tensorflow/contrib/py2tf/utils/testing.py
@@ -1,4 +1,4 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,18 +12,24 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Utility functions related to managing `tf.Variable`s."""
+"""Testing utilities."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-# go/tf-wildcard-import
-from tensorflow.contrib.bayesflow.python.ops.variable_utils_impl import *  # pylint: disable=wildcard-import,unused-wildcard-import,g-importing-member
-from tensorflow.python.util import all_util
+import imp
 
-_allowed_symbols = [
-    "externalize_variables_as_args",
-]
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import math_ops
 
-all_util.remove_undocumented(__name__, _allowed_symbols)
+
+def fake_tf():
+  """Creates a fake module that looks like TensorFlow, for testing."""
+  mod = imp.new_module('tensorflow')
+  mod_contents = dict()
+  mod_contents.update(math_ops.__dict__)
+  mod_contents.update(ops.__dict__)
+  mod_contents.update(mod.__dict__)
+  mod.__dict__.update(mod_contents)
+  return mod
diff --git a/tensorflow/contrib/quantization/README.md b/tensorflow/contrib/quantization/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..359950aaf3d89c1f3e8fda21cbd27fb89217d918
--- /dev/null
+++ b/tensorflow/contrib/quantization/README.md
@@ -0,0 +1,7 @@
+The contrib/quantization package exposes a few TensorFlow quantization operations.
+
+If you are looking for quantized training rewrites that allow for training
+quantized models that work with
+[TensorFlow Lite](https://www.tensorflow.org/mobile/tflite/), you should look at
+the [contrib/quantize](https://www.tensorflow.org/api_docs/python/tf/contrib/quantize)
+package.
diff --git a/tensorflow/contrib/quantize/README.md b/tensorflow/contrib/quantize/README.md
index 40541729da5fd9d0ae75579e11f20999337de124..8b0e7bb68f5a11f5d1942f7cf048e96768da259e 100644
--- a/tensorflow/contrib/quantize/README.md
+++ b/tensorflow/contrib/quantize/README.md
@@ -1,9 +1,10 @@
+# Quantized Training Rewrites
+
 tf.contrib.quantize provides tools for transforming graphs to include ops to
 model quantization of weights, biases and activations during both training and
 inference. This is done using the
 [fake quantization op]
-(https://www.tensorflow.org/versions/r0.12/api_docs/python/array_ops/fake_quantization),
-which is described below:
+(https://www.tensorflow.org/versions/r0.12/api_docs/python/array_ops/fake_quantization).
 
 Recent literature has shown that fixed point networks provide comparable
 performance to floating point networks [1]. This is achieved by modeling the
@@ -14,56 +15,52 @@ updated at high precision as this is needed to ensure sufficient precision in
 accumulating tiny adjustments to the parameters. However, for the forward pass,
 the parameters and activations are quantized to the desired lower precision.
 
-![drawing](g3doc/drawings/Fake_Quantization.jpg)
-
-###Forward pass
-
-
-
-
-\begin{equation*}
-f_Q(x) = \Delta\text{ }round\left(\frac{sat\left(x\right)-x_{min}}{\Delta}\right)
-\end{equation*}
-
-
-where
-
-$$
-\begin{equation*}
-sat(x) =
-\left\{
-	\begin{array}{ll}
-		x_{min}  & \mbox{if } x \le x_{min} \\
-		x & \mbox{if } x_{min} \leq x \leq x_{max} \\
-    x_{max} & \mbox{if } x_{max} \le x
-	\end{array}
-\right.
-\end{equation*}
-$$
-
-
-where $$\Delta$$ is the Quantizer Step size, given by
-$$\Delta =\frac{x_{max} - x_{min} }{255} $$ and $$x_{min} $$ and $$x_{max}$$ are
-the minimum and maximum values of the variable under consideration. Note that
-the rounding performed is deterministic and corresponds to asymmetric rounding,
-which is supported in almost all hardware platforms.
-
-###Backward pass
-For the backward pass, we model the quantizer as a piecewise linear block, with
-derivatives that are non-zero only in the linear region.
-
-
-
-\begin{equation*}
-\frac{df_Q(x)}{dx}=1, x_{min} \leq x \leq x_{max},\text{ 0  elsewhere }
-\end{equation*}
-
-Therefore, the backward pass through the quantizer reduces to passing through
-the gradients as long as the inputs to the quantizer are in the linear region.
-Otherwise, the gradients are set to zero.
-
-Note that the quantizer is fully specified by the min and max values of the
-variables being quantized.
+## How to use the Rewrites
+
+tf.contrib.quantize provides two rewrites, one to train for quantization and
+one to create a [TensorFlow Lite](https://www.tensorflow.org/mobile/tflite/)
+compatible eval graph.
+
+```
+# Build forward pass of model.
+…
+loss = tf.losses.get_total_loss()
+
+# Call the training rewrite which rewrites the graph in-place with FakeQuantization nodes
+# and folds batchnorm for training.
+# It is often needed to finetune a floating point model for quantization with this training tool.
+# When training from scratch, quant_delay can be used to activate quantization after
+# training to convergence with the float graph, effectively finetuning the model.
+tf.contrib.quantize.create_training_graph(quant_delay=2000000)
+
+# Call backward pass optimizer as usual.
+optimizer = tf.train.GradientDescentOptimizer(learning_rate)
+optimizer.minimize(loss)
+```
+
+Additionally, the rewritten eval graph is non-trivially different from the
+training graph due the effects of quantization on batch normalization. Thus,
+we offer a separate rewrite for the eval_graph.
+
+```
+# Build eval model
+…
+logits = tf.nn.softmax_cross_entropy_with_logits(...)
+
+# Call the eval rewrite which rewrites the graph in-place with FakeQuantization nodes
+# and fold batchnorm for eval.
+tf.contrib.quantize.create_eval_graph()
+
+# Save the checkpoint and eval graph proto to disk for freezing and providing to TFLite.
+with open(eval_graph_file, ‘w’) as f:
+  f.write(str(g.as_graph_def()))
+saver = tf.train.Saver()
+saver.save(sess, checkpoint_name)
+```
+
+These rewrites are an active area of research and experimentation, so the
+rewrites and quantized training will likely not work across all models, though
+we hope to work towards generalizing these techniques.
 
 
 [1] P.Gysel, "HARDWARE-ORIENTED APPROXIMATION OF CONVOLUTIONAL
diff --git a/tensorflow/contrib/quantize/g3doc/drawings/Fake_Quantization.jpg b/tensorflow/contrib/quantize/g3doc/drawings/Fake_Quantization.jpg
deleted file mode 100644
index fdc7ae40cec757cc0a93d50eca6c8698a4697d07..0000000000000000000000000000000000000000
Binary files a/tensorflow/contrib/quantize/g3doc/drawings/Fake_Quantization.jpg and /dev/null differ
diff --git a/tensorflow/contrib/quantize/python/fold_batch_norms.py b/tensorflow/contrib/quantize/python/fold_batch_norms.py
index 75d9eb0e58d96e4bb2946684febd250e2e1a6b4a..1f0648bbb6cf4579739cb2d4fbcfb478aaa5836d 100644
--- a/tensorflow/contrib/quantize/python/fold_batch_norms.py
+++ b/tensorflow/contrib/quantize/python/fold_batch_norms.py
@@ -194,7 +194,7 @@ def _FindFusedBatchNorms(graph):
     layer_op = match_result.get_op(layer_pattern)
     layer_tensor = match_result.get_tensor(layer_pattern)
     bn_op = match_result.get_op(batch_norm_pattern)
-    batch_epsilon_tensor = bn_op.get_attr('epsilon')
+    batch_epsilon = bn_op.get_attr('epsilon')
 
     # In the MatMul case, the output of batch norm is reshaped back into a
     # 2D tensor, so the output_tensor is the output of the Reshape op.
@@ -207,6 +207,11 @@ def _FindFusedBatchNorms(graph):
         continue
       output_tensor = output_reshape_op.outputs[0]
 
+    # Ensure that the output tensor has consumers, otherwise this is a dangling
+    # node and not a match.
+    if not output_tensor.consumers():
+      continue
+
     input_tensor = match_result.get_tensor(input_pattern)
     weight_tensor = match_result.get_tensor(weight_pattern)
     gamma_tensor = match_result.get_tensor(gamma_pattern)
@@ -270,7 +275,7 @@ def _FindFusedBatchNorms(graph):
         moving_variance_tensor=moving_variance_tensor,
         bn_decay_mean_tensor=bn_decay_mean_tensor,
         bn_decay_var_tensor=bn_decay_var_tensor,
-        batch_epsilon_tensor=batch_epsilon_tensor)
+        batch_epsilon=batch_epsilon)
 
 
 def _ComputeBatchNormCorrections(context, match, freeze_batch_norm_delay,
@@ -313,9 +318,8 @@ def _ComputeBatchNormCorrections(context, match, freeze_batch_norm_delay,
   g = ops.get_default_graph()
   with g.name_scope(context + '/batch_norm_correction'):
     recip_sigma_mv = math_ops.rsqrt(
-        match.moving_variance_tensor + match.batch_epsilon_tensor)
-    recip_sigma = math_ops.rsqrt(
-        match.variance_tensor + match.batch_epsilon_tensor)
+        match.moving_variance_tensor + match.batch_epsilon)
+    recip_sigma = math_ops.rsqrt(match.variance_tensor + match.batch_epsilon)
     correction_scale = math_ops.divide(
         recip_sigma_mv, recip_sigma, name='scale_compute')
     correction_scale = array_ops.identity(
@@ -434,6 +438,9 @@ def _FoldUnfusedBatchNorms(graph, is_training, freeze_batch_norm_delay):
   for bn in common.BatchNormGroups(graph):
     has_scaling = _HasScaling(graph, input_to_ops_map, bn)
 
+    if not _IsValidUnfusedBatchNorm(graph, bn):
+      continue
+
     # The mangling code intimately depends on BatchNorm node's internals.
     original_op, folded_op = _CreateFoldedOp(
         graph,
@@ -462,6 +469,15 @@ def _FoldUnfusedBatchNorms(graph, is_training, freeze_batch_norm_delay):
       raise ValueError('Unexpected inputs to op: %s' % add_bypass.name)
 
 
+def _IsValidUnfusedBatchNorm(graph, context):
+  """Checks that the output of the unfused batch norm has consumers."""
+  add_shift = graph.get_operation_by_name(
+      context + '/BatchNorm/batchnorm/add_1')
+  # Ensure that the output tensor of batch norm has consumers, otherwise this
+  # is a dangling node and not a match.
+  return bool(add_shift.outputs[0].consumers())
+
+
 def _GetBatchNormParams(graph, context, has_scaling):
   """Extracts relevant tensors for folding batch norms.
 
@@ -478,7 +494,7 @@ def _GetBatchNormParams(graph, context, has_scaling):
   batch_variance_tensor = None
   moving_mean_tensor = None
   moving_variance_tensor = None
-  batch_epsilon_tensor = None
+  batch_epsilon = None
   bn_decay_mean_tensor = None
   bn_decay_var_tensor = None
 
@@ -509,7 +525,7 @@ def _GetBatchNormParams(graph, context, has_scaling):
     if op.name.endswith(op_suffix_moving_variance):
       moving_variance_tensor = graph.get_tensor_by_name(op.name + ':0')
     if op.name.endswith(op_suffix_epsilon):
-      batch_epsilon_tensor = graph.get_tensor_by_name(op.name + ':0')
+      batch_epsilon = graph.get_tensor_by_name(op.name + ':0')
     if op.name.endswith(op_suffix_bn_decay_mean):
       bn_decay_mean_tensor = graph.get_tensor_by_name(op.name + ':0')
     if op.name.endswith(op_suffix_bn_decay_var):
@@ -535,7 +551,7 @@ def _GetBatchNormParams(graph, context, has_scaling):
       moving_variance_tensor=moving_variance_tensor,
       bn_decay_mean_tensor=bn_decay_mean_tensor,
       bn_decay_var_tensor=bn_decay_var_tensor,
-      batch_epsilon_tensor=batch_epsilon_tensor)
+      batch_epsilon=batch_epsilon)
 
 
 def _CreateFoldedOp(graph, context, has_scaling, freeze_batch_norm_delay,
@@ -816,7 +832,7 @@ class _BatchNormMatch(object):
   def __init__(self, layer_op, bn_op, output_tensor, input_tensor,
                weight_tensor, gamma_tensor, beta_tensor, mean_tensor,
                variance_tensor, moving_mean_tensor, moving_variance_tensor,
-               bn_decay_mean_tensor, bn_decay_var_tensor, batch_epsilon_tensor):
+               bn_decay_mean_tensor, bn_decay_var_tensor, batch_epsilon):
     self._layer_op = layer_op
     self._bn_op = bn_op
     self._output_tensor = output_tensor
@@ -830,7 +846,7 @@ class _BatchNormMatch(object):
     self._moving_variance_tensor = moving_variance_tensor
     self._bn_decay_mean_tensor = bn_decay_mean_tensor
     self._bn_decay_var_tensor = bn_decay_var_tensor
-    self._batch_epsilon_tensor = batch_epsilon_tensor
+    self._batch_epsilon = batch_epsilon
 
   @property
   def layer_op(self):
@@ -877,8 +893,8 @@ class _BatchNormMatch(object):
     return self._moving_variance_tensor
 
   @property
-  def batch_epsilon_tensor(self):
-    return self._batch_epsilon_tensor
+  def batch_epsilon(self):
+    return self._batch_epsilon
 
   @property
   def bn_decay_mean_tensor(self):
diff --git a/tensorflow/contrib/quantize/python/quantize.py b/tensorflow/contrib/quantize/python/quantize.py
index 7a3f92f503a5d6f2b0fab2a499f8e8758809d0ed..5fd806d195dce671d079386ea4b6c89042e26cf6 100644
--- a/tensorflow/contrib/quantize/python/quantize.py
+++ b/tensorflow/contrib/quantize/python/quantize.py
@@ -207,6 +207,18 @@ def _FindLayersToQuantize(graph):
     yield _LayerMatch(layer_op, weight_tensor, activation_op, bypass_op,
                       bias_add_op)
 
+  # Match the final layer, where there will not be an activation and instead
+  # the output of the final BiasAdd must be quantized, so we treat it as the
+  # 'activation_op' in the _LayerMatch.
+  # TODO(suharshs): Figure out how to quantize this final layer across many
+  # models.
+  final_layer_matcher = graph_matcher.GraphMatcher(bias_add_pattern)
+  for match_result in final_layer_matcher.match_graph(graph):
+    layer_op = match_result.get_op(layer_pattern)
+    weight_tensor = match_result.get_tensor(weight_pattern)
+    activation_op = match_result.get_op(bias_add_pattern)
+    yield _LayerMatch(layer_op, weight_tensor, activation_op, None, None)
+
 
 class _LayerMatch(object):
   """Contains all information related to a matched Layer."""
diff --git a/tensorflow/contrib/quantize/python/quantize_graph_test.py b/tensorflow/contrib/quantize/python/quantize_graph_test.py
index 6b9289ef5f4b847172e1f093a1e4b5b2d3bdab57..b9d03c1bc059fe7bcce75978f503cbbf76090dbd 100644
--- a/tensorflow/contrib/quantize/python/quantize_graph_test.py
+++ b/tensorflow/contrib/quantize/python/quantize_graph_test.py
@@ -211,6 +211,19 @@ class QuantizeGraphTest(test_util.TensorFlowTestCase):
       self.assertFalse(any(s in op.name for s in update_names))
     self.assertTrue(quant_found)
 
+  def testIdempotent(self):
+    self._RunTestOverAllRewrites(self._TestIdempotent)
+
+  def _TestIdempotent(self, rewrite_fn):
+    with ops.Graph().as_default() as g:
+      self._ConvLayer()
+      rewrite_fn()
+      graph_def_before = str(g.as_graph_def())
+      # Ensuring that calling the rewrite again doesn't add more nodes.
+      rewrite_fn()
+      graph_def_after = str(g.as_graph_def())
+      self.assertEqual(graph_def_before, graph_def_after)
+
   def _ConvLayer(self):
     """Add a basic convolution layer to the default graph."""
     batch_size, height, width, depth = 5, 128, 128, 3
diff --git a/tensorflow/contrib/quantize/python/quantize_parameterized_test.py b/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
index 639a7454a92aebd7289c59498cebff82cc003f75..dd73f6c86048b9d75f2ad9808155007eed7079ec 100644
--- a/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
+++ b/tensorflow/contrib/quantize/python/quantize_parameterized_test.py
@@ -87,8 +87,8 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       update_barrier = control_flow_ops.no_op(name='update_barrier')
       with ops.control_dependencies([update_barrier]):
         array_ops.identity(node, name='control_dependency')
-
       quantize.Quantize(graph, True, quant_delay=delay)
+
     quantization_node_name = 'FakeQuantWithMinMaxVars'
     weights_quant = graph.get_operation_by_name(scope + '/weights_quant/' +
                                                 quantization_node_name)
@@ -130,6 +130,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
     output_op_name = ('test/act_quant/delayed_quant/Switch_1'
                       if delay else 'control_dependency')
     self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._TestIdempotent(graph)
 
   def testQuantize_Conv2dWithoutBatchNorm(self):
     self._RunWithoutBatchNormTestOverParameters(
@@ -163,7 +164,6 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       update_barrier = control_flow_ops.no_op(name='update_barrier')
       with ops.control_dependencies([update_barrier]):
         array_ops.identity(node, name='control_dependency')
-
       quantize.Quantize(graph, True, quant_delay=delay)
 
     quantization_node_name = 'FakeQuantWithMinMaxVars'
@@ -205,6 +205,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
     output_op_name = ('test/act_quant/delayed_quant/Switch_1'
                       if delay else 'control_dependency')
     self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._TestIdempotent(graph)
 
   def testQuantize_FCWithoutBatchNorm(self):
     self._RunWithoutBatchNormTestOverParameters(
@@ -239,7 +240,6 @@ class QuantizeTest(test_util.TensorFlowTestCase):
       update_barrier = control_flow_ops.no_op(name='update_barrier')
       with ops.control_dependencies([update_barrier]):
         array_ops.identity(node, name='control_dependency')
-
       quantize.Quantize(graph, True, quant_delay=delay)
 
     quantization_node_name = 'FakeQuantWithMinMaxVars'
@@ -282,6 +282,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
     output_op_name = ('test/act_quant/delayed_quant/Switch_1'
                       if delay else 'control_dependency')
     self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._TestIdempotent(graph)
 
   def testQuantize_DepthwiseConv2dWithoutBatchNorm(self):
     self._RunWithoutBatchNormTestOverParameters(
@@ -364,7 +365,6 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         array_ops.identity(node, name='control_dependency')
 
       fold_batch_norms.FoldBatchNorms(graph, is_training=True)
-
       quantize.Quantize(graph, True, quant_delay=delay)
 
     quantization_node_name = 'FakeQuantWithMinMaxVars'
@@ -404,6 +404,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
     output_op_name = ('test/act_quant/delayed_quant/Switch_1'
                       if delay else 'control_dependency')
     self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._TestIdempotent(graph)
 
   def testQuantize_FCWithBatchNorm(self):
     self._RunBatchNormTestOverParameters(self._TestQuantize_FCWithBatchNorm)
@@ -487,6 +488,7 @@ class QuantizeTest(test_util.TensorFlowTestCase):
     output_op_name = ('test/act_quant/delayed_quant/Switch_1'
                       if delay else 'control_dependency')
     self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._TestIdempotent(graph)
 
   def testQuantize_DepthwiseConv2dWithBatchNorm(self):
     self._RunBatchNormTestOverParameters(
@@ -535,8 +537,8 @@ class QuantizeTest(test_util.TensorFlowTestCase):
         array_ops.identity(node, name='control_dependency')
 
       fold_batch_norms.FoldBatchNorms(graph, is_training=True)
-
       quantize.Quantize(graph, True, quant_delay=delay)
+
     quantization_node_name = 'FakeQuantWithMinMaxVars'
     weights_quant = graph.get_operation_by_name(scope + '/weights_quant/' +
                                                 quantization_node_name)
@@ -574,6 +576,17 @@ class QuantizeTest(test_util.TensorFlowTestCase):
     output_op_name = ('test/act_quant/delayed_quant/Switch_1'
                       if delay else 'control_dependency')
     self._AssertOutputGoesToOps(act_quant, graph, [output_op_name])
+    self._TestIdempotent(graph)
+
+  def _TestIdempotent(self, graph):
+    # Ensure that calling the rewrite again doesn't change the graph.
+    graph_def_before = str(graph.as_graph_def())
+    with graph.as_default():
+      # Ensuring that calling the rewrite again doesn't add more nodes.
+      fold_batch_norms.FoldBatchNorms(graph, is_training=True)
+      quantize.Quantize(graph, True)
+    graph_def_after = str(graph.as_graph_def())
+    self.assertEqual(graph_def_before, graph_def_after)
 
   def _BatchNormParams(self, fused=False):
     return {'center': True, 'scale': True, 'decay': 1.0 - 0.003, 'fused': fused}
diff --git a/tensorflow/contrib/quantize/python/quantize_test.py b/tensorflow/contrib/quantize/python/quantize_test.py
index bb7be0809421b64a019e73f00aac6c58524222e8..ef59475167137e203db2f6ca7f43c7b8f1938060 100644
--- a/tensorflow/contrib/quantize/python/quantize_test.py
+++ b/tensorflow/contrib/quantize/python/quantize_test.py
@@ -113,6 +113,28 @@ class QuantizeTest(test_util.TensorFlowTestCase):
                                             quantization_node_name)
     self.assertEqual(add_quant.type, quantization_node_name)
 
+  def testFinalLayerQuantized(self):
+    self._RunTestOverParameters(self._TestFinalLayerQuantized)
+
+  def _TestFinalLayerQuantized(self, is_training):
+    graph = ops.Graph()
+    with graph.as_default():
+      batch_size, height, width, depth = 5, 128, 128, 3
+      input1 = array_ops.zeros((batch_size, height, width, depth))
+      _ = conv2d(
+          input1,
+          32, [5, 5],
+          stride=2,
+          padding='SAME',
+          weights_initializer=self._WeightInit(0.09),
+          activation_fn=None,
+          scope='test')
+      # Ensure that the a FakeQuant operation is in the outputs of the BiasAdd.
+      bias_add_op = graph.get_operation_by_name('test/BiasAdd')
+      quantize.Quantize(graph, is_training, weight_bits=8, activation_bits=8)
+      self.assertTrue('FakeQuantWithMinMaxVars' in
+                      [op.type for op in bias_add_op.outputs[0].consumers()])
+
   def _WeightInit(self, stddev):
     """Returns truncated normal variable initializer.
 
diff --git a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
index 0e62b315b61cb3ceeb5cfd33bf5102a71abef83b..d41fc0b3ac1cee4eacc88cb0f41df1f9ee59e7c3 100644
--- a/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
+++ b/tensorflow/contrib/rnn/python/kernel_tests/core_rnn_cell_test.py
@@ -187,6 +187,8 @@ class RNNCellTest(test.TestCase):
               ],
               state_is_tuple=False)
           self.assertEqual(cell.dtype, None)
+          self.assertEqual("cell-0", cell._checkpoint_dependencies[0].name)
+          self.assertEqual("cell-1", cell._checkpoint_dependencies[1].name)
           g, out_m = cell(x, m)
           # Layer infers the input type.
           self.assertEqual(cell.dtype, dtype.name)
diff --git a/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py b/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
index d6184d61095f727f9dcab56fe59e2601868c1624..554eb24e5260724a905b099091bf8aea461554cf 100644
--- a/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
+++ b/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py
@@ -724,7 +724,7 @@ def _mask_probs(probs, eos_token, finished):
       eos_token,
       vocab_size,
       dtype=probs.dtype,
-      on_value=0.,
+      on_value=ops.convert_to_tensor(0., dtype=probs.dtype),
       off_value=probs.dtype.min)
   finished_probs = array_ops.tile(
       array_ops.reshape(finished_row, [1, 1, -1]),
diff --git a/tensorflow/contrib/slim/python/slim/data/parallel_reader.py b/tensorflow/contrib/slim/python/slim/data/parallel_reader.py
index ad5e985487190e72b9eb2809da964f3d7b34ef94..b3343aef47d9f352c3bcbef4afbe8f9bf2560e6d 100644
--- a/tensorflow/contrib/slim/python/slim/data/parallel_reader.py
+++ b/tensorflow/contrib/slim/python/slim/data/parallel_reader.py
@@ -221,7 +221,7 @@ def parallel_read(data_sources,
         the data will be cycled through indefinitely.
     num_readers: a integer, number of Readers to create.
     reader_kwargs: an optional dict, of kwargs for the reader.
-    shuffle: boolean, wether should shuffle the files and the records by using
+    shuffle: boolean, whether should shuffle the files and the records by using
       RandomShuffleQueue as common_queue.
     dtypes:  A list of types.  The length of dtypes must equal the number
         of elements in each record. If it is None it will default to
diff --git a/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.h b/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.h
index 04e6b0a735320dd024e326a94ef910593a326245..dc3e9fe79d32a19930d500b62b520eddb4b41aa8 100644
--- a/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.h
+++ b/tensorflow/contrib/tensor_forest/kernels/v4/grow_stats.h
@@ -468,7 +468,7 @@ class FixedSizeSparseClassificationGrowStats : public ClassificationStats {
   void PackToProto(FertileSlot* slot) const override;
 
   void InitLeafClassStats(int best_split_index, LeafStat* left_stats,
-                          LeafStat* right_stats) const;
+                          LeafStat* right_stats) const override;
 
  protected:
   void ClassificationAddSplitStats() override {
diff --git a/tensorflow/contrib/tensorrt/BUILD b/tensorflow/contrib/tensorrt/BUILD
index cf67c27b70f1a8c761b71074d3eb5cd962a68488..c832c6f2e0cefe9e9c895eac8c4fbe4e0ef1c33d 100644
--- a/tensorflow/contrib/tensorrt/BUILD
+++ b/tensorflow/contrib/tensorrt/BUILD
@@ -47,7 +47,10 @@ tf_cuda_cc_test(
 
 tf_custom_op_library(
     name = "python/ops/_trt_engine_op.so",
-    srcs = ["ops/trt_engine_op.cc"],
+    srcs = [
+        "ops/trt_calib_op.cc",
+        "ops/trt_engine_op.cc",
+    ],
     deps = [
         ":trt_engine_op_kernel",
         ":trt_shape_function",
@@ -71,22 +74,33 @@ tf_cuda_library(
 
 cc_library(
     name = "trt_engine_op_kernel",
-    srcs = ["kernels/trt_engine_op.cc"],
-    hdrs = ["kernels/trt_engine_op.h"],
+    srcs = [
+        "kernels/trt_calib_op.cc",
+        "kernels/trt_engine_op.cc",
+    ],
+    hdrs = [
+        "kernels/trt_calib_op.h",
+        "kernels/trt_engine_op.h",
+    ],
     copts = tf_copts(),
     deps = [
         ":trt_logging",
+        ":trt_resources",
         "//tensorflow/core:gpu_headers_lib",
         "//tensorflow/core:lib_proto_parsing",
         "//tensorflow/core:stream_executor_headers_lib",
     ] + if_tensorrt([
         "@local_config_tensorrt//:nv_infer",
     ]) + tf_custom_op_library_additional_deps(),
-    alwayslink = 1,
+    # TODO(laigd)
+    alwayslink = 1,  # buildozer: disable=alwayslink-with-hdrs
 )
 
 tf_gen_op_libs(
-    op_lib_names = ["trt_engine_op"],
+    op_lib_names = [
+        "trt_engine_op",
+        "trt_calib_op",
+    ],
     deps = if_tensorrt([
         "@local_config_tensorrt//:nv_infer",
     ]),
@@ -106,7 +120,9 @@ tf_cuda_library(
 
 tf_gen_op_wrapper_py(
     name = "trt_engine_op",
+    gen_locally = True,
     deps = [
+        ":trt_calib_op_op_lib",
         ":trt_engine_op_op_lib",
         ":trt_logging",
         ":trt_shape_function",
@@ -170,6 +186,27 @@ tf_py_wrap_cc(
     ],
 )
 
+tf_cuda_library(
+    name = "trt_resources",
+    srcs = [
+        "resources/trt_int8_calibrator.cc",
+        "resources/trt_resource_manager.cc",
+    ],
+    hdrs = [
+        "resources/trt_int8_calibrator.h",
+        "resources/trt_resource_manager.h",
+        "resources/trt_resources.h",
+    ],
+    deps = [
+        ":trt_logging",
+        "//tensorflow/core:framework_headers_lib",
+        "//tensorflow/core:framework_lite",
+        "//tensorflow/core:lib_proto_parsing",
+    ] + if_tensorrt([
+        "@local_config_tensorrt//:nv_infer",
+    ]),
+)
+
 # Library for the node-level conversion portion of TensorRT operation creation
 tf_cuda_library(
     name = "trt_conversion",
@@ -184,6 +221,7 @@ tf_cuda_library(
     deps = [
         ":segment",
         ":trt_logging",
+        ":trt_resources",
         "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core:framework",
diff --git a/tensorflow/contrib/tensorrt/convert/convert_graph.cc b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
index 899448004f917b36b35fb871a66a9d857736a338..970f8104736d95d09ea3ffabb07f84d8591a8f9c 100644
--- a/tensorflow/contrib/tensorrt/convert/convert_graph.cc
+++ b/tensorflow/contrib/tensorrt/convert/convert_graph.cc
@@ -37,7 +37,7 @@ limitations under the License.
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/types.h"
-#include "tensorflow/core/protobuf/device_properties.pb.h"
+#include "tensorflow/core/protobuf/device_properties.pb.h"  // NOLINT
 
 #if GOOGLE_CUDA
 #if GOOGLE_TENSORRT
diff --git a/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1dcb87e7683ad73b1f5f894b61a15a16d36cfcdf
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.cc
@@ -0,0 +1,129 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/tensorrt/kernels/trt_calib_op.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_resource_manager.h"
+#include "tensorflow/contrib/tensorrt/resources/trt_resources.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_shape.h"
+#include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/framework/types.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+#include "cuda_runtime_api.h"
+#include "tensorrt/include/NvInfer.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+TRTCalibOp::TRTCalibOp(OpKernelConstruction* context) : OpKernel(context) {
+  OP_REQUIRES_OK(context, context->GetAttr("segment_nodes", &segment_nodes_));
+  OP_REQUIRES_OK(context, context->GetAttr("input_names", &input_names_));
+  OP_REQUIRES_OK(context, context->GetAttr("resource_name", &resource_name_));
+};
+
+#define TYPECASE(dt, X, Y)                                                \
+  case dt: {                                                              \
+    return (void*)X->flat<tensorflow::EnumToDataType<dt>::Type>().data(); \
+  }
+
+void* GetTensorAddress(const Tensor* tensor_ptr) {
+  auto tensor_type = tensor_ptr->dtype();
+  switch (tensor_type) {
+    TYPECASE(tensorflow::DT_FLOAT, tensor_ptr, dest_ptr);
+    TYPECASE(tensorflow::DT_HALF, tensor_ptr, dest_ptr);
+    TYPECASE(tensorflow::DT_INT8, tensor_ptr, dest_ptr);
+    default: {
+      LOG(FATAL) << "Unsupported Data type "
+                 << tensorflow::DataTypeString(tensor_type);
+      return nullptr;
+    }
+  }
+}
+
+void TRTCalibOp::Compute(tensorflow::OpKernelContext* ctx) {
+  // TODO(aaroey): make sure ctx->resource_mgr() is used in future PR.
+  auto trt_rm = tensorflow::tensorrt::TRTResourceManager::instance();
+  auto res_mgr = trt_rm->getManager("TRTCalibOps");
+  tensorflow::tensorrt::TRTCalibrationResource* calib_res = nullptr;
+  auto status = res_mgr->Lookup(resource_name_, resource_name_, &calib_res);
+
+  if (!status.ok()) {
+    ctx->SetStatus(status);
+    return;
+  }
+  int num_inputs = ctx->num_inputs();
+  // first run instantiate calibrator
+  if (calib_res->calibrator_ == nullptr) {
+    dev_tensors_.resize(num_inputs);
+    int batch_size = ctx->input(0).dim_size(0);
+    VLOG(1) << " Constructing calibrator";
+    for (int i = 0; i < num_inputs; i++) {
+      // allocate workspace on device for inputs
+      const tensorflow::Tensor& t = ctx->input(i);
+      OP_REQUIRES_OK(ctx,
+                     ctx->allocate_persistent(t.dtype(), t.shape(),
+                                              &dev_tensors_.at(i), nullptr));
+      const auto device_tensor = dev_tensors_.at(i).AccessTensor(ctx);
+      CHECK_EQ(t.TotalBytes(), device_tensor->TotalBytes());
+      void* device_address = GetTensorAddress(device_tensor);
+      device_buffers_.emplace(input_names_.at(i),
+                              std::pair<void*, size_t>(
+                                  device_address, device_tensor->TotalBytes()));
+    }
+
+    calib_res->calibrator_ =
+        new TRTInt8Calibrator(device_buffers_, batch_size, resource_name_);
+    string label(resource_name_);
+    calib_res->thr_ = new std::thread([calib_res, label]() {
+      VLOG(1) << "Starting calibration thread, Calibration Resource @ "
+              << calib_res;
+      calib_res->builder_->setInt8Calibrator(calib_res->calibrator_);
+      calib_res->builder_->setInt8Mode(true);
+      calib_res->engine_ = calib_res->builder_->buildCudaEngine(
+          *calib_res->network_);  // will loop until we terminate calibrator
+      VLOG(1) << "Calibration loop terminated " << label;
+    });
+    VLOG(1) << "initialized calibrator resource";
+  }  //  calibrator initialized
+
+  // Pass input data to calibrator
+  std::unordered_map<string, void*> input_data;
+  for (int i = 0; i < num_inputs; i++) {
+    const Tensor& t = ctx->input(i);
+    void* data_address = GetTensorAddress(&t);
+    const auto device_tensor = dev_tensors_.at(i).AccessTensor(ctx);
+    CHECK_EQ(t.TotalBytes(),
+             device_tensor->TotalBytes());  // use the tensor so FW keeps it
+    input_data.emplace(input_names_.at(i), data_address);
+    ctx->set_output(i, t);
+  }
+  VLOG(2) << "Filled map for sending";
+  calib_res->calibrator_->setBatch(input_data);
+  VLOG(2) << "Passed calibration data";
+  // TODO(aaroey): make sure we wait for the completion of calibration on the
+  // last batch in future PR.
+};
+
+#undef TYPECASE
+
+REGISTER_KERNEL_BUILDER(Name("TRTCalibOp").Device(DEVICE_GPU), TRTCalibOp);
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+#endif
+#endif
diff --git a/tensorflow/contrib/tensorrt/kernels/trt_calib_op.h b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..23df9db32f077a080eaff7479fcbe90d6a504c42
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/kernels/trt_calib_op.h
@@ -0,0 +1,52 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_TENSORRT_KERNELS_TRT_CALIB_OP_H
+#define TENSORFLOW_CONTRIB_TENSORRT_KERNELS_TRT_CALIB_OP_H
+
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include <utility>
+#include <vector>
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor_shape.h"
+#include "tensorflow/core/platform/types.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+namespace tensorflow {
+namespace tensorrt {
+// TODO(sami): Convert this to async kernel!
+class TRTCalibOp : public OpKernel {
+ public:
+  explicit TRTCalibOp(OpKernelConstruction* context);
+
+  void Compute(OpKernelContext* context) override;
+
+ private:
+  string resource_name_;
+  std::vector<string> segment_nodes_;
+  std::vector<string> input_names_;
+  std::vector<tensorflow::TensorShape> shapes_;
+  std::unordered_map<string, std::pair<void*, size_t>> device_buffers_;
+  std::vector<tensorflow::PersistentTensor> dev_tensors_;
+};
+}  // namespace tensorrt
+}  // namespace tensorflow
+#endif
+#endif
+#endif  // TENSORFLOW_CONTRIB_TENSORRT_KERNELS_TRT_CALIB_OP_H
diff --git a/tensorflow/contrib/tensorrt/ops/trt_calib_op.cc b/tensorflow/contrib/tensorrt/ops/trt_calib_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4835e5065068ec7a59995eb7f6126b31aecf6704
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/ops/trt_calib_op.cc
@@ -0,0 +1,37 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/shape_inference.h"
+namespace tensorflow {
+
+REGISTER_OP("TRTCalibOp")
+    .Attr("segment_nodes: list(string)")         // names of the ops in segment
+    .Attr("segment_output_names: list(string)")  // names of the output ops in
+                                                 // segment
+    .Attr("input_names: list(string)")           // names of the inputs for
+                                                 // passing into tensorrt
+    .Attr("resource_name: string")
+    .Attr("InT: list({int8, float16, float32})")
+    .Input("in_tensor: InT")
+    .Output("out_tensor: InT")
+    .SetShapeFn([](tensorflow::shape_inference::InferenceContext* c) {
+      for (int i = 0; i < c->num_inputs(); i++) {
+        c->set_output(i, c->input(i));
+      }
+      return Status::OK();
+    });
+
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3d5cc76c4256bea70e75ea3dd9b1e87c951a9000
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.cc
@@ -0,0 +1,119 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h"
+
+#include <atomic>
+#include <chrono>
+#include <unordered_map>
+
+#include "tensorflow/core/platform/logging.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+#include "cuda_runtime_api.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+// set the batch size before constructing the thread to execute engine
+int TRTInt8Calibrator::getBatchSize() const { return batch_size_; }
+
+TRTInt8Calibrator::TRTInt8Calibrator(
+    const std::unordered_map<string, std::pair<void*, size_t>>& dev_buffers,
+    int batch_size, string engine_name)
+    : batch_size_(batch_size),
+      done_(false),
+      dev_buffers_(dev_buffers),
+      calib_running_(false),
+      engine_name_(engine_name) {}
+
+bool TRTInt8Calibrator::setBatch(
+    const std::unordered_map<string, void*>& data) {
+  // TODO(aaroey): make sure that in future PR:
+  // 1. the mutex_lock is outside of the loop
+  // 2. wait() is used instead of wait_for()
+  // 3. done_ is to be protected by the mutex
+  // 4. the first batch is not missed
+  if (done_) return false;
+  while (calib_running_.load(
+      std::memory_order_acquire)) {  // wait while calibration is running
+    tensorflow::mutex_lock l(cond_mtx_);
+    cond_.wait_for(l, std::chrono::milliseconds(50));
+    if (done_) return false;
+  }
+  VLOG(1) << "Set Batch Waiting finished";
+  for (const auto it : data) {
+    auto devptr = dev_buffers_.find(it.first);
+    if (devptr == dev_buffers_.end()) {
+      LOG(FATAL) << "FATAL " << engine_name_ << " input name '" << it.first
+                 << "' does not match with the buffer names";
+    }
+    const auto& d = devptr->second;
+
+    // TODO(aaroey): we should not use sync copy on default stream. Make sure
+    // stream->ThenMemcpy() is used in future PRs.
+    auto status =
+        cudaMemcpy(d.first, it.second, d.second, cudaMemcpyDeviceToDevice);
+    if (status != cudaSuccess) {
+      LOG(FATAL) << "cudaMemcpy " << engine_name_ << " for '" << it.first
+                 << "' failed with " << status;
+    }
+  }
+  calib_running_.store(true, std::memory_order_release);  // release builder
+  cond_.notify_all();
+  return true;
+}
+
+bool TRTInt8Calibrator::getBatch(void** bindings, const char** names,
+                                 int num_bindings) {
+  calib_running_.store(false, std::memory_order_release);  // wait for new batch
+  cond_.notify_all();
+  while (!calib_running_.load(
+      std::memory_order_acquire)) {  // wait until new batch arrives
+    tensorflow::mutex_lock l(cond_mtx_);
+    cond_.wait_for(l, std::chrono::milliseconds(50));
+    if (done_) return false;
+  }
+  if (done_) {
+    return false;
+  }
+
+  for (int i = 0; i < num_bindings; i++) {
+    auto it = dev_buffers_.find(names[i]);
+    if (it == dev_buffers_.end()) {
+      LOG(FATAL) << "Calibration engine asked for unknown tensor name '"
+                 << names[i] << "' at position " << i;
+    }
+
+    bindings[i] = it->second.first;
+  }
+  return true;
+}
+
+const void* TRTInt8Calibrator::readCalibrationCache(std::size_t& length) {
+  return nullptr;
+}
+
+void TRTInt8Calibrator::writeCalibrationCache(const void* ptr,
+                                              std::size_t length) {}
+TRTInt8Calibrator::~TRTInt8Calibrator() {
+  VLOG(1) << "Destroying calibrator for " << engine_name_;
+}
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+#endif
+#endif
diff --git a/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h
new file mode 100644
index 0000000000000000000000000000000000000000..8830f7efe75b42eb82cffe5b07ddd3832b36145c
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h
@@ -0,0 +1,65 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_INT8_CALIBRATOR_H_
+#define TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_INT8_CALIBRATOR_H_
+
+#include <atomic>
+#include <string>
+#include <unordered_map>
+#include <utility>
+#include "tensorflow/core/platform/mutex.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+#include "tensorrt/include/NvInfer.h"
+namespace tensorflow {
+namespace tensorrt {
+// This class provides a 1 element queue to match TFs push model to
+// TRTs pull model for calibration. When TRT implements a means for
+// a push calibration This class should be updated accordingly
+
+struct TRTInt8Calibrator : public nvinfer1::IInt8EntropyCalibrator {
+ public:
+  TRTInt8Calibrator(
+      const std::unordered_map<string, std::pair<void*, size_t>>& dev_buffers,
+      int batch_size, string engine_name);
+  int getBatchSize() const override;
+  bool getBatch(void* bindings[], const char* names[],
+                int num_bindings) override;
+  bool setBatch(const std::unordered_map<string, void*>& data);
+  void setDone() { done_ = true; }
+  const void* readCalibrationCache(std::size_t& length) override;
+  void writeCalibrationCache(const void* ptr, std::size_t length) override;
+  ~TRTInt8Calibrator();
+
+ private:
+  const int batch_size_;
+  tensorflow::mutex cond_mtx_;           // mutex for condition_variable
+  tensorflow::condition_variable cond_;  // condition variable to implement
+                                         // producer-consumer queue for
+                                         // calibration
+  bool done_;
+  const std::unordered_map<string, std::pair<void*, size_t>>
+      dev_buffers_;  // map to keep tensorrt input buffers and sizes keyed with
+                     // buffer names
+  std::atomic_bool calib_running_;
+  string engine_name_;
+};
+}  // namespace tensorrt
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_INT8_CALIBRATOR_H_
+#endif
+#endif
diff --git a/tensorflow/contrib/tensorrt/resources/trt_resource_manager.cc b/tensorflow/contrib/tensorrt/resources/trt_resource_manager.cc
new file mode 100644
index 0000000000000000000000000000000000000000..e663eed4dd6704e2f41bde1dfabd411e86669ecd
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_resource_manager.cc
@@ -0,0 +1,39 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/contrib/tensorrt/resources/trt_resource_manager.h"
+#include "tensorflow/core/platform/logging.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+std::shared_ptr<tensorflow::ResourceMgr>
+tensorflow::tensorrt::TRTResourceManager::getManager(const string& op_name) {
+  // mutex is held for lookup only. Most instantiations where mutex will be held
+  // longer will be during op creation and should be ok.
+  tensorflow::mutex_lock lock(map_mutex_);
+  auto s = managers_.find(op_name);
+  if (s == managers_.end()) {
+    auto it = managers_.emplace(
+        op_name, std::make_shared<tensorflow::ResourceMgr>(op_name));
+    VLOG(1) << "Returning a new manager " << op_name;
+    return it.first->second;
+  }
+  VLOG(1) << "Returning old manager " << op_name;
+  return s->second;
+}
+
+}  // namespace tensorrt
+}  // namespace tensorflow
diff --git a/tensorflow/contrib/tensorrt/resources/trt_resource_manager.h b/tensorflow/contrib/tensorrt/resources/trt_resource_manager.h
new file mode 100644
index 0000000000000000000000000000000000000000..5f8ad491d3c13e8911b0b95c3e95e19afe4d59c0
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_resource_manager.h
@@ -0,0 +1,49 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_RESOURCE_MANAGER_H_
+#define TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRT_RESOURCE_MANAGER_H_
+#include <memory>
+
+#include <string>
+#include <unordered_map>
+#include "tensorflow/core/framework/resource_mgr.h"
+#include "tensorflow/core/platform/mutex.h"
+
+namespace tensorflow {
+namespace tensorrt {
+
+class TRTResourceManager {
+  TRTResourceManager() = default;
+
+ public:
+  static std::shared_ptr<TRTResourceManager> instance() {
+    static std::shared_ptr<TRTResourceManager> instance_(
+        new TRTResourceManager);
+    return instance_;
+  }
+  // returns a manager for given op, if it doesn't exists it creates one
+  std::shared_ptr<tensorflow::ResourceMgr> getManager(const string& op_name);
+
+ private:
+  std::unordered_map<string, std::shared_ptr<tensorflow::ResourceMgr>>
+      managers_;
+  tensorflow::mutex map_mutex_;
+};
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CONTRIB_TENSORRT_RESOURCE_TRT_RESOURCE_MANAGER_H_
diff --git a/tensorflow/contrib/tensorrt/resources/trt_resources.h b/tensorflow/contrib/tensorrt/resources/trt_resources.h
new file mode 100644
index 0000000000000000000000000000000000000000..3c85968ae7acf5c5fc567be6805a5d226b1094c7
--- /dev/null
+++ b/tensorflow/contrib/tensorrt/resources/trt_resources.h
@@ -0,0 +1,95 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRTRESOURCES_H_
+#define TENSORFLOW_CONTRIB_TENSORRT_RESOURCES_TRTRESOURCES_H_
+
+#include <list>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+#include "tensorflow/contrib/tensorrt/log/trt_logger.h"
+#include "tensorflow/core/framework/resource_mgr.h"
+
+#if GOOGLE_CUDA
+#if GOOGLE_TENSORRT
+#include "tensorflow/contrib/tensorrt/resources/trt_int8_calibrator.h"
+#include "tensorrt/include/NvInfer.h"
+
+namespace tensorflow {
+namespace tensorrt {
+class TRTCalibrationResource : public tensorflow::ResourceBase {
+ public:
+  TRTCalibrationResource()
+      : calibrator_(nullptr),
+        builder_(nullptr),
+        network_(nullptr),
+        engine_(nullptr),
+        logger_(nullptr),
+        thr_(nullptr) {}
+  string DebugString() override {
+    std::stringstream oss;
+    oss << " Calibrator = " << std::hex << calibrator_ << std::dec << std::endl
+        << " Builder    = " << std::hex << builder_ << std::dec << std::endl
+        << " Network    = " << std::hex << network_ << std::dec << std::endl
+        << " Engine     = " << std::hex << engine_ << std::dec << std::endl
+        << " Logger     = " << std::hex << logger_ << std::dec << std::endl
+        << " Thread     = " << std::hex << thr_ << std::dec << std::endl;
+    return oss.str();
+  }
+  ~TRTCalibrationResource() {
+    VLOG(0) << "Destroying Calibration Resource " << std::endl << DebugString();
+  }
+  TRTInt8Calibrator* calibrator_;
+  nvinfer1::IBuilder* builder_;
+  nvinfer1::INetworkDefinition* network_;
+  nvinfer1::ICudaEngine* engine_;
+  tensorflow::tensorrt::Logger* logger_;
+  // TODO(sami): Use threadpool threads!
+  std::thread* thr_;
+};
+
+class TRTWeightStore : public tensorflow::ResourceBase {
+ public:
+  TRTWeightStore() {}
+  std::list<std::vector<uint8_t>> store_;
+  string DebugString() override {
+    std::stringstream oss;
+    size_t lenBytes = 0;
+    for (const auto& v : store_) {
+      lenBytes += v.size() * sizeof(uint8_t);
+    }
+    oss << " Number of entries     = " << store_.size() << std::endl
+        << " Total number of bytes = "
+        << store_.size() * sizeof(std::vector<uint8_t>) + lenBytes << std::endl;
+    return oss.str();
+  }
+  virtual ~TRTWeightStore() { VLOG(1) << "Destroying store" << DebugString(); }
+};
+
+class TRTEngineResource : public tensorflow::ResourceBase {
+ public:
+  TRTEngineResource() : runtime_(nullptr), ctx_(nullptr){};
+  string DebugString() override { return string(""); }
+  nvinfer1::IRuntime* runtime_;
+  nvinfer1::IExecutionContext* ctx_;
+};
+
+}  // namespace tensorrt
+}  // namespace tensorflow
+#endif  // TENSORFLOW_CONTRIB_TENSORRT_RESOURCEMGR_TRTRESOURCES_H_
+#endif
+#endif
diff --git a/tensorflow/contrib/tensorrt/test/test_tftrt.py b/tensorflow/contrib/tensorrt/test/test_tftrt.py
index 18dba94acb3724cb2b5a1c53227bcf08bf9f8fcc..c78f6f222457a875525e768eacc9a4ebf28ad504 100644
--- a/tensorflow/contrib/tensorrt/test/test_tftrt.py
+++ b/tensorflow/contrib/tensorrt/test/test_tftrt.py
@@ -37,7 +37,7 @@ from tensorflow.python.ops import nn_ops as nn_ops
 
 
 def get_simple_graph_def():
-  """Create a simple graph and return its graph_def"""
+  """Create a simple graph and return its graph_def."""
   g = ops.Graph()
   with g.as_default():
     a = aops.placeholder(
diff --git a/tensorflow/contrib/timeseries/examples/data/multivariate_periods.csv b/tensorflow/contrib/timeseries/examples/data/multivariate_periods.csv
index b49a0662c29b1d810f4be31ca1f318f0571f533e..9b15b4f0b26f11ac3281ca4206654872984628b6 100644
--- a/tensorflow/contrib/timeseries/examples/data/multivariate_periods.csv
+++ b/tensorflow/contrib/timeseries/examples/data/multivariate_periods.csv
@@ -1,100 +1,100 @@
-0,0.926906299771,1.99107237682,2.56546245685,3.07914768197,4.04839057867,1.,0.
-1,0.108010001864,1.41645361423,2.1686839775,2.94963962176,4.1263503303,1.,0.
-2,-0.800567600028,1.0172132907,1.96434754116,2.99885333086,4.04300485864,1.,0.
-3,0.0607042871898,0.719540073421,1.9765012584,2.89265588817,4.0951014426,1.,0.
-4,0.933712200629,0.28052120776,1.41018552514,2.69232603996,4.06481164223,1.,0.
-5,-0.171730652974,0.260054421028,1.48770816369,2.62199129293,4.44572807842,1.,0.
-6,-1.00180162933,0.333045158863,1.50006392277,2.88888309683,4.24755865606,1.,0.
-7,0.0580061875336,0.688929398826,1.56543458772,2.99840358953,4.52726873347,1.,0.
-8,0.764139447412,1.24704875327,1.77649279698,3.13578593851,4.63238922951,1.,0.
-9,-0.230331874785,1.47903998963,2.03547545751,3.20624030377,4.77980005228,1.,0.
-10,-1.03846045211,2.01133000781,2.31977503972,3.67951536251,5.09716775897,1.,0.
-11,0.188643592253,2.23285349038,2.68338482249,3.49817168611,5.24928239634,1.,0.
-12,0.91207302309,2.24244446841,2.71362604985,3.96332587625,5.37802271594,1.,0.
-13,-0.296588665881,2.02594634141,3.07733910479,3.99698324956,5.56365901394,1.,0.
-14,-0.959961476551,1.45078629833,3.18996420137,4.3763059609,5.65356015609,1.,0.
-15,0.46313530679,1.01141441548,3.4980215948,4.20224896882,5.88842247449,1.,0.
-16,0.929354125798,0.626635305936,3.70508262244,4.51791573544,5.73945973251,1.,0.
-17,-0.519110731957,0.269249223148,3.39866823332,4.46802003061,5.82768174382,1.,0.
-18,-0.924330981367,0.349602834684,3.21762413294,4.72803587499,5.94918925767,1.,0.
-19,0.253239387885,0.345158023497,3.11071425333,4.79311566935,5.9489259713,1.,0.
-20,0.637408390225,0.698996675371,3.25232492145,4.73814732384,5.9612010251,1.,0.
-21,-0.407396859412,1.17456342803,2.49526823723,4.59323415742,5.82501686811,1.,0.
-22,-0.967485452118,1.66655933642,2.47284606244,4.58316034754,5.88721406681,1.,0.
-23,0.474480867904,1.95018556323,2.0228950072,4.48651142819,5.8255943735,1.,0.
-24,1.04309652155,2.23519892356,1.91924131572,4.19094661783,5.87457348436,1.,0.
-25,-0.517861513772,2.12501967336,1.70266619979,4.05280882887,5.72160912899,1.,0.
-26,-0.945301585146,1.65464653549,1.81567174251,3.92309850635,5.58270493814,1.,0.
-27,0.501153868974,1.40600764889,1.53991387719,3.72853247942,5.60169001727,1.,0.
-28,0.972859524418,1.00344321868,1.5175642828,3.64092376655,5.10567722582,1.,0.
-29,-0.70553406135,0.465306263885,1.7038540803,3.33236870312,5.09182481555,1.,0.
-30,-0.946093634916,0.294539309453,1.88052827037,2.93011492669,4.97354922696,1.,0.
-31,0.47922123231,0.308465865031,2.03445883031,2.90772899045,4.86241793548,1.,0.
-32,0.754030014252,0.549752241167,2.46115815089,2.95063349534,4.71834614627,1.,0.
-33,-0.64875949826,0.894615488148,2.5922463381,2.81269864022,4.43480095104,1.,0.
-34,-0.757829951086,1.39123914261,2.69258079904,2.61834837315,4.36580046156,1.,0.
-35,0.565653301088,1.72360022693,2.97794913834,2.80403840334,4.27327248459,1.,0.
-36,0.867440092372,2.21100730052,3.38648090792,2.84057515729,4.12210169576,1.,0.
-37,-0.894567758095,2.17549105818,3.45532493329,2.90446025717,4.00251740584,1.,0.
-38,-0.715442356893,2.15105389965,3.52041791902,3.03650393392,4.12809249577,1.,0.
-39,0.80671703672,1.81504564517,3.60463324866,3.00747789871,3.98440762467,1.,0.
-40,0.527014790142,1.31803513865,3.43842186337,3.3332594663,4.03232406566,1.,0.
-41,-0.795936862129,0.847809114454,3.09875133548,3.52863155938,3.94883924909,1.,0.
-42,-0.610245806946,0.425530441018,2.92581949152,3.77238736123,4.27287245021,1.,0.
-43,0.611662279431,0.178432049837,2.48128214822,3.73212087883,4.17319013831,1.,0.
-44,0.650866553108,0.220341648392,2.41694642022,4.2609098519,4.27271645905,1.,0.
-45,-0.774156982023,0.632667602331,2.05474356052,4.32889204886,4.18029723271,1.,0.
-46,-0.714058448409,0.924562377599,1.75706135146,4.52492718422,4.3972678094,1.,0.
-47,0.889627293379,1.46207968841,1.78299357672,4.64466731095,4.56317887554,1.,0.
-48,0.520140662861,1.8996333843,1.41377633823,4.48899091177,4.78805049769,1.,0.
-49,-1.03816935616,2.08997002059,1.51218375351,4.84167764204,4.93026048606,1.,0.
-50,-0.40772951362,2.30878972136,1.44144415128,4.76854460997,5.01538444629,1.,0.
-51,0.792730684781,1.91367048509,1.58887384677,4.71739397335,5.25690012199,1.,0.
-52,0.371311881576,1.67565079528,1.81688563053,4.60353107555,5.44265822961,1.,0.
-53,-0.814398070371,1.13374634126,1.80328814859,4.72264252878,5.52674761122,1.,0.
-54,-0.469017949323,0.601244136627,2.29690896736,4.49859178859,5.54126153454,1.,0.
-55,0.871044371426,0.407597593794,2.7499112487,4.19060637761,5.57693767301,1.,0.
-56,0.523764933017,0.247705192709,3.09002071379,4.02095509006,5.80510362182,1.,0.
-57,-0.881326403531,0.31513103164,3.11358205718,3.96079100808,5.81000652365,1.,0.
-58,-0.357928025339,0.486163915865,3.17884556771,3.72634990659,5.85693642011,1.,0.
-59,0.853038779822,1.04218094475,3.45835384454,3.36703969978,5.9585988449,1.,0.
-60,0.435311516013,1.59715085283,3.63313338588,3.11276729421,5.93643818229,1.,0.
-61,-1.02703719138,1.92205832542,3.47606111735,3.06247155999,6.02106646259,1.,0.
-62,-0.246661325557,2.14653802542,3.29446326567,2.89936259181,5.67531541272,1.,0.
-63,1.02554736569,2.25943737733,3.07031591528,2.78176218013,5.78206328989,1.,0.
-64,0.337814475969,2.07589147224,2.80356226089,2.55888206331,5.7094075496,1.,0.
-65,-1.12023369929,1.25333011618,2.56497288445,2.77361359194,5.50799418376,1.,0.
-66,-0.178980246554,1.11937139901,2.51598681313,2.91438309151,5.47469577206,1.,0.
-67,0.97550951531,0.60553823137,2.11657741073,2.88081098981,5.37034999502,1.,0.
-68,0.136653357206,0.365828836075,1.97386033165,3.13217903204,5.07254490219,1.,0.
-69,-1.05607596951,0.153152115069,1.52110743825,3.01308794192,5.08902539125,1.,0.
-70,-0.13095280331,0.337113974483,1.52703079853,3.16687131599,4.86649398514,1.,0.
-71,1.07081057754,0.714247566736,1.53761382634,3.45151989484,4.75892309166,1.,0.
-72,0.0153410376082,1.24631231847,1.61690939161,3.85481994498,4.35683752832,1.,0.
-73,-0.912801257303,1.60791309476,1.8729264524,4.03037260012,4.36072588913,1.,0.
-74,-0.0894895640338,2.02535207407,1.93484909619,4.09557485132,4.35327025188,1.,0.
-75,0.978646999652,2.20085086625,2.09003440427,4.27542353033,4.1805058388,1.,0.
-76,-0.113312642876,2.2444100761,2.50789248839,4.4151861502,4.03267168136,1.,0.
-77,-1.00215099149,1.84305628445,2.61691237246,4.45425147595,3.81203553766,1.,0.
-78,-0.0183234614205,1.49573923116,2.99308471214,4.71134960112,4.0273804959,1.,0.
-79,1.0823738177,1.12211589848,3.27079386925,4.94288270502,4.01851068083,1.,0.
-80,0.124370187893,0.616474412808,3.4284236674,4.76942168327,3.9749536483,1.,0.
-81,-0.929423379352,0.290977090976,3.34131726136,4.78590392707,4.10190661656,1.,0.
-82,0.23766302648,0.155302052254,3.49779513794,4.64605656795,4.15571321107,1.,0.
-83,1.03531486192,0.359702776204,3.4880725919,4.48167586667,4.21134561991,1.,0.
-84,-0.261234571382,0.713877760378,3.42756426614,4.426443869,4.25208300527,1.,0.
-85,-1.03572442277,1.25001113691,2.96908341113,4.25500915322,4.25723010649,1.,0.
-86,0.380034261243,1.70543355622,2.73605932518,4.16703432307,4.63700400788,1.,0.
-87,1.03734873488,1.97544410562,2.55586572141,3.84976673263,4.55282864289,1.,0.
-88,-0.177344253372,2.22614526325,2.09565864891,3.77378097953,4.82577400298,1.,0.
-89,-0.976821526892,2.18385079177,1.78522284118,3.67768223554,5.06302440873,1.,0.
-90,0.264820472091,1.86981946157,1.50048403865,3.43619796921,5.05651761669,1.,0.
-91,1.05642344868,1.47568646076,1.51347671977,3.20898518885,5.50149047462,1.,0.
-92,-0.311607433358,1.04226467636,1.52089650905,3.02291865417,5.4889046232,1.,0.
-93,-0.724285777937,0.553052311957,1.48573560173,2.7365973598,5.72549174225,1.,0.
-94,0.519859192905,0.226520626591,1.61543723167,2.84102086852,5.69330622288,1.,0.
-95,1.0323195039,0.260873217055,1.81913034804,2.83951143848,5.90325028086,1.,0.
-96,-0.53285682538,0.387695521405,1.70935609313,2.57977050631,5.79579213161,1.,0.
-97,-0.975127997215,0.920948771589,2.51292643636,2.71004616612,5.87016469227,1.,0.
-98,0.540246804099,1.36445470181,2.61949412896,2.98482553485,6.02447664937,1.,0.
-99,0.987764008058,1.85581989607,2.84685706149,2.94760204892,6.0212151724,1.,0.
+0,0.926906299771,1.99107237682,2.56546245685,3.07914768197,4.04839057867,1.,0.,strkeya
+1,0.108010001864,1.41645361423,2.1686839775,2.94963962176,4.1263503303,1.,0.,strkeyb
+2,-0.800567600028,1.0172132907,1.96434754116,2.99885333086,4.04300485864,1.,0.,strkey
+3,0.0607042871898,0.719540073421,1.9765012584,2.89265588817,4.0951014426,1.,0.,strkey
+4,0.933712200629,0.28052120776,1.41018552514,2.69232603996,4.06481164223,1.,0.,strkey
+5,-0.171730652974,0.260054421028,1.48770816369,2.62199129293,4.44572807842,1.,0.,strkey
+6,-1.00180162933,0.333045158863,1.50006392277,2.88888309683,4.24755865606,1.,0.,strkey
+7,0.0580061875336,0.688929398826,1.56543458772,2.99840358953,4.52726873347,1.,0.,strkey
+8,0.764139447412,1.24704875327,1.77649279698,3.13578593851,4.63238922951,1.,0.,strkey
+9,-0.230331874785,1.47903998963,2.03547545751,3.20624030377,4.77980005228,1.,0.,strkey
+10,-1.03846045211,2.01133000781,2.31977503972,3.67951536251,5.09716775897,1.,0.,strkeyc
+11,0.188643592253,2.23285349038,2.68338482249,3.49817168611,5.24928239634,1.,0.,strkey
+12,0.91207302309,2.24244446841,2.71362604985,3.96332587625,5.37802271594,1.,0.,strkey
+13,-0.296588665881,2.02594634141,3.07733910479,3.99698324956,5.56365901394,1.,0.,strkey
+14,-0.959961476551,1.45078629833,3.18996420137,4.3763059609,5.65356015609,1.,0.,strkey
+15,0.46313530679,1.01141441548,3.4980215948,4.20224896882,5.88842247449,1.,0.,strkey
+16,0.929354125798,0.626635305936,3.70508262244,4.51791573544,5.73945973251,1.,0.,strkey
+17,-0.519110731957,0.269249223148,3.39866823332,4.46802003061,5.82768174382,1.,0.,strkey
+18,-0.924330981367,0.349602834684,3.21762413294,4.72803587499,5.94918925767,1.,0.,strkey
+19,0.253239387885,0.345158023497,3.11071425333,4.79311566935,5.9489259713,1.,0.,strkey
+20,0.637408390225,0.698996675371,3.25232492145,4.73814732384,5.9612010251,1.,0.,strkey
+21,-0.407396859412,1.17456342803,2.49526823723,4.59323415742,5.82501686811,1.,0.,strkey
+22,-0.967485452118,1.66655933642,2.47284606244,4.58316034754,5.88721406681,1.,0.,strkey
+23,0.474480867904,1.95018556323,2.0228950072,4.48651142819,5.8255943735,1.,0.,strkey
+24,1.04309652155,2.23519892356,1.91924131572,4.19094661783,5.87457348436,1.,0.,strkey
+25,-0.517861513772,2.12501967336,1.70266619979,4.05280882887,5.72160912899,1.,0.,strkey
+26,-0.945301585146,1.65464653549,1.81567174251,3.92309850635,5.58270493814,1.,0.,strkey
+27,0.501153868974,1.40600764889,1.53991387719,3.72853247942,5.60169001727,1.,0.,strkey
+28,0.972859524418,1.00344321868,1.5175642828,3.64092376655,5.10567722582,1.,0.,strkey
+29,-0.70553406135,0.465306263885,1.7038540803,3.33236870312,5.09182481555,1.,0.,strkey
+30,-0.946093634916,0.294539309453,1.88052827037,2.93011492669,4.97354922696,1.,0.,strkey
+31,0.47922123231,0.308465865031,2.03445883031,2.90772899045,4.86241793548,1.,0.,strkey
+32,0.754030014252,0.549752241167,2.46115815089,2.95063349534,4.71834614627,1.,0.,strkey
+33,-0.64875949826,0.894615488148,2.5922463381,2.81269864022,4.43480095104,1.,0.,strkey
+34,-0.757829951086,1.39123914261,2.69258079904,2.61834837315,4.36580046156,1.,0.,strkey
+35,0.565653301088,1.72360022693,2.97794913834,2.80403840334,4.27327248459,1.,0.,strkey
+36,0.867440092372,2.21100730052,3.38648090792,2.84057515729,4.12210169576,1.,0.,strkey
+37,-0.894567758095,2.17549105818,3.45532493329,2.90446025717,4.00251740584,1.,0.,strkeyd
+38,-0.715442356893,2.15105389965,3.52041791902,3.03650393392,4.12809249577,1.,0.,strkey
+39,0.80671703672,1.81504564517,3.60463324866,3.00747789871,3.98440762467,1.,0.,strkey
+40,0.527014790142,1.31803513865,3.43842186337,3.3332594663,4.03232406566,1.,0.,strkey
+41,-0.795936862129,0.847809114454,3.09875133548,3.52863155938,3.94883924909,1.,0.,strkey
+42,-0.610245806946,0.425530441018,2.92581949152,3.77238736123,4.27287245021,1.,0.,strkey
+43,0.611662279431,0.178432049837,2.48128214822,3.73212087883,4.17319013831,1.,0.,strkey
+44,0.650866553108,0.220341648392,2.41694642022,4.2609098519,4.27271645905,1.,0.,strkey
+45,-0.774156982023,0.632667602331,2.05474356052,4.32889204886,4.18029723271,1.,0.,strkey
+46,-0.714058448409,0.924562377599,1.75706135146,4.52492718422,4.3972678094,1.,0.,strkey
+47,0.889627293379,1.46207968841,1.78299357672,4.64466731095,4.56317887554,1.,0.,strkey
+48,0.520140662861,1.8996333843,1.41377633823,4.48899091177,4.78805049769,1.,0.,strkey
+49,-1.03816935616,2.08997002059,1.51218375351,4.84167764204,4.93026048606,1.,0.,strkey
+50,-0.40772951362,2.30878972136,1.44144415128,4.76854460997,5.01538444629,1.,0.,strkey
+51,0.792730684781,1.91367048509,1.58887384677,4.71739397335,5.25690012199,1.,0.,strkey
+52,0.371311881576,1.67565079528,1.81688563053,4.60353107555,5.44265822961,1.,0.,strkey
+53,-0.814398070371,1.13374634126,1.80328814859,4.72264252878,5.52674761122,1.,0.,strkey
+54,-0.469017949323,0.601244136627,2.29690896736,4.49859178859,5.54126153454,1.,0.,strkey
+55,0.871044371426,0.407597593794,2.7499112487,4.19060637761,5.57693767301,1.,0.,strkey
+56,0.523764933017,0.247705192709,3.09002071379,4.02095509006,5.80510362182,1.,0.,strkey
+57,-0.881326403531,0.31513103164,3.11358205718,3.96079100808,5.81000652365,1.,0.,strkey
+58,-0.357928025339,0.486163915865,3.17884556771,3.72634990659,5.85693642011,1.,0.,strkey
+59,0.853038779822,1.04218094475,3.45835384454,3.36703969978,5.9585988449,1.,0.,strkey
+60,0.435311516013,1.59715085283,3.63313338588,3.11276729421,5.93643818229,1.,0.,strkey
+61,-1.02703719138,1.92205832542,3.47606111735,3.06247155999,6.02106646259,1.,0.,strkey
+62,-0.246661325557,2.14653802542,3.29446326567,2.89936259181,5.67531541272,1.,0.,strkey
+63,1.02554736569,2.25943737733,3.07031591528,2.78176218013,5.78206328989,1.,0.,strkey
+64,0.337814475969,2.07589147224,2.80356226089,2.55888206331,5.7094075496,1.,0.,strkey
+65,-1.12023369929,1.25333011618,2.56497288445,2.77361359194,5.50799418376,1.,0.,strkey
+66,-0.178980246554,1.11937139901,2.51598681313,2.91438309151,5.47469577206,1.,0.,strkey
+67,0.97550951531,0.60553823137,2.11657741073,2.88081098981,5.37034999502,1.,0.,strkey
+68,0.136653357206,0.365828836075,1.97386033165,3.13217903204,5.07254490219,1.,0.,strkey
+69,-1.05607596951,0.153152115069,1.52110743825,3.01308794192,5.08902539125,1.,0.,strkey
+70,-0.13095280331,0.337113974483,1.52703079853,3.16687131599,4.86649398514,1.,0.,strkey
+71,1.07081057754,0.714247566736,1.53761382634,3.45151989484,4.75892309166,1.,0.,strkey
+72,0.0153410376082,1.24631231847,1.61690939161,3.85481994498,4.35683752832,1.,0.,strkey
+73,-0.912801257303,1.60791309476,1.8729264524,4.03037260012,4.36072588913,1.,0.,strkey
+74,-0.0894895640338,2.02535207407,1.93484909619,4.09557485132,4.35327025188,1.,0.,strkey
+75,0.978646999652,2.20085086625,2.09003440427,4.27542353033,4.1805058388,1.,0.,strkey
+76,-0.113312642876,2.2444100761,2.50789248839,4.4151861502,4.03267168136,1.,0.,strkey
+77,-1.00215099149,1.84305628445,2.61691237246,4.45425147595,3.81203553766,1.,0.,strkey
+78,-0.0183234614205,1.49573923116,2.99308471214,4.71134960112,4.0273804959,1.,0.,strkey
+79,1.0823738177,1.12211589848,3.27079386925,4.94288270502,4.01851068083,1.,0.,strkey
+80,0.124370187893,0.616474412808,3.4284236674,4.76942168327,3.9749536483,1.,0.,strkey
+81,-0.929423379352,0.290977090976,3.34131726136,4.78590392707,4.10190661656,1.,0.,strkey
+82,0.23766302648,0.155302052254,3.49779513794,4.64605656795,4.15571321107,1.,0.,strkey
+83,1.03531486192,0.359702776204,3.4880725919,4.48167586667,4.21134561991,1.,0.,strkey
+84,-0.261234571382,0.713877760378,3.42756426614,4.426443869,4.25208300527,1.,0.,strkey
+85,-1.03572442277,1.25001113691,2.96908341113,4.25500915322,4.25723010649,1.,0.,strkey
+86,0.380034261243,1.70543355622,2.73605932518,4.16703432307,4.63700400788,1.,0.,strkey
+87,1.03734873488,1.97544410562,2.55586572141,3.84976673263,4.55282864289,1.,0.,strkey
+88,-0.177344253372,2.22614526325,2.09565864891,3.77378097953,4.82577400298,1.,0.,strkey
+89,-0.976821526892,2.18385079177,1.78522284118,3.67768223554,5.06302440873,1.,0.,strkey
+90,0.264820472091,1.86981946157,1.50048403865,3.43619796921,5.05651761669,1.,0.,strkey
+91,1.05642344868,1.47568646076,1.51347671977,3.20898518885,5.50149047462,1.,0.,strkey
+92,-0.311607433358,1.04226467636,1.52089650905,3.02291865417,5.4889046232,1.,0.,strkey
+93,-0.724285777937,0.553052311957,1.48573560173,2.7365973598,5.72549174225,1.,0.,strkey
+94,0.519859192905,0.226520626591,1.61543723167,2.84102086852,5.69330622288,1.,0.,strkey
+95,1.0323195039,0.260873217055,1.81913034804,2.83951143848,5.90325028086,1.,0.,strkey
+96,-0.53285682538,0.387695521405,1.70935609313,2.57977050631,5.79579213161,1.,0.,strkey
+97,-0.975127997215,0.920948771589,2.51292643636,2.71004616612,5.87016469227,1.,0.,strkey
+98,0.540246804099,1.36445470181,2.61949412896,2.98482553485,6.02447664937,1.,0.,strkey
+99,0.987764008058,1.85581989607,2.84685706149,2.94760204892,6.0212151724,1.,0.,strkey
diff --git a/tensorflow/contrib/timeseries/examples/known_anomaly.py b/tensorflow/contrib/timeseries/examples/known_anomaly.py
index 7659dd308a7ee1b70d6688b85e4f6157ddee0540..c08c0b0acb917f527d7efa91874d6405b9220083 100644
--- a/tensorflow/contrib/timeseries/examples/known_anomaly.py
+++ b/tensorflow/contrib/timeseries/examples/known_anomaly.py
@@ -46,12 +46,12 @@ def train_and_evaluate_exogenous(csv_file_name=_DATA_FILE, train_steps=300):
 
   # Indicate the format of our exogenous feature, in this case a string
   # representing a boolean value.
-  string_feature = tf.contrib.layers.sparse_column_with_keys(
-      column_name="is_changepoint", keys=["no", "yes"])
+  string_feature = tf.feature_column.categorical_column_with_vocabulary_list(
+      key="is_changepoint", vocabulary_list=["no", "yes"])
   # Specify the way this feature is presented to the model, here using a one-hot
   # encoding.
-  one_hot_feature = tf.contrib.layers.one_hot_column(
-      sparse_id_column=string_feature)
+  one_hot_feature = tf.feature_column.indicator_column(
+      categorical_column=string_feature)
 
   estimator = tf.contrib.timeseries.StructuralEnsembleRegressor(
       periodicities=12,
diff --git a/tensorflow/contrib/timeseries/examples/lstm.py b/tensorflow/contrib/timeseries/examples/lstm.py
index f37cafcc502dc9415db0829b9b067b862f87dca7..2eee878196bb64b523c491ca808ca8d6ff5dd36c 100644
--- a/tensorflow/contrib/timeseries/examples/lstm.py
+++ b/tensorflow/contrib/timeseries/examples/lstm.py
@@ -59,10 +59,10 @@ class _LSTMModel(ts_model.SequentialTimeSeriesModel):
       num_units: The number of units in the model's LSTMCell.
       num_features: The dimensionality of the time series (features per
         timestep).
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects representing features which are inputs to the model but are
-          not predicted by it. These must then be present for training,
-          evaluation, and prediction.
+      exogenous_feature_columns: A list of `tf.feature_column`s representing
+          features which are inputs to the model but are not predicted by
+          it. These must then be present for training, evaluation, and
+          prediction.
       dtype: The floating point data type to use.
     """
     super(_LSTMModel, self).__init__(
@@ -189,12 +189,16 @@ def train_and_predict(
     export_directory=None):
   """Train and predict using a custom time series model."""
   # Construct an Estimator from our LSTM model.
+  categorical_column = tf.feature_column.categorical_column_with_hash_bucket(
+      key="categorical_exogenous_feature", hash_bucket_size=16)
   exogenous_feature_columns = [
       # Exogenous features are not part of the loss, but can inform
       # predictions. In this example the features have no extra information, but
       # are included as an API example.
-      tf.contrib.layers.real_valued_column(
-          "2d_exogenous_feature", dimension=2)]
+      tf.feature_column.numeric_column(
+          "2d_exogenous_feature", shape=(2,)),
+      tf.feature_column.embedding_column(
+          categorical_column=categorical_column, dimension=10)]
   estimator = ts_estimators.TimeSeriesRegressor(
       model=_LSTMModel(num_features=5, num_units=128,
                        exogenous_feature_columns=exogenous_feature_columns),
@@ -205,7 +209,11 @@ def train_and_predict(
       csv_file_name,
       column_names=((tf.contrib.timeseries.TrainEvalFeatures.TIMES,)
                     + (tf.contrib.timeseries.TrainEvalFeatures.VALUES,) * 5
-                    + ("2d_exogenous_feature",) * 2))
+                    + ("2d_exogenous_feature",) * 2
+                    + ("categorical_exogenous_feature",)),
+      # Data types other than for `times` need to be specified if they aren't
+      # float32. In this case one of our exogenous features has string dtype.
+      column_dtypes=((tf.int64,) + (tf.float32,) * 7 + (tf.string,)))
   train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
       reader, batch_size=4, window_size=32)
   estimator.train(input_fn=train_input_fn, steps=training_steps)
@@ -215,7 +223,9 @@ def train_and_predict(
   predict_exogenous_features = {
       "2d_exogenous_feature": numpy.concatenate(
           [numpy.ones([1, 100, 1]), numpy.zeros([1, 100, 1])],
-          axis=-1)}
+          axis=-1),
+      "categorical_exogenous_feature": numpy.array(
+          ["strkey"] * 100)[None, :, None]}
   (predictions,) = tuple(estimator.predict(
       input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
           evaluation, steps=100,
diff --git a/tensorflow/contrib/timeseries/python/timeseries/BUILD b/tensorflow/contrib/timeseries/python/timeseries/BUILD
index fff972c1f3277ad5d83673a202a50d1e6f7df210..ed3ed4c0e1731df62e9197aa7471fd6a31e9858e 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/BUILD
+++ b/tensorflow/contrib/timeseries/python/timeseries/BUILD
@@ -140,11 +140,13 @@ py_library(
         "//tensorflow/python:framework_ops",
         "//tensorflow/python:math_ops",
         "//tensorflow/python:state_ops",
+        "//tensorflow/python:summary",
         "//tensorflow/python:util",
         "//tensorflow/python:variable_scope",
         "//tensorflow/python/estimator:estimator_py",
         "//tensorflow/python/estimator:export",
         "//tensorflow/python/estimator:head",
+        "//tensorflow/python/estimator:metric_keys",
     ],
 )
 
diff --git a/tensorflow/contrib/timeseries/python/timeseries/estimators.py b/tensorflow/contrib/timeseries/python/timeseries/estimators.py
index f8355f366fe8e191ab570fd271bbe4a8bf71c73d..8d13343e82340dae11b0be54e3bc3152060dca36 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/estimators.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/estimators.py
@@ -18,8 +18,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.contrib.layers.python.layers import feature_column
-
 from tensorflow.contrib.timeseries.python.timeseries import ar_model
 from tensorflow.contrib.timeseries.python.timeseries import feature_keys
 from tensorflow.contrib.timeseries.python.timeseries import head as ts_head_lib
@@ -31,10 +29,12 @@ from tensorflow.contrib.timeseries.python.timeseries.state_space_models.filterin
 
 from tensorflow.python.estimator import estimator_lib
 from tensorflow.python.estimator.export import export_lib
+from tensorflow.python.feature_column import feature_column
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import parsing_ops
 from tensorflow.python.training import training as train
 
 
@@ -117,22 +117,29 @@ class TimeSeriesRegressor(estimator_lib.Estimator):
                   dtype=self._model.dtype),
               shape=(default_batch_size, default_series_length,
                      self._model.num_features)))
-      with ops.Graph().as_default():
-        # Default placeholders have only an unknown batch dimension. Make them
-        # in a separate graph, then splice in the series length to the shapes
-        # and re-create them in the outer graph.
-        exogenous_feature_shapes = {
-            key: (value.get_shape(), value.dtype) for key, value
-            in feature_column.make_place_holder_tensors_for_base_features(
-                self._model.exogenous_feature_columns).items()}
-      for feature_key, (batch_only_feature_shape, value_dtype) in (
-          exogenous_feature_shapes.items()):
-        batch_only_feature_shape = batch_only_feature_shape.with_rank_at_least(
-            1).as_list()
-        feature_shape = ([default_batch_size, default_series_length]
-                         + batch_only_feature_shape[1:])
-        placeholders[feature_key] = array_ops.placeholder(
-            dtype=value_dtype, name=feature_key, shape=feature_shape)
+      if self._model.exogenous_feature_columns:
+        with ops.Graph().as_default():
+          # Default placeholders have only an unknown batch dimension. Make them
+          # in a separate graph, then splice in the series length to the shapes
+          # and re-create them in the outer graph.
+          parsed_features = (
+              feature_column.make_parse_example_spec(
+                  self._model.exogenous_feature_columns))
+          placeholder_features = parsing_ops.parse_example(
+              serialized=array_ops.placeholder(
+                  shape=[None], dtype=dtypes.string),
+              features=parsed_features)
+          exogenous_feature_shapes = {
+              key: (value.get_shape(), value.dtype) for key, value
+              in placeholder_features.items()}
+        for feature_key, (batch_only_feature_shape, value_dtype) in (
+            exogenous_feature_shapes.items()):
+          batch_only_feature_shape = (
+              batch_only_feature_shape.with_rank_at_least(1).as_list())
+          feature_shape = ([default_batch_size, default_series_length]
+                           + batch_only_feature_shape[1:])
+          placeholders[feature_key] = array_ops.placeholder(
+              dtype=value_dtype, name=feature_key, shape=feature_shape)
       # Models may not know the shape of their state without creating some
       # variables/ops. Avoid polluting the default graph by making a new one. We
       # use only static metadata from the returned Tensors.
@@ -333,11 +340,11 @@ class StructuralEnsembleRegressor(StateSpaceRegressor):
           determine the model size. Learning autoregressive coefficients
           typically requires more steps and a smaller step size than other
           components.
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects (for example tf.contrib.layers.embedding_column) corresponding
-          to exogenous features which provide extra information to the model but
-          are not part of the series to be predicted. Passed to
-          tf.contrib.layers.input_from_feature_columns.
+      exogenous_feature_columns: A list of `tf.feature_column`s (for example
+          `tf.feature_column.embedding_column`) corresponding to exogenous
+          features which provide extra information to the model but are not part
+          of the series to be predicted. Passed to
+          `tf.feature_column.input_layer`.
       exogenous_update_condition: A function taking two Tensor arguments,
           `times` (shape [batch size]) and `features` (a dictionary mapping
           exogenous feature keys to Tensors with shapes [batch size, ...]), and
diff --git a/tensorflow/contrib/timeseries/python/timeseries/head.py b/tensorflow/contrib/timeseries/python/timeseries/head.py
index f0330bfbbd6e8067e5d085376acdf2e6bcaccb6a..1d96145e59ce80c2add6528da6e4e7ec9a1361f5 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/head.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/head.py
@@ -26,6 +26,7 @@ from tensorflow.contrib.timeseries.python.timeseries import feature_keys
 
 from tensorflow.python.estimator import estimator_lib
 from tensorflow.python.estimator.canned import head as head_lib
+from tensorflow.python.estimator.canned import metric_keys
 from tensorflow.python.estimator.export import export_lib
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
@@ -35,6 +36,7 @@ from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.util import nest
+from tensorflow.python.summary import summary
 
 
 def time_series_regression_head(model,
@@ -71,9 +73,32 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
     self.input_statistics_generator = input_statistics_generator
     self._name = name
 
+  @property
+  def name(self):
+    return self._name
+
+  # TODO(terrytangyuan): consolidate `model_outputs` and `_Head.LossSpec`
+  # once `_Head.create_loss` becomes extendable
+  def create_loss(self, features, mode, logits=None, labels=None):
+    """See `_Head`."""
+    model_outputs = self.state_manager.define_loss(
+        self.model, features, mode)
+    summary.scalar(
+        head_lib._summary_key(self._name, metric_keys.MetricKeys.LOSS),
+        model_outputs.loss)
+    return model_outputs
+
+  @property
+  def logits_dimension(self):
+    """See `_Head`."""
+    return 1
+
   def _train_ops(self, features):
     """Add training ops to the graph."""
-    with variable_scope.variable_scope("model"):
+    with variable_scope.variable_scope(
+        "model",
+        # Use ResourceVariables to avoid race conditions.
+        use_resource=True):
       model_outputs = self.state_manager.define_loss(
           self.model, features, estimator_lib.ModeKeys.TRAIN)
 
@@ -88,26 +113,9 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
         mode=estimator_lib.ModeKeys.TRAIN,
         train_op=train_op)
 
-  # TODO(terrytangyuan): suffix summary and metrics keys by `"/" + name`
-  @property
-  def name(self):
-    return self._name
-
-  # TODO(terrytangyuan): unused for now. Need to decouple
-  # `state_manager.define_loss` to satisfy the extendable return signature of
-  # `_Head.create_loss`.
-  def create_loss(self, features, mode, logits, labels):
-    """See `_Head`."""
-    return None
-
-  # TODO(terrytangyuan): check label dimension
-  @property
-  def logits_dimension(self):
-    return None
-
   def _evaluate_ops(self, features):
     """Add ops for evaluation (aka filtering) to the graph."""
-    with variable_scope.variable_scope("model"):
+    with variable_scope.variable_scope("model", use_resource=True):
       model_outputs = self.state_manager.define_loss(
           self.model, features, estimator_lib.ModeKeys.EVAL)
     metrics = {}
@@ -128,7 +136,7 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
 
   def _predict_ops(self, features):
     """Add ops for prediction to the graph."""
-    with variable_scope.variable_scope("model"):
+    with variable_scope.variable_scope("model", use_resource=True):
       prediction = self.model.predict(features=features)
     prediction[feature_keys.PredictionResults.TIMES] = features[
         feature_keys.PredictionFeatures.TIMES]
@@ -137,12 +145,11 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
 
   def _serving_ops(self, features):
     """Add ops for serving to the graph."""
-    with variable_scope.variable_scope("model"):
+    with variable_scope.variable_scope("model", use_resource=True):
       prediction_outputs = self.model.predict(features=features)
     with variable_scope.variable_scope("model", reuse=True):
-      filtering_outputs = self.state_manager.define_loss(
-          self.model, features, estimator_lib.ModeKeys.EVAL)
-
+      filtering_outputs = self.create_loss(
+          features, estimator_lib.ModeKeys.EVAL)
     return estimator_lib.EstimatorSpec(
         mode=estimator_lib.ModeKeys.PREDICT,
         export_outputs={
@@ -191,7 +198,7 @@ class _TimeSeriesRegressionHead(head_lib._Head):  # pylint:disable=protected-acc
 
   def create_estimator_spec(self, features, mode, labels=None):
     """Performs basic error checking and returns an EstimatorSpec."""
-    with ops.name_scope("head"):
+    with ops.name_scope(self._name, "head"):
       if labels:
         raise ValueError(
             "The model received a `labels` dictionary, which is "
diff --git a/tensorflow/contrib/timeseries/python/timeseries/input_pipeline.py b/tensorflow/contrib/timeseries/python/timeseries/input_pipeline.py
index d4ee59036624cffb216709e096981d362670e416..04225333b9377447f46d32663df76aece97a51e7 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/input_pipeline.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/input_pipeline.py
@@ -500,6 +500,41 @@ class CSVReader(ReaderBaseTimeSeriesParser):
     return features
 
 
+class TFExampleReader(ReaderBaseTimeSeriesParser):
+  """Reads and parses `tf.Example`s from a TFRecords file."""
+
+  def __init__(self,
+               filenames,
+               features):
+    """Configure `tf.Example` parsing.
+
+    Args:
+      filenames: A filename or list of filenames to read the time series
+          from. Each line must have columns corresponding to `column_names`.
+      features: A dictionary mapping from feature keys to `tf.FixedLenFeature`
+          objects. Must include `TrainEvalFeatures.TIMES` (scalar integer) and
+          `TrainEvalFeatures.VALUES` (floating point vector) features.
+    Raises:
+      ValueError: If required times/values features are not present.
+    """
+    if feature_keys.TrainEvalFeatures.TIMES not in features:
+      raise ValueError("'{}' is a required column.".format(
+          feature_keys.TrainEvalFeatures.TIMES))
+    if feature_keys.TrainEvalFeatures.VALUES not in features:
+      raise ValueError("'{}' is a required column.".format(
+          feature_keys.TrainEvalFeatures.VALUES))
+    self._features = features
+    super(TFExampleReader, self).__init__(filenames=filenames)
+
+  def _get_reader(self):
+    return io_ops.TFRecordReader()
+
+  def _process_records(self, examples):
+    """Parse `tf.Example`s into `Tensors`."""
+    return parsing_ops.parse_example(
+        serialized=examples, features=self._features)
+
+
 class TimeSeriesInputFn(object):
   """Base for classes which create batches of windows from a time series."""
 
diff --git a/tensorflow/contrib/timeseries/python/timeseries/input_pipeline_test.py b/tensorflow/contrib/timeseries/python/timeseries/input_pipeline_test.py
index ed78a835a4d451e9e7d18bb833d8ebed6c05a195..703537abf0fe3985aaf0434cc633cb410dd6bd4c 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/input_pipeline_test.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/input_pipeline_test.py
@@ -27,7 +27,11 @@ from tensorflow.contrib.timeseries.python.timeseries import input_pipeline
 from tensorflow.contrib.timeseries.python.timeseries import test_utils
 from tensorflow.contrib.timeseries.python.timeseries.feature_keys import TrainEvalFeatures
 
+from tensorflow.core.example import example_pb2
+from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
+from tensorflow.python.lib.io import tf_record
+from tensorflow.python.ops import parsing_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 from tensorflow.python.training import coordinator as coordinator_lib
@@ -52,6 +56,21 @@ def _make_csv_time_series(num_features, num_samples, test_tmpdir):
   return filename
 
 
+def _make_tfexample_series(num_features, num_samples, test_tmpdir):
+  _, data_file = tempfile.mkstemp(dir=test_tmpdir)
+  with tf_record.TFRecordWriter(data_file) as writer:
+    for i in range(num_samples):
+      example = example_pb2.Example()
+      times = example.features.feature[TrainEvalFeatures.TIMES]
+      times.int64_list.value.append(i)
+      values = example.features.feature[TrainEvalFeatures.VALUES]
+      values.float_list.value.extend(
+          [float(i) * 2. + feature_number
+           for feature_number in range(num_features)])
+      writer.write(example.SerializeToString())
+  return data_file
+
+
 def _make_numpy_time_series(num_features, num_samples):
   times = numpy.arange(num_samples)
   values = times[:, None] * 2. + numpy.arange(num_features)[None, :]
@@ -107,6 +126,19 @@ class RandomWindowInputFnTests(test.TestCase):
     time_series_reader = input_pipeline.CSVReader([filename])
     self._test_out_of_order(time_series_reader, discard_out_of_order=False)
 
+  def test_tfexample_sort_out_of_order(self):
+    filename = _make_tfexample_series(
+        num_features=1, num_samples=50,
+        test_tmpdir=self.get_temp_dir())
+    time_series_reader = input_pipeline.TFExampleReader(
+        [filename],
+        features={
+            TrainEvalFeatures.TIMES: parsing_ops.FixedLenFeature(
+                shape=[], dtype=dtypes.int64),
+            TrainEvalFeatures.VALUES: parsing_ops.FixedLenFeature(
+                shape=[1], dtype=dtypes.float32)})
+    self._test_out_of_order(time_series_reader, discard_out_of_order=False)
+
   def test_numpy_sort_out_of_order(self):
     data = _make_numpy_time_series(num_features=1, num_samples=50)
     time_series_reader = input_pipeline.NumpyReader(data)
@@ -183,6 +215,20 @@ class RandomWindowInputFnTests(test.TestCase):
     self._test_multivariate(time_series_reader=time_series_reader,
                             num_features=2)
 
+  def test_tfexample_multivariate(self):
+    filename = _make_tfexample_series(
+        num_features=2, num_samples=50,
+        test_tmpdir=self.get_temp_dir())
+    time_series_reader = input_pipeline.TFExampleReader(
+        [filename],
+        features={
+            TrainEvalFeatures.TIMES: parsing_ops.FixedLenFeature(
+                shape=[], dtype=dtypes.int64),
+            TrainEvalFeatures.VALUES: parsing_ops.FixedLenFeature(
+                shape=[2], dtype=dtypes.float32)})
+    self._test_multivariate(time_series_reader=time_series_reader,
+                            num_features=2)
+
   def test_numpy_multivariate(self):
     data = _make_numpy_time_series(num_features=3, num_samples=50)
     time_series_reader = input_pipeline.NumpyReader(data)
@@ -248,6 +294,20 @@ class WholeDatasetInputFnTests(test.TestCase):
       self._whole_dataset_input_fn_test_template(
           time_series_reader=time_series_reader, num_features=1, num_samples=50)
 
+  def test_tfexample(self):
+    filename = _make_tfexample_series(
+        num_features=4, num_samples=100,
+        test_tmpdir=self.get_temp_dir())
+    time_series_reader = input_pipeline.TFExampleReader(
+        [filename],
+        features={
+            TrainEvalFeatures.TIMES: parsing_ops.FixedLenFeature(
+                shape=[], dtype=dtypes.int64),
+            TrainEvalFeatures.VALUES: parsing_ops.FixedLenFeature(
+                shape=[4], dtype=dtypes.float32)})
+    self._whole_dataset_input_fn_test_template(
+        time_series_reader=time_series_reader, num_features=4, num_samples=100)
+
   def test_numpy(self):
     data = _make_numpy_time_series(num_features=4, num_samples=100)
     time_series_reader = input_pipeline.NumpyReader(data)
diff --git a/tensorflow/contrib/timeseries/python/timeseries/model.py b/tensorflow/contrib/timeseries/python/timeseries/model.py
index bac7d1ebf59b28d4688a3d1a69ecdc1fc12248e0..7644764a7459db3951fe9a2790389713dd412a8f 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/model.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/model.py
@@ -21,18 +21,17 @@ from __future__ import print_function
 import abc
 import collections
 
-from tensorflow.contrib import layers
-from tensorflow.contrib.layers import feature_column
-
 from tensorflow.contrib.timeseries.python.timeseries import math_utils
 from tensorflow.contrib.timeseries.python.timeseries.feature_keys import PredictionFeatures
 from tensorflow.contrib.timeseries.python.timeseries.feature_keys import TrainEvalFeatures
 
+from tensorflow.python.feature_column import feature_column
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import parsing_ops
 from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.ops import variable_scope
 
@@ -66,11 +65,11 @@ class TimeSeriesModel(object):
 
     Args:
       num_features: Number of features for the time series
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects (for example tf.contrib.layers.embedding_column) corresponding
-          to exogenous features which provide extra information to the model but
-          are not part of the series to be predicted. Passed to
-          tf.contrib.layers.input_from_feature_columns.
+      exogenous_feature_columns: A list of `tf.feature_column`s (for example
+           `tf.feature_column.embedding_column`) corresponding to exogenous
+           features which provide extra information to the model but are not
+           part of the series to be predicted. Passed to
+           `tf.feature_column.input_layer`.
       dtype: The floating point datatype to use.
     """
     if exogenous_feature_columns:
@@ -86,7 +85,7 @@ class TimeSeriesModel(object):
 
   @property
   def exogenous_feature_columns(self):
-    """`FeatureColumn` objects for features which are not predicted."""
+    """`tf.feature_colum`s for features which are not predicted."""
     return self._exogenous_feature_columns
 
   # TODO(allenl): Move more of the generic machinery for generating and
@@ -265,11 +264,14 @@ class TimeSeriesModel(object):
     if not self._exogenous_feature_columns:
       return (0,)
     with ops.Graph().as_default():
-      placeholder_features = (
-          feature_column.make_place_holder_tensors_for_base_features(
+      parsed_features = (
+          feature_column.make_parse_example_spec(
               self._exogenous_feature_columns))
-      embedded = layers.input_from_feature_columns(
-          columns_to_tensors=placeholder_features,
+      placeholder_features = parsing_ops.parse_example(
+          serialized=array_ops.placeholder(shape=[None], dtype=dtypes.string),
+          features=parsed_features)
+      embedded = feature_column.input_layer(
+          features=placeholder_features,
           feature_columns=self._exogenous_feature_columns)
       return embedded.get_shape().as_list()[1:]
 
@@ -308,13 +310,13 @@ class TimeSeriesModel(object):
         # Avoid shape warnings when embedding "scalar" exogenous features (those
         # with only batch and window dimensions); input_from_feature_columns
         # expects input ranks to match the embedded rank.
-        if tensor.get_shape().ndims == 1:
+        if tensor.get_shape().ndims == 1 and tensor.dtype != dtypes.string:
           exogenous_features_single_batch_dimension[name] = tensor[:, None]
         else:
           exogenous_features_single_batch_dimension[name] = tensor
       embedded_exogenous_features_single_batch_dimension = (
-          layers.input_from_feature_columns(
-              columns_to_tensors=exogenous_features_single_batch_dimension,
+          feature_column.input_layer(
+              features=exogenous_features_single_batch_dimension,
               feature_columns=self._exogenous_feature_columns,
               trainable=True))
       exogenous_regressors = array_ops.reshape(
@@ -381,8 +383,8 @@ class SequentialTimeSeriesModel(TimeSeriesModel):
           may use _scale_back_data or _scale_back_variance to return predictions
           to the input scale.
       dtype: The floating point datatype to use.
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects. See `TimeSeriesModel`.
+      exogenous_feature_columns: A list of `tf.feature_column`s objects. See
+          `TimeSeriesModel`.
       exogenous_update_condition: A function taking two Tensor arguments `times`
           (shape [batch size]) and `features` (a dictionary mapping exogenous
           feature keys to Tensors with shapes [batch size, ...]) and returning a
diff --git a/tensorflow/contrib/timeseries/python/timeseries/state_space_models/state_space_model.py b/tensorflow/contrib/timeseries/python/timeseries/state_space_models/state_space_model.py
index 6257002647ed53bbde3ace11a6b45e4e2cdeb57d..951c6546d5fed77e0cfa98a4e774b804639d7dad 100644
--- a/tensorflow/contrib/timeseries/python/timeseries/state_space_models/state_space_model.py
+++ b/tensorflow/contrib/timeseries/python/timeseries/state_space_models/state_space_model.py
@@ -112,11 +112,11 @@ class StateSpaceModelConfiguration(
       exogenous_noise_decreases: If True, exogenous regressors can "set" model
           state, decreasing uncertainty. If both this parameter and
           exogenous_noise_increases are False, exogenous regressors are ignored.
-      exogenous_feature_columns: A list of tf.contrib.layers.FeatureColumn
-          objects (for example tf.contrib.layers.embedding_column) corresponding
-          to exogenous features which provide extra information to the model but
-          are not part of the series to be predicted. Passed to
-          tf.contrib.layers.input_from_feature_columns.
+      exogenous_feature_columns: A list of `tf.feature_column`s (for example
+          `tf.feature_column.embedding_column`) corresponding to exogenous
+          features which provide extra information to the model but are not part
+          of the series to be predicted. Passed to
+          `tf.feature_column.input_layer`.
       exogenous_update_condition: A function taking two Tensor arguments `times`
           (shape [batch size]) and `features` (a dictionary mapping exogenous
           feature keys to Tensors with shapes [batch size, ...]) and returning a
diff --git a/tensorflow/contrib/tpu/BUILD b/tensorflow/contrib/tpu/BUILD
index c48e84ddfaac8ac9c07e061847315eab3fd72152..095b4821f10b32ff742711caa155e60beb624852 100644
--- a/tensorflow/contrib/tpu/BUILD
+++ b/tensorflow/contrib/tpu/BUILD
@@ -163,6 +163,7 @@ py_library(
     ],
     srcs_version = "PY2AND3",
     deps = [
+        ":datasets",
         ":profiler",
         ":tpu_py",
         "//tensorflow/contrib/tpu/proto:topology_proto_py",
@@ -181,6 +182,33 @@ py_library(
     ],
 )
 
+py_library(
+    name = "datasets",
+    srcs = [
+        "python/tpu/datasets.py",
+    ],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/contrib/data/python/ops:transformation_ops",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:function",
+        "//tensorflow/python:functional_ops",
+        "//tensorflow/python/data/ops:dataset_ops",
+        "//tensorflow/python/data/ops:iterator_ops",
+        "//tensorflow/python/data/ops:readers",
+    ],
+)
+
+tf_py_test(
+    name = "datasets_test",
+    srcs = ["python/tpu/datasets_test.py"],
+    additional_deps = [
+        "//tensorflow/python:client_testlib",
+        ":datasets",
+    ],
+    grpc_enabled = True,
+)
+
 tf_py_test(
     name = "tpu_test",
     size = "small",
diff --git a/tensorflow/contrib/tpu/profiler/tf_op_stats.proto b/tensorflow/contrib/tpu/profiler/tf_op_stats.proto
index 2094294baad63ae73712c8648b588accd4551ef8..e5c798aa2f463a1a2d7cb041ba9b51569958f4fd 100644
--- a/tensorflow/contrib/tpu/profiler/tf_op_stats.proto
+++ b/tensorflow/contrib/tpu/profiler/tf_op_stats.proto
@@ -77,6 +77,8 @@ message StepInfoResult {
   // The infeed duration in picoseconds.
   // Can turn into a map if we want a variable number of ops.
   optional uint64 infeed_duration_ps = 3;
+  // The start time of this step in picoseconds.
+  optional uint64 begin_ps = 4;
 }
 
 // Result proto for a sequence of steps.
@@ -155,6 +157,54 @@ message RunEnvironmentResult {
   repeated HostDependentJobInfoResult host_dependent_job_info = 6;
 }
 
+// The types of host operations that are tracked.
+enum HostOp {
+  // Invalid host op.
+  kINVALIDHostOp = 0;
+  // Each of host op type has two parts:
+  // (1) the stage where the op happens and (2) the op name.
+  // stage = Input Data Producer, op = Get Next Batch.
+  kInputDataProducerGetNextBatch = 1;
+  // stage = Input Data Producer, op = Session Run.
+  kInputDataProducerSessionRun = 2;
+  // stage = Input Data Producer, op = Forward Batch.
+  kInputDataProducerForwardBatch = 3;
+  // stage = Infeed Thread, op = Get Next Batch.
+  kInfeedThreadGetNextBatch = 4;
+  // stage = Infeed Thread, op = Session Run.
+  kInfeedThreadSessionRun = 5;
+  // stage = Infeed Thread, op = Forward Batch.
+  kInfeedThreadForwardBatch = 6;
+  // stage = Outfeed Thread, op = Get Next Batch.
+  kOutfeedThreadGetNextBatch = 7;
+  // stage = Outfeed Thread, op = Session Run.
+  kOutfeedThreadSessionRun = 8;
+  // stage = Outfeed Thread, op = Forward Batch.
+  kOutfeedThreadForwardBatch = 9;
+}
+
+// Result proto for the host ops per TPU step.
+message HostOpsPerTpuStep {
+  // Whether the data in this message is valid.
+  optional bool valid = 1 [default = false];
+  // The current TPU step number.
+  optional uint32 tpu_step_num = 2;
+  // The beginning time of the current TPU step on the device in picoseconds.
+  optional uint64 tpu_step_begin_ps = 3;
+  // The ending time of the current TPU step on the device in picoseconds.
+  optional uint64 tpu_step_end_ps = 4;
+  // For each possible host operation, maps to the difference between the TPU
+  // step number that the host op targets and the current TPU step number.
+  // The key is HostOp, value is the step difference.
+  map<int32, int32> step_diffs = 5;
+}
+
+// Result proto for the host ops for all TPU steps.
+message HostOpsResult {
+  // A sequence of HostOpsPerTpuStep (one for each TPU step)
+  repeated HostOpsPerTpuStep host_op_sequence = 1;
+}
+
 // Result proto for TfStatsHelper.
 message TfOpStats {
   // The result for the TF-metric database.
@@ -171,4 +221,6 @@ message TfOpStats {
   optional double matrix_unit_utilization_percent = 6;
   // The run environment of this profiling session.
   optional RunEnvironmentResult run_environment = 7;
+  // The result for the host operations.
+  optional HostOpsResult host_ops = 8;
 }
diff --git a/tensorflow/contrib/tpu/python/ops/tpu_ops.py b/tensorflow/contrib/tpu/python/ops/tpu_ops.py
index 97876216793e0e6b20b7c072cac4f575b8fd48be..14c63a79763300dcfe8d6c8e09b90f8e9c772358 100644
--- a/tensorflow/contrib/tpu/python/ops/tpu_ops.py
+++ b/tensorflow/contrib/tpu/python/ops/tpu_ops.py
@@ -47,7 +47,7 @@ if platform.system() != "Windows":
   # types are supported.
 
   _SUPPORTED_INFEED_DTYPES = set([
-      dtypes.bool, dtypes.int32, dtypes.bfloat16, dtypes.float32,
+      dtypes.bool, dtypes.int32, dtypes.int64, dtypes.bfloat16, dtypes.float32,
       dtypes.complex64
   ])
 
diff --git a/tensorflow/contrib/tpu/python/tpu/datasets.py b/tensorflow/contrib/tpu/python/tpu/datasets.py
new file mode 100644
index 0000000000000000000000000000000000000000..71a3a9254000022b5c6d198b4ba3fd0ccc3bccb9
--- /dev/null
+++ b/tensorflow/contrib/tpu/python/tpu/datasets.py
@@ -0,0 +1,192 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ======================================
+"""Library of Cloud TPU helper functions for data loading."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.contrib.data.python.ops import batching
+from tensorflow.contrib.data.python.ops import interleave_ops
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.ops import iterator_ops
+from tensorflow.python.data.ops import readers
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import function
+from tensorflow.python.framework import ops
+from tensorflow.python.ops import functional_ops
+
+
+def _TextLineDataset(filename):
+  buffer_size = 8 * 1024 * 1024  # 8 MiB per file
+  dataset = readers.TextLineDataset(filename, buffer_size=buffer_size)
+  return dataset
+
+
+def _TFRecordDataset(filename):
+  buffer_size = 8 * 1024 * 1024  # 8 MiB per file
+  dataset = readers.TFRecordDataset(filename, buffer_size=buffer_size)
+  return dataset
+
+
+_FILETYPE_MAP = {
+    'tfrecord': _TFRecordDataset,
+    'textline': _TextLineDataset,
+    'text': _TextLineDataset,
+}
+
+
+def StreamingFilesDataset(files,
+                          filetype=None,
+                          file_reader_job=None,
+                          worker_job=None,
+                          num_epochs=None,
+                          filename_shuffle_buffer_size=None,
+                          num_parallel_reads=None,
+                          batch_transfer_size=None,
+                          sloppy=None):
+  """StreamingFilesDataset constructs a dataset to stream from workers (GCE VM).
+
+  Because Cloud TPUs are allocated over the network, a Cloud TPU cannot read
+  files local to your GCE VM. In order to train using files stored on your local
+  VM (e.g. on local SSD for extreme performance), use the StreamingFilesDataset
+  helper to generate a dataset to feed your Cloud TPU with files from your GCE
+  VM.
+
+  The resulting dataset may return an OutOfRangeError if there are no files
+  found as a result of the fileglob expansion.
+
+  Note: StreamingFilesDataset assumes that the session is using a
+  TPUClusterResolver and has therefore a worker and a coordinator job. File
+  loading will be done on the coordinator job.
+
+  Args:
+    files: A string glob to match files, or a `tf.data.Dataset` generating file
+      names.
+    filetype: A string (one of 'tfrecord', or 'textline') or a single-argument
+      TensorFlow function that when given a filename returns a dataset.
+    file_reader_job: An optional string that corresponds to the job that should
+      perform the file reads.
+    worker_job: An optional string that corresponds to the job that should
+      process the tensors (i.e. your GPU or TPU worker).
+    num_epochs: The number of epochs through the training set that should be
+      generated. By default, it will repeat infinitely.
+    filename_shuffle_buffer_size: An optional integer whose value controls the
+      shuffling of the file names. If you would like to read from the files in
+      the same order, set to 0 or False.
+    num_parallel_reads: An optional integer controlling the number of files to
+      read from concurrently. (Set to 1 for no parallelism.)
+    batch_transfer_size: An optional integer controlling the batching used to
+      amortize the remote function invocation overhead. Set to a very large
+      number to increase throughput. Set to a very small number to reduce memory
+      consumption. Set to False to skip batching.
+    sloppy: (Optional.) If `True`, read input data as fast as possible, without
+      maintaining a deterministic order. Defaults to `False`.
+  Returns:
+    A `tf.data.Dataset` with an infinite stream of elements generated by a
+    parallel interleaving of the set of files matched (or generated) by `files`
+    with a type is the output of the dataset specified by `filetype`.
+
+  Raises:
+    ValueError: if any argument is not of the expected type.
+  """
+  if filetype is None:
+    filetype = 'tfrecord'
+
+  if isinstance(filetype, str):
+    if filetype not in _FILETYPE_MAP:
+      raise ValueError('Unexpected filetype: %s' % filetype)
+    reader_fn = _FILETYPE_MAP[filetype]
+  elif callable(filetype):
+    reader_fn = filetype
+  else:
+    raise ValueError('filetype should be a string or a callable')
+
+  file_reader_job = file_reader_job or 'coordinator'
+
+  worker_job = worker_job or 'tpu_worker'
+
+  if filename_shuffle_buffer_size is None:
+    filename_shuffle_buffer_size = 4096
+
+  num_parallel_reads = num_parallel_reads or 8
+
+  if batch_transfer_size is None:
+    batch_transfer_size = 1024
+
+  if sloppy is None:
+    sloppy = False
+
+  with ops.device('/job:%s' % file_reader_job):
+    if isinstance(files, str):
+      source_dataset = dataset_ops.Dataset.list_files(files)
+    elif isinstance(files, dataset_ops.Dataset):
+      source_dataset = files
+    else:
+      raise ValueError('files was not a string or a dataset: %s' % files)
+
+    if filename_shuffle_buffer_size:
+      source_dataset = source_dataset.shuffle(
+          buffer_size=filename_shuffle_buffer_size)
+
+    # NOTE: We perform the `repeat` on the source dataset, because the output
+    # dataset does not currently have enough information to recreate an iterator
+    # over the source dataset when it reaches the end.
+    source_dataset = source_dataset.repeat(num_epochs)
+
+    source_dataset = source_dataset.apply(
+        interleave_ops.parallel_interleave(
+            reader_fn, cycle_length=num_parallel_reads, sloppy=sloppy))
+
+    if batch_transfer_size:
+      # Note: we can safely call batch_and_drop_remainder because we have an
+      # infinite stream of TFRecords.
+      source_dataset = source_dataset.apply(
+          batching.batch_and_drop_remainder(batch_transfer_size))
+
+    source_dataset = source_dataset.prefetch(1)
+
+    source_iterator = source_dataset.make_one_shot_iterator()
+    source_handle = source_iterator.string_handle()
+
+  @function.Defun(dtypes.string)
+  def LoadingFunc(h):
+    remote_iterator = iterator_ops.Iterator.from_string_handle(
+        h, source_dataset.output_types, source_dataset.output_shapes)
+    return remote_iterator.get_next()
+
+  def MapFn(unused_input):
+    return functional_ops.remote_call(
+        args=[source_handle],
+        Tout=[dtypes.string],
+        f=LoadingFunc,
+        target='/job:%s/replica:0/task:0/cpu:0' % file_reader_job)
+
+  with ops.device('/job:%s' % worker_job):
+    # TODO(saeta,mrry): Switch to using _GeneratorDataset.
+
+    # identity = lambda x: x
+    # dummy = constant_op.constant(0)
+    # output_dataset = dataset_ops._GeneratorDataset(dummy, identity, MapFn,
+    #                                                identity)
+
+    output_dataset = dataset_ops.Dataset.range(2).repeat().map(MapFn)
+    output_dataset = output_dataset.prefetch(1)
+
+    if batch_transfer_size:
+      # Undo the batching used during the transfer.
+      output_dataset = output_dataset.apply(batching.unbatch()).prefetch(1)
+
+  return output_dataset
diff --git a/tensorflow/contrib/tpu/python/tpu/datasets_test.py b/tensorflow/contrib/tpu/python/tpu/datasets_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..0173aac4f7119fb2f945ded718ea4c80a4e6c1d3
--- /dev/null
+++ b/tensorflow/contrib/tpu/python/tpu/datasets_test.py
@@ -0,0 +1,181 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""TPU datasets tests."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+from tensorflow.contrib.tpu.python.tpu import datasets
+from tensorflow.core.protobuf import cluster_pb2
+from tensorflow.core.protobuf import config_pb2
+from tensorflow.python.client import session
+from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.data.ops import readers
+from tensorflow.python.lib.io import python_io
+from tensorflow.python.platform import test
+from tensorflow.python.training import server_lib
+from tensorflow.python.util import compat
+
+_NUM_FILES = 10
+_NUM_ENTRIES = 200
+
+
+class DatasetsTest(test.TestCase):
+
+  def setUp(self):
+    super(DatasetsTest, self).setUp()
+    self._coord = server_lib.Server.create_local_server()
+    self._worker = server_lib.Server.create_local_server()
+
+    self._cluster_def = cluster_pb2.ClusterDef()
+    worker_job = self._cluster_def.job.add()
+    worker_job.name = 'tpu_worker'
+    worker_job.tasks[0] = self._worker.target[len('grpc://'):]
+    coord_job = self._cluster_def.job.add()
+    coord_job.name = 'coordinator'
+    coord_job.tasks[0] = self._coord.target[len('grpc://'):]
+
+    session_config = config_pb2.ConfigProto(cluster_def=self._cluster_def)
+
+    self._sess = session.Session(self._worker.target, config=session_config)
+
+  def testTextLineDataset(self):
+    all_contents = []
+    for i in range(_NUM_FILES):
+      filename = os.path.join(self.get_temp_dir(), 'text_line.%d.txt' % i)
+      contents = []
+      for j in range(_NUM_ENTRIES):
+        contents.append(compat.as_bytes('%d: %d' % (i, j)))
+      with open(filename, 'wb') as f:
+        f.write(b'\n'.join(contents))
+      all_contents.extend(contents)
+
+    dataset = datasets.StreamingFilesDataset(
+        os.path.join(self.get_temp_dir(), 'text_line.*.txt'), filetype='text')
+
+    iterator = dataset.make_initializable_iterator()
+    self._sess.run(iterator.initializer)
+    get_next = iterator.get_next()
+
+    retrieved_values = []
+    for _ in range(2 * len(all_contents)):
+      retrieved_values.append(compat.as_bytes(self._sess.run(get_next)))
+
+    self.assertEqual(set(all_contents), set(retrieved_values))
+
+  def testTFRecordDataset(self):
+    all_contents = []
+    for i in range(_NUM_FILES):
+      filename = os.path.join(self.get_temp_dir(), 'tf_record.%d' % i)
+      writer = python_io.TFRecordWriter(filename)
+      for j in range(_NUM_ENTRIES):
+        record = compat.as_bytes('Record %d of file %d' % (j, i))
+        writer.write(record)
+        all_contents.append(record)
+      writer.close()
+
+    dataset = datasets.StreamingFilesDataset(
+        os.path.join(self.get_temp_dir(), 'tf_record*'), filetype='tfrecord')
+
+    iterator = dataset.make_initializable_iterator()
+    self._sess.run(iterator.initializer)
+    get_next = iterator.get_next()
+
+    retrieved_values = []
+    for _ in range(2 * len(all_contents)):
+      retrieved_values.append(compat.as_bytes(self._sess.run(get_next)))
+
+    self.assertEqual(set(all_contents), set(retrieved_values))
+
+  def testTFRecordDatasetFromDataset(self):
+    filenames = []
+    all_contents = []
+    for i in range(_NUM_FILES):
+      filename = os.path.join(self.get_temp_dir(), 'tf_record.%d' % i)
+      filenames.append(filename)
+      writer = python_io.TFRecordWriter(filename)
+      for j in range(_NUM_ENTRIES):
+        record = compat.as_bytes('Record %d of file %d' % (j, i))
+        writer.write(record)
+        all_contents.append(record)
+      writer.close()
+
+    filenames = dataset_ops.Dataset.from_tensor_slices(filenames)
+
+    dataset = datasets.StreamingFilesDataset(filenames, filetype='tfrecord')
+
+    iterator = dataset.make_initializable_iterator()
+    self._sess.run(iterator.initializer)
+    get_next = iterator.get_next()
+
+    retrieved_values = []
+    for _ in range(2 * len(all_contents)):
+      retrieved_values.append(compat.as_bytes(self._sess.run(get_next)))
+
+    self.assertEqual(set(all_contents), set(retrieved_values))
+
+  def testArbitraryReaderFunc(self):
+
+    def MakeRecord(i, j):
+      return compat.as_bytes('%04d-%04d' % (i, j))
+
+    record_bytes = len(MakeRecord(10, 200))
+
+    all_contents = []
+    for i in range(_NUM_FILES):
+      filename = os.path.join(self.get_temp_dir(), 'fixed_length.%d' % i)
+      with open(filename, 'wb') as f:
+        for j in range(_NUM_ENTRIES):
+          record = MakeRecord(i, j)
+          f.write(record)
+          all_contents.append(record)
+
+    def FixedLengthFile(filename):
+      return readers.FixedLengthRecordDataset(filename, record_bytes)
+
+    dataset = datasets.StreamingFilesDataset(
+        os.path.join(self.get_temp_dir(), 'fixed_length*'),
+        filetype=FixedLengthFile)
+
+    iterator = dataset.make_initializable_iterator()
+    self._sess.run(iterator.initializer)
+    get_next = iterator.get_next()
+
+    retrieved_values = []
+    for _ in range(2 * len(all_contents)):
+      retrieved_values.append(compat.as_bytes(self._sess.run(get_next)))
+
+    self.assertEqual(set(all_contents), set(retrieved_values))
+
+  def testUnexpectedFiletypeString(self):
+    with self.assertRaises(ValueError):
+      datasets.StreamingFilesDataset(
+          os.path.join(self.get_temp_dir(), '*'), filetype='foo')
+
+  def testUnexpectedFiletypeType(self):
+    with self.assertRaises(ValueError):
+      datasets.StreamingFilesDataset(
+          os.path.join(self.get_temp_dir(), '*'), filetype=3)
+
+  def testUnexpectedFilesType(self):
+    with self.assertRaises(ValueError):
+      datasets.StreamingFilesDataset(123, filetype='tfrecord')
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_config.py b/tensorflow/contrib/tpu/python/tpu/tpu_config.py
index 644070218214643923b9ca3ee138615ec568e8b5..7ceb4069cf011d88b6fb4586d7e80acbacf9aebe 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_config.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_config.py
@@ -26,6 +26,7 @@ import os
 import numpy as np
 
 from tensorflow.contrib.tpu.python.tpu import util as util_lib
+from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.estimator import run_config as run_config_lib
 from tensorflow.python.platform import tf_logging as logging
 
@@ -140,6 +141,7 @@ class RunConfig(run_config_lib.RunConfig):
                tpu_config=None,
                evaluation_master=None,
                master=None,
+               cluster=None,
                **kwargs):
     """Constructs a RunConfig.
 
@@ -148,15 +150,26 @@ class RunConfig(run_config_lib.RunConfig):
       evaluation_master: a string. The address of the master to use for eval.
         Defaults to master if not set.
       master: a string. The address of the master to use for training.
+      cluster: a ClusterResolver
       **kwargs: keyword config parameters.
+
+    Raises:
+      ValueError: if cluster is not None and the provided session_config has a
+        cluster_def already.
     """
     super(RunConfig, self).__init__(**kwargs)
     self._tpu_config = tpu_config or TPUConfig()
+    self._cluster = cluster
 
     # If user sets master and/or evaluation_master explicilty, including empty
     # string '', take it. Otherwise, take the values set by parent class.
     if master is not None:
+      if cluster is not None:
+        raise ValueError('Both master and cluster are set.')
       self._master = master
+    else:
+      if cluster:
+        self._master = cluster.master()
 
     if evaluation_master is not None:
       self._evaluation_master = evaluation_master
@@ -170,6 +183,20 @@ class RunConfig(run_config_lib.RunConfig):
       # evaluation_master to master, unless user overwrites it.
       self._evaluation_master = self._master
 
+    # Set the ClusterSpec to use
+    if cluster:
+      self._cluster_spec = cluster.cluster_spec()
+
+      # Merge the cluster_def into the ConfigProto.
+      if self._session_config is None:  # pylint: disable=access-member-before-definition
+        self._session_config = config_pb2.ConfigProto(allow_soft_placement=True)
+      if self._session_config.HasField('cluster_def'):
+        raise ValueError(
+            'You cannot provide a ClusterResolver and '
+            'session_config.cluster_def.')
+      self._session_config.cluster_def.CopyFrom(
+          self._cluster_spec.as_cluster_def())
+
   @property
   def evaluation_master(self):
     return self._evaluation_master
@@ -182,6 +209,10 @@ class RunConfig(run_config_lib.RunConfig):
   def tpu_config(self):
     return self._tpu_config
 
+  @property
+  def cluster(self):
+    return self._cluster
+
   def replace(self, **kwargs):
     if 'tpu_config' not in kwargs:
       return super(RunConfig, self).replace(**kwargs)
diff --git a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
index ff53fe4f5d0e219f56d77d3476640bb023c7535a..1b2eda1caa0fa2779834d65b5a49121d9cc0af56 100644
--- a/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
+++ b/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py
@@ -1763,6 +1763,9 @@ class TPUEstimator(estimator_lib.Estimator):
     if 'config' in input_fn_args:
       kwargs['config'] = config
 
+    if 'mode' in input_fn_args:
+      kwargs['mode'] = mode
+
     with self._ctx.with_mode(mode) as ctx:
       # Setting the batch size in params first. This helps user to have same
       # input_fn for use_tpu=True/False.
diff --git a/tensorflow/contrib/training/python/training/hparam.py b/tensorflow/contrib/training/python/training/hparam.py
index fdfd27d6a414933b0bec824bae512c45dac24d3c..95e051e3b5bb9f8075e66891a45c64a27bca68d1 100644
--- a/tensorflow/contrib/training/python/training/hparam.py
+++ b/tensorflow/contrib/training/python/training/hparam.py
@@ -358,6 +358,8 @@ class HParams(object):
   ```
   """
 
+  _HAS_DYNAMIC_ATTRIBUTES = True  # Required for pytype checks.
+
   def __init__(self, hparam_def=None, model_structure=None, **kwargs):
     """Create an instance of `HParams` from keyword arguments.
 
diff --git a/tensorflow/contrib/verbs/README.md b/tensorflow/contrib/verbs/README.md
index 58fed4e5cb4c24b0f21dfe9b99cf4c665d2591c7..4b6104a8b4d542b1d8a9cb3e48eeed4950d791cd 100644
--- a/tensorflow/contrib/verbs/README.md
+++ b/tensorflow/contrib/verbs/README.md
@@ -93,7 +93,7 @@ When the receiver receives the RDMA write, it will locate the relevant **RdmaTen
 
 1. When the sender receives a tensor request, the source tensor may or may not be ready yet. The situation is handled through a process of tag matching:
 	* If the request arrives before the tensor is ready, then a callback is put in a local table, and will be invoked once the tensor arrives.
-	* If the tensor is ready before the request arives, than the tensor is put in a local table. When the request arrives, it will invoke the callback immediately.
+	* If the tensor is ready before the request arrives, than the tensor is put in a local table. When the request arrives, it will invoke the callback immediately.
    In code it is done by calling **RecvLocalAsync()**, which receives the tensor's key, step-id, and the callback.
 2. When the callback is invoked, the relevant tensor is removed from the tag matching table. In the case where we need to send the tensor's meta-data, the **RdmaTensorResponse** will store a copy of the tensor until the re-request arrives.
 3. The sending of protocol messages (**RDMA_MESSAGE_TENSOR_REQUEST**, **RDMA_MESSAGE_META_DATA_RESPONSE** and **RDMA_MESSAGE_TENSOR_RE_REQUEST**) is done by the class **RdmaMessageBuffer**. All messages are sent using RDMA writes from/to fixed messages buffers. This implies that we cannot send on a specific channel more than one message at a time. In order to synchronize the messages, the **RdmaMessageBuffer** holds the a local and remote buffer statuses which can be either busy or idle. When a write is issued, both statuses will be changed to busy. When the write-complete event is received, the local status is changed to idle. When the write is received on the remote side, the remote side will parse the message, and return an ACK back to the sending side on which the sending side will update the remote status to idle. When both the local and remote statuses are idle, the next message can be sent.
diff --git a/tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md b/tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md
index 956b8f2147cf8154b6f1ade006d7bff194864c9b..da6fdd48e19e9d1503d1537926b1c464a0e77589 100644
--- a/tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md
+++ b/tensorflow/contrib/verbs/patch_notes_verbs_with_0_copies.md
@@ -64,7 +64,7 @@ The protocol messages themselves will remain mostly unchanged at the first stage
 	* type - The message type.
 	* request_index - Request index.
 	* is_dead/data_type/tensor_shape/tensor_bytes - The up-to-date meta-data.
-* **RDMA_MESSAGE_BUFFER_RESPONSE** - (receiver ==> sender) Tensor re-requset after meta-data update and reallocation of result/proxy tensors.
+* **RDMA_MESSAGE_BUFFER_RESPONSE** - (receiver ==> sender) Tensor re-request after meta-data update and reallocation of result/proxy tensors.
 	* type - The message type.
 	* name (name_size) - Name of the requested tensor.
 	* step_id - Step ID.
diff --git a/tensorflow/core/BUILD b/tensorflow/core/BUILD
index 30ac270109dbbb77cd6a400d1feaa1ac116456c1..3a436ff6804622d5cb841fb74b1de1a3f1eb234f 100644
--- a/tensorflow/core/BUILD
+++ b/tensorflow/core/BUILD
@@ -339,6 +339,7 @@ cc_library(
         "lib/strings/strcat.h",
         "lib/strings/stringprintf.h",
         "platform/abi.h",
+        "platform/context.h",
         "platform/cpu_feature_guard.h",
         "platform/cpu_info.h",
         "platform/dynamic_annotations.h",
@@ -480,6 +481,7 @@ tf_cuda_library(
         "framework/type_index.h",
         "framework/type_traits.h",
         "framework/types.h",
+        "framework/visitable_allocator.h",
         "public/version.h",
         "util/activation_mode.h",
         "util/bcast.h",
@@ -988,22 +990,15 @@ filegroup(
 
 # Core sources for Android builds.
 filegroup(
-    name = "mobile_srcs",
+    name = "mobile_srcs_no_runtime",
     srcs = [
         ":proto_text_srcs_all",
-        "//tensorflow/core/kernels:android_srcs",
         "//tensorflow/core/platform/default/build_config:android_srcs",
-        "//tensorflow/core/util/ctc:android_srcs",
-        "//tensorflow/core/util/tensor_bundle:android_srcs",
     ] + glob(
         [
             "client/**/*.cc",
-            "common_runtime/**/*.h",
-            "common_runtime/**/*.cc",
             "framework/**/*.h",
             "framework/**/*.cc",
-            "graph/**/*.h",
-            "graph/**/*.cc",
             "lib/**/*.h",
             "lib/**/*.cc",
             "platform/**/*.h",
@@ -1019,7 +1014,6 @@ filegroup(
             "**/*main.cc",
             "debug/**/*",
             "framework/op_gen_*",
-            "graph/dot.*",
             "lib/jpeg/**/*",
             "lib/png/**/*",
             "lib/gif/**/*",
@@ -1036,6 +1030,10 @@ filegroup(
             "platform/stream_executor.*",
             "platform/windows/**/*",
             "user_ops/**/*.cu.cc",
+            "util/ctc/*.h",
+            "util/ctc/*.cc",
+            "util/tensor_bundle/*.h",
+            "util/tensor_bundle/*.cc",
             "common_runtime/gpu/**/*",
             "common_runtime/gpu_device_factory.*",
         ],
@@ -1043,6 +1041,41 @@ filegroup(
     visibility = ["//visibility:public"],
 )
 
+filegroup(
+    name = "mobile_srcs_only_runtime",
+    srcs = [
+        "//tensorflow/core/kernels:android_srcs",
+        "//tensorflow/core/util/ctc:android_srcs",
+        "//tensorflow/core/util/tensor_bundle:android_srcs",
+    ] + glob(
+        [
+            "common_runtime/**/*.h",
+            "common_runtime/**/*.cc",
+            "graph/**/*.h",
+            "graph/**/*.cc",
+        ],
+        exclude = [
+            "**/*test.*",
+            "**/*testutil*",
+            "**/*testlib*",
+            "**/*main.cc",
+            "common_runtime/gpu/**/*",
+            "common_runtime/gpu_device_factory.*",
+            "graph/dot.*",
+        ],
+    ),
+    visibility = ["//visibility:public"],
+)
+
+filegroup(
+    name = "mobile_srcs",
+    srcs = [
+        ":mobile_srcs_no_runtime",
+        ":mobile_srcs_only_runtime",
+    ],
+    visibility = ["//visibility:public"],
+)
+
 # Native library support for Android applications.  Does not contain
 # operators, use :android_tensorflow_lib if you want full operator
 # support.
@@ -1781,6 +1814,7 @@ FRAMEWORK_INTERNAL_PUBLIC_HEADERS = [
     "framework/tracking_allocator.h",  # only needed for tests
     "framework/unique_tensor_references.h",
     "framework/variant.h",
+    "framework/visitable_allocator.h",
     "platform/variant_coding.h",
     "util/command_line_flags.h",
     "util/env_var.h",
@@ -1886,7 +1920,7 @@ tf_cuda_library(
     ) + if_mkl(
         [
             "//third_party/mkl:intel_binary_blob",
-            "@mkl_dnn//:mkl_dnn",
+            "@mkl_dnn",
         ],
     ),
     alwayslink = 1,
@@ -2076,7 +2110,6 @@ CORE_CPU_LIB_HEADERS = CORE_CPU_BASE_HDRS + [
     "common_runtime/stats_publisher_interface.h",
     "common_runtime/step_stats_collector.h",
     "common_runtime/threadpool_device.h",
-    "common_runtime/visitable_allocator.h",
     "graph/gradients.h",
     "graph/quantize_training.h",
 ] + if_mkl(["graph/mkl_graph_util.h"])
@@ -2102,6 +2135,7 @@ tf_cuda_library(
         "common_runtime/graph_runner.cc",
         "common_runtime/local_device.cc",
         "common_runtime/memory_types.cc",
+        "common_runtime/mkl_cpu_allocator.cc",
         "common_runtime/optimization_registry.cc",
         "common_runtime/parallel_concat_optimizer.cc",
         "common_runtime/placer.cc",
@@ -2141,6 +2175,7 @@ tf_cuda_library(
     ] + if_mkl(
         [
             "//third_party/mkl:intel_binary_blob",
+            "@mkl_dnn",
         ],
     ),
     alwayslink = 1,
@@ -2185,14 +2220,12 @@ tf_cuda_library(
     ] + if_mkl(
         [
             "//third_party/mkl:intel_binary_blob",
-            "@mkl_dnn//:mkl_dnn",
+            "@mkl_dnn",
         ],
     ) + tf_additional_core_deps() + if_static([":core_cpu_impl"]),
     alwayslink = 1,
 )
 
-# This library is deprecated and no longer publicly available.
-# Do not add more uses of it.
 cc_library(
     name = "regexp_internal",
     hdrs = [
@@ -3484,6 +3517,7 @@ tf_cc_tests(
         "ops/parsing_ops_test.cc",
         "ops/random_ops_test.cc",
         "ops/set_ops_test.cc",
+        "ops/shape_function_test.cc",
         "ops/sparse_ops_test.cc",
         "ops/spectral_ops_test.cc",
         "ops/state_ops_test.cc",
@@ -3642,6 +3676,18 @@ filegroup(
     visibility = ["//tensorflow:__subpackages__"],
 )
 
+alias(
+    name = "android_srcs_no_runtime",
+    actual = ":mobile_srcs_no_runtime",
+    visibility = ["//visibility:public"],
+)
+
+alias(
+    name = "android_srcs_only_runtime",
+    actual = ":mobile_srcs_only_runtime",
+    visibility = ["//visibility:public"],
+)
+
 alias(
     name = "android_srcs",
     actual = ":mobile_srcs",
diff --git a/tensorflow/core/api_def/base_api/api_def_ConsumeMutexLock.pbtxt b/tensorflow/core/api_def/base_api/api_def_ConsumeMutexLock.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b9db8274dea5d904dbbc687927673e0c7f7fa649
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_ConsumeMutexLock.pbtxt
@@ -0,0 +1,19 @@
+op {
+  graph_op_name: "ConsumeMutexLock"
+  in_arg {
+    name: "mutex_lock"
+    description: <<END
+A tensor returned by `MutexLock`.
+END
+  }
+  summary: "This op consumes a lock created by `MutexLock`."
+  description: <<END
+This op exists to consume a tensor created by `MutexLock` (other than
+direct control dependencies).  It should be the only that consumes the tensor,
+and will raise an error if it is not.  Its only purpose is to keep the
+mutex lock tensor alive until it is consumed by this op.
+
+**NOTE**: This operation must run on the same device as its input.  This may
+be enforced via the `colocate_with` mechanism.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_CriticalSectionOp.pbtxt b/tensorflow/core/api_def/base_api/api_def_CriticalSectionOp.pbtxt
deleted file mode 100644
index 5027fa861e7d8914b1e8ae06cd1ffa2ed06b6ad2..0000000000000000000000000000000000000000
--- a/tensorflow/core/api_def/base_api/api_def_CriticalSectionOp.pbtxt
+++ /dev/null
@@ -1,16 +0,0 @@
-op {
-  graph_op_name: "CriticalSectionOp"
-  attr {
-    name: "container"
-    description: <<END
-the container this critical section is placed in.
-END
-  }
-  attr {
-    name: "shared_name"
-    description: <<END
-the name by which this critical section is referred to.
-END
-  }
-  summary: "Creates a handle to a CriticalSection resource."
-}
diff --git a/tensorflow/core/api_def/base_api/api_def_ExecuteInCriticalSection.pbtxt b/tensorflow/core/api_def/base_api/api_def_ExecuteInCriticalSection.pbtxt
deleted file mode 100644
index cd5fc84a74faa209262da0402c546bcc3b4256fe..0000000000000000000000000000000000000000
--- a/tensorflow/core/api_def/base_api/api_def_ExecuteInCriticalSection.pbtxt
+++ /dev/null
@@ -1,49 +0,0 @@
-op {
-  graph_op_name: "ExecuteInCriticalSection"
-  in_arg {
-    name: "critical_section"
-    description: <<END
-The handle of the `critical_section`.
-END
-  }
-  in_arg {
-    name: "arguments"
-    description: <<END
-Arguments for `f`, including any captured inputs appended at the end.
-END
-  }
-  out_arg {
-    name: "outputs"
-    description: <<END
-The outputs of `f`.
-END
-  }
-  attr {
-    name: "f"
-    description: <<END
-The `Function` to execute.
-END
-  }
-  summary: "Executes function `f` within critical section `critical_section`."
-  description: <<END
-While `f` is running in `critical_section`, no other functions which wish to
-use this critical section may run.
-
-Often the use case is that two executions of the same graph, in parallel,
-wish to run `f`; and we wish to ensure that only one of them executes
-at a time.  This is especially important if `f` modifies one or more
-variables at a time.
-
-It is also useful if two separate functions must share a resource, but we
-wish to ensure the usage is exclusive.
-
-The signature of `f` is expected to be:
-
-```
-  outputs <- F(arguments)
-```
-Typically, but this is not required, `arguments` contain resources.  The
-primary purpose of this op is to limit access to these resources to one
-execution of `F` at a time.
-END
-}
diff --git a/tensorflow/core/api_def/base_api/api_def_GeneratorDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_GeneratorDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4f1cf3e6867a06df1f39774bc389fbe35a994ab4
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_GeneratorDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "GeneratorDataset"
+  summary: "Creates a dataset that invokes a function to generate elements."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_MutexLock.pbtxt b/tensorflow/core/api_def/base_api/api_def_MutexLock.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..cd3eb434868a43602158fa263c9215eee7d25124
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_MutexLock.pbtxt
@@ -0,0 +1,58 @@
+op {
+  graph_op_name: "MutexLock"
+  in_arg {
+    name: "mutex"
+    description: <<END
+The mutex resource to lock.
+END
+  }
+  out_arg {
+    name: "mutex_lock"
+    description: <<END
+A tensor that keeps a shared pointer to a lock on the mutex;
+when the Tensor is destroyed, the use count on the shared pointer is decreased
+by 1.  When it reaches 0, the lock is released.
+END
+  }
+  summary: "Locks a mutex resource.  The output is the lock.  So long as the lock tensor"
+  description: <<END
+is alive, any other request to use `MutexLock` with this mutex will wait.
+
+This is particularly useful for creating a critical section when used in
+conjunction with `MutexLockIdentity`:
+
+```python
+
+mutex = mutex_v2(
+  shared_name=handle_name, container=container, name=name)
+
+def execute_in_critical_section(fn, *args, **kwargs):
+  lock = gen_resource_variable_ops.mutex_lock(mutex)
+
+  with ops.control_dependencies([lock]):
+    r = fn(*args, **kwargs)
+
+  with ops.control_dependencies(nest.flatten(r)):
+    with ops.colocate_with(mutex):
+      ensure_lock_exists = mutex_lock_identity(lock)
+
+    # Make sure that if any element of r is accessed, all of
+    # them are executed together.
+    r = nest.map_structure(tf.identity, r)
+
+  with ops.control_dependencies([ensure_lock_exists]):
+    return nest.map_structure(tf.identity, r)
+```
+
+While `fn` is running in the critical section, no other functions which wish to
+use this critical section may run.
+
+Often the use case is that two executions of the same graph, in parallel,
+wish to run `fn`; and we wish to ensure that only one of them executes
+at a time.  This is especially important if `fn` modifies one or more
+variables at a time.
+
+It is also useful if two separate functions must share a resource, but we
+wish to ensure the usage is exclusive.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_MutexV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_MutexV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..22295ecbd847c7b918ecb337a75492c1164a1e5d
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_MutexV2.pbtxt
@@ -0,0 +1,24 @@
+op {
+  graph_op_name: "MutexV2"
+  out_arg {
+    name: "resource"
+    description: <<END
+The mutex resource.
+END
+  }
+  attr {
+    name: "container"
+    description: <<END
+If non-empty, this variable is placed in the given container.
+Otherwise, a default container is used.
+END
+  }
+  attr {
+    name: "shared_name"
+    description: <<END
+If non-empty, this variable is named in the given bucket
+with this shared_name. Otherwise, the node name is used instead.
+END
+  }
+  summary: "Creates a Mutex resource that can be locked by `MutexLock`."
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_RegexReplace.pbtxt b/tensorflow/core/api_def/base_api/api_def_RegexReplace.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..70ad5219267fcc84368f072a6f5a122b6cc11a89
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_RegexReplace.pbtxt
@@ -0,0 +1,25 @@
+op {
+  graph_op_name: "RegexReplace"
+  in_arg {
+    name: "input"
+    description: "The text to be processed."
+  }
+  in_arg {
+    name: "pattern"
+    description: "The regular expression to match the input."
+  }
+  in_arg {
+    name: "rewrite"
+    description: "The rewrite to be applied to the matched expresion."
+  }
+  out_arg {
+    name: "output"
+    description: "The text after applying pattern and rewrite."
+  }
+  attr {
+    name: "replace_global"
+    description: "If True, the replacement is global, otherwise the replacement\nis done only on the first match."
+  }
+  summary: "Replaces the match of pattern in input with rewrite."
+  description: "It follows the re2 syntax (https://github.com/google/re2/wiki/Syntax)"
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_Timestamp.pbtxt b/tensorflow/core/api_def/base_api/api_def_Timestamp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..bf2d07bcf58a51ae5dc7f52006f4a911430bf421
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_Timestamp.pbtxt
@@ -0,0 +1,10 @@
+op {
+  graph_op_name: "Timestamp"
+  summary: "Provides the time since epoch in seconds."
+  description: <<END
+Returns the timestamp as a `float64` for seconds since the Unix epoch.
+
+Note: the timestamp is computed when the op is executed, not when it is added
+to the graph.
+END
+}
diff --git a/tensorflow/core/api_def/base_api/api_def_UniqueDataset.pbtxt b/tensorflow/core/api_def/base_api/api_def_UniqueDataset.pbtxt
deleted file mode 100644
index 009256916908c412fdebd0775387a7f7f4d30a25..0000000000000000000000000000000000000000
--- a/tensorflow/core/api_def/base_api/api_def_UniqueDataset.pbtxt
+++ /dev/null
@@ -1,4 +0,0 @@
-op {
-  graph_op_name: "UniqueDataset"
-  summary: "Creates a dataset that contains the unique elements of `input_dataset`."
-}
diff --git a/tensorflow/core/api_def/base_api/api_def_UniqueWithCountsV2.pbtxt b/tensorflow/core/api_def/base_api/api_def_UniqueWithCountsV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e21f56ba5b926826a79f10d154bb7b0c2253af2f
--- /dev/null
+++ b/tensorflow/core/api_def/base_api/api_def_UniqueWithCountsV2.pbtxt
@@ -0,0 +1,85 @@
+op {
+  graph_op_name: "UniqueWithCountsV2"
+  in_arg {
+    name: "x"
+    description: <<END
+A `Tensor`.
+END
+  }
+  in_arg {
+    name: "axis"
+    description: <<END
+A `Tensor` of type `int32` (default: None). The axis of the Tensor to
+find the unique elements.
+END
+  }
+  out_arg {
+    name: "y"
+    description: <<END
+A `Tensor`. Unique elements along the `axis` of `Tensor` x.
+END
+  }
+  out_arg {
+    name: "idx"
+    description: <<END
+A 1-D Tensor. Has the same type as x that contains the index of each
+value of x in the output y.
+END
+  }
+  out_arg {
+    name: "count"
+    description: <<END
+A 1-D Tensor. The count of each value of x in the output y.
+END
+  }
+  summary: "Finds unique elements along an axis of a tensor."
+  description: <<END
+This operation either returns a tensor `y` containing unique elements
+along the `axis` of a tensor. The returned unique elements is sorted
+in the same order as they occur along `axis` in `x`.
+This operation also returns a tensor `idx` and a tensor `count`
+that are the same size as the number of the elements in `x` along the
+`axis` dimension. The `idx` contains the index in the unique output `y`
+and the `count` contains the count in the unique output `y`.
+In other words, for an `1-D` tensor `x` with `axis = None:
+
+`y[idx[i]] = x[i] for i in [0, 1,...,rank(x) - 1]`
+
+For example:
+
+```
+# tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
+y, idx, count = unique_with_counts(x)
+y ==> [1, 2, 4, 7, 8]
+idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
+count ==> [2, 1, 3, 1, 2]
+```
+
+For an `2-D` tensor `x` with `axis = 0`:
+
+```
+# tensor 'x' is [[1, 0, 0],
+#                [1, 0, 0],
+#                [2, 0, 0]]
+y, idx, count = unique_with_counts(x, axis=0)
+y ==> [[1, 0, 0],
+       [2, 0, 0]]
+idx ==> [0, 0, 1]
+count ==> [2, 1]
+```
+
+For an `2-D` tensor `x` with `axis = 1`:
+
+```
+# tensor 'x' is [[1, 0, 0],
+#                [1, 0, 0],
+#                [2, 0, 0]]
+y, idx, count = unique_with_counts(x, axis=1)
+y ==> [[1, 0],
+       [1, 0],
+       [2, 0]]
+idx ==> [0, 1, 1]
+count ==> [1, 2]
+```
+END
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Abort.pbtxt b/tensorflow/core/api_def/python_api/api_def_Abort.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3f95aaf12c65383b1425fd4063a79afff63480a6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Abort.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Abort"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AccumulatorApplyGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_AccumulatorApplyGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1e76d6dadcde5083dba8e2ef78740256fd45dc63
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AccumulatorApplyGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AccumulatorApplyGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AccumulatorNumAccumulated.pbtxt b/tensorflow/core/api_def/python_api/api_def_AccumulatorNumAccumulated.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fbe971ab2e221bd01e991f9c80e1d527736e59bf
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AccumulatorNumAccumulated.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AccumulatorNumAccumulated"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AccumulatorSetGlobalStep.pbtxt b/tensorflow/core/api_def/python_api/api_def_AccumulatorSetGlobalStep.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0047b25af6a52d02b5b4f1e87fa16fce56a90a29
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AccumulatorSetGlobalStep.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AccumulatorSetGlobalStep"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AccumulatorTakeGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_AccumulatorTakeGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..860fbe124506c7db95e3f1603b9f3878a2d4b84b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AccumulatorTakeGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AccumulatorTakeGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AdjustContrast.pbtxt b/tensorflow/core/api_def/python_api/api_def_AdjustContrast.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0311ad92b7e40c67969bf14193a8b2f98659558a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AdjustContrast.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AdjustContrast"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AdjustHue.pbtxt b/tensorflow/core/api_def/python_api/api_def_AdjustHue.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b4411677118b002bd751a492aadf30b6fb0f4ac8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AdjustHue.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AdjustHue"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AdjustSaturation.pbtxt b/tensorflow/core/api_def/python_api/api_def_AdjustSaturation.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..893219e17a70c5d4fdd24b46986b6ed33303448c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AdjustSaturation.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AdjustSaturation"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Angle.pbtxt b/tensorflow/core/api_def/python_api/api_def_Angle.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..771e861fd17171ae886fabcf47218923ab5451b6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Angle.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Angle"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAdadelta.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAdadelta.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d8776b19f1a28dca7f9f067154d51cdb02599d79
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAdadelta.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAdadelta"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7e659c1bb30ab2f80a9bb010c55a4426b12f9d5b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAdagradDA.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAdagradDA.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d647c5eb0a23346f25407630d24d25321e3282a3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAdagradDA.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAdagradDA"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAdam.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAdam.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..66d9095c8fd0eb729f3b3c3ca5938ebe045723d7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAdam.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAdam"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyAddSign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyAddSign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b7fe1aa6542e874c071bb3d069d50a37a4779754
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyAddSign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyAddSign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyCenteredRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyCenteredRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..56003c5e6fda60fb43c4a5784e2ccdf564ff1ed8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyCenteredRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyCenteredRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyFtrl.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyFtrl.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..680b3ef480f54da4e13331cc47589810ccecb54e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyFtrl.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyFtrl"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyFtrlV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyFtrlV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5ab3bb6efd2bd483a5223acbb9a16b0b9ab3d001
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyFtrlV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyFtrlV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..467bf7db558000d7a15bf83b33011690a34a107a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyMomentum.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyMomentum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7c3f0fef95f55c18170343d6f1cc081612dd68ee
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyMomentum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyMomentum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyPowerSign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyPowerSign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f376b1dc6e531113d580406df26a5b30210b07ad
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyPowerSign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyPowerSign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyProximalAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyProximalAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0c6e2a4bb1ede274433bd297c7f39b2c58186923
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyProximalAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyProximalAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyProximalGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyProximalGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..90c1655fe913504f55b910d826e834e70a7c4acc
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyProximalGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyProximalGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApplyRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApplyRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..18cce1915a5eb25f68a227968c0819977d02467f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApplyRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApplyRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ApproximateEqual.pbtxt b/tensorflow/core/api_def/python_api/api_def_ApproximateEqual.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..707f6716f9604994f5d92651ecacc7608f10742d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ApproximateEqual.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ApproximateEqual"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AssignAddVariableOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_AssignAddVariableOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e30ec092e68bf9204b34e6d06b3b4c1cfbeab02b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AssignAddVariableOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AssignAddVariableOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AssignSubVariableOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_AssignSubVariableOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..81290a56ec1cd89b15a875c24ec31a20faa11bb8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AssignSubVariableOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AssignSubVariableOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AssignVariableOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_AssignVariableOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3ffa4a11c41f2cf38b9251a0fd030ea0fc511058
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AssignVariableOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "AssignVariableOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_AvgPool3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_AvgPool3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..cc16523a1567e8d7f2d0146c1c44d9ef11b6c6d5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_AvgPool3D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "AvgPool3D"
+  endpoint {
+    name: "nn.avg_pool3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4289c1daf96583943b8dfad84aeca3351657bee4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchMatrixBandPart.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchMatrixBandPart.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0a699e20506e47177722a57249f53cb1b80cf1b2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchMatrixBandPart.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchMatrixBandPart"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiag.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiag.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..40be51ecccd3d5f6ceb3a1dc245e925e666d8ac5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiag.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchMatrixDiag"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiagPart.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiagPart.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1ef78fa5ece9a6ff1147b70d4660769954e67301
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchMatrixDiagPart.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchMatrixDiagPart"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BatchMatrixSetDiag.pbtxt b/tensorflow/core/api_def/python_api/api_def_BatchMatrixSetDiag.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..644c1270a2385871d9dc4f429e39de4c0382c27a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BatchMatrixSetDiag.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BatchMatrixSetDiag"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BiasAddGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_BiasAddGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9226c6791c82fcddb1ed0d54db155d20ff44d18e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BiasAddGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BiasAddGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Bincount.pbtxt b/tensorflow/core/api_def/python_api/api_def_Bincount.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..551b51db26251cbf15b38ac3e48b5024fae0ec72
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Bincount.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Bincount"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_BytesProducedStatsDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_BytesProducedStatsDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fcf541f9036baaef1590f06da0d7471b0558b4c7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_BytesProducedStatsDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "BytesProducedStatsDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CacheDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_CacheDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2bbb4ff9e3b08d0dd11c7444e5d00feb514e81c0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CacheDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CacheDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Cast.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cast.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..428aa62c462a2361c519243a8b8a6bdb4f42cb9d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Cast.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Cast"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CholeskyGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_CholeskyGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3538afb2a7108763b1986615f8c738e0c3113c96
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CholeskyGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CholeskyGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CompareAndBitpack.pbtxt b/tensorflow/core/api_def/python_api/api_def_CompareAndBitpack.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..493a7e48665954b647e99a5fe9d06a7a42755494
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CompareAndBitpack.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CompareAndBitpack"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ConcatenateDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ConcatenateDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c005a4da0f866c1d1106effabbaa22f1abecf422
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ConcatenateDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ConcatenateDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ConditionalAccumulator.pbtxt b/tensorflow/core/api_def/python_api/api_def_ConditionalAccumulator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a4663e8eb3a20b6d809242ec9a44376247b8b3ca
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ConditionalAccumulator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ConditionalAccumulator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ConsumeMutexLock.pbtxt b/tensorflow/core/api_def/python_api/api_def_ConsumeMutexLock.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9559947490e6fbcb88ab8a359e8416fd11a8b165
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ConsumeMutexLock.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ConsumeMutexLock"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ControlTrigger.pbtxt b/tensorflow/core/api_def/python_api/api_def_ControlTrigger.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..33941493af7bf038503dad2379821d0c36929cf8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ControlTrigger.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ControlTrigger"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2ae75d6da222d84245bb2a912942522eb52047bc
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv2D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv2D"
+  endpoint {
+    name: "nn.conv2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropFilter.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropFilter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6f21d8c8802f9a18c9357dbe68d3c65407bff923
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropFilter.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv2DBackpropFilter"
+  endpoint {
+    name: "nn.conv2d_backprop_filter"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropInput.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropInput.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ea976799cbc73bc9164a15e781a051f03e14275b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv2DBackpropInput.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv2DBackpropInput"
+  endpoint {
+    name: "nn.conv2d_backprop_input"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ba8d178263c94574c0aaac8f1f24fb1424a50275
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv3D"
+  endpoint {
+    name: "nn.conv3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilter.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..634545f427c906edd94297f9e5291be4021462ad
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Conv3DBackpropFilter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilterV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilterV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1da8ee3a25f36a0b44f6458a351854190fe7830f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropFilterV2.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Conv3DBackpropFilterV2"
+  endpoint {
+    name: "nn.conv3d_backprop_filter_v2"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInput.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInput.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e2b0a0d19f4a31b05771133c52446078a7e938c8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInput.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Conv3DBackpropInput"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInputV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInputV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4e5c4f74fe90c148e11be98a2e343a41511d1d1d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Conv3DBackpropInputV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Conv3DBackpropInputV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradBoxes.pbtxt b/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradBoxes.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ac4449419314a1fe09e9a2b17e815a741b960b1d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradBoxes.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CropAndResizeGradBoxes"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradImage.pbtxt b/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradImage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..eecd0536f29bc189705c7a7311a79eb5ffff02dc
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_CropAndResizeGradImage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "CropAndResizeGradImage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Cumprod.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cumprod.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8f5e2f061bd281d22e91b0922899eda3a641d68f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Cumprod.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Cumprod"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Cumsum.pbtxt b/tensorflow/core/api_def/python_api/api_def_Cumsum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..715f26fcac2bb03b729d58f5c5f7cfe6802660fd
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Cumsum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Cumsum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DataFormatDimMap.pbtxt b/tensorflow/core/api_def/python_api/api_def_DataFormatDimMap.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..82a39cfc5981f14edfe39cee363abf169f89245e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DataFormatDimMap.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DataFormatDimMap"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DataFormatVecPermute.pbtxt b/tensorflow/core/api_def/python_api/api_def_DataFormatVecPermute.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9ec292df8f670cfbae6488545979354d751e5d41
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DataFormatVecPermute.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DataFormatVecPermute"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DatasetToSingleElement.pbtxt b/tensorflow/core/api_def/python_api/api_def_DatasetToSingleElement.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e3d34cc15be752b466aa03f6805cd687698f74fa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DatasetToSingleElement.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DatasetToSingleElement"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt b/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f0b7539918617e866acdf4d4d88279e1aeeb7a14
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DecodeCompressed.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DecodeCompressed"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DenseToDenseSetOperation.pbtxt b/tensorflow/core/api_def/python_api/api_def_DenseToDenseSetOperation.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1c47ec09c5ee16d37ac57c211c2409cd8f8c6970
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DenseToDenseSetOperation.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DenseToDenseSetOperation"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DenseToSparseBatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_DenseToSparseBatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0a8e068afb744ce8b472111d19cf743d39ac44ef
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DenseToSparseBatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DenseToSparseBatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DenseToSparseSetOperation.pbtxt b/tensorflow/core/api_def/python_api/api_def_DenseToSparseSetOperation.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a30757df4d0326159c180c4be14309f9150fff00
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DenseToSparseSetOperation.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DenseToSparseSetOperation"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DepthToSpace.pbtxt b/tensorflow/core/api_def/python_api/api_def_DepthToSpace.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fd0766b36556d86b2fc99f8b2ac19480832546e1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DepthToSpace.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DepthToSpace"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DeserializeIterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_DeserializeIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..170d37be4e268c2829a5fa01fcaa48be082c2e0e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DeserializeIterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DeserializeIterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_DestroyResourceOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_DestroyResourceOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b9dde0080a8c3533a1b1837ddd4aaeb05e7a180e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_DestroyResourceOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "DestroyResourceOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Dilation2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_Dilation2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6d73ecf1bb06895017b2d2ac2a16c702681eb217
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Dilation2D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "Dilation2D"
+  endpoint {
+    name: "nn.dilation2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropFilter.pbtxt b/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropFilter.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..feb9f083db691c55832997509ff6455a6584f486
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropFilter.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Dilation2DBackpropFilter"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropInput.pbtxt b/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropInput.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9a6b09f5cc653ba456bfb9fb66757c48963503e4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Dilation2DBackpropInput.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Dilation2DBackpropInput"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_EnqueueInQueueDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_EnqueueInQueueDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..051cf14c0ec2b32779be8b9c297b93abd1bc1318
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_EnqueueInQueueDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "EnqueueInQueueDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FFT2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_FFT2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9ed1341dfe2d0c4f57e0fa3c2d14378bce452be3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FFT2D.pbtxt
@@ -0,0 +1,9 @@
+op {
+  graph_op_name: "FFT2D"
+  endpoint {
+    name: "spectral.fft2d"
+  }
+  endpoint {
+    name: "fft2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FFT3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_FFT3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5a4e1d6adf9b9c2bf68c6375de6aebfdfcf5bfb3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FFT3D.pbtxt
@@ -0,0 +1,9 @@
+op {
+  graph_op_name: "FFT3D"
+  endpoint {
+    name: "spectral.fft3d"
+  }
+  endpoint {
+    name: "fft3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FilterDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_FilterDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6f91b842181c769d0a2f921f1d7566c4d8522541
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FilterDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FilterDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FixedLengthRecordDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_FixedLengthRecordDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d0703471d38c94a8c37da6f0a65ebd165c23a820
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FixedLengthRecordDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FixedLengthRecordDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FlatMapDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_FlatMapDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9de61ac263cd82a0893aa2e27b9d7532490ca441
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FlatMapDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FlatMapDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..56409f32d8d58b923980f78b3662f196e7954e14
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FusedBatchNormGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGradV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGradV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f5a4200b76c884c0f24335df1716f85b0666b589
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FusedBatchNormGradV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FusedBatchNormGradV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FusedPadConv2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_FusedPadConv2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..03b5fdd5a11844af209c86d9ef8e362c4d286ea6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FusedPadConv2D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FusedPadConv2D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_FusedResizeAndPadConv2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_FusedResizeAndPadConv2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..52165d9b4d991d1636cdc08d5cb2f9efe2f7754f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_FusedResizeAndPadConv2D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "FusedResizeAndPadConv2D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Gather.pbtxt b/tensorflow/core/api_def/python_api/api_def_Gather.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5f956930e0f5bc9a9160974ee4c4a177102942fa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Gather.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Gather"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_GatherV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_GatherV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..029bc59b51cb5463579dbf867e3a1927cb3577f7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_GatherV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "GatherV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_GeneratorDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_GeneratorDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9dcfa0f7d210012aa5c2d43349239a953ea3739e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_GeneratorDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "GeneratorDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_GroupByWindowDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_GroupByWindowDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8d40208e613e6b7ee1522c2990afea1345cc5de1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_GroupByWindowDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "GroupByWindowDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IFFT2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_IFFT2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d6b36a314b8d8a197651ee3c68b1376a9bbed669
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IFFT2D.pbtxt
@@ -0,0 +1,9 @@
+op {
+  graph_op_name: "IFFT2D"
+  endpoint {
+    name: "spectral.ifft2d"
+  }
+  endpoint {
+    name: "ifft2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IFFT3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_IFFT3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6def5b36da17766c5342703fcefe2b377028f330
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IFFT3D.pbtxt
@@ -0,0 +1,9 @@
+op {
+  graph_op_name: "IFFT3D"
+  endpoint {
+    name: "spectral.ifft3d"
+  }
+  endpoint {
+    name: "ifft3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IRFFT.pbtxt b/tensorflow/core/api_def/python_api/api_def_IRFFT.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8fa74a4317fe635cb10ca226f5516834370275c2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IRFFT.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IRFFT"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IRFFT2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_IRFFT2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2021cad63911d2bafb159e1a1f2f11ed2a1d372e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IRFFT2D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IRFFT2D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IRFFT3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_IRFFT3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5d1eab6003ece3b1eed22200743a28de185d1299
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IRFFT3D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IRFFT3D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Imag.pbtxt b/tensorflow/core/api_def/python_api/api_def_Imag.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5632fd4365f718ccf079e1c75b962b011c0253f6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Imag.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Imag"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ImmutableConst.pbtxt b/tensorflow/core/api_def/python_api/api_def_ImmutableConst.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..997013914b89fb489eaa3c8f96f001b093aa23e0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ImmutableConst.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ImmutableConst"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_InterleaveDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_InterleaveDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ef1b06b19cc6a0c62f6e9f451aceed8aeabed553
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_InterleaveDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "InterleaveDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Inv.pbtxt b/tensorflow/core/api_def/python_api/api_def_Inv.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ed58a276f69e46bbf3d14fbd4b921ad2f0d7a2df
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Inv.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Inv"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IsVariableInitialized.pbtxt b/tensorflow/core/api_def/python_api/api_def_IsVariableInitialized.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6a7b0789090c96fa2db968edbc885258aecf34d7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IsVariableInitialized.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IsVariableInitialized"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Iterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_Iterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a021db1534834a5c248e750b9e56e334a20d3949
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Iterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Iterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorFromStringHandle.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorFromStringHandle.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f9efe2d1446330aa78405329b017ce0c81d3a20c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorFromStringHandle.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorFromStringHandle"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorGetNext.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorGetNext.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f7066484ceaa2c0dce7a9ccba8c71838e79e85c0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorGetNext.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorGetNext"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorGetNextSync.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorGetNextSync.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d94edbc71de2295e4a83bda4a0616cbb6c3ebe41
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorGetNextSync.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorGetNextSync"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorSetStatsAggregator.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorSetStatsAggregator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..db51ae3873c7354dc7ce932b99b42edd12066757
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorSetStatsAggregator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorSetStatsAggregator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_IteratorToStringHandle.pbtxt b/tensorflow/core/api_def/python_api/api_def_IteratorToStringHandle.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8a4251f76bd078903cbdf4b2d8419815dab2742e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_IteratorToStringHandle.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "IteratorToStringHandle"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_LatencyStatsDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_LatencyStatsDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..94bf6106ad8459767d31a345a17483b255dfc02b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_LatencyStatsDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "LatencyStatsDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_LoopCond.pbtxt b/tensorflow/core/api_def/python_api/api_def_LoopCond.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4cfa295b2a33446e3646fc1d000ecefd78d64291
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_LoopCond.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "LoopCond"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MakeIterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_MakeIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..acc3342c9ba98d1e5022d99a17fe51c9f4af0ce6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MakeIterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MakeIterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapAndBatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapAndBatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..cffd2910fb404bc7f75e55e42b9ebba1635db134
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapAndBatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapAndBatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapClear.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapClear.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..67c1c3e2dd3191d9e37ea40e6d8cf00e5f888550
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapClear.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapClear"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0b1d2f2c730ff8b8b928fcd97c4fe3bdc704e470
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapIncompleteSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapIncompleteSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..db7921e13b97eebf09260f985b311175bf5b67a4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapIncompleteSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapIncompleteSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapPeek.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapPeek.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..85fab1722948b9b5e0d2e74794bd98b9dd7de37e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapPeek.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapPeek"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8b6ed1a0cf460ad9050af7b3bea7a2ef9bd5c1e4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapStage.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapStage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3ae70d5d5791ec58642fff759c53d56338670540
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapStage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapStage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapUnstage.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapUnstage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e5f92e37db41b2528961f1dde322e3a1938539b3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapUnstage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapUnstage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MapUnstageNoKey.pbtxt b/tensorflow/core/api_def/python_api/api_def_MapUnstageNoKey.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2c2a25db2139db94c3a541188fa17021a2492738
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MapUnstageNoKey.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MapUnstageNoKey"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MaxPool3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_MaxPool3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e8576c9ff2e0729235d9bca70c369536dacaa08e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MaxPool3D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "MaxPool3D"
+  endpoint {
+    name: "nn.max_pool3d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MaxPoolGradGradV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_MaxPoolGradGradV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..534cc90e41ea33fda876a907bc1dfe7eae1bcc15
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MaxPoolGradGradV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MaxPoolGradGradV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MaxPoolGradV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_MaxPoolGradV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e79f839686a425a5648f569d05a2cce60d46edcb
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MaxPoolGradV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MaxPoolGradV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MergeV2Checkpoints.pbtxt b/tensorflow/core/api_def/python_api/api_def_MergeV2Checkpoints.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ca9f74e0c19081446fdaa2d13413d2817e00f402
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MergeV2Checkpoints.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MergeV2Checkpoints"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Multinomial.pbtxt b/tensorflow/core/api_def/python_api/api_def_Multinomial.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9b654335806407602938e43850d2165e3c952032
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Multinomial.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Multinomial"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MutexLock.pbtxt b/tensorflow/core/api_def/python_api/api_def_MutexLock.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..74e6e1035771484adbfdabd1720c260a39e5f519
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MutexLock.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MutexLock"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_MutexV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_MutexV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..013f42d8550cac92aad2539f766deb3e97abaeaf
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_MutexV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "MutexV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_NextIteration.pbtxt b/tensorflow/core/api_def/python_api/api_def_NextIteration.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..28ac301e4169fa4302124a3e554cae6f8f1e13db
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_NextIteration.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "NextIteration"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_NthElement.pbtxt b/tensorflow/core/api_def/python_api/api_def_NthElement.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ec838585103345748ef5332b032af7a522393fb0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_NthElement.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "NthElement"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OneShotIterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_OneShotIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ee9d777b4e4c7104ea919bfe3fa6e48aa0928b20
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OneShotIterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OneShotIterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OnesLike.pbtxt b/tensorflow/core/api_def/python_api/api_def_OnesLike.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c058e5b1ab19790b8aa4049412f937282bd14abb
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OnesLike.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OnesLike"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapClear.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapClear.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b8276b964a58855e3ab92d026ebb0fc00e67f2e7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapClear.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapClear"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapIncompleteSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapIncompleteSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1ba6c5b2fc7f461d05cf944a1152d249de0217f0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapIncompleteSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapIncompleteSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapPeek.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapPeek.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8f0c7afd465358c35ea8c1fd3b33eb4d4a76ef87
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapPeek.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapPeek"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2e155726da6bd697ef422c53d96cec086df511b3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapStage.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapStage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6222c1fc4c998174b65e861ef1aeb4375d58c05e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapStage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapStage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstage.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5cca8d9f93d065271a71ea23bf953e73a1cd6e58
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapUnstage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstageNoKey.pbtxt b/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstageNoKey.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d67b95b65b7e00475a1f8f422e3df529b3747ea0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_OrderedMapUnstageNoKey.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "OrderedMapUnstageNoKey"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PaddedBatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_PaddedBatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c6223b3132ed0d6878995d3c5e657275fac0cc4f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PaddedBatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PaddedBatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ParallelDynamicStitch.pbtxt b/tensorflow/core/api_def/python_api/api_def_ParallelDynamicStitch.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a36ad273646a97aaadbe74718800e5fb1fc27dae
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ParallelDynamicStitch.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ParallelDynamicStitch"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ParallelInterleaveDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ParallelInterleaveDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..93cd5719feb613cd3de2e422e23cc3d690bdef08
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ParallelInterleaveDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ParallelInterleaveDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ParallelMapDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ParallelMapDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..09d200dd24c828af85d1505bb17086dbfa688ee8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ParallelMapDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ParallelMapDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ParseSingleExample.pbtxt b/tensorflow/core/api_def/python_api/api_def_ParseSingleExample.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4193bdd091e015eca8cca85034255d36ba27a67b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ParseSingleExample.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ParseSingleExample"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PlaceholderV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_PlaceholderV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a30360d2de4e36f47f3c7564db5ec9ca045034b9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PlaceholderV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PlaceholderV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PopulationCount.pbtxt b/tensorflow/core/api_def/python_api/api_def_PopulationCount.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d35550236a317c6581b6e8b91f8843b5cc24977f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PopulationCount.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PopulationCount"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PrefetchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_PrefetchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ec4e214eb5e082c8f732cbef9db69524c48d80a4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PrefetchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PrefetchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PrependFromQueueAndPaddedBatchDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_PrependFromQueueAndPaddedBatchDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..228c4047d2e0b7ddfec1d8cd4fad478aa6c4c1a7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PrependFromQueueAndPaddedBatchDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PrependFromQueueAndPaddedBatchDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_PreventGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_PreventGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9565f5632b9ffdbaf1879dc1c18092838143d06b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_PreventGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "PreventGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantize.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d2468f1b243f318dcc3a8fb45524c6b548f378fa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeAndDequantize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..15e181be20948128a7f970f024e6cc8dfe28c96c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeAndDequantizeV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV3.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV3.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f1edc6f5faecfbedc0b9b873484b160551b0f2d2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeAndDequantizeV3.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeAndDequantizeV3"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeDownAndShrinkRange.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeDownAndShrinkRange.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9a2a86d25dad916208e9a666b5ffaa15f1513c4a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeDownAndShrinkRange.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeDownAndShrinkRange"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizeV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizeV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..40673234ed02fa49601b83ce4f587b9051295315
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizeV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizeV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b952d6eccbb30df85582848f7f7e7869eea367a8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedBatchNormWithGlobalNormalization.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedBatchNormWithGlobalNormalization.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e009ada5535993ab5c6eefe8e0b8858e04735824
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedBatchNormWithGlobalNormalization.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedBatchNormWithGlobalNormalization"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedBiasAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedBiasAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3432962e593d6777a62723d25bffd22b8001cc68
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedBiasAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedBiasAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedConv2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedConv2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2409d12abeff922cca92f9ae609764a27f651356
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedConv2D.pbtxt
@@ -0,0 +1,6 @@
+op {
+  graph_op_name: "QuantizedConv2D"
+  endpoint {
+    name: "nn.quantized_conv2d"
+  }
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedInstanceNorm.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedInstanceNorm.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..47a4931a05ab0f9f8b746103667c64f2cc27fbae
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedInstanceNorm.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedInstanceNorm"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedMatMul.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedMatMul.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3ca9d2ae0774cda244db4843e86372cbe40e2ecb
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedMatMul.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedMatMul"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedMul.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedMul.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c026fba194c0a0fa6208799248b7269d09a5623b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedMul.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedMul"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedRelu.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedRelu.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e5da4f25f0e51b1b73b9bb96c9b5b18c2ee54d60
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedRelu.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedRelu"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedRelu6.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedRelu6.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ef1e64831270955219d409c80f865a16713cbcc3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedRelu6.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedRelu6"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedReshape.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedReshape.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7e6d9ed718386f77c5f28ed164803e2d7f148eaf
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedReshape.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedReshape"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QuantizedResizeBilinear.pbtxt b/tensorflow/core/api_def/python_api/api_def_QuantizedResizeBilinear.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a8da4128c260644db022183683e2dc362d82d39e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QuantizedResizeBilinear.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QuantizedResizeBilinear"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QueueIsClosed.pbtxt b/tensorflow/core/api_def/python_api/api_def_QueueIsClosed.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f1d2ef63f1a8849befe42341e15c1630f730ec04
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QueueIsClosed.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QueueIsClosed"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_QueueIsClosedV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_QueueIsClosedV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..07cf1a7497a40ff435f40eaaa31d22e8785bd20c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_QueueIsClosedV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "QueueIsClosedV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RFFT.pbtxt b/tensorflow/core/api_def/python_api/api_def_RFFT.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e9719255aebe6c665f9178c6b652230dd4542d13
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RFFT.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RFFT"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RFFT2D.pbtxt b/tensorflow/core/api_def/python_api/api_def_RFFT2D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1336a64408e8135284d9cafd6ca057572950bdf5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RFFT2D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RFFT2D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RFFT3D.pbtxt b/tensorflow/core/api_def/python_api/api_def_RFFT3D.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..978b5814ff652246cca7630f9a2df22985bbb28e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RFFT3D.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RFFT3D"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RandomDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_RandomDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a5f6f8c6f1db344c480e2bd452362d977dc15000
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RandomDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RandomDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RandomPoissonV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_RandomPoissonV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8cc217c50ea74af0413a804c8e2b726b3e5f1a91
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RandomPoissonV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RandomPoissonV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RangeDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_RangeDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4cd8296b2233ac58c12e6573d2194f7d976d9137
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RangeDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RangeDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ReadVariableOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ReadVariableOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e250b78effcd998b3d26804522858b2386df9b46
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ReadVariableOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ReadVariableOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Real.pbtxt b/tensorflow/core/api_def/python_api/api_def_Real.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..52a9089f4a75018cb1a6a551aecef7b1795e9f4f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Real.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Real"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RecordInput.pbtxt b/tensorflow/core/api_def/python_api/api_def_RecordInput.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..29f798050e7c868a13a13b3c123ecbc2c5f70de1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RecordInput.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RecordInput"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ReduceJoin.pbtxt b/tensorflow/core/api_def/python_api/api_def_ReduceJoin.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0fde5942abee41797db084e1b34b8202532db1a5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ReduceJoin.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ReduceJoin"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RefNextIteration.pbtxt b/tensorflow/core/api_def/python_api/api_def_RefNextIteration.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f9dfcf5e97e9fd4f0676cdb59503947c1a1972f9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RefNextIteration.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RefNextIteration"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RefSelect.pbtxt b/tensorflow/core/api_def/python_api/api_def_RefSelect.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8f9909aa86a861e5b1bfb95aa96e3fbd925f0c4a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RefSelect.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RefSelect"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RefSwitch.pbtxt b/tensorflow/core/api_def/python_api/api_def_RefSwitch.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..68b0f4a694aa1f059ca85b5218b40c99e1d21d28
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RefSwitch.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RefSwitch"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RemoteCall.pbtxt b/tensorflow/core/api_def/python_api/api_def_RemoteCall.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fc069d857d0b40bda75dedc4d881359419ce8b6b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RemoteCall.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RemoteCall"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RepeatDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_RepeatDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..be301da8386af0fbd98c9b02d2cfc0fe79178990
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RepeatDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RepeatDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RequantizationRange.pbtxt b/tensorflow/core/api_def/python_api/api_def_RequantizationRange.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e327595a3898c69e3b060a821345cb8d863b4587
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RequantizationRange.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RequantizationRange"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Requantize.pbtxt b/tensorflow/core/api_def/python_api/api_def_Requantize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f26f0611bae9544aab74f014690db9ceb4606241
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Requantize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Requantize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdadelta.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdadelta.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e0413a67a3c949e8d34311b26acd81a251100ca4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdadelta.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAdadelta"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..52b8ba0b0e4db79a65fc47cc14a66f7469db6328
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagradDA.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagradDA.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..edfc0a733f84542601ce95a3bad1b99db629c2a1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdagradDA.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAdagradDA"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdam.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdam.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ca2713b533804e8ecc9ed76798744c0b07bb5e24
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAdam.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAdam"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyAddSign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAddSign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..50dd6439536ae38c5de377d5636262f5cdb906a6
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyAddSign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyAddSign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyCenteredRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyCenteredRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..20592e38c812aeb45cd03bae300b5d6667aee7dd
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyCenteredRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyCenteredRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrl.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrl.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..72b49e09d6217ce9fdc110f91f4e10fe86124e09
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrl.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyFtrl"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrlV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrlV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..af1d24c344d81f94216c0517011b387ab93965eb
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyFtrlV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyFtrlV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..75d6afd426a0bfcd324542e6fdae70bf7f4b53b9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyMomentum.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyMomentum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3e499cf72e254ed5d0e6e73da8f88a9de4392605
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyMomentum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyMomentum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyPowerSign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyPowerSign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..b23ad0d061a42a054e9116a25792f3d73a40caf1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyPowerSign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyPowerSign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..6ad124c59005285d2c9a9f894d53147cf0823c86
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyProximalAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d684a5dd6720333e334b8afe462224437e32e248
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyProximalGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyProximalGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceApplyRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceApplyRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c4c20e1382f6759993511e7eb4fd846c63575611
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceApplyRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceApplyRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceCountUpTo.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceCountUpTo.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..87376b74475d3191dd0d2be3a80c8da57087a88b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceCountUpTo.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceCountUpTo"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceGather.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceGather.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..714ba4a7ca9a1f05dc34cefe9e2430ebdc6f284a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceGather.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceGather"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceScatterAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceScatterAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4d4601cafdeb8af4821887ac0d354b1a4a7844b9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceScatterAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceScatterAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceScatterNdUpdate.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceScatterNdUpdate.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..54c66708aeb1263b01ed90b792b1331e909416ec
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceScatterNdUpdate.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceScatterNdUpdate"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceScatterUpdate.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceScatterUpdate.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..30f885bee0cf27801b56918e82cdef8c644afe1b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceScatterUpdate.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceScatterUpdate"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdadelta.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdadelta.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a7e4dad13878112512fc7b915cb4fec9bd47b76a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdadelta.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyAdadelta"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..1388da789c08d54bf61ace44791280586d0ec6aa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagradDA.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagradDA.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c5beaa4f580fb22cd23a7bcefe9009f4168c36b4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyAdagradDA.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyAdagradDA"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyCenteredRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyCenteredRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f3de3d93df3d658d02663537d6cc4c404a315d27
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyCenteredRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyCenteredRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrl.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrl.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f83833d3511cf3f42c0257cd61677684da86b35f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrl.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyFtrl"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrlV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrlV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..71adbb0bcd6f63e2de2ecf63d7c8d56265b29ae2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyFtrlV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyFtrlV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyMomentum.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyMomentum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..28a19caaccfabc58444e5963ff1a1c6446e67255
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyMomentum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyMomentum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e8cda7f4edd1217baca8ed84b7c9ae96a22e3b6f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyProximalAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5fa1ade6696ba464a3da44b05f35c67e8ada4fc8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyProximalGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyProximalGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..86cc9a41ae9db89aa61b2225aeb15c42461dda45
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceSparseApplyRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceSparseApplyRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ResourceStridedSliceAssign.pbtxt b/tensorflow/core/api_def/python_api/api_def_ResourceStridedSliceAssign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ef6e19fea0d35ad6410f1001d3a683581bd545ea
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ResourceStridedSliceAssign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ResourceStridedSliceAssign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_RestoreV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_RestoreV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..34d07239a1a18e85e2534db6607a89c12cc670a1
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_RestoreV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "RestoreV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ReverseSequence.pbtxt b/tensorflow/core/api_def/python_api/api_def_ReverseSequence.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f3fc2578dfd2af59077f611bce137f98f6af38ee
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ReverseSequence.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ReverseSequence"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Roll.pbtxt b/tensorflow/core/api_def/python_api/api_def_Roll.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9cc919f36fef69b28fc20873f1a635f5dd1644cf
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Roll.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Roll"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SaveV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_SaveV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..617897ee44e8351bd95d9f44ca2b660894617b88
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SaveV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SaveV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ScanDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ScanDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e71b655c22fbcbf1524433fc65a392e4d80c5c43
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ScanDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ScanDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ScatterNdNonAliasingAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_ScatterNdNonAliasingAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ecf71cd6257b5630566cb6fb92110f6f738f91f4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ScatterNdNonAliasingAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ScatterNdNonAliasingAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SerializeIterator.pbtxt b/tensorflow/core/api_def/python_api/api_def_SerializeIterator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..07d2f200fee0dee55cb813389f672a914a10e0f2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SerializeIterator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SerializeIterator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SetSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_SetSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ee9c71036bdcf6c3d64d468cfe5a4793e522335d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SetSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SetSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Shape.pbtxt b/tensorflow/core/api_def/python_api/api_def_Shape.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..bd7b5ad36c6e1d7f3292cbd4ca13a1242bf09e8c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Shape.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Shape"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ShuffleAndRepeatDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ShuffleAndRepeatDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7b0d2994f0711f440fb6623aa2322c86bd3859f8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ShuffleAndRepeatDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ShuffleAndRepeatDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ShuffleDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ShuffleDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8f0be9197adeb23b2d5047c5d69916df0e2c1eda
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ShuffleDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ShuffleDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Size.pbtxt b/tensorflow/core/api_def/python_api/api_def_Size.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7f76173a5d870910edead637d3493e75ba651b67
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Size.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Size"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SkipDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_SkipDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..96a551c5b6669a8d019e3c705507aba768ab9d21
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SkipDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SkipDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SpaceToDepth.pbtxt b/tensorflow/core/api_def/python_api/api_def_SpaceToDepth.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d56a7384eb9c75f2f90420a1a742733b364e770c
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SpaceToDepth.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SpaceToDepth"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorApplyGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorApplyGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5e158c9ca0ca4620cb18c7e98f969598180df7c0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorApplyGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseAccumulatorApplyGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorTakeGradient.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorTakeGradient.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5326f23def4637a178e8af1aff972f4ad1d982c7
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseAccumulatorTakeGradient.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseAccumulatorTakeGradient"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyAdadelta.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdadelta.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d30a8676e03172b852a1a3c6d50f77722ed25625
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdadelta.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyAdadelta"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..cb5ddef2128bbaa3239a279df614a4e3512dcf41
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagradDA.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagradDA.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c3b87b09536ee5d36f5d8b1c83b025e5857d13ab
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyAdagradDA.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyAdagradDA"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyCenteredRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyCenteredRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..db4732873845424e21a14506b84723268a963eea
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyCenteredRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyCenteredRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrl.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrl.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..14e37b8ba209374d6478cc047229e55314edcd81
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrl.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyFtrl"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrlV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrlV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0d307af9b497739d76d1d875751ef16323a1d56b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyFtrlV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyFtrlV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyMomentum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyMomentum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ed34c0485d4b597dbe366ad69503be1393e079e8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyMomentum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyMomentum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalAdagrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalAdagrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..ff2d3b673141a571b6b1d816bf29fb5ba9880232
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalAdagrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyProximalAdagrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalGradientDescent.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalGradientDescent.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f342a611bb2aed3e4c9eae07a4802ed59bae76e8
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyProximalGradientDescent.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyProximalGradientDescent"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseApplyRMSProp.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseApplyRMSProp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7f337d50e5b0a995a2a3765782bf229d967ee9b4
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseApplyRMSProp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseApplyRMSProp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseConditionalAccumulator.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseConditionalAccumulator.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..bad4120795ef1fc6411a3bcacc209efe2e7b6841
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseConditionalAccumulator.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseConditionalAccumulator"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseAdd.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseAdd.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..c5e7c9851f15b6748e48eadd26b26925fbe2ed94
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseAdd.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseDenseCwiseAdd"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseDiv.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseDiv.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f72031cf684b4515bbf580203f2ea4e5714eab58
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseDiv.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseDenseCwiseDiv"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseMul.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseMul.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a87004ee5f9c9d445acb1d49138639d91da7b44f
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseDenseCwiseMul.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseDenseCwiseMul"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentMean.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMean.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f12c2e207368d4c585137bfeac523e2fada0ed50
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMean.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentMean"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..771083cd513c4d3deba21aaa2abaa090362c5684
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentMeanGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanWithNumSegments.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanWithNumSegments.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..fcb029535c36c6be0e7e08f36ac7b39a5e126df0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentMeanWithNumSegments.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentMeanWithNumSegments"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtN.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtN.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7daaa81482ff7f9e84ee7ba8a3768d0b19ef38e0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtN.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSqrtN"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0682a597bb038abebab4198a44079486b60eb799
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSqrtNGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNWithNumSegments.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNWithNumSegments.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7311a093df9e593dbebac764cd28e55c556ae6da
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSqrtNWithNumSegments.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSqrtNWithNumSegments"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e7028efce268d3e6e80328908c3a9e2dfc3d4343
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSegmentSumWithNumSegments.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSumWithNumSegments.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..81c2b8554e8f62e5ffffbdf410389776bbcf9035
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSegmentSumWithNumSegments.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSegmentSumWithNumSegments"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSparseMaximum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSparseMaximum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0dbadc01edc539658e3066a9d720c71bc50ecae9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSparseMaximum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSparseMaximum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseSparseMinimum.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseSparseMinimum.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..0e3ffcbddf3bccdae0e28cc15527566ddf2ff03a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseSparseMinimum.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseSparseMinimum"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseTensorSliceDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseTensorSliceDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..19c0c7f199dfd24d24a56c3766733f9e55957c12
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseTensorSliceDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseTensorSliceDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SparseToSparseSetOperation.pbtxt b/tensorflow/core/api_def/python_api/api_def_SparseToSparseSetOperation.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..735ee18e149aca56cd82c9bb2b3fd8d3870a3188
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SparseToSparseSetOperation.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SparseToSparseSetOperation"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_SqlDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_SqlDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2ab4c3e441dd51f50a2796ef9d6fa0d21b727ffa
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_SqlDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "SqlDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Stage.pbtxt b/tensorflow/core/api_def/python_api/api_def_Stage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..66de5901bc9b604d92693e2affb75ea8555bfc4e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Stage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Stage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StageClear.pbtxt b/tensorflow/core/api_def/python_api/api_def_StageClear.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f54a1c1c0428753d93cc19abbd4fbc961d8eb988
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StageClear.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StageClear"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StagePeek.pbtxt b/tensorflow/core/api_def/python_api/api_def_StagePeek.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..710394d30d34cece63c148f41b6a18a3e4d99b7b
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StagePeek.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StagePeek"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StageSize.pbtxt b/tensorflow/core/api_def/python_api/api_def_StageSize.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..472032ac42af197e32a437a112f6c39704193ad0
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StageSize.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StageSize"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StatsAggregatorHandle.pbtxt b/tensorflow/core/api_def/python_api/api_def_StatsAggregatorHandle.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..f7bed36602f40602313157c20677acbbf592d7be
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StatsAggregatorHandle.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StatsAggregatorHandle"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StatsAggregatorSummary.pbtxt b/tensorflow/core/api_def/python_api/api_def_StatsAggregatorSummary.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..8b1bab2440f1934f1fd0194b76b7907fb0fb142d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StatsAggregatorSummary.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StatsAggregatorSummary"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StridedSlice.pbtxt b/tensorflow/core/api_def/python_api/api_def_StridedSlice.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..a55fa9887797ed9fa6900f9e77f9d1fa70de5aa2
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StridedSlice.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StridedSlice"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StridedSliceAssign.pbtxt b/tensorflow/core/api_def/python_api/api_def_StridedSliceAssign.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..bcf1df228e879fbde73fed7d0e955f67ea494663
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StridedSliceAssign.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StridedSliceAssign"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_StridedSliceGrad.pbtxt b/tensorflow/core/api_def/python_api/api_def_StridedSliceGrad.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..05d7d57511e8dd485d40a5168ca0866c4b6c481a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_StridedSliceGrad.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "StridedSliceGrad"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TFRecordDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TFRecordDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..3c270ada3c219b03715e0cd651a4b56fe5ebc227
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TFRecordDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TFRecordDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TakeDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TakeDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..711b335dc1926d32071637b3c986727c339736a3
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TakeDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TakeDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TensorDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TensorDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..5bc3920c56360f2348805db1db79ab2b630f379d
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TensorDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TensorDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TensorSliceDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TensorSliceDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..89ad016483fa392a302915d588d32201237c717a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TensorSliceDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TensorSliceDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_TextLineDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_TextLineDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..08d785191b6a4bddce2ac43fd4c0188b4d74548e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_TextLineDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "TextLineDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Transpose.pbtxt b/tensorflow/core/api_def/python_api/api_def_Transpose.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..e22b6a040e4011f87b1c945ffd7df050bcbdea76
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Transpose.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Transpose"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_UniqueWithCounts.pbtxt b/tensorflow/core/api_def/python_api/api_def_UniqueWithCounts.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..71b35eaab5f4a251ebebf9ddb7baf2ecd0a12401
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_UniqueWithCounts.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "UniqueWithCounts"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_UniqueWithCountsV2.pbtxt b/tensorflow/core/api_def/python_api/api_def_UniqueWithCountsV2.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..7876e55cf3e2c24e19507cefb01f9f61abd0a2bc
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_UniqueWithCountsV2.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "UniqueWithCountsV2"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Unstage.pbtxt b/tensorflow/core/api_def/python_api/api_def_Unstage.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..65eb756b870d4a6b8d767de3876d9353a192c1d5
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Unstage.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Unstage"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_VarHandleOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_VarHandleOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..2c93a6db93cf62aab345bb9044e4acddd01da7d9
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_VarHandleOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "VarHandleOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_VarIsInitializedOp.pbtxt b/tensorflow/core/api_def/python_api/api_def_VarIsInitializedOp.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..de5d9850acb1a3adbc59e554aedd819d870c5442
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_VarIsInitializedOp.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "VarIsInitializedOp"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_VariableShape.pbtxt b/tensorflow/core/api_def/python_api/api_def_VariableShape.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..9b317152ddf925b3f0b5b24c95bcb44bed6b718a
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_VariableShape.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "VariableShape"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_Where.pbtxt b/tensorflow/core/api_def/python_api/api_def_Where.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..d4dd25a20655a036c0fba33c14133389a532bf8e
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_Where.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "Where"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/api_def/python_api/api_def_ZipDataset.pbtxt b/tensorflow/core/api_def/python_api/api_def_ZipDataset.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..dd1459521ff70fc4b3adce7fbb1251b45106b439
--- /dev/null
+++ b/tensorflow/core/api_def/python_api/api_def_ZipDataset.pbtxt
@@ -0,0 +1,4 @@
+op {
+  graph_op_name: "ZipDataset"
+  visibility: HIDDEN
+}
diff --git a/tensorflow/core/common_runtime/bfc_allocator.h b/tensorflow/core/common_runtime/bfc_allocator.h
index b8e773503c7a2f8024e8a6f58247ad343a762f71..e34945dd48a1e54e4ae82dd7ea9959f39a97f2c2 100644
--- a/tensorflow/core/common_runtime/bfc_allocator.h
+++ b/tensorflow/core/common_runtime/bfc_allocator.h
@@ -23,7 +23,7 @@ limitations under the License.
 #include <vector>
 
 #include "tensorflow/core/common_runtime/allocator_retry.h"
-#include "tensorflow/core/common_runtime/visitable_allocator.h"
+#include "tensorflow/core/framework/visitable_allocator.h"
 #include "tensorflow/core/lib/gtl/stl_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/macros.h"
diff --git a/tensorflow/core/common_runtime/function.cc b/tensorflow/core/common_runtime/function.cc
index b941819838a7b155d8c8f54985bd6ae8bc15ce9d..3e937ceb640554be3a2578decdb336d0e58c197f 100644
--- a/tensorflow/core/common_runtime/function.cc
+++ b/tensorflow/core/common_runtime/function.cc
@@ -42,11 +42,8 @@ limitations under the License.
 namespace tensorflow {
 
 // A few string constant used throughout this module.
-//
-// TODO(zhifengc): Dedup some of these constants into
-// framework/function.h
-static constexpr const char* const kArgOp = "_Arg";
-static constexpr const char* const kRetOp = "_Retval";
+static constexpr const char* const kArgOp = FunctionLibraryDefinition::kArgOp;
+static constexpr const char* const kRetOp = FunctionLibraryDefinition::kRetOp;
 static constexpr const char* const kGradientOp =
     FunctionLibraryDefinition::kGradientOp;
 static constexpr const char* const kNodeLabel = "Func";
@@ -177,6 +174,7 @@ class FunctionLibraryRuntimeImpl : public FunctionLibraryRuntime {
   }
 
   Device* device() override { return device_; }
+  const DeviceMgr* device_mgr() const override { return device_mgr_; }
   Env* env() override { return env_; }
   int graph_def_version() override { return graph_def_version_; }
 
@@ -1580,9 +1578,6 @@ Status FunctionDefToBodyHelper(
 
   // Call BuildControlFlowInfo to validate that this function body has
   // well-formed control flow.
-  // NOTE(skyewm): this is usually done in Partition(), but we don't partition
-  // function bodies. This should be removed if function bodies ever go through
-  // the Partition() path.
   std::vector<ControlFlowInfo> dummy;
   TF_RETURN_IF_ERROR(BuildControlFlowInfo(graph.get(), &dummy));
 
diff --git a/tensorflow/core/common_runtime/function_testlib.cc b/tensorflow/core/common_runtime/function_testlib.cc
index 87c2476b04af7300d7138d59b3261496eb38c482..87733ed2dbe931c6bb64fd065d2691072d4eced0 100644
--- a/tensorflow/core/common_runtime/function_testlib.cc
+++ b/tensorflow/core/common_runtime/function_testlib.cc
@@ -15,6 +15,7 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/function_testlib.h"
 
 #include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/common_shape_fns.h"
 #include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/node_def_builder.h"
 #include "tensorflow/core/framework/op_kernel.h"
@@ -39,7 +40,9 @@ class FindDeviceOpKernel : public OpKernel {
 
 REGISTER_KERNEL_BUILDER(Name("FindDeviceOp").Device(tensorflow::DEVICE_CPU),
                         FindDeviceOpKernel);
-REGISTER_OP("FindDeviceOp").Output("device_name: string");
+REGISTER_OP("FindDeviceOp")
+    .Output("device_name: string")
+    .SetShapeFn(shape_inference::UnknownShape);
 
 FunctionDef FindDevice() {
   return FDH::Define(
diff --git a/tensorflow/core/common_runtime/gpu/gpu_cudamalloc_allocator.h b/tensorflow/core/common_runtime/gpu/gpu_cudamalloc_allocator.h
index 208697361d2dfc4f3b8290ea511d15c9bd86857b..0a586344ccf2228a23059d68e7aa2d7a8f4eadba 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_cudamalloc_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_cudamalloc_allocator.h
@@ -19,7 +19,7 @@ limitations under the License.
 #include <memory>
 
 #include "tensorflow/core/common_runtime/gpu/gpu_id.h"
-#include "tensorflow/core/common_runtime/visitable_allocator.h"
+#include "tensorflow/core/framework/visitable_allocator.h"
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/stream_executor.h"
 #include "tensorflow/core/platform/types.h"
diff --git a/tensorflow/core/common_runtime/gpu/gpu_debug_allocator.h b/tensorflow/core/common_runtime/gpu/gpu_debug_allocator.h
index adce3a84368ced958002443721016778cb6df028..0db08dc9759c9306ebd99b4acf4967128ef04895 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_debug_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_debug_allocator.h
@@ -21,7 +21,7 @@ limitations under the License.
 #include <unordered_map>
 
 #include "tensorflow/core/common_runtime/gpu/gpu_id.h"
-#include "tensorflow/core/common_runtime/visitable_allocator.h"
+#include "tensorflow/core/framework/visitable_allocator.h"
 #include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/stream_executor.h"
 #include "tensorflow/core/platform/types.h"
diff --git a/tensorflow/core/common_runtime/gpu/gpu_device.cc b/tensorflow/core/common_runtime/gpu/gpu_device.cc
index 15ff15fd5ab28605c4ab0904e62305edc3815adb..8357cc5a7201b3b590c6965648eed72116167459 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_device.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_device.cc
@@ -1013,21 +1013,34 @@ Status BaseGPUDeviceFactory::CreateGPUDevice(const SessionOptions& options,
   GpuIdUtil::CheckValidTfGpuId(tf_gpu_id);
   CudaGpuId cuda_gpu_id = GpuIdManager::TfToCudaGpuId(tf_gpu_id);
   int numa_node = dev_locality.numa_node();
-  Bytes allocated_bytes = static_cast<Bytes>(memory_limit);
 
   gpu::StreamExecutor* se =
       GpuIdUtil::ExecutorForCudaGpuId(cuda_gpu_id).ValueOrDie();
   const gpu::DeviceDescription& desc = se->GetDeviceDescription();
-  LOG(INFO) << "Creating TensorFlow device (" << device_name << " with "
-            << (memory_limit >> 20) << " MB memory) -> physical GPU ("
-            << GetShortDeviceDescription(cuda_gpu_id, desc) << ")";
   ProcessState* process_state = ProcessState::singleton();
+  Allocator* gpu_allocator = process_state->GetGPUAllocator(
+      options.config.gpu_options(), tf_gpu_id, memory_limit);
+  if (gpu_allocator == nullptr) {
+    return errors::Internal("Failed to get memory allocator for TF GPU ",
+                            tf_gpu_id.value(), " with ", memory_limit,
+                            " bytes of memory.");
+  }
+  AllocatorStats stats;
+  gpu_allocator->GetStats(&stats);
+  // 'memory_limit' is the required memory size, but if the allocator with given
+  // tf_gpu_id was created before, we'll use it instead of creating a new one
+  // (as TF gpu device is a shared resource), in which case the actual memory
+  // limit represented by 'stats.bytes_limit' used by that allocator may be
+  // different (which should be an error).
+  //
+  // TODO(laigd): report error if memory_limit doesn't match stats.bytes_limit.
   BaseGPUDevice* gpu_device = CreateGPUDevice(
-      options, device_name, allocated_bytes, dev_locality, tf_gpu_id,
-      GetShortDeviceDescription(cuda_gpu_id, desc),
-      process_state->GetGPUAllocator(options.config.gpu_options(), tf_gpu_id,
-                                     memory_limit),
+      options, device_name, static_cast<Bytes>(stats.bytes_limit), dev_locality,
+      tf_gpu_id, GetShortDeviceDescription(cuda_gpu_id, desc), gpu_allocator,
       process_state->GetCPUAllocator(numa_node));
+  LOG(INFO) << "Created TensorFlow device (" << device_name << " with "
+            << (stats.bytes_limit >> 20) << " MB memory) -> physical GPU ("
+            << GetShortDeviceDescription(cuda_gpu_id, desc) << ")";
   TF_RETURN_IF_ERROR(gpu_device->Init(options));
   devices->push_back(gpu_device);
 
diff --git a/tensorflow/core/common_runtime/gpu/gpu_device.h b/tensorflow/core/common_runtime/gpu/gpu_device.h
index c88daa8ff87589a3fc48f4c7693d073d6adf9a5a..d817c7dd1f3af5656e48c3b2a0420270a7938447 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_device.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_device.h
@@ -68,7 +68,7 @@ class BaseGPUDevice : public LocalDevice {
       const TensorReferenceVector& tensor_refs) override;
 
   Status FillContextMap(const Graph* graph,
-                        DeviceContextMap* device_context_map);
+                        DeviceContextMap* device_context_map) override;
 
   void Compute(OpKernel* op_kernel, OpKernelContext* context) override;
 
diff --git a/tensorflow/core/common_runtime/gpu/gpu_device_test.cc b/tensorflow/core/common_runtime/gpu/gpu_device_test.cc
index b56823204afe8ee52e0ea376b1a79d91d6932fa0..f3935f6ba26c49a9967d0848bfb6d965c73d2fab 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_device_test.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_device_test.cc
@@ -18,42 +18,48 @@ limitations under the License.
 #include "tensorflow/core/common_runtime/gpu/gpu_device.h"
 
 #include "tensorflow/core/common_runtime/gpu/gpu_init.h"
+#include "tensorflow/core/common_runtime/gpu/process_state.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
 
 namespace tensorflow {
-namespace {
 const char* kDeviceNamePrefix = "/job:localhost/replica:0/task:0";
 
-static SessionOptions MakeSessionOptions(
-    const string& visible_device_list = "",
-    double per_process_gpu_memory_fraction = 0, int gpu_device_count = 1,
-    const std::vector<std::vector<float>>& memory_limit_mb = {}) {
-  SessionOptions options;
-  ConfigProto* config = &options.config;
-  (*config->mutable_device_count())["GPU"] = gpu_device_count;
-  GPUOptions* gpu_options = config->mutable_gpu_options();
-  gpu_options->set_visible_device_list(visible_device_list);
-  gpu_options->set_per_process_gpu_memory_fraction(
-      per_process_gpu_memory_fraction);
-  for (const auto& v : memory_limit_mb) {
-    auto virtual_devices =
-        gpu_options->mutable_experimental()->add_virtual_devices();
-    for (float mb : v) {
-      virtual_devices->add_memory_limit_mb(mb);
+class GPUDeviceTest : public ::testing::Test {
+ public:
+  void TearDown() { ProcessState::singleton()->TestOnlyReset(); }
+
+ protected:
+  static SessionOptions MakeSessionOptions(
+      const string& visible_device_list = "",
+      double per_process_gpu_memory_fraction = 0, int gpu_device_count = 1,
+      const std::vector<std::vector<float>>& memory_limit_mb = {}) {
+    SessionOptions options;
+    ConfigProto* config = &options.config;
+    (*config->mutable_device_count())["GPU"] = gpu_device_count;
+    GPUOptions* gpu_options = config->mutable_gpu_options();
+    gpu_options->set_visible_device_list(visible_device_list);
+    gpu_options->set_per_process_gpu_memory_fraction(
+        per_process_gpu_memory_fraction);
+    for (const auto& v : memory_limit_mb) {
+      auto virtual_devices =
+          gpu_options->mutable_experimental()->add_virtual_devices();
+      for (float mb : v) {
+        virtual_devices->add_memory_limit_mb(mb);
+      }
     }
+    return options;
   }
-  return options;
-}
 
-static bool StartsWith(const string& lhs, const string& rhs) {
-  if (rhs.length() > lhs.length()) return false;
-  return lhs.substr(0, rhs.length()) == rhs;
-}
+  static bool StartsWith(const string& lhs, const string& rhs) {
+    if (rhs.length() > lhs.length()) return false;
+    return lhs.substr(0, rhs.length()) == rhs;
+  }
+};
 
-TEST(GPUDeviceTest, FailedToParseVisibleDeviceList) {
+TEST_F(GPUDeviceTest, FailedToParseVisibleDeviceList) {
   SessionOptions opts = MakeSessionOptions("0,abc");
   std::vector<tensorflow::Device*> devices;
   Status status = DeviceFactory::GetFactory("GPU")->CreateDevices(
@@ -63,7 +69,7 @@ TEST(GPUDeviceTest, FailedToParseVisibleDeviceList) {
       << status;
 }
 
-TEST(GPUDeviceTest, InvalidGpuId) {
+TEST_F(GPUDeviceTest, InvalidGpuId) {
   SessionOptions opts = MakeSessionOptions("100");
   std::vector<tensorflow::Device*> devices;
   Status status = DeviceFactory::GetFactory("GPU")->CreateDevices(
@@ -74,7 +80,7 @@ TEST(GPUDeviceTest, InvalidGpuId) {
       << status;
 }
 
-TEST(GPUDeviceTest, DuplicateEntryInVisibleDeviceList) {
+TEST_F(GPUDeviceTest, DuplicateEntryInVisibleDeviceList) {
   SessionOptions opts = MakeSessionOptions("0,0");
   std::vector<tensorflow::Device*> devices;
   Status status = DeviceFactory::GetFactory("GPU")->CreateDevices(
@@ -85,7 +91,7 @@ TEST(GPUDeviceTest, DuplicateEntryInVisibleDeviceList) {
       << status;
 }
 
-TEST(GPUDeviceTest, VirtualDeviceConfigConflictsWithMemoryFractionSettings) {
+TEST_F(GPUDeviceTest, VirtualDeviceConfigConflictsWithMemoryFractionSettings) {
   SessionOptions opts = MakeSessionOptions("0", 0.1, 1, {{}});
   std::vector<tensorflow::Device*> devices;
   Status status = DeviceFactory::GetFactory("GPU")->CreateDevices(
@@ -96,7 +102,7 @@ TEST(GPUDeviceTest, VirtualDeviceConfigConflictsWithMemoryFractionSettings) {
       << status;
 }
 
-TEST(GPUDeviceTest, GpuDeviceCountTooSmall) {
+TEST_F(GPUDeviceTest, GpuDeviceCountTooSmall) {
   // device_count is 0, but with one entry in visible_device_list and one
   // (empty) VirtualDevices messages.
   SessionOptions opts = MakeSessionOptions("0", 0, 0, {{}});
@@ -109,7 +115,7 @@ TEST(GPUDeviceTest, GpuDeviceCountTooSmall) {
       << status;
 }
 
-TEST(GPUDeviceTest, NotEnoughGpuInVisibleDeviceList) {
+TEST_F(GPUDeviceTest, NotEnoughGpuInVisibleDeviceList) {
   // Single entry in visible_device_list with two (empty) VirtualDevices
   // messages.
   SessionOptions opts = MakeSessionOptions("0", 0, 8, {{}, {}});
@@ -122,7 +128,7 @@ TEST(GPUDeviceTest, NotEnoughGpuInVisibleDeviceList) {
       << status;
 }
 
-TEST(GPUDeviceTest, VirtualDeviceConfigConflictsWithVisibleDeviceList) {
+TEST_F(GPUDeviceTest, VirtualDeviceConfigConflictsWithVisibleDeviceList) {
   // This test requires at least two visible GPU hardware.
   if (GPUMachineManager()->VisibleDeviceCount() < 2) return;
   // Three entries in visible_device_list with two (empty) VirtualDevices
@@ -139,7 +145,7 @@ TEST(GPUDeviceTest, VirtualDeviceConfigConflictsWithVisibleDeviceList) {
       << status;
 }
 
-TEST(GPUDeviceTest, EmptyVirtualDeviceConfig) {
+TEST_F(GPUDeviceTest, EmptyVirtualDeviceConfig) {
   // It'll create single virtual device when the virtual device config is empty.
   SessionOptions opts = MakeSessionOptions("0");
   std::vector<tensorflow::Device*> devices;
@@ -150,7 +156,7 @@ TEST(GPUDeviceTest, EmptyVirtualDeviceConfig) {
   for (auto d : devices) delete d;
 }
 
-TEST(GPUDeviceTest, SingleVirtualDeviceWithNoMemoryLimit) {
+TEST_F(GPUDeviceTest, SingleVirtualDeviceWithNoMemoryLimit) {
   // It'll create single virtual device for the gpu in question when
   // memory_limit_mb is unset.
   SessionOptions opts = MakeSessionOptions("0", 0, 1, {{}});
@@ -162,7 +168,7 @@ TEST(GPUDeviceTest, SingleVirtualDeviceWithNoMemoryLimit) {
   for (auto d : devices) delete d;
 }
 
-TEST(GPUDeviceTest, SingleVirtualDeviceWithMemoryLimit) {
+TEST_F(GPUDeviceTest, SingleVirtualDeviceWithMemoryLimit) {
   SessionOptions opts = MakeSessionOptions("0", 0, 1, {{123}});
   std::vector<tensorflow::Device*> devices;
   TF_CHECK_OK(DeviceFactory::GetFactory("GPU")->CreateDevices(
@@ -172,7 +178,7 @@ TEST(GPUDeviceTest, SingleVirtualDeviceWithMemoryLimit) {
   for (auto d : devices) delete d;
 }
 
-TEST(GPUDeviceTest, MultipleVirtualDevices) {
+TEST_F(GPUDeviceTest, MultipleVirtualDevices) {
   SessionOptions opts = MakeSessionOptions("0", 0, 1, {{123, 456}});
   std::vector<tensorflow::Device*> devices;
   TF_CHECK_OK(DeviceFactory::GetFactory("GPU")->CreateDevices(
@@ -195,7 +201,6 @@ TEST(GPUDeviceTest, MultipleVirtualDevices) {
   for (auto d : devices) delete d;
 }
 
-}  // namespace
 }  // namespace tensorflow
 
 #endif
diff --git a/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc b/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc
index 207afdca75642b14c1617c8abae4fd5e9916f020..7dfff3269cf91582adf783dcd15dd55d1c4e1451 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc
+++ b/tensorflow/core/common_runtime/gpu/gpu_id_manager.cc
@@ -18,7 +18,10 @@ limitations under the License.
 #include <unordered_map>
 
 #include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/mutex.h"
 
 namespace tensorflow {
@@ -27,8 +30,8 @@ namespace {
 class TfToCudaGpuIdMap {
  public:
   static TfToCudaGpuIdMap* singleton() {
-    static auto* manager = new TfToCudaGpuIdMap;
-    return manager;
+    static auto* id_map = new TfToCudaGpuIdMap;
+    return id_map;
   }
 
   void InsertOrDie(TfGpuId tf_gpu_id, CudaGpuId cuda_gpu_id)
@@ -47,18 +50,41 @@ class TfToCudaGpuIdMap {
     }
   }
 
-  int32 FindOrDie(TfGpuId tf_gpu_id) const LOCKS_EXCLUDED(mu_) {
+  CudaGpuId FindOrDie(TfGpuId tf_gpu_id) const LOCKS_EXCLUDED(mu_) {
     mutex_lock lock(mu_);
+    return FindOrDieLocked(tf_gpu_id);
+  }
+
+  bool Find(TfGpuId tf_gpu_id, CudaGpuId* cuda_gpu_id) const
+      LOCKS_EXCLUDED(mu_) {
+    mutex_lock lock(mu_);
+    if (id_map_.count(tf_gpu_id.value()) == 0) return false;
+    *cuda_gpu_id = FindOrDieLocked(tf_gpu_id);
+    return true;
+  }
+
+ private:
+  TfToCudaGpuIdMap() = default;
+
+  CudaGpuId FindOrDieLocked(TfGpuId tf_gpu_id) const
+      EXCLUSIVE_LOCKS_REQUIRED(mu_) {
     auto result = id_map_.find(tf_gpu_id.value());
     CHECK(result != id_map_.end())
         << "Could not find the mapping for TfGpuId: " << tf_gpu_id;
-    return result->second;
+    return CudaGpuId(result->second);
+  }
+
+  void TestOnlyReset() LOCKS_EXCLUDED(mu_) {
+    mutex_lock lock(mu_);
+    id_map_.clear();
   }
 
- private:
   using IdMapType = std::unordered_map<int32, int32>;
   mutable mutex mu_;
   IdMapType id_map_ GUARDED_BY(mu_);
+
+  friend class ::tensorflow::GpuIdManager;
+  TF_DISALLOW_COPY_AND_ASSIGN(TfToCudaGpuIdMap);
 };
 }  // namespace
 
@@ -67,8 +93,20 @@ void GpuIdManager::InsertTfCudaGpuIdPair(TfGpuId tf_gpu_id,
   TfToCudaGpuIdMap::singleton()->InsertOrDie(tf_gpu_id, cuda_gpu_id);
 }
 
+Status GpuIdManager::TfToCudaGpuId(TfGpuId tf_gpu_id, CudaGpuId* cuda_gpu_id) {
+  if (TfToCudaGpuIdMap::singleton()->Find(tf_gpu_id, cuda_gpu_id)) {
+    return Status::OK();
+  }
+  return errors::NotFound("TF GPU device with id ", tf_gpu_id.value(),
+                          " was not registered");
+}
+
 CudaGpuId GpuIdManager::TfToCudaGpuId(TfGpuId tf_gpu_id) {
-  return CudaGpuId(TfToCudaGpuIdMap::singleton()->FindOrDie(tf_gpu_id));
+  return TfToCudaGpuIdMap::singleton()->FindOrDie(tf_gpu_id);
+}
+
+void GpuIdManager::TestOnlyReset() {
+  TfToCudaGpuIdMap::singleton()->TestOnlyReset();
 }
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/gpu/gpu_id_manager.h b/tensorflow/core/common_runtime/gpu/gpu_id_manager.h
index 33925d8c36f44a9d2c7abc8f2801f3f203bcb982..2b54cc184ca508b94e2a715642cdb13fe8a4c3e1 100644
--- a/tensorflow/core/common_runtime/gpu/gpu_id_manager.h
+++ b/tensorflow/core/common_runtime/gpu/gpu_id_manager.h
@@ -17,15 +17,25 @@ limitations under the License.
 #define TENSORFLOW_CORE_COMMON_RUNTIME_GPU_GPU_ID_MANAGER_H_
 
 #include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/lib/core/status.h"
 
 namespace tensorflow {
 
-// Class that manages the translation between Tensorflow GPU ids and CUDA GPU
-// ids.
+// Class that maintains a map from TfGpuId to CudaGpuId, and manages the
+// translation between them.
 class GpuIdManager {
  public:
+  // Adds a mapping from tf_gpu_id to cuda_gpu_id.
   static void InsertTfCudaGpuIdPair(TfGpuId tf_gpu_id, CudaGpuId cuda_gpu_id);
+
+  // Gets the cuda_gpu_id associated with tf_gpu_id. Returns OK if found.
+  static Status TfToCudaGpuId(TfGpuId tf_gpu_id, CudaGpuId* cuda_gpu_id);
+  // Similar to the above version, but returns the result, and checks fail if
+  // no result is found.
   static CudaGpuId TfToCudaGpuId(TfGpuId tf_gpu_id);
+
+  // Clears the map. Used in unit tests only.
+  static void TestOnlyReset();
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/gpu/pool_allocator.h b/tensorflow/core/common_runtime/gpu/pool_allocator.h
index 91ce830df8521e7fe8284dd3c52d1bbf667891cd..38d669ea07c91bc1a892ecf925b3141f2ca506dd 100644
--- a/tensorflow/core/common_runtime/gpu/pool_allocator.h
+++ b/tensorflow/core/common_runtime/gpu/pool_allocator.h
@@ -24,7 +24,7 @@ limitations under the License.
 #include <map>
 #include <memory>
 #include <vector>
-#include "tensorflow/core/common_runtime/visitable_allocator.h"
+#include "tensorflow/core/framework/visitable_allocator.h"
 #include "tensorflow/core/lib/core/bits.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/macros.h"
diff --git a/tensorflow/core/common_runtime/gpu/process_state.cc b/tensorflow/core/common_runtime/gpu/process_state.cc
index 61013bd1acd254b6e927a8d41accaeda424d6ebc..866a03d04632c649fae278c4ab311e22ebf8dc31 100644
--- a/tensorflow/core/common_runtime/gpu/process_state.cc
+++ b/tensorflow/core/common_runtime/gpu/process_state.cc
@@ -29,6 +29,7 @@ limitations under the License.
 #include "tensorflow/core/framework/allocator.h"
 #include "tensorflow/core/framework/log_memory.h"
 #include "tensorflow/core/framework/tracking_allocator.h"
+#include "tensorflow/core/lib/gtl/stl_util.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
 #include "tensorflow/core/platform/mutex.h"
@@ -318,4 +319,17 @@ void ProcessState::AddGPUAllocVisitor(int bus_id, AllocVisitor visitor) {
 #endif  // GOOGLE_CUDA
 }
 
+void ProcessState::TestOnlyReset() {
+  mutex_lock lock(mu_);
+  gpu_device_enabled_ = false;
+  gpu_visitors_.clear();
+  mem_desc_map_.clear();
+  gtl::STLDeleteElements(&cpu_allocators_);
+  gtl::STLDeleteElements(&gpu_allocators_);
+  gtl::STLDeleteElements(&cuda_host_allocators_);
+  gtl::STLDeleteElements(&cpu_al_);
+  gtl::STLDeleteElements(&gpu_al_);
+  gtl::STLDeleteElements(&cuda_al_);
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/common_runtime/gpu/process_state.h b/tensorflow/core/common_runtime/gpu/process_state.h
index f6e234967306476542cec3038ea2e271cca2dc8c..bc2c4182d72334e26d387397e564dbf02cfa3ae4 100644
--- a/tensorflow/core/common_runtime/gpu/process_state.h
+++ b/tensorflow/core/common_runtime/gpu/process_state.h
@@ -114,6 +114,10 @@ class ProcessState {
  protected:
   ProcessState();
 
+  // Helper method for unit tests to reset the ProcessState singleton by
+  // cleaning up everything. Never use in production.
+  virtual void TestOnlyReset();
+
   static ProcessState* instance_;
   bool gpu_device_enabled_;
 
@@ -132,6 +136,8 @@ class ProcessState {
   std::vector<Allocator*> cpu_al_ GUARDED_BY(mu_);
   std::vector<Allocator*> gpu_al_ GUARDED_BY(mu_);
   std::vector<Allocator*> cuda_al_ GUARDED_BY(mu_);
+
+  friend class GPUDeviceTest;
 };
 
 namespace internal {
diff --git a/tensorflow/core/common_runtime/graph_execution_state.cc b/tensorflow/core/common_runtime/graph_execution_state.cc
index 33a5d60eb7ec4de829d3c0784f909ef42cf994d1..785ec3d2276d3e90dae6e8eb657687e5180873f4 100644
--- a/tensorflow/core/common_runtime/graph_execution_state.cc
+++ b/tensorflow/core/common_runtime/graph_execution_state.cc
@@ -73,6 +73,10 @@ GraphExecutionState::~GraphExecutionState() {
 /* static */ Status GraphExecutionState::MakeForBaseGraph(
     GraphDef* graph_def, const GraphExecutionStateOptions& options,
     std::unique_ptr<GraphExecutionState>* out_state) {
+#ifndef __ANDROID__
+  VLOG(1) << "Graph proto is " << graph_def->DebugString();
+#endif  // __ANDROID__
+
   std::unique_ptr<GraphExecutionState> ret(
       new GraphExecutionState(graph_def, options));
 
diff --git a/tensorflow/core/common_runtime/mkl_cpu_allocator.cc b/tensorflow/core/common_runtime/mkl_cpu_allocator.cc
new file mode 100644
index 0000000000000000000000000000000000000000..43a909466ed4b6fe6ea32b1ad72a1154390288ac
--- /dev/null
+++ b/tensorflow/core/common_runtime/mkl_cpu_allocator.cc
@@ -0,0 +1,27 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifdef INTEL_MKL
+
+#include "tensorflow/core/common_runtime/mkl_cpu_allocator.h"
+
+namespace tensorflow {
+
+constexpr const char* MklCPUAllocator::kMaxLimitStr;
+constexpr const size_t MklCPUAllocator::kDefaultMaxLimit;
+
+}  // namespace tensorflow
+
+#endif  // INTEL_MKL
diff --git a/tensorflow/core/common_runtime/mkl_cpu_allocator.h b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
index 71e0de9724680cfcc012ae04782b90b867e0095b..55c8411ad017dd8a2e64309bc426d96852a2a696 100644
--- a/tensorflow/core/common_runtime/mkl_cpu_allocator.h
+++ b/tensorflow/core/common_runtime/mkl_cpu_allocator.h
@@ -24,7 +24,7 @@ limitations under the License.
 #include <cstdlib>
 #include <string>
 #include "tensorflow/core/common_runtime/bfc_allocator.h"
-#include "tensorflow/core/common_runtime/visitable_allocator.h"
+#include "tensorflow/core/framework/visitable_allocator.h"
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/str_util.h"
 #include "tensorflow/core/platform/mem.h"
@@ -53,7 +53,7 @@ class MklCPUAllocator : public VisitableAllocator {
   static constexpr const char* kMaxLimitStr = "TF_MKL_ALLOC_MAX_BYTES";
 
   /// Default upper limit on allocator size - 64GB
-  static const size_t kDefaultMaxLimit = 64LL << 30;
+  static constexpr size_t kDefaultMaxLimit = 64LL << 30;
 
   MklCPUAllocator() { TF_CHECK_OK(Initialize()); }
 
@@ -158,7 +158,7 @@ class MklCPUAllocator : public VisitableAllocator {
   static constexpr const char* kName = "mklcpu";
 
   /// The alignment that we need for the allocations
-  static const size_t kAlignment = 64;
+  static constexpr const size_t kAlignment = 64;
 
   VisitableAllocator* allocator_;  // owned by this class
 };
diff --git a/tensorflow/core/common_runtime/placer.cc b/tensorflow/core/common_runtime/placer.cc
index a913f2075181a3896015579d79093395d67101ff..e128b9257f2369e25c911f9a9e1d08475706d561 100644
--- a/tensorflow/core/common_runtime/placer.cc
+++ b/tensorflow/core/common_runtime/placer.cc
@@ -464,6 +464,7 @@ class ColocationGraph {
     // the user can see why an unsatisfiable placement occurred.
 
     std::unordered_map<string, string> type_to_devices;
+    std::vector<const Node*> colocation_nodes;
     int num_nodes_found = 0;
 
     for (const Node* node : graph_->nodes()) {
@@ -475,6 +476,7 @@ class ColocationGraph {
         continue;
       }
       ++num_nodes_found;
+      colocation_nodes.push_back(node);
       const string& op_type = node->type_string();
       string devices_registered;
       for (const auto& device_type : members_[id].supported_device_types) {
@@ -488,6 +490,13 @@ class ColocationGraph {
     for (const auto& td : type_to_devices) {
       strings::StrAppend(&text, "\n", td.first, ": ", td.second);
     }
+    strings::StrAppend(&text,
+                       "\n\nColocation members and user-requested devices:");
+    for (const Node* node : colocation_nodes) {
+      strings::StrAppend(&text, "\n  ", node->name(), " (", node->type_string(),
+                         ") ", node->requested_device());
+    }
+    strings::StrAppend(&text, "\n");
 
     if (num_nodes_found <= 1) {
       text.clear();
diff --git a/tensorflow/core/common_runtime/shape_refiner.cc b/tensorflow/core/common_runtime/shape_refiner.cc
index 45cdab98e0642a3fbfee3dfa415696b98251600a..2acaa31d32de40148bd88021eb0613f0fb8522ff 100644
--- a/tensorflow/core/common_runtime/shape_refiner.cc
+++ b/tensorflow/core/common_runtime/shape_refiner.cc
@@ -211,14 +211,14 @@ Status ShapeRefiner::AddNode(const Node* node) {
   // For each 'input' of this node, fetch the corresponding shape
   // from 'input's InferenceContext, and store into a vector
   // indexed by 'node's input.
-  std::vector<Node*> input_nodes(node->num_inputs());
+  std::vector<const Node*> input_nodes(node->num_inputs());
   std::vector<ShapeHandle> input_shapes(node->num_inputs());
   std::vector<std::unique_ptr<std::vector<ShapeAndType>>>
       input_handle_shapes_and_types(node->num_inputs());
   for (const Edge* e : node->in_edges()) {
     if (e->IsControlEdge()) continue;
 
-    Node* input = e->src();
+    const Node* input = e->src();
     auto it = node_to_context_.find(input);
     if (it == node_to_context_.end()) {
       return errors::FailedPrecondition(
diff --git a/tensorflow/core/common_runtime/step_stats_collector.cc b/tensorflow/core/common_runtime/step_stats_collector.cc
index cb900db10af98496cfdfafa5a38296bfdc4e996b..f21536d586edcca2cec9257579db9ca616f36a6c 100644
--- a/tensorflow/core/common_runtime/step_stats_collector.cc
+++ b/tensorflow/core/common_runtime/step_stats_collector.cc
@@ -226,13 +226,14 @@ void StepStatsCollector::BuildCostModel(
       if (node) {
         for (int i = 0; i < stats.output_size(); ++i) {
           const auto& output = stats.output(i);
-          cm->RecordMaxMemorySize(node, i,
+          int output_slot = output.slot();
+          cm->RecordMaxMemorySize(node, output_slot,
                                   Bytes(output.tensor_description()
                                             .allocation_description()
                                             .allocated_bytes()),
-                                  stats.output(i).tensor_description().shape(),
-                                  node->output_types()[i]);
-          cm->RecordAllocationId(node, i,
+                                  output.tensor_description().shape(),
+                                  node->output_types()[output_slot]);
+          cm->RecordAllocationId(node, output_slot,
                                  output.tensor_description()
                                      .allocation_description()
                                      .allocation_id());
diff --git a/tensorflow/core/debug/BUILD b/tensorflow/core/debug/BUILD
index 40cb8353cdccb4307f09b537ff7016e3dca5a8da..f6fe9edb022dce29286190e9948f385b933c5a07 100644
--- a/tensorflow/core/debug/BUILD
+++ b/tensorflow/core/debug/BUILD
@@ -298,6 +298,9 @@ tf_cc_test(
     size = "small",
     srcs = ["debug_grpc_io_utils_test.cc"],
     linkstatic = tf_kernel_tests_linkstatic(),
+    tags = [
+        "no_oss",  # b/73962011
+    ],
     deps = [
         ":debug_graph_utils",
         ":debug_grpc_testlib",
diff --git a/tensorflow/core/distributed_runtime/BUILD b/tensorflow/core/distributed_runtime/BUILD
index 9e152aa0823b67fceb7f103cc6e090f00870f88a..434626bd2da57ce4c4895017c0bb0abef58c6f44 100644
--- a/tensorflow/core/distributed_runtime/BUILD
+++ b/tensorflow/core/distributed_runtime/BUILD
@@ -595,6 +595,7 @@ tf_cc_test(
     srcs = ["recent_request_ids_test.cc"],
     deps = [
         ":recent_request_ids",
+        ":request_id",
         "//tensorflow/core:lib",
         "//tensorflow/core:test",
         "//tensorflow/core:test_main",
diff --git a/tensorflow/core/distributed_runtime/cluster_function_library_runtime.cc b/tensorflow/core/distributed_runtime/cluster_function_library_runtime.cc
index 3a8d5912369525253904bd700dfdc6e3eb26e0ae..0c5c4d59edc8c73d6bcac3ce0f9ec0b77495fb58 100644
--- a/tensorflow/core/distributed_runtime/cluster_function_library_runtime.cc
+++ b/tensorflow/core/distributed_runtime/cluster_function_library_runtime.cc
@@ -175,32 +175,33 @@ void ClusterFunctionLibraryRuntime::Run(
     return;
   }
 
-  RunGraphRequest req;
-  req.set_session_handle(worker_session_->session_name);
-  req.set_graph_handle(function_data->graph_handle);
+  RunGraphRequest* req = new RunGraphRequest;
+  req->set_session_handle(worker_session_->session_name);
+  req->set_graph_handle(function_data->graph_handle);
   // Borrowed from master_session.cc
   const uint64 step_id = (random::New64() & ((1uLL << 56) - 1)) | (1uLL << 56);
-  req.set_step_id(step_id);
+  req->set_step_id(step_id);
   int i = 0;
   for (const auto& send_key : function_data->send_keys) {
-    NamedTensorProto* send = req.add_send();
+    NamedTensorProto* send = req->add_send();
     send->set_name(send_key);
     args[i].AsProtoTensorContent(send->mutable_tensor());
     i++;
   }
   const std::vector<string>& recv_keys = function_data->recv_keys;
   for (const auto& recv_key : recv_keys) {
-    req.add_recv_key(recv_key);
+    req->add_recv_key(recv_key);
   }
 
   RunGraphResponse* resp = new RunGraphResponse();
   CallOptions* call_options = new CallOptions();
   wi->RunGraphAsync(
-      call_options, &req, resp,
-      [call_options, resp, rets, recv_keys, done](const Status& status) {
+      call_options, req, resp,
+      [call_options, req, resp, rets, recv_keys, done](const Status& status) {
         if (!status.ok()) {
           done(status);
           delete call_options;
+          delete req;
           delete resp;
           return;
         }
@@ -212,25 +213,28 @@ void ClusterFunctionLibraryRuntime::Run(
         for (const auto& recv_key : recv_keys) {
           TensorProto* tp = mapped_recvs[recv_key];
           if (tp == nullptr) {
+            done(errors::Internal("Could not find key: ", recv_key));
             delete call_options;
+            delete req;
             delete resp;
-            done(errors::Internal("Could not find key: ", recv_key));
             return;
           }
           Tensor t;
           if (t.FromProto(*tp)) {
             rets->push_back(t);
           } else {
-            delete call_options;
-            delete resp;
             done(errors::Internal("Could not convert tensor proto: ",
                                   tp->DebugString()));
+            delete call_options;
+            delete req;
+            delete resp;
             return;
           }
         }
+        done(status);
         delete call_options;
+        delete req;
         delete resp;
-        done(status);
       });
 }
 
diff --git a/tensorflow/core/distributed_runtime/recent_request_ids.cc b/tensorflow/core/distributed_runtime/recent_request_ids.cc
index c30879406c6924aa85ad4bf8279b278eaf5d29fd..4f6866c5d154ba023b0923af67fe00a7a69b459d 100644
--- a/tensorflow/core/distributed_runtime/recent_request_ids.cc
+++ b/tensorflow/core/distributed_runtime/recent_request_ids.cc
@@ -15,6 +15,8 @@ limitations under the License.
 
 #include "tensorflow/core/distributed_runtime/recent_request_ids.h"
 
+#include <utility>
+
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/logging.h"
@@ -29,12 +31,14 @@ RecentRequestIds::RecentRequestIds(int num_tracked_request_ids)
 Status RecentRequestIds::TrackUnique(int64 request_id,
                                      const string& method_name,
                                      const protobuf::Message& request) {
-  mutex_lock l(mu_);
   if (request_id == 0) {
     // For backwards compatibility, allow all requests with request_id 0.
     return Status::OK();
   }
-  if (set_.count(request_id) > 0) {
+
+  mutex_lock l(mu_);
+  const bool inserted = set_.insert(request_id).second;
+  if (!inserted) {
     // Note: RecentRequestIds is not strict LRU because we don't update
     // request_id's age in the circular_buffer_ if it's tracked again. Strict
     // LRU is not useful here because returning this error will close the
@@ -49,7 +53,6 @@ Status RecentRequestIds::TrackUnique(int64 request_id,
   // when the buffer is not yet full.
   set_.erase(circular_buffer_[next_index_]);
   circular_buffer_[next_index_] = request_id;
-  set_.insert(request_id);
   next_index_ = (next_index_ + 1) % circular_buffer_.size();
   return Status::OK();
 }
diff --git a/tensorflow/core/distributed_runtime/recent_request_ids.h b/tensorflow/core/distributed_runtime/recent_request_ids.h
index e8e45331dd5a26e2230bb92e8ce73888d3f28505..11cf937c94659d85e3dc88350f20e107a27fab62 100644
--- a/tensorflow/core/distributed_runtime/recent_request_ids.h
+++ b/tensorflow/core/distributed_runtime/recent_request_ids.h
@@ -16,11 +16,13 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RECENT_REQUEST_IDS_H_
 #define TENSORFLOW_CORE_DISTRIBUTED_RUNTIME_RECENT_REQUEST_IDS_H_
 
+#include <string>
+#include <unordered_set>
 #include <vector>
 
 #include "tensorflow/core/lib/core/status.h"
-#include "tensorflow/core/lib/gtl/flatset.h"
 #include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/thread_annotations.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/protobuf/worker.pb.h"
@@ -64,7 +66,7 @@ class RecentRequestIds {
   // request_id.
   int next_index_ GUARDED_BY(mu_) = 0;
   std::vector<int64> circular_buffer_ GUARDED_BY(mu_);
-  gtl::FlatSet<int64> set_ GUARDED_BY(mu_);
+  std::unordered_set<int64> set_ GUARDED_BY(mu_);
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/recent_request_ids_test.cc b/tensorflow/core/distributed_runtime/recent_request_ids_test.cc
index 9a0facf5404bb4e6d0d57f55bcd1f2a4f4f99dba..8910a50e9cda691984d712ebfc5aea1d4f904d3f 100644
--- a/tensorflow/core/distributed_runtime/recent_request_ids_test.cc
+++ b/tensorflow/core/distributed_runtime/recent_request_ids_test.cc
@@ -17,8 +17,10 @@ limitations under the License.
 
 #include <algorithm>
 
+#include "tensorflow/core/distributed_runtime/request_id.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
 #include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/platform/test_benchmark.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/protobuf/worker.pb.h"
 
@@ -93,4 +95,15 @@ TEST(RecentRequestIds, Ordered3) { TestOrdered(3); }
 TEST(RecentRequestIds, Ordered4) { TestOrdered(4); }
 TEST(RecentRequestIds, Ordered5) { TestOrdered(5); }
 
+void BM_TrackUnique(int iters) {
+  RecentRequestIds recent_request_ids(100000);
+  RecvTensorRequest request;
+  for (int i = 0; i < iters; ++i) {
+    TF_CHECK_OK(recent_request_ids.TrackUnique(GetUniqueRequestId(),
+                                               "BM_TrackUnique", request));
+  }
+}
+
+BENCHMARK(BM_TrackUnique);
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
index c4ac92d809627e7134b5d4ae694f9978cd5390b4..a6f4be3eaf69f40199e64c43dff443e886aa5aa1 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc
@@ -106,7 +106,8 @@ GrpcServer::~GrpcServer() {
 Status GrpcServer::Init(
     ServiceInitFunction service_func,
     const RendezvousMgrCreationFunction& rendezvous_mgr_func,
-    const WorkerCreationFunction& worker_func) {
+    const WorkerCreationFunction& worker_func,
+    const StatsPublisherFactory& stats_factory) {
   mutex_lock l(mu_);
   CHECK_EQ(state_, NEW);
   master_env_.env = env_;
@@ -218,7 +219,7 @@ Status GrpcServer::Init(
   master_env_.ops = OpRegistry::Global();
   master_env_.worker_cache = worker_cache;
   master_env_.master_session_factory =
-      [config](
+      [config, stats_factory](
           SessionOptions options, const MasterEnv* env,
           std::unique_ptr<std::vector<std::unique_ptr<Device>>> remote_devs,
           std::unique_ptr<WorkerCacheInterface> worker_cache,
@@ -226,7 +227,7 @@ Status GrpcServer::Init(
         options.config.MergeFrom(config);
         return new MasterSession(options, env, std::move(remote_devs),
                                  std::move(worker_cache), std::move(device_set),
-                                 CreateNoOpStatsPublisher);
+                                 stats_factory);
       };
   master_env_.worker_cache_factory =
       [this](const WorkerCacheFactoryOptions& options,
@@ -241,6 +242,14 @@ Status GrpcServer::Init(
   return Status::OK();
 }
 
+Status GrpcServer::Init(
+    ServiceInitFunction service_func,
+    const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+    const WorkerCreationFunction& worker_func) {
+  return Init(std::move(service_func), rendezvous_mgr_func, worker_func,
+              CreateNoOpStatsPublisher);
+}
+
 Status GrpcServer::Init(
     ServiceInitFunction service_func,
     const RendezvousMgrCreationFunction& rendezvous_mgr_func) {
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h
index 8b12ac1461d6b1fa3098197aa7697031a5d3075b..7c2f06f618a85c901ce7a7902cb8b1bc4e57be40 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_server_lib.h
@@ -22,6 +22,7 @@ limitations under the License.
 #include "grpc++/security/credentials.h"
 
 #include "tensorflow/core/common_runtime/process_util.h"
+#include "tensorflow/core/common_runtime/stats_publisher_interface.h"
 #include "tensorflow/core/distributed_runtime/master_env.h"
 #include "tensorflow/core/distributed_runtime/rpc/async_service_interface.h"
 #include "tensorflow/core/distributed_runtime/rpc/grpc_channel.h"
@@ -68,6 +69,11 @@ class GrpcServer : public ServerInterface {
   const string target() const override;
 
  protected:
+  Status Init(ServiceInitFunction service_func,
+              const RendezvousMgrCreationFunction& rendezvous_mgr_func,
+              const WorkerCreationFunction& worker_func,
+              const StatsPublisherFactory& stats_factory);
+
   Status Init(ServiceInitFunction service_func,
               const RendezvousMgrCreationFunction& rendezvous_mgr_func,
               const WorkerCreationFunction& worker_func);
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_worker_cache.cc b/tensorflow/core/distributed_runtime/rpc/grpc_worker_cache.cc
index 2ed07e3669a3badd82b8ef27f45bac2b712c8978..bb14e0197b7b0ea44c4a75528f4919045574f4c5 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_worker_cache.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_worker_cache.cc
@@ -34,7 +34,7 @@ namespace {
 class GrpcWorkerCache : public WorkerCachePartial {
  public:
   // TODO(ncteisen): consider adding a config var or flag for this
-  static constexpr const size_t kGrpcWorkerCacheThreadCount = 2;
+  static constexpr const size_t kGrpcWorkerCacheThreadCount = 8;
 
   explicit GrpcWorkerCache(GrpcChannelCache* channel_cache,
                            WorkerInterface* local_worker,
diff --git a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc
index 1beb198732ad40ed9e21f66c665ff82a231eebb6..b20e744a97160a17cd1621b38475a7c9c4f81d8f 100644
--- a/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc
+++ b/tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc
@@ -52,7 +52,7 @@ namespace {
 
 class GrpcWorkerService : public AsyncServiceInterface {
   // TODO(ncteisen): consider adding a config var or flag for this
-  static constexpr const size_t kGrpcWorkerServiceThreadCount = 2;
+  static constexpr const size_t kGrpcWorkerServiceThreadCount = 8;
 
  public:
   GrpcWorkerService(GrpcWorker* worker, ::grpc::ServerBuilder* builder)
diff --git a/tensorflow/core/distributed_runtime/scheduler.cc b/tensorflow/core/distributed_runtime/scheduler.cc
index 9dae5b3b926fab14c2b36955436d3956baa29fdd..84036361971b73f9fb7fe990833d5018f6321e27 100644
--- a/tensorflow/core/distributed_runtime/scheduler.cc
+++ b/tensorflow/core/distributed_runtime/scheduler.cc
@@ -80,7 +80,7 @@ Microseconds SlackAnalysis::ComputeAsap(std::vector<Microseconds>* asap_times) {
   std::vector<int> pending_count(graph_->num_node_ids());
   InitializePending(graph_, &pending_count);
 
-  std::deque<Node*> queue;
+  std::deque<const Node*> queue;
   Node* srcNode = graph_->source_node();
   queue.push_back(srcNode);
   (*asap_times)[srcNode->id()] = 0;
@@ -92,7 +92,7 @@ Microseconds SlackAnalysis::ComputeAsap(std::vector<Microseconds>* asap_times) {
     for (const Edge* out_edge : curr->out_edges()) {
       // The time needed for 'out' to get its input from 'curr'.
       Microseconds copy_time(0);
-      Node* out = out_edge->dst();
+      const Node* out = out_edge->dst();
       if (!out_edge->IsControlEdge() &&
           curr->assigned_device_name() != out->assigned_device_name()) {
         // Add an arbitrary 10microsecs for each copy.
@@ -137,7 +137,7 @@ Microseconds SlackAnalysis::ComputeAlap(std::vector<Microseconds>* alap_times) {
     }
   }
 
-  std::deque<Node*> queue;
+  std::deque<const Node*> queue;
   Node* sinkNode = graph_->sink_node();
   queue.push_back(sinkNode);
   (*alap_times)[sinkNode->id()] = 0;
@@ -148,7 +148,7 @@ Microseconds SlackAnalysis::ComputeAlap(std::vector<Microseconds>* alap_times) {
     for (const Edge* in_edge : curr->in_edges()) {
       // The time needed for 'curr' to get its input from 'src'.
       Microseconds copy_time(0);
-      Node* src = in_edge->src();
+      const Node* src = in_edge->src();
       if (!in_edge->IsControlEdge() &&
           src->assigned_device_name() != curr->assigned_device_name()) {
         // TODO(yuanbyu): Use the real cost model
@@ -236,7 +236,7 @@ Microseconds GreedyScheduler::ComputeSchedule(
 
       for (const Edge* out_edge : event.node->out_edges()) {
         Microseconds copy_time(0);
-        Node* out = out_edge->dst();
+        const Node* out = out_edge->dst();
         if (!out_edge->IsControlEdge() &&
             event.node->assigned_device_name() != out->assigned_device_name()) {
           // TODO(yuanbyu): Use below with the real cost model.
@@ -277,11 +277,11 @@ Microseconds GreedyScheduler::ComputeSchedule(
   return max_completion;
 }
 
-Node* GreedyScheduler::GetNodeWithHighestPriority(
-    const std::vector<Node*>& nodes) {
-  Node* curr_node = nullptr;
+const Node* GreedyScheduler::GetNodeWithHighestPriority(
+    const std::vector<const Node*>& nodes) {
+  const Node* curr_node = nullptr;
   int64 curr_priority = kint64max;
-  for (Node* n : nodes) {
+  for (const Node* n : nodes) {
     if ((*priority_)[n->id()] < curr_priority) {
       curr_node = n;
       curr_priority = (*priority_)[n->id()];
diff --git a/tensorflow/core/distributed_runtime/scheduler.h b/tensorflow/core/distributed_runtime/scheduler.h
index ef87b9834dba50cf628a8c29c70b0266661d6227..bf9d0d1bec33284a44f69412477edb4a0963e8a1 100644
--- a/tensorflow/core/distributed_runtime/scheduler.h
+++ b/tensorflow/core/distributed_runtime/scheduler.h
@@ -57,11 +57,11 @@ class GreedyScheduler {
   struct Sim {
     int degree_parallelism;
     int num_running;
-    std::vector<Node*> ready_nodes;
+    std::vector<const Node*> ready_nodes;
   };
 
   struct Event {
-    Node* node;
+    const Node* node;
     Microseconds time;
     bool is_completion;
 
@@ -79,7 +79,7 @@ class GreedyScheduler {
 
  private:
   // Returns the ready node with the highest priority for a sim.
-  Node* GetNodeWithHighestPriority(const std::vector<Node*>& nodes);
+  const Node* GetNodeWithHighestPriority(const std::vector<const Node*>& nodes);
 
   const DeviceSet* devices_;
   const CostModel* cost_model_;
diff --git a/tensorflow/core/framework/allocator.cc b/tensorflow/core/framework/allocator.cc
index 94bf34afa49f586e1bb61c1654865a5abc9abe19..a382b8be95f143898a8f52f887b9396f3823372b 100644
--- a/tensorflow/core/framework/allocator.cc
+++ b/tensorflow/core/framework/allocator.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/core/framework/allocator.h"
+#include "tensorflow/core/framework/visitable_allocator.h"
 
 #include "tensorflow/core/framework/allocator_registry.h"
 #include "tensorflow/core/framework/log_memory.h"
@@ -68,15 +68,19 @@ void EnableCPUAllocatorFullStats(bool enable) {
   cpu_allocator_collect_full_stats = enable;
 }
 
-class CPUAllocator : public Allocator {
+class CPUAllocator : public VisitableAllocator {
  public:
-  CPUAllocator() {}
+  CPUAllocator() : allocation_begun_(false) {}
 
   ~CPUAllocator() override {}
 
   string Name() override { return "cpu"; }
 
   void* AllocateRaw(size_t alignment, size_t num_bytes) override {
+    if (!allocation_begun_) {
+      allocation_begun_ = true;
+    }
+
     void* p = port::AlignedMalloc(num_bytes, alignment);
     if (cpu_allocator_collect_stats) {
       const std::size_t alloc_size = port::MallocExtension_GetAllocatedSize(p);
@@ -88,16 +92,38 @@ class CPUAllocator : public Allocator {
       stats_.max_alloc_size =
           std::max<int64>(stats_.max_alloc_size, alloc_size);
     }
+
+    // visit each Visitor in alloc_visitors_
+    if (p != nullptr) {
+      for (const Visitor& v : alloc_visitors_) {
+        v(p, num_bytes);
+      }
+    }
+
     return p;
   }
 
   void DeallocateRaw(void* ptr) override {
+    std::size_t alloc_size;
+    bool init_alloc_size = false;
     if (cpu_allocator_collect_stats) {
-      const std::size_t alloc_size =
-          port::MallocExtension_GetAllocatedSize(ptr);
+      alloc_size = port::MallocExtension_GetAllocatedSize(ptr);
+      init_alloc_size = true;
       mutex_lock l(mu_);
       stats_.bytes_in_use -= alloc_size;
     }
+
+    // visit each Visitor in free_visitors_
+    if (ptr != nullptr) {
+      if (!init_alloc_size) {
+        alloc_size = port::MallocExtension_GetAllocatedSize(ptr);
+        init_alloc_size = true;
+      }
+      for (const Visitor& v : free_visitors_) {
+        v(ptr, alloc_size);
+      }
+    }
+
     port::AlignedFree(ptr);
   }
 
@@ -117,10 +143,36 @@ class CPUAllocator : public Allocator {
     return port::MallocExtension_GetAllocatedSize(ptr);
   }
 
+  // REQUIRES: can only add visitors before the first Allocate call
+
+  void AddAllocVisitor(Visitor visitor) override {
+    mutex_lock lock(visitor_mutex_);
+    CHECK(!allocation_begun_)
+        << "AddAllocVisitor may not be called after allocation has begun.";
+    alloc_visitors_.push_back(visitor);
+  }
+
+  void AddFreeVisitor(Visitor visitor) override {
+    mutex_lock lock(visitor_mutex_);
+    CHECK(!allocation_begun_)
+        << "AddFreeVisitor may not be called after allocation has begun.";
+    free_visitors_.push_back(visitor);
+  }
+
  private:
   mutex mu_;
   AllocatorStats stats_ GUARDED_BY(mu_);
 
+  // visitor_mutex_ protects write access to alloc_visitors_ and free_visitors_.
+  // While write access is mutually exclusive, reads may happen concurrently.
+  // This is okay because we may only append to alloc_visitors_ and
+  // free_visitors_ before first allocation, and subsequently we only read these
+  // vectors.
+  mutex visitor_mutex_;
+  std::vector<Visitor> alloc_visitors_;
+  std::vector<Visitor> free_visitors_;
+  std::atomic<bool> allocation_begun_;
+
   TF_DISALLOW_COPY_AND_ASSIGN(CPUAllocator);
 };
 
diff --git a/tensorflow/core/framework/device_base.h b/tensorflow/core/framework/device_base.h
index 1838a8ad02d2bd5522ce3162fea53e3f5afc0309..fb6d5c69e135c0263845cf71b93ac53bb2a359ed 100644
--- a/tensorflow/core/framework/device_base.h
+++ b/tensorflow/core/framework/device_base.h
@@ -128,6 +128,8 @@ class DeviceBase {
   // using a single stream.)
   // "event_mgr" is used to delay deallocation of temporary GPU buffers.
   // TODO(pbar) Work out how to move this out of DeviceBase.
+  // GpuDeviceInfo name is an unfortunate legacy, it is used not only by GPUs
+  // but also by TPU devices (to provide default device context).
   struct GpuDeviceInfo {
     // Make sure all the defaults are NULL, so we can spot missing assignments.
     perftools::gputools::Stream* stream = nullptr;
@@ -230,6 +232,7 @@ class DeviceBase {
  private:
   Env* const env_;
   CpuWorkerThreads* cpu_worker_threads_ = nullptr;
+  // Set by GPUs as well as by TPU devices.
   GpuDeviceInfo* gpu_device_info_ = nullptr;
   thread::ThreadPool* device_thread_pool_ = nullptr;
   Eigen::ThreadPoolDevice* eigen_cpu_device_ = nullptr;
diff --git a/tensorflow/core/framework/function.cc b/tensorflow/core/framework/function.cc
index eae8e6c3c10c4b49081aed0e253d9a6f382f562b..3e7b89d4ebc91df42ee81c1c9fe67c68e755f736 100644
--- a/tensorflow/core/framework/function.cc
+++ b/tensorflow/core/framework/function.cc
@@ -168,7 +168,7 @@ class FunctionInstantiationHelper {
         strings::StrAppend(&name, "_", i);
       }
       NodeDef* gnode = AddNode(name);
-      gnode->set_op("_Arg");
+      gnode->set_op(FunctionLibraryDefinition::kArgOp);
       AddAttr("T", dtypes[i], gnode);
       AddAttr("index", arg_index, gnode);
       result_.arg_types.push_back(dtypes[i]);
@@ -328,7 +328,7 @@ class FunctionInstantiationHelper {
         strings::StrAppend(&name, "_", i);
       }
       NodeDef* gnode = AddNode(name);
-      gnode->set_op("_Retval");
+      gnode->set_op(FunctionLibraryDefinition::kRetOp);
       AddInput(nodes_.size() - 1, item->nid, item->idx + i);
       AddAttr("T", dtypes[i], gnode);
       AddAttr("index", (*ret_index)++, gnode);
@@ -558,9 +558,9 @@ string Print(gtl::ArraySlice<const NodeDef*> nodes) {
   std::vector<const NodeDef*> ret;
   std::vector<const NodeDef*> body;
   for (const NodeDef* n : nodes) {
-    if (n->op() == "_Arg") {
+    if (n->op() == FunctionLibraryDefinition::kArgOp) {
       arg.push_back(n);
-    } else if (n->op() == "_Retval") {
+    } else if (n->op() == FunctionLibraryDefinition::kRetOp) {
       ret.push_back(n);
     } else {
       body.push_back(n);
diff --git a/tensorflow/core/framework/function.h b/tensorflow/core/framework/function.h
index e27001133bbb5056abf1a3e1f5b9d69c8e01bc56..e00399f97de42ca6c683202fdec9142310fa6e2d 100644
--- a/tensorflow/core/framework/function.h
+++ b/tensorflow/core/framework/function.h
@@ -344,6 +344,11 @@ class FunctionLibraryDefinition : public OpRegistryInterface {
   Status LookUp(const string& op_type_name,
                 const OpRegistrationData** op_reg_data) const override;
 
+  // Ops created for function arguments bear the name given by `kArgOp`; those
+  // created for return values bear the name given by `kRetOp`.
+  static constexpr const char* const kArgOp = "_Arg";
+  static constexpr const char* const kRetOp = "_Retval";
+
   static constexpr const char* const kGradientOp = "SymbolicGradient";
   static constexpr const char* const kFuncAttr = "f";
 
@@ -404,6 +409,8 @@ struct FunctionBody;
 
 // Forward declare. Defined in common_runtime/device.h
 class Device;
+// Forward declare. Defined in common_runtime/device_mgr.h
+class DeviceMgr;
 
 class FunctionLibraryRuntime {
  public:
@@ -518,6 +525,9 @@ class FunctionLibraryRuntime {
   // Returns the device on which the function executes.
   virtual Device* device() = 0;
 
+  // Get the DeviceMgr from which the device was obtained.
+  virtual const DeviceMgr* device_mgr() const = 0;
+
   // Returns the function library definition that backs this runtime.
   // NOTE(mrry): The returned library definition is the default function library
   // for this runtime. The runtime may instantiate functions from separate
diff --git a/tensorflow/core/framework/op.cc b/tensorflow/core/framework/op.cc
index fadb60d744217daa0c569601c437146a70f9b4d5..fc5467b3c86934908c3f1261c79659c6a0469350 100644
--- a/tensorflow/core/framework/op.cc
+++ b/tensorflow/core/framework/op.cc
@@ -110,6 +110,15 @@ void OpRegistry::GetRegisteredOps(std::vector<OpDef>* op_defs) {
   }
 }
 
+void OpRegistry::GetOpRegistrationData(
+    std::vector<OpRegistrationData>* op_data) {
+  mutex_lock lock(mu_);
+  MustCallDeferred();
+  for (const auto& p : registry_) {
+    op_data->push_back(*p.second);
+  }
+}
+
 Status OpRegistry::SetWatcher(const Watcher& watcher) {
   mutex_lock lock(mu_);
   if (watcher_ && watcher) {
diff --git a/tensorflow/core/framework/op.h b/tensorflow/core/framework/op.h
index f7f1ed2a886548c39fa38239d65aa2a73564c3c4..3ccca4090d9804050c484d64a62826665b94d4d2 100644
--- a/tensorflow/core/framework/op.h
+++ b/tensorflow/core/framework/op.h
@@ -89,6 +89,9 @@ class OpRegistry : public OpRegistryInterface {
   // Get all registered ops.
   void GetRegisteredOps(std::vector<OpDef>* op_defs);
 
+  // Get all `OpRegistrationData`s.
+  void GetOpRegistrationData(std::vector<OpRegistrationData>* op_data);
+
   // Watcher, a function object.
   // The watcher, if set by SetWatcher(), is called every time an op is
   // registered via the Register function. The watcher is passed the Status
diff --git a/tensorflow/core/common_runtime/visitable_allocator.h b/tensorflow/core/framework/visitable_allocator.h
similarity index 94%
rename from tensorflow/core/common_runtime/visitable_allocator.h
rename to tensorflow/core/framework/visitable_allocator.h
index 8edf922d11ee1662b78771bfdc7c38e0144aee19..ed41b05531acaa1be803ac533854efe6160691b4 100644
--- a/tensorflow/core/common_runtime/visitable_allocator.h
+++ b/tensorflow/core/framework/visitable_allocator.h
@@ -13,8 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#ifndef TENSORFLOW_COMMON_RUNTIME_VISITABLE_ALLOCATOR_H_
-#define TENSORFLOW_COMMON_RUNTIME_VISITABLE_ALLOCATOR_H_
+#ifndef TENSORFLOW_CORE_FRAMEWORK_VISITABLE_ALLOCATOR_H_
+#define TENSORFLOW_CORE_FRAMEWORK_VISITABLE_ALLOCATOR_H_
 
 #include <functional>
 #include "tensorflow/core/framework/allocator.h"
@@ -76,4 +76,4 @@ class TrackingVisitableAllocator : public TrackingAllocator,
   VisitableAllocator* allocator_;
 };
 }  // namespace tensorflow
-#endif  // TENSORFLOW_COMMON_RUNTIME_VISITABLE_ALLOCATOR_H_
+#endif  // TENSORFLOW_CORE_FRAMEWORK_VISITABLE_ALLOCATOR_H_
diff --git a/tensorflow/core/graph/control_flow.cc b/tensorflow/core/graph/control_flow.cc
index db6683d1e74512e37a40773b7642cf33eb888782..30ff19cd7eae794e0e9875ca0825b647b44b02af 100644
--- a/tensorflow/core/graph/control_flow.cc
+++ b/tensorflow/core/graph/control_flow.cc
@@ -24,23 +24,24 @@ limitations under the License.
 
 namespace tensorflow {
 
-Status BuildControlFlowInfo(Graph* g, std::vector<ControlFlowInfo>* info) {
+Status BuildControlFlowInfo(const Graph* g,
+                            std::vector<ControlFlowInfo>* info) {
   info->clear();
   info->resize(g->num_node_ids());
 
   std::vector<const Node*> parent_nodes;
   parent_nodes.resize(g->num_node_ids());
 
-  Node* src_node = g->source_node();
+  const Node* src_node = g->source_node();
   ControlFlowInfo& src_info = (*info)[src_node->id()];
   src_info.frame = src_node;
   src_info.parent_frame = src_node;
 
   string frame_name;
-  std::deque<Node*> ready;
+  std::deque<const Node*> ready;
   ready.push_back(src_node);
   while (!ready.empty()) {
-    Node* curr_node = ready.front();
+    const Node* curr_node = ready.front();
     ready.pop_front();
     const ControlFlowInfo& curr_info = (*info)[curr_node->id()];
     const Node* frame = curr_info.frame;
@@ -56,7 +57,7 @@ Status BuildControlFlowInfo(Graph* g, std::vector<ControlFlowInfo>* info) {
     }
 
     for (const Edge* out_edge : curr_node->out_edges()) {
-      Node* out = out_edge->dst();
+      const Node* out = out_edge->dst();
       int out_id = out->id();
       ControlFlowInfo* out_info = &(*info)[out_id];
       const Node* out_parent = out_info->parent_frame;
diff --git a/tensorflow/core/graph/control_flow.h b/tensorflow/core/graph/control_flow.h
index 372044f538f9428e1979ba80bbb18a9742fc014e..79e2be0d4b9db6dd70d339ee07faf25c85376386 100644
--- a/tensorflow/core/graph/control_flow.h
+++ b/tensorflow/core/graph/control_flow.h
@@ -30,14 +30,14 @@ struct ControlFlowInfo {
   string frame_name;                   // frame name of a node
 };
 
-// Assign to each node the name of the frame and the level it belongs to.
-// We check the well-formedness of the graph: All inputs to a node must
-// come from the same frame and have the same "static" iteration level.
-// `info` is cleared and populated by this function.
-// NOTE(yuanbyu): For now, we require all sends/recvs have iteration level
-// 0. This essentially means there can't be multiple serial Nexts in
-// an iteration, which all sane front-ends should satisfy.
-Status BuildControlFlowInfo(Graph* g, std::vector<ControlFlowInfo>* info);
+// Clear and populate `info` with each node's frame and the level it belongs to.
+// We check the well-formedness of the graph: All inputs to a node must come
+// from the same frame and have the same "static" iteration level.
+//
+// NOTE(yuanbyu): For now, we require all sends/recvs have iteration level 0.
+// This essentially means there can't be multiple serial Nexts in an iteration,
+// which all sane front-ends should satisfy.
+Status BuildControlFlowInfo(const Graph* g, std::vector<ControlFlowInfo>* info);
 
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/graph/costmodel.cc b/tensorflow/core/graph/costmodel.cc
index 4f3a6ec38cb88213c7127df41823bc16e9834d09..1df45d9b893fdb2807c5e6ab63dd4a8577d7feb6 100644
--- a/tensorflow/core/graph/costmodel.cc
+++ b/tensorflow/core/graph/costmodel.cc
@@ -427,7 +427,7 @@ static void AssignSizes(const Graph& g, CostModel* cost_model) {
     if (e->IsControlEdge()) {
       continue;
     }
-    Node* src = e->src();
+    const Node* src = e->src();
 
     // TODO(josh11b): Get an estimate from the Op
     Bytes size(1);
diff --git a/tensorflow/core/graph/graph.cc b/tensorflow/core/graph/graph.cc
index 9b56216f1f97a9598dd7ae8b70786e32bb7e0f4b..a7af5e2312af716ef25cb35c8f247d6feccb6d9c 100644
--- a/tensorflow/core/graph/graph.cc
+++ b/tensorflow/core/graph/graph.cc
@@ -339,7 +339,7 @@ Node* Graph::AddNode(const NodeDef& node_def, Status* status) {
   return node;
 }
 
-Node* Graph::CopyNode(Node* node) {
+Node* Graph::CopyNode(const Node* node) {
   DCHECK(!node->IsSource());
   DCHECK(!node->IsSink());
   Node* copy = AllocateNode(node->props_, node);
diff --git a/tensorflow/core/graph/graph.h b/tensorflow/core/graph/graph.h
index 9d96cd4654bbf1fd65c5135d6a8bdc271c6e9443..cbd58b051afde592731ddf2b2ed61854cdfac49e 100644
--- a/tensorflow/core/graph/graph.h
+++ b/tensorflow/core/graph/graph.h
@@ -422,7 +422,7 @@ class Graph {
   // Copies *node, which may belong to another graph, to a new node,
   // which is returned.  Does not copy any edges.  *this owns the
   // returned instance.
-  Node* CopyNode(Node* node);
+  Node* CopyNode(const Node* node);
 
   // Removes a node from this graph, including all edges from or to it.
   // *node should not be accessed after calling this function.
diff --git a/tensorflow/core/graph/graph_constructor.cc b/tensorflow/core/graph/graph_constructor.cc
index 0629ff32d00cf7fad00c39f07810aa4a9d57f14f..627309078ac51a25fe2924935c191ec1c4d2a534 100644
--- a/tensorflow/core/graph/graph_constructor.cc
+++ b/tensorflow/core/graph/graph_constructor.cc
@@ -1271,7 +1271,7 @@ void CopyGraph(const Graph& src, Graph* dest) {
   dest->set_versions(src.versions());
 
   // Copy the nodes
-  std::unordered_map<Node*, Node*>
+  std::unordered_map<const Node*, Node*>
       node_map;  // "Node in src" -> "Node in *dest"
   node_map[src.source_node()] = dest->source_node();
   node_map[src.sink_node()] = dest->sink_node();
diff --git a/tensorflow/core/graph/graph_partition.cc b/tensorflow/core/graph/graph_partition.cc
index add80eda23d7887fb06902c0b123c03db8f4cccf..17a174101b2be479bea834a407544b3a74dc08cf 100644
--- a/tensorflow/core/graph/graph_partition.cc
+++ b/tensorflow/core/graph/graph_partition.cc
@@ -123,8 +123,8 @@ bool NeedSameDeviceSendRecv(const Edge* edge, const GraphInfo& info) {
     return false;
   }
 
-  Node* src = edge->src();
-  Node* dst = edge->dst();
+  const Node* src = edge->src();
+  const Node* dst = edge->dst();
   if (src->assigned_device_name() == dst->assigned_device_name()) {
     int src_port = edge->src_output();
     int dst_port = edge->dst_input();
@@ -141,7 +141,7 @@ bool NeedSameDeviceSendRecv(const Edge* edge, const GraphInfo& info) {
 
 // Return true iff (dst, dst_input) is specified on host memory.
 bool IsDstInputOnHost(const Edge* edge, const GraphInfo& info) {
-  Node* dst = edge->dst();
+  const Node* dst = edge->dst();
   int dst_port = edge->dst_input();
   if (info.device_types[dst->id()] != DEVICE_CPU) {
     if (edge->IsControlEdge()) return false;
diff --git a/tensorflow/core/graph/mkl_graph_util.h b/tensorflow/core/graph/mkl_graph_util.h
index 1b99d54e8e33fd5155913a78ee833343bf92b905..5f51d6083b1ae17d8c4dee2434f4b57de5f18d06 100644
--- a/tensorflow/core/graph/mkl_graph_util.h
+++ b/tensorflow/core/graph/mkl_graph_util.h
@@ -90,7 +90,7 @@ inline string GetMklOpName(const string& name) {
 // @input: name of the op
 // @input: T datatype to be used for checking op
 // @return: true if opname is registered as Mkl op; false otherwise
-static inline bool IsMklOp(const std::string& op_name, DataType T) {
+static inline bool IsMklOp(const string& op_name, DataType T) {
   string kernel = KernelsRegisteredForOp(op_name);
   bool result =
       kernel.find(kMklOpLabelPattern) != string::npos && (T == DT_FLOAT);
@@ -104,7 +104,7 @@ static inline bool IsMklOp(const std::string& op_name, DataType T) {
 // @input: T datatype to be used for checking op
 // @return: true if opname is registered as element-wise Mkl op;
 // false otherwise
-static inline bool IsMklElementWiseOp(const std::string& op_name, DataType T) {
+static inline bool IsMklElementWiseOp(const string& op_name, DataType T) {
   if (!IsMklOp(op_name, T)) {
     return false;
   }
diff --git a/tensorflow/core/graph/mkl_layout_pass.cc b/tensorflow/core/graph/mkl_layout_pass.cc
index 7d3be152991351533a6185ea088503032f720b47..02038c5d77d0107213096a42edc6e6f6955008be 100644
--- a/tensorflow/core/graph/mkl_layout_pass.cc
+++ b/tensorflow/core/graph/mkl_layout_pass.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+// TODO(intel): Improve error handling in this file; instead of CHECK failing
+// all over the place, we should log an error and execute the original graph.
 #ifdef INTEL_MKL
 
 #include <algorithm>
@@ -1030,8 +1032,7 @@ void MklLayoutRewritePass::GetDummyMklTensorNode(std::unique_ptr<Graph>* g,
   TensorProto proto;
   proto.set_dtype(dt);
   uint8 zero[8] = {0, 0, 0, 0, 0, 0, 0, 0};
-  proto.set_tensor_content(const_cast<const void*>(static_cast<void*>(&zero)),
-                           8);
+  proto.set_tensor_content(string(reinterpret_cast<const char*>(zero), 8));
   TensorShape dummy_shape({8});
   dummy_shape.AsProto(proto.mutable_tensor_shape());
   TF_CHECK_OK(NodeBuilder((*g)->NewName("DMT"), "Const")
@@ -1144,7 +1145,8 @@ int MklLayoutRewritePass::SetUpContiguousInputs(
     // For that let's first find filter node that is 2nd input (slot 1)
     // of BackpropInput.
     Node* filter_node = nullptr;
-    old_node->input_node(kConv2DBackpropInputFilterInputSlotIdx, &filter_node);
+    TF_CHECK_OK(old_node->input_node(kConv2DBackpropInputFilterInputSlotIdx,
+                                     &filter_node));
     CHECK_NOTNULL(filter_node);
 
     // Now check which nodes receive from filter_node. Filter feeds as
@@ -1323,8 +1325,7 @@ void MklLayoutRewritePass::GetDummyWorkspaceTensorNode(
   TensorProto proto;
   proto.set_dtype(dt);
   float zero[1] = {0};
-  proto.set_tensor_content(const_cast<const void*>(static_cast<void*>(&zero)),
-                           4);
+  proto.set_tensor_content(string(reinterpret_cast<char*>(&zero), 4));
   TensorShape dummy_shape({1});
   dummy_shape.AsProto(proto.mutable_tensor_shape());
   TF_CHECK_OK(NodeBuilder((*g)->NewName("DMT"), "Const")
@@ -1829,7 +1830,7 @@ Status MklLayoutRewritePass::MergeNode(std::unique_ptr<Graph>* g, Node* succ,
 
     // Create node.
     Node* new_node;
-    nb.Finalize(&**g, &new_node);
+    TF_CHECK_OK(nb.Finalize(&**g, &new_node));
     CHECK_NOTNULL(new_node);
 
     // Set the Mkl layer label for this op.
diff --git a/tensorflow/core/graph/node_builder.cc b/tensorflow/core/graph/node_builder.cc
index 138952dcb33e7b1e57cf013147581d20f509e85d..114962c0e4f2969fe539d5b168aaf62d577a7024 100644
--- a/tensorflow/core/graph/node_builder.cc
+++ b/tensorflow/core/graph/node_builder.cc
@@ -88,7 +88,7 @@ NodeBuilder& NodeBuilder::ControlInput(Node* src_node) {
 NodeBuilder& NodeBuilder::ControlInputs(gtl::ArraySlice<Node*> src_nodes) {
   control_inputs_.insert(control_inputs_.end(), src_nodes.begin(),
                          src_nodes.end());
-  for (Node* src_node : src_nodes) {
+  for (const Node* src_node : src_nodes) {
     def_builder_.ControlInput(src_node->name());
   }
   return *this;
@@ -127,7 +127,7 @@ Status NodeBuilder::Finalize(Graph* graph, Node** created_node) const {
   return Status::OK();
 }
 
-void NodeBuilder::AddIndexError(Node* node, int i) {
+void NodeBuilder::AddIndexError(const Node* node, int i) {
   if (node == nullptr) {
     errors_.emplace_back(
         strings::StrCat("Attempt to add nullptr Node to node with type ",
@@ -140,7 +140,7 @@ void NodeBuilder::AddIndexError(Node* node, int i) {
   }
 }
 
-bool NodeBuilder::GetOutputType(Node* node, int i, DataType* dt) {
+bool NodeBuilder::GetOutputType(const Node* node, int i, DataType* dt) {
   bool error;
   *dt = SafeGetOutput(node, i, &error);
   if (error) AddIndexError(node, i);
diff --git a/tensorflow/core/graph/node_builder.h b/tensorflow/core/graph/node_builder.h
index 86647a49c12085b6850a0e6d2622ef1bb58c513d..f6b7b5674b032cd2b19d69765e7c3b6b6613b3bd 100644
--- a/tensorflow/core/graph/node_builder.h
+++ b/tensorflow/core/graph/node_builder.h
@@ -120,7 +120,7 @@ class NodeBuilder {
   const OpDef& op_def() const { return def_builder_.op_def(); }
 
  private:
-  static DataType SafeGetOutput(Node* node, int i, bool* error) {
+  static DataType SafeGetOutput(const Node* node, int i, bool* error) {
     if (node != nullptr && i >= 0 && i < node->num_outputs()) {
       *error = false;
       return node->output_type(i);
@@ -131,11 +131,11 @@ class NodeBuilder {
   }
 
   // If SafeGetOutput indicates a range error, add it to errors_.
-  void AddIndexError(Node* node, int i);
+  void AddIndexError(const Node* node, int i);
 
   // Set *dt and returns true if i is in range. Combines
   // SafeGetOutput() and AddIndexError().
-  bool GetOutputType(Node* node, int i, DataType* dt);
+  bool GetOutputType(const Node* node, int i, DataType* dt);
 
   NodeDefBuilder def_builder_;
   std::vector<NodeOut> inputs_;
diff --git a/tensorflow/core/graph/optimizer_cse.cc b/tensorflow/core/graph/optimizer_cse.cc
index 6b452a1d5dca0a636264a3483e4ee9d027fd2e06..4073255db3f7cbcd697f3cb2781e04b3b01634c1 100644
--- a/tensorflow/core/graph/optimizer_cse.cc
+++ b/tensorflow/core/graph/optimizer_cse.cc
@@ -65,8 +65,8 @@ class OptimizerCSE {
 };
 
 static void FillInputs(const Node* n,
-                       gtl::InlinedVector<Node*, 4>* control_edges,
-                       gtl::InlinedVector<std::pair<Node*, int>, 4>* in) {
+                       gtl::InlinedVector<const Node*, 4>* control_edges,
+                       gtl::InlinedVector<std::pair<const Node*, int>, 4>* in) {
   DCHECK_EQ(in->size(), n->num_inputs());
   control_edges->clear();
   for (const Edge* e : n->in_edges()) {
@@ -96,8 +96,8 @@ size_t OptimizerCSE::NodeHash(const Node* n) {
 
   const int N_in = n->num_inputs();
   strings::StrAppend(&str_to_hash, N_in);
-  gtl::InlinedVector<Node*, 4> control_edges;
-  gtl::InlinedVector<std::pair<Node*, int>, 4> in(N_in);
+  gtl::InlinedVector<const Node*, 4> control_edges;
+  gtl::InlinedVector<std::pair<const Node*, int>, 4> in(N_in);
   FillInputs(n, &control_edges, &in);
   for (const auto& edge : in) {
     strings::StrAppend(&str_to_hash, edge.first->id(), edge.second);
@@ -147,10 +147,10 @@ bool OptimizerCSE::Equivalent(const Node* a, const Node* b,
   // Compare input sources
   if (a->num_inputs() != b->num_inputs()) return false;
   const int N_in = a->num_inputs();
-  gtl::InlinedVector<Node*, 4> a_control_edges;
-  gtl::InlinedVector<Node*, 4> b_control_edges;
-  gtl::InlinedVector<std::pair<Node*, int>, 4> a_in(N_in);
-  gtl::InlinedVector<std::pair<Node*, int>, 4> b_in(N_in);
+  gtl::InlinedVector<const Node*, 4> a_control_edges;
+  gtl::InlinedVector<const Node*, 4> b_control_edges;
+  gtl::InlinedVector<std::pair<const Node*, int>, 4> a_in(N_in);
+  gtl::InlinedVector<std::pair<const Node*, int>, 4> b_in(N_in);
   FillInputs(a, &a_control_edges, &a_in);
   FillInputs(b, &b_control_edges, &b_in);
   if (a_in != b_in) return false;
diff --git a/tensorflow/core/graph/testlib.cc b/tensorflow/core/graph/testlib.cc
index 0d88d1ff723b94783693559926c51c6726a2341b..67b252cb6c576b84de7f823ace2a1c7750d0c35b 100644
--- a/tensorflow/core/graph/testlib.cc
+++ b/tensorflow/core/graph/testlib.cc
@@ -16,6 +16,7 @@ limitations under the License.
 #include "tensorflow/core/graph/testlib.h"
 
 #include <vector>
+#include "tensorflow/core/framework/common_shape_fns.h"
 #include "tensorflow/core/framework/graph.pb.h"
 #include "tensorflow/core/framework/node_def_builder.h"
 #include "tensorflow/core/framework/node_def_util.h"
@@ -50,7 +51,8 @@ REGISTER_KERNEL_BUILDER(
 REGISTER_OP("HostConst")
     .Output("output: dtype")
     .Attr("value: tensor")
-    .Attr("dtype: type");
+    .Attr("dtype: type")
+    .SetShapeFn(shape_inference::UnknownShape);
 
 namespace test {
 namespace graph {
diff --git a/tensorflow/core/grappler/clusters/BUILD b/tensorflow/core/grappler/clusters/BUILD
index b8f8e13c9a6830658e2b53388e1f91fbc8a22eab..b653f902e857ce804f797a016ebde551bf3b6695 100644
--- a/tensorflow/core/grappler/clusters/BUILD
+++ b/tensorflow/core/grappler/clusters/BUILD
@@ -1,7 +1,12 @@
 licenses(["notice"])  # Apache 2.0
 
+load("//tensorflow:tensorflow.bzl", "if_cuda")
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
 load("//tensorflow:tensorflow.bzl", "tf_cuda_library")
+load(
+    "//tensorflow/core:platform/default/build_config_root.bzl",
+    "tf_cuda_tests_tags",
+)
 
 filegroup(
     name = "all_files",
@@ -26,13 +31,12 @@ config_setting(
 tf_cuda_library(
     name = "utils",
     srcs = ["utils.cc"],
-    hdrs = [
-        "utils.h",
-    ],
+    hdrs = ["utils.h"],
     visibility = ["//visibility:public"],
     deps = [
         "//third_party/eigen3",
         "//tensorflow/core:framework",
+        "//tensorflow/core:gpu_id",
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
     ] + select({
@@ -41,6 +45,21 @@ tf_cuda_library(
     }),
 )
 
+tf_cc_test(
+    name = "utils_test",
+    srcs = ["utils_test.cc"],
+    linkstatic = if_cuda(1, 0),
+    tags = tf_cuda_tests_tags(),
+    deps = [
+        ":utils",
+        "//tensorflow/core:gpu_id",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
 cc_library(
     name = "cluster",
     srcs = ["cluster.cc"],
@@ -104,6 +123,7 @@ cc_library(
         "//tensorflow/core:core_cpu_lib",
         "//tensorflow/core:direct_session",
         "//tensorflow/core:framework",
+        "//tensorflow/core:gpu_id",
         "//tensorflow/core:lib",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core/kernels:ops_util",
diff --git a/tensorflow/core/grappler/clusters/single_machine.cc b/tensorflow/core/grappler/clusters/single_machine.cc
index cc7f418d49816d64ffc51704d2f127a441815d7b..8e236c9ee80f30f7aa5c00f32fd137a718215cf3 100644
--- a/tensorflow/core/grappler/clusters/single_machine.cc
+++ b/tensorflow/core/grappler/clusters/single_machine.cc
@@ -21,6 +21,8 @@ limitations under the License.
 #include "tensorflow/cc/training/queue_runner.h"
 #include "tensorflow/core/common_runtime/device.h"
 #include "tensorflow/core/common_runtime/device_mgr.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id_manager.h"
 #include "tensorflow/core/grappler/clusters/utils.h"
 #include "tensorflow/core/grappler/utils.h"
 #include "tensorflow/core/kernels/ops_util.h"
@@ -80,13 +82,24 @@ Status SingleMachine::Provision() {
 
   std::vector<DeviceAttributes> devices;
   TF_RETURN_IF_ERROR(session_->ListDevices(&devices));
-  int gpu_id = 0;
   for (const auto& dev : devices) {
     DeviceProperties attr;
     if (dev.device_type() == "CPU") {
       attr = GetLocalCPUInfo();
     } else if (dev.device_type() == "GPU") {
-      attr = GetLocalGPUInfo(gpu_id++);
+      DeviceNameUtils::ParsedName parsed;
+      if (!DeviceNameUtils::ParseFullName(dev.name(), &parsed)) {
+        return errors::InvalidArgument(
+            strings::StrCat("Not able to parse GPU device name: ", dev.name()));
+      }
+      TfGpuId tf_gpu_id(parsed.id);
+      CudaGpuId cuda_gpu_id;
+      Status s = GpuIdManager::TfToCudaGpuId(tf_gpu_id, &cuda_gpu_id);
+      if (!s.ok()) {
+        return errors::Unavailable("Unknown TF GPU device with id ",
+                                   tf_gpu_id.value(), ": ", s.ToString());
+      }
+      attr = GetLocalGPUInfo(cuda_gpu_id);
     } else if (dev.device_type().find("XLA") == string::npos) {
       // Filter out the fake XLA devices to avoid double counting the actual
       // hardware resources that are available.
diff --git a/tensorflow/core/grappler/clusters/utils.cc b/tensorflow/core/grappler/clusters/utils.cc
index aacd2ccb72df07ac6b31c9bd5b96deca499038e4..b54b34959a53b56022a449ca286ff0ba823f2aa5 100644
--- a/tensorflow/core/grappler/clusters/utils.cc
+++ b/tensorflow/core/grappler/clusters/utils.cc
@@ -27,6 +27,9 @@ limitations under the License.
 #include "include/libxsmm.h"
 #endif
 
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id_manager.h"
+#include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/platform/cpu_info.h"
@@ -51,7 +54,7 @@ DeviceProperties GetLocalCPUInfo() {
 
   int64 free_mem = port::AvailableRam();
   if (free_mem < INT64_MAX) {
-    device.set_memory_size(free_mem);
+    device.set_memory_size(free_mem * 1024);
   }
 
   (*device.mutable_environment())["cpu_instruction_set"] =
@@ -66,36 +69,40 @@ DeviceProperties GetLocalCPUInfo() {
   return device;
 }
 
-DeviceProperties GetLocalGPUInfo(int gpu_id) {
+DeviceProperties GetLocalGPUInfo(CudaGpuId cuda_gpu_id) {
   DeviceProperties device;
   device.set_type("GPU");
 
 #if GOOGLE_CUDA
   cudaDeviceProp properties;
-  cudaError_t error = cudaGetDeviceProperties(&properties, gpu_id);
-  if (error == cudaSuccess) {
-    device.set_vendor("NVidia");
-    device.set_model(properties.name);
-    device.set_frequency(properties.clockRate * 1e-3);
-    device.set_num_cores(properties.multiProcessorCount);
-    device.set_num_registers(properties.regsPerMultiprocessor);
-    // For compute capability less than 5, l1 cache size is configurable to
-    // either 16 KB or 48 KB. We use the initial configuration 16 KB here. For
-    // compute capability larger or equal to 5, l1 cache (unified with texture
-    // cache) size is 24 KB. This number may need to be updated for future
-    // compute capabilities.
-    device.set_l1_cache_size((properties.major < 5) ? 16 * 1024 : 24 * 1024);
-    device.set_l2_cache_size(properties.l2CacheSize);
-    device.set_l3_cache_size(0);
-    device.set_shared_memory_size_per_multiprocessor(
-        properties.sharedMemPerMultiprocessor);
-    device.set_memory_size(properties.totalGlobalMem);
-    // 8 is the number of bits per byte. 2 is accounted for
-    // double data rate (DDR).
-    device.set_bandwidth(properties.memoryBusWidth / 8 *
-                         properties.memoryClockRate * 2);
+  cudaError_t error = cudaGetDeviceProperties(&properties, cuda_gpu_id.value());
+  if (error != cudaSuccess) {
+    device.set_type("UNKNOWN");
+    LOG(ERROR) << "Failed to get device properties, error code: " << error;
+    return device;
   }
 
+  device.set_vendor("NVIDIA");
+  device.set_model(properties.name);
+  device.set_frequency(properties.clockRate * 1e-3);
+  device.set_num_cores(properties.multiProcessorCount);
+  device.set_num_registers(properties.regsPerMultiprocessor);
+  // For compute capability less than 5, l1 cache size is configurable to
+  // either 16 KB or 48 KB. We use the initial configuration 16 KB here. For
+  // compute capability larger or equal to 5, l1 cache (unified with texture
+  // cache) size is 24 KB. This number may need to be updated for future
+  // compute capabilities.
+  device.set_l1_cache_size((properties.major < 5) ? 16 * 1024 : 24 * 1024);
+  device.set_l2_cache_size(properties.l2CacheSize);
+  device.set_l3_cache_size(0);
+  device.set_shared_memory_size_per_multiprocessor(
+      properties.sharedMemPerMultiprocessor);
+  device.set_memory_size(properties.totalGlobalMem);
+  // 8 is the number of bits per byte. 2 is accounted for
+  // double data rate (DDR).
+  device.set_bandwidth(properties.memoryBusWidth / 8 *
+                       properties.memoryClockRate * 2);
+
   (*device.mutable_environment())["architecture"] =
       strings::StrCat(properties.major, ".", properties.minor);
   (*device.mutable_environment())["cuda"] = strings::StrCat(CUDA_VERSION);
@@ -106,18 +113,26 @@ DeviceProperties GetLocalGPUInfo(int gpu_id) {
 }
 
 DeviceProperties GetDeviceInfo(const DeviceNameUtils::ParsedName& device) {
+  DeviceProperties unknown;
+  unknown.set_type("UNKNOWN");
+
   if (device.type == "CPU") {
     return GetLocalCPUInfo();
   } else if (device.type == "GPU") {
     if (device.has_id) {
-      return GetLocalGPUInfo(device.id);
+      TfGpuId tf_gpu_id(device.id);
+      CudaGpuId cuda_gpu_id;
+      Status s = GpuIdManager::TfToCudaGpuId(tf_gpu_id, &cuda_gpu_id);
+      if (!s.ok()) {
+        LOG(ERROR) << s;
+        return unknown;
+      }
+      return GetLocalGPUInfo(cuda_gpu_id);
     } else {
-      return GetLocalGPUInfo(0);
+      return GetLocalGPUInfo(CudaGpuId(0));
     }
   }
-  DeviceProperties result;
-  result.set_type("UNKNOWN");
-  return result;
+  return unknown;
 }
 
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/clusters/utils.h b/tensorflow/core/grappler/clusters/utils.h
index 191942040a1fdd276bb50f799ce314389c2cb0fe..df8e7dca44ad637aed8a6a2e87fc6e20bdf62606 100644
--- a/tensorflow/core/grappler/clusters/utils.h
+++ b/tensorflow/core/grappler/clusters/utils.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_GRAPPLER_CLUSTERS_UTILS_H_
 #define TENSORFLOW_GRAPPLER_CLUSTERS_UTILS_H_
 
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
 #include "tensorflow/core/protobuf/device_properties.pb.h"
 #include "tensorflow/core/util/device_name_utils.h"
 
@@ -27,7 +28,7 @@ DeviceProperties GetLocalCPUInfo();
 
 // Returns the DeviceProperties for the specified GPU attached to the server on
 // which grappler is running.
-DeviceProperties GetLocalGPUInfo(int gpu_id);
+DeviceProperties GetLocalGPUInfo(CudaGpuId cuda_gpu_id);
 
 // Returns the DeviceProperties of the specified device
 DeviceProperties GetDeviceInfo(const DeviceNameUtils::ParsedName& device);
diff --git a/tensorflow/core/grappler/clusters/utils_test.cc b/tensorflow/core/grappler/clusters/utils_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..74218adbac4eda3a7a780933b8116cfd2b7a1b18
--- /dev/null
+++ b/tensorflow/core/grappler/clusters/utils_test.cc
@@ -0,0 +1,100 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/clusters/utils.h"
+
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id_manager.h"
+#include "tensorflow/core/platform/logging.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/protobuf/device_properties.pb.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+TEST(UtilsTest, GetLocalGPUInfo) {
+  GpuIdManager::TestOnlyReset();
+#if GOOGLE_CUDA
+  LOG(INFO) << "CUDA is enabled.";
+  DeviceProperties properties;
+
+  // Invalid CUDA GPU ID.
+  properties = GetLocalGPUInfo(CudaGpuId(100));
+  EXPECT_EQ("UNKNOWN", properties.type());
+
+  // Succeed when a valid CUDA GPU id was inserted.
+  properties = GetLocalGPUInfo(CudaGpuId(0));
+  EXPECT_EQ("GPU", properties.type());
+  EXPECT_EQ("NVIDIA", properties.vendor());
+#else
+  LOG(INFO) << "CUDA is not enabled.";
+  DeviceProperties properties;
+
+  properties = GetLocalGPUInfo(CudaGpuId(0));
+  EXPECT_EQ("GPU", properties.type());
+
+  properties = GetLocalGPUInfo(CudaGpuId(100));
+  EXPECT_EQ("GPU", properties.type());
+#endif
+}
+
+TEST(UtilsTest, GetDeviceInfo) {
+  GpuIdManager::TestOnlyReset();
+  DeviceNameUtils::ParsedName device;
+  DeviceProperties properties;
+
+  // Invalid type.
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("UNKNOWN", properties.type());
+
+  // Cpu info.
+  device.type = "CPU";
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("CPU", properties.type());
+
+  // No TF GPU id provided.
+  device.type = "GPU";
+  device.has_id = false;
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("GPU", properties.type());
+#if GOOGLE_CUDA
+  EXPECT_EQ("NVIDIA", properties.vendor());
+#endif
+
+  // TF to CUDA GPU id mapping entry doesn't exist.
+  device.has_id = true;
+  device.id = 0;
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("UNKNOWN", properties.type());
+
+#if GOOGLE_CUDA
+  // Invalid CUDA GPU id.
+  GpuIdManager::InsertTfCudaGpuIdPair(TfGpuId(0), CudaGpuId(100));
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("UNKNOWN", properties.type());
+
+  // Valid CUDA GPU id.
+  GpuIdManager::InsertTfCudaGpuIdPair(TfGpuId(1), CudaGpuId(0));
+  device.id = 1;
+  properties = GetDeviceInfo(device);
+  EXPECT_EQ("GPU", properties.type());
+  EXPECT_EQ("NVIDIA", properties.vendor());
+#endif
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/costs/BUILD b/tensorflow/core/grappler/costs/BUILD
index 0fe01e9c9e094ebfa7fd1e6200d775ef61775184..5336df1f51dbb5dd5f48857a088ece1b1a04dbb5 100644
--- a/tensorflow/core/grappler/costs/BUILD
+++ b/tensorflow/core/grappler/costs/BUILD
@@ -142,6 +142,7 @@ tf_cuda_library(
         "//third_party/eigen3",
         "//tensorflow/core:framework",
         "//tensorflow/core:graph",
+        "//tensorflow/core:gpu_id",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_proto_parsing",
         "//tensorflow/core:protos_all_cc",
diff --git a/tensorflow/core/grappler/costs/op_level_cost_estimator.cc b/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
index a57cfdd9891b1d654092f9b896af248fa40eb88f..29ef317e46f13bd64847fd898fcb2eb9fee67f1c 100644
--- a/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
+++ b/tensorflow/core/grappler/costs/op_level_cost_estimator.cc
@@ -245,6 +245,8 @@ OpLevelCostEstimator::OpLevelCostEstimator() {
       {"Add", Eigen::internal::functor_traits<
                   Eigen::internal::scalar_sum_op<float>>::Cost},
       {"ApproximateEqual", 1},
+      {"BiasAdd", Eigen::internal::functor_traits<
+                      Eigen::internal::scalar_sum_op<float>>::Cost},
       {"Div", Eigen::internal::functor_traits<
                   Eigen::internal::scalar_quotient_op<float>>::Cost},
       {"Equal", 1},
@@ -718,6 +720,56 @@ int64 OpLevelCostEstimator::CountBatchMatMulOperations(
   return ops;
 }
 
+bool GetTensorShapeProtoFromTensorProto(const TensorProto& tensor_proto,
+                                        TensorShapeProto* tensor_shape_proto) {
+  tensor_shape_proto->Clear();
+  // First convert TensorProto into Tensor class so that it correctly parses
+  // data values within TensorProto (whether it's in int_val, int64_val,
+  // tensor_content, or anything.
+  Tensor tensor(tensor_proto.dtype());
+  if (!tensor.FromProto(tensor_proto)) {
+    LOG(WARNING) << "GetTensorShapeProtoFromTensorProto() -- "
+                 << "failed to parse TensorProto: "
+                 << tensor_proto.DebugString();
+    return false;
+  }
+  if (tensor.dims() != 1) {
+    LOG(WARNING) << "GetTensorShapeProtoFromTensorProto() -- "
+                 << "tensor is not 1D: " << tensor.dims();
+    return false;
+  }
+  // Then, convert it back to TensorProto using AsProtoField, which makes sure
+  // the data is in int_val, int64_val, or such repeated data fields, not in
+  // tensor_content.
+  TensorProto temp_tensor;
+  tensor.AsProtoField(&temp_tensor);
+
+#define TENSOR_VALUES_TO_TENSOR_SHAPE_PROTO(type)        \
+  do {                                                   \
+    for (const auto& value : temp_tensor.type##_val()) { \
+      tensor_shape_proto->add_dim()->set_size(value);    \
+    }                                                    \
+  } while (0)
+
+  if (tensor.dtype() == DT_INT32 || tensor.dtype() == DT_INT16 ||
+      tensor.dtype() == DT_INT8 || tensor.dtype() == DT_UINT8) {
+    TENSOR_VALUES_TO_TENSOR_SHAPE_PROTO(int);
+  } else if (tensor.dtype() == DT_INT64) {
+    TENSOR_VALUES_TO_TENSOR_SHAPE_PROTO(int64);
+  } else if (tensor.dtype() == DT_UINT32) {
+    TENSOR_VALUES_TO_TENSOR_SHAPE_PROTO(uint32);
+  } else if (tensor.dtype() == DT_UINT64) {
+    TENSOR_VALUES_TO_TENSOR_SHAPE_PROTO(uint64);
+  } else {
+    LOG(WARNING) << "GetTensorShapeProtoFromTensorProto() -- "
+                 << "Unsupported dtype: " << tensor.dtype();
+    return false;
+  }
+#undef TENSOR_VALUES_TO_TENSOR_SHAPE_PROTO
+
+  return true;
+}
+
 // TODO(cliffy): Dedup this method and CountConv2DBackpropFilterOperations.
 int64 OpLevelCostEstimator::CountConv2DBackpropInputOperations(
     const OpInfo& op_features, ConvolutionDimensions* returned_conv_dims,
@@ -732,20 +784,16 @@ int64 OpLevelCostEstimator::CountConv2DBackpropInputOperations(
   }
 
   TensorShapeProto input_shape;
+  bool shape_found = false;
   if (op_features.inputs(0).has_value()) {
     const TensorProto& value = op_features.inputs(0).value();
-    if (value.int64_val_size() > 0) {
-      for (int i = 0; i < value.int64_val_size(); ++i) {
-        input_shape.add_dim()->set_size(value.int64_val(i));
-      }
-    } else {
-      for (int i = 0; i < value.int_val_size(); ++i) {
-        input_shape.add_dim()->set_size(value.int_val(i));
-      }
-    }
-  } else if (op_features.outputs_size() == 1) {
+    shape_found = GetTensorShapeProtoFromTensorProto(value, &input_shape);
+  }
+  if (!shape_found && op_features.outputs_size() == 1) {
     input_shape = op_features.outputs(0).shape();
-  } else {
+    shape_found = true;
+  }
+  if (!shape_found) {
     // Set the minimum filter size that's feasible.
     for (int i = 0; i < 4; ++i) {
       input_shape.add_dim()->set_size(1);
@@ -778,20 +826,16 @@ int64 OpLevelCostEstimator::CountConv2DBackpropFilterOperations(
   DCHECK_EQ(kConv2dBackpropFilter, op_features.op());
 
   TensorShapeProto filter_shape;
+  bool shape_found = false;
   if (op_features.inputs_size() >= 2 && op_features.inputs(1).has_value()) {
     const TensorProto& value = op_features.inputs(1).value();
-    if (value.int64_val_size() > 0) {
-      for (int i = 0; i < value.int64_val_size(); ++i) {
-        filter_shape.add_dim()->set_size(value.int64_val(i));
-      }
-    } else {
-      for (int i = 0; i < value.int_val_size(); ++i) {
-        filter_shape.add_dim()->set_size(value.int_val(i));
-      }
-    }
-  } else if (op_features.outputs_size() == 1) {
+    shape_found = GetTensorShapeProtoFromTensorProto(value, &filter_shape);
+  }
+  if (!shape_found && op_features.outputs_size() == 1) {
     filter_shape = op_features.outputs(0).shape();
-  } else {
+    shape_found = true;
+  }
+  if (!shape_found) {
     // Set the minimum filter size that's feasible.
     for (int i = 0; i < 4; ++i) {
       filter_shape.add_dim()->set_size(1);
diff --git a/tensorflow/core/grappler/costs/op_level_cost_estimator.h b/tensorflow/core/grappler/costs/op_level_cost_estimator.h
index a292e5e97fe52383648d74b08bb7a384b6278446..7bb530fe31a9f70d168ae16783fac7d487e5f12d 100644
--- a/tensorflow/core/grappler/costs/op_level_cost_estimator.h
+++ b/tensorflow/core/grappler/costs/op_level_cost_estimator.h
@@ -28,6 +28,9 @@ limitations under the License.
 namespace tensorflow {
 namespace grappler {
 
+bool GetTensorShapeProtoFromTensorProto(const TensorProto& tensor_proto,
+                                        TensorShapeProto* tensor_shape_proto);
+
 class OpLevelCostEstimator {
  public:
   OpLevelCostEstimator();
diff --git a/tensorflow/core/grappler/costs/op_level_cost_estimator_test.cc b/tensorflow/core/grappler/costs/op_level_cost_estimator_test.cc
index 60fc783472d2b6a1d50eb52e912da1fccbe8cf08..4790b9bab2c7d67e7a29d45aaf9f964c470c63df 100644
--- a/tensorflow/core/grappler/costs/op_level_cost_estimator_test.cc
+++ b/tensorflow/core/grappler/costs/op_level_cost_estimator_test.cc
@@ -14,6 +14,8 @@ limitations under the License.
 ==============================================================================*/
 
 #include "tensorflow/core/grappler/costs/op_level_cost_estimator.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/framework/tensor_shape.pb.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/platform/test.h"
@@ -97,47 +99,81 @@ OpContext DescribeBatchMatMul(const std::vector<int>& dims_a,
 // Wrangles the minimum number of proto fields to set up a 4D Tensor for cost
 // estimation purposes.
 void DescribeTensor4D(int dim0, int dim1, int dim2, int dim3,
-                      OpInfo* op_features) {
-  auto input = op_features->add_inputs();
-  auto shape = input->mutable_shape();
+                      OpInfo::TensorProperties* tensor) {
+  auto shape = tensor->mutable_shape();
   shape->add_dim()->set_size(dim0);
   shape->add_dim()->set_size(dim1);
   shape->add_dim()->set_size(dim2);
   shape->add_dim()->set_size(dim3);
-  input->set_dtype(DT_FLOAT);
+  tensor->set_dtype(DT_FLOAT);
 }
 
-// Returns an OpInfo for Conv2D with the minimum set of fields set up.
+// DescribeConvolution constructs an OpContext for a Conv2D applied to an input
+// tensor with shape (batch, ix, iy, iz1) and a kernel tensor with shape
+// (kx, ky, iz2, oz).
 OpContext DescribeConvolution(int batch, int ix, int iy, int iz1, int iz2,
                               int kx, int ky, int oz) {
   OpContext op_context;
   SetCpuDevice(&op_context.op_info);
   op_context.op_info.set_op("Conv2D");
 
-  DescribeTensor4D(batch, ix, iy, iz1, &op_context.op_info);
-  DescribeTensor4D(kx, ky, iz2, oz, &op_context.op_info);
+  DescribeTensor4D(batch, ix, iy, iz1, op_context.op_info.add_inputs());
+  DescribeTensor4D(kx, ky, iz2, oz, op_context.op_info.add_inputs());
+
   return op_context;
 }
 
-OpContext DescribeOp(const string& op, int size1, int size2) {
+// DescribeUnaryOp constructs an OpContext for the given operation applied to
+// a 4-tensor with shape (size1, 1, 1, 1).
+OpContext DescribeUnaryOp(const string& op, int size1) {
   OpContext op_context;
   SetCpuDevice(&op_context.op_info);
   op_context.op_info.set_op(op);
 
-  DescribeTensor4D(size1, 1, 1, 1, &op_context.op_info);
-  DescribeTensor4D(2 * size1, size2, 1, 1, &op_context.op_info);
+  DescribeTensor4D(size1, 1, 1, 1, op_context.op_info.add_inputs());
+  DescribeTensor4D(size1, 1, 1, 1, op_context.op_info.add_outputs());
+
+  return op_context;
+}
 
-  auto output = op_context.op_info.add_outputs();
-  auto shape = output->mutable_shape();
-  shape->add_dim()->set_size(2 * size1);
-  shape->add_dim()->set_size(size2);
-  shape->add_dim()->set_size(1);
-  shape->add_dim()->set_size(1);
-  output->set_dtype(DT_FLOAT);
+// DescribeBinaryOp constructs an OpContext for the given operation applied to
+// a 4-tensor with dimensions (size1, 1, 1, 1) and a 4-tensor with dimensions
+// (2 * size1, size2, 1, 1).
+//
+// The choice of dimension here is arbitrary, and is used strictly to test the
+// cost model for applying elementwise operations to tensors with unequal
+// dimension values.
+OpContext DescribeBinaryOp(const string& op, int size1, int size2) {
+  OpContext op_context;
+  SetCpuDevice(&op_context.op_info);
+  op_context.op_info.set_op(op);
+
+  DescribeTensor4D(size1, 1, 1, 1, op_context.op_info.add_inputs());
+  DescribeTensor4D(2 * size1, size2, 1, 1, op_context.op_info.add_inputs());
+  DescribeTensor4D(2 * size1, size2, 1, 1, op_context.op_info.add_outputs());
+
+  return op_context;
+}
 
+// DescribeBiasAdd constructs an OpContext for a BiasAdd applied to a 4-tensor
+// with dimensions (1, 1, size2, size1) and a bias with dimension (size1),
+// according to the constraint that the bias must be 1D with size equal to that
+// of the last dimension of the input value.
+OpContext DescribeBiasAdd(int size1, int size2) {
+  OpContext op_context;
   SetCpuDevice(&op_context.op_info);
+  op_context.op_info.set_op("BiasAdd");
+
+  DescribeTensor4D(1, 1, size2, size1, op_context.op_info.add_inputs());
+  DescribeTensor4D(1, 1, size2, size1, op_context.op_info.add_outputs());
+
+  auto bias = op_context.op_info.add_inputs();
+  bias->mutable_shape()->add_dim()->set_size(size1);
+  bias->set_dtype(DT_FLOAT);
+
   return op_context;
 }
+
 }  // namespace
 
 class OpLevelCostEstimatorTest : public ::testing::Test {
@@ -164,8 +200,24 @@ class OpLevelCostEstimatorTest : public ::testing::Test {
   OpLevelCostEstimator estimator_;
 };
 
+TEST_F(OpLevelCostEstimatorTest, BiasAddExecutionTime) {
+  auto cost = PredictCosts(DescribeBiasAdd(1000, 10));
+  EXPECT_EQ(Costs::Duration(8400), cost.memory_time);
+  EXPECT_EQ(Costs::Duration(1000), cost.compute_time);
+  EXPECT_EQ(Costs::Duration(9400), cost.execution_time);
+  EXPECT_FALSE(cost.inaccurate);
+}
+
+TEST_F(OpLevelCostEstimatorTest, Conv2DExecutionTime) {
+  auto cost = PredictCosts(DescribeConvolution(16, 19, 19, 48, 48, 5, 5, 256));
+  EXPECT_EQ(Costs::Duration(233780), cost.memory_time);
+  EXPECT_EQ(Costs::Duration(354877440), cost.compute_time);
+  EXPECT_EQ(Costs::Duration(355111220), cost.execution_time);
+  EXPECT_FALSE(cost.inaccurate);
+}
+
 TEST_F(OpLevelCostEstimatorTest, DummyExecutionTime) {
-  auto cost = PredictCosts(DescribeOp("Dummy", 1000, 1));
+  auto cost = PredictCosts(DescribeBinaryOp("Dummy", 1000, 1));
   EXPECT_EQ(Costs::Duration(2000), cost.memory_time);
   EXPECT_EQ(Costs::Duration(0), cost.compute_time);
   EXPECT_EQ(Costs::Duration(2000), cost.execution_time);
@@ -174,7 +226,7 @@ TEST_F(OpLevelCostEstimatorTest, DummyExecutionTime) {
 
 TEST_F(OpLevelCostEstimatorTest, ExecutionTimeSumOrMax) {
   SetComputeMemoryOverlap(true);
-  auto cost = PredictCosts(DescribeOp("Dummy", 1000, 1));
+  auto cost = PredictCosts(DescribeBinaryOp("Dummy", 1000, 1));
   EXPECT_EQ(Costs::Duration(2000), cost.memory_time);
   EXPECT_EQ(Costs::Duration(0), cost.compute_time);
   EXPECT_EQ(Costs::Duration(2000), cost.execution_time);  // max(2000, 200)
@@ -183,7 +235,7 @@ TEST_F(OpLevelCostEstimatorTest, ExecutionTimeSumOrMax) {
 }
 
 TEST_F(OpLevelCostEstimatorTest, MulExecutionTime) {
-  auto cost = PredictCosts(DescribeOp("Mul", 1000, 1));
+  auto cost = PredictCosts(DescribeBinaryOp("Mul", 1000, 1));
   EXPECT_EQ(Costs::Duration(2000), cost.memory_time);
   EXPECT_EQ(Costs::Duration(200), cost.compute_time);
   EXPECT_EQ(Costs::Duration(2200), cost.execution_time);
@@ -191,7 +243,7 @@ TEST_F(OpLevelCostEstimatorTest, MulExecutionTime) {
 }
 
 TEST_F(OpLevelCostEstimatorTest, MulBroadcastExecutionTime) {
-  auto cost = PredictCosts(DescribeOp("Mul", 1000, 2));
+  auto cost = PredictCosts(DescribeBinaryOp("Mul", 1000, 2));
   EXPECT_EQ(Costs::Duration(3600), cost.memory_time);
   EXPECT_EQ(Costs::Duration(400), cost.compute_time);
   EXPECT_EQ(Costs::Duration(4000), cost.execution_time);
@@ -199,13 +251,21 @@ TEST_F(OpLevelCostEstimatorTest, MulBroadcastExecutionTime) {
 }
 
 TEST_F(OpLevelCostEstimatorTest, ModExecutionTime) {
-  auto cost = PredictCosts(DescribeOp("Mod", 1000, 1));
+  auto cost = PredictCosts(DescribeBinaryOp("Mod", 1000, 1));
   EXPECT_EQ(Costs::Duration(2000), cost.memory_time);
   EXPECT_EQ(Costs::Duration(1600), cost.compute_time);
   EXPECT_EQ(Costs::Duration(3600), cost.execution_time);
   EXPECT_FALSE(cost.inaccurate);
 }
 
+TEST_F(OpLevelCostEstimatorTest, ReluExecutionTime) {
+  auto cost = PredictCosts(DescribeUnaryOp("Relu", 1000));
+  EXPECT_EQ(Costs::Duration(800), cost.memory_time);
+  EXPECT_EQ(Costs::Duration(100), cost.compute_time);
+  EXPECT_EQ(Costs::Duration(900), cost.execution_time);
+  EXPECT_FALSE(cost.inaccurate);
+}
+
 TEST_F(OpLevelCostEstimatorTest, UnknownOrPartialShape) {
   EXPECT_FALSE(PredictCosts(DescribeMatMul(2, 4, 7, 7)).inaccurate);
   EXPECT_TRUE(PredictCosts(DescribeMatMul(-1, 4, 7, 7)).inaccurate);
@@ -247,5 +307,108 @@ TEST_F(OpLevelCostEstimatorTest, BatchMatMul) {
   EXPECT_NE(matmul_inaccurate, batch_matmul_inaccurate);
 }
 
+// Helper functions for testing GetTensorShapeProtoFromTensorProto().
+void GetTensorProto(const DataType dtype, const std::vector<int64>& shape,
+                    const std::vector<int64> values, const bool tensor_content,
+                    TensorProto* tensor_proto) {
+  tensor_proto->Clear();
+  TensorProto temp_tensor_proto;
+  temp_tensor_proto.set_dtype(dtype);
+  for (const auto& x : shape) {
+    temp_tensor_proto.mutable_tensor_shape()->add_dim()->set_size(x);
+  }
+  for (const auto& x : values) {
+    if (dtype == DT_INT64) {
+      temp_tensor_proto.add_int64_val(x);
+    } else if (dtype == DT_INT32 || dtype == DT_INT16 || dtype == DT_INT8 ||
+               dtype == DT_UINT8) {
+      temp_tensor_proto.add_int_val(x);
+    } else if (dtype == DT_UINT32) {
+      temp_tensor_proto.add_uint32_val(x);
+    } else if (dtype == DT_UINT64) {
+      temp_tensor_proto.add_uint64_val(x);
+    } else {
+      CHECK(false) << "Unsupported dtype: " << dtype;
+    }
+  }
+  Tensor tensor(dtype);
+  CHECK(tensor.FromProto(temp_tensor_proto));
+  if (tensor_content) {
+    tensor.AsProtoTensorContent(tensor_proto);
+  } else {
+    tensor.AsProtoField(tensor_proto);
+  }
+}
+
+void ExpectTensorShape(const std::vector<int64>& expected,
+                       const TensorShapeProto& tensor_shape_proto) {
+  TensorShape tensor_shape_expected(expected);
+  TensorShape tensor_shape(tensor_shape_proto);
+
+  LOG(INFO) << "Expected: " << tensor_shape_expected.DebugString();
+  LOG(INFO) << "TensorShape: " << tensor_shape.DebugString();
+  EXPECT_TRUE(tensor_shape_expected == tensor_shape);
+}
+
+TEST_F(OpLevelCostEstimatorTest, GetTensorShapeProtoFromTensorProto) {
+  TensorProto tensor_proto;
+  TensorShapeProto tensor_shape_proto;
+
+  // Dimention larger than max value; should fail while converting to Tensor
+  // class.
+  tensor_proto.mutable_tensor_shape()->add_dim()->set_size(255);
+  EXPECT_FALSE(
+      GetTensorShapeProtoFromTensorProto(tensor_proto, &tensor_shape_proto));
+
+  tensor_proto.Clear();
+  // Expect only 1D shape.
+  tensor_proto.mutable_tensor_shape()->add_dim()->set_size(1);
+  tensor_proto.mutable_tensor_shape()->add_dim()->set_size(2);
+  EXPECT_FALSE(
+      GetTensorShapeProtoFromTensorProto(tensor_proto, &tensor_shape_proto));
+
+  // Expect only handle integer data types.
+  GetTensorProto(DT_FLOAT, {}, {}, /*tensor_content=*/false, &tensor_proto);
+  EXPECT_FALSE(
+      GetTensorShapeProtoFromTensorProto(tensor_proto, &tensor_shape_proto));
+
+  // Check GetTensorShapeProtoFromTensorProto() resturns correct values.
+  {
+    std::vector<int64> shape_expected = {10, 20, 30, 40};
+    GetTensorProto(DT_INT32, {4}, shape_expected, /*tensor_content=*/false,
+                   &tensor_proto);
+    EXPECT_TRUE(
+        GetTensorShapeProtoFromTensorProto(tensor_proto, &tensor_shape_proto));
+    ExpectTensorShape(shape_expected, tensor_shape_proto);
+  }
+
+  {
+    std::vector<int64> shape_expected = {40, 20, 90, 40};
+    GetTensorProto(DT_INT64, {4}, shape_expected, /*tensor_content=*/false,
+                   &tensor_proto);
+    EXPECT_TRUE(
+        GetTensorShapeProtoFromTensorProto(tensor_proto, &tensor_shape_proto));
+    ExpectTensorShape(shape_expected, tensor_shape_proto);
+  }
+
+  {
+    std::vector<int64> shape_expected = {10, 20, 30, 40};
+    GetTensorProto(DT_INT32, {4}, shape_expected, /*tensor_content=*/true,
+                   &tensor_proto);
+    EXPECT_TRUE(
+        GetTensorShapeProtoFromTensorProto(tensor_proto, &tensor_shape_proto));
+    ExpectTensorShape(shape_expected, tensor_shape_proto);
+  }
+
+  {
+    std::vector<int64> shape_expected = {40, 20, 90, 40};
+    GetTensorProto(DT_INT64, {4}, shape_expected, /*tensor_content=*/true,
+                   &tensor_proto);
+    EXPECT_TRUE(
+        GetTensorShapeProtoFromTensorProto(tensor_proto, &tensor_shape_proto));
+    ExpectTensorShape(shape_expected, tensor_shape_proto);
+  }
+}
+
 }  // end namespace grappler
 }  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/costs/utils.cc b/tensorflow/core/grappler/costs/utils.cc
index 602f69f12ea9d24ebd94da73a2a76d1992f3bfb1..076945d5c626b9609448e339fcbd96de3e9d137f 100644
--- a/tensorflow/core/grappler/costs/utils.cc
+++ b/tensorflow/core/grappler/costs/utils.cc
@@ -26,6 +26,8 @@ limitations under the License.
 #include "cuda/include/cudnn.h"
 #endif
 
+#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
+#include "tensorflow/core/common_runtime/gpu/gpu_id_manager.h"
 #include "tensorflow/core/framework/allocation_description.pb.h"
 #include "tensorflow/core/framework/attr_value.pb.h"
 #include "tensorflow/core/framework/op.h"
@@ -200,17 +202,25 @@ std::vector<OpInfo::TensorProperties> FindInputFeatures(
 }
 
 DeviceProperties GetDeviceInfo(const string& device_str) {
+  DeviceProperties unknown;
+  unknown.set_type("UNKNOWN");
+
   DeviceNameUtils::ParsedName parsed;
   if (DeviceNameUtils::ParseFullName(device_str, &parsed)) {
     if (parsed.type == "GPU") {
-      return GetLocalGPUInfo(parsed.id);
+      TfGpuId tf_gpu_id(parsed.id);
+      CudaGpuId cuda_gpu_id;
+      Status s = GpuIdManager::TfToCudaGpuId(tf_gpu_id, &cuda_gpu_id);
+      if (!s.ok()) {
+        LOG(ERROR) << s;
+        return unknown;
+      }
+      return GetLocalGPUInfo(cuda_gpu_id);
     } else if (parsed.type == "CPU") {
       return GetLocalCPUInfo();
     }
   }
-  DeviceProperties device;
-  device.set_type("UNKNOWN");
-  return device;
+  return unknown;
 }
 
 DeviceProperties GetDeviceInfo(const CostGraphDef::Node& node) {
diff --git a/tensorflow/core/grappler/costs/virtual_scheduler.cc b/tensorflow/core/grappler/costs/virtual_scheduler.cc
index 14b4ed7507f6237ea6255f46e060aa3d0f60b34d..3ac3ae0f8f835226bbc3ec5d6cec6cb890a6998f 100644
--- a/tensorflow/core/grappler/costs/virtual_scheduler.cc
+++ b/tensorflow/core/grappler/costs/virtual_scheduler.cc
@@ -325,7 +325,7 @@ Status VirtualScheduler::Init() {
 
   // Get the nodes that would run to output fetch_nodes.
   bool ill_formed = false;
-  std::vector<const NodeDef*> nodes =
+  const std::vector<const NodeDef*> fetch_fanin_nodes =
       ComputeTransitiveFanin(graph, fetch_nodes, &ill_formed);
   if (ill_formed) {
     return errors::InvalidArgument(
@@ -339,7 +339,7 @@ Status VirtualScheduler::Init() {
   // exactly the same as those executed for real. One possible discrepancy could
   // be the control flow nodes, where tf only executes one path.
   std::unordered_map<string, const NodeDef*> name_to_node;
-  for (const auto& node : nodes) {
+  for (const auto& node : fetch_fanin_nodes) {
     name_to_node[node->name()] = node;
   }
 
@@ -360,14 +360,22 @@ Status VirtualScheduler::Init() {
 
   // Build node_map; for each node, create its NodeState and connect its inputs
   // and outputs.
-  for (const auto* curr_node : nodes) {
+  for (const auto* curr_node : fetch_fanin_nodes) {
     auto& curr_node_state = GetNodeStateOrCreateIt(curr_node);
     const string curr_node_device = DeviceName(curr_node);
     std::vector<string> inputs;
     if (IsRecv(*curr_node)) {
       const auto& attr = curr_node->attr();
-      const NodeDef* send = name_to_send[attr.at("tensor_name").s()];
-      inputs = {send->name()};
+      if (attr.count("tensor_name")) {
+        const auto& send_node_name = attr.at("tensor_name").s();
+        auto it = name_to_send.find(send_node_name);
+        // If there is a _Send associated with the curr_node (_Recv), add it as
+        // input.
+        if (it != name_to_send.end()) {
+          const NodeDef* send = it->second;
+          inputs = {send->name()};
+        }
+      }
     } else {
       for (const string& input : curr_node->input()) {
         inputs.push_back(input);
@@ -426,9 +434,11 @@ Status VirtualScheduler::Init() {
         feed_nodes.find(curr_node->name()) != feed_nodes.end();
 
     // Default case: node without inputs are ready at time 0.
-    const bool has_no_inputs = curr_node->input().empty();
+    // Note that we check inputs vector which may be different to
+    // curr_node->input(); e.g., we add Send as input to Recv.
+    const bool has_no_inputs = inputs.empty();
 
-    if (!IsRecv(*curr_node) && (given_as_feed || has_no_inputs)) {
+    if (given_as_feed || has_no_inputs) {
       curr_node_state.time_ready = Costs::Duration();
       ready_nodes_->AddNode(curr_node);
       VLOG(3) << "Added ready node: " << curr_node->name();
@@ -451,9 +461,11 @@ Status VirtualScheduler::Init() {
   }
 
   if (!feed_nodes.empty()) {
-    return errors::InvalidArgument(
-        strings::StrCat("Some feed nodes were not found in the graph: ",
-                        str_util::Join(feed_nodes, ",")));
+    // This isn't always a bug: when the caller hasn't specified the exact list
+    // of feed and fetch nodes, by default we consider all placeholders as feed
+    // nodes, but some of them may not be needed for the default fetch node.
+    VLOG(1) << "Some feed nodes were not consumed by the fetch fanin: "
+            << str_util::Join(feed_nodes, ",");
   }
   initialized_ = true;
   return Status::OK();
diff --git a/tensorflow/core/grappler/costs/virtual_scheduler_test.cc b/tensorflow/core/grappler/costs/virtual_scheduler_test.cc
index 53dcb497a6453dfa70c1215352e74e96796ebeb7..f9154e42f984c8dd8e774b83750b41a48087d7bb 100644
--- a/tensorflow/core/grappler/costs/virtual_scheduler_test.cc
+++ b/tensorflow/core/grappler/costs/virtual_scheduler_test.cc
@@ -205,6 +205,25 @@ class VirtualSchedulerTest : public ::testing::Test {
     dependency_["out"] = {"x", "y", "z", "w"};
   }
 
+  // Graph with some placeholder feed nodes that are not in the fetch fan-in.
+  void CreateGrapplerItemWithUnnecessaryPlaceholderNodes() {
+    Scope s = Scope::NewRootScope().WithDevice(kCPU0);
+    auto unnecessary = ops::Placeholder(s.WithOpName("unnecessary"), DT_FLOAT);
+    auto x = ops::Placeholder(s.WithOpName("x"), DT_FLOAT);
+
+    GraphDef def;
+    TF_CHECK_OK(s.ToGraphDef(&def));
+
+    grappler_item_.reset(new GrapplerItem);
+    grappler_item_->id = "test_extra_placeholders";
+    grappler_item_->graph = def;
+    grappler_item_->fetch = {"x"};
+
+    // Grappler Item Builder puts all placeholder nodes into the feed
+    // list by default.
+    grappler_item_->feed = {{"x", Tensor()}, {"unnecessary", Tensor()}};
+  }
+
   // NoOp that takes 7 NoOps as control dependency.
   void CreateGrapplerItemWithControlDependency() {
     Scope s = Scope::NewRootScope().WithDevice(kCPU0);
@@ -394,6 +413,63 @@ versions {
     grappler_item_->fetch = {"Recv"};
   }
 
+  void CreateGrapplerItemWithRecvWithoutSend() {
+    const string gdef_ascii = R"EOF(
+node {
+  name: "Recv"
+  op: "_Recv"
+  device: "/job:localhost/replica:0/task:0/device:CPU:0"
+  attr {
+    key: "client_terminated"
+    value {
+      b: false
+    }
+  }
+  attr {
+    key: "recv_device"
+    value {
+      s: "/job:localhost/replica:0/task:0/device:CPU:0"
+    }
+  }
+  attr {
+    key: "send_device"
+    value {
+      s: "/job:localhost/replica:0/task:0/device:CPU:0"
+    }
+  }
+  attr {
+    key: "send_device_incarnation"
+    value {
+      i: 0
+    }
+  }
+  attr {
+    key: "tensor_name"
+    value {
+      s: "test"
+    }
+  }
+  attr {
+    key: "tensor_type"
+    value {
+      type: DT_FLOAT
+    }
+  }
+}
+library {
+}
+versions {
+  producer: 24
+}
+    )EOF";
+
+    grappler_item_.reset(new GrapplerItem);
+    CHECK(protobuf::TextFormat::ParseFromString(gdef_ascii,
+                                                &grappler_item_->graph));
+    grappler_item_->id = "test_graph";
+    grappler_item_->fetch = {"Recv"};
+  }
+
   // A simple while loop
   void CreateGrapplerItemWithLoop() {
     // Test graph produced in python using:
@@ -1700,6 +1776,16 @@ TEST_F(VirtualSchedulerTest, MemoryUsage) {
                               cpu_state.mem_usage_snapshot_at_peak);
 }
 
+TEST_F(VirtualSchedulerTest, UnnecessaryFeedNodes) {
+  CreateGrapplerItemWithUnnecessaryPlaceholderNodes();
+  InitScheduler();
+
+  // Test that scheduler can run graphs with extra unnecessary feed nodes.
+  auto ops_executed = RunScheduler("");
+  ASSERT_EQ(1, ops_executed.size());
+  ASSERT_EQ(ops_executed.count("x"), 1);
+}
+
 TEST_F(VirtualSchedulerTest, ControlDependency) {
   // Init.
   CreateGrapplerItemWithControlDependency();
@@ -2015,5 +2101,17 @@ TEST_F(VirtualSchedulerTest, GraphWithSendRecvDifferentDevice) {
             0);
   EXPECT_GT(ops_executed.count("Recv"), 0);
 }
+
+TEST_F(VirtualSchedulerTest, GraphWihtOnlyRecv) {
+  // Init.
+  CreateGrapplerItemWithRecvWithoutSend();
+  InitScheduler();
+
+  // Run the scheduler.
+  auto ops_executed = RunScheduler("");
+
+  // Recv without Send will be treated as initially ready node.
+  EXPECT_GT(ops_executed.count("Recv"), 0);
+}
 }  // end namespace grappler
 }  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/grappler_item.cc b/tensorflow/core/grappler/grappler_item.cc
index 2f8549cf395f6b78154f7a6faf3fea06ea6c56c4..ad86356504e06d31ccc0a315fbd6991e49df0f19 100644
--- a/tensorflow/core/grappler/grappler_item.cc
+++ b/tensorflow/core/grappler/grappler_item.cc
@@ -32,6 +32,7 @@ GrapplerItem::GrapplerItem(const GrapplerItem& other, GraphDef&& graphDef) {
   feed = other.feed;
   fetch = other.fetch;
   init_ops = other.init_ops;
+  keep_ops = other.keep_ops;
   expected_init_time = other.expected_init_time;
   save_op = other.save_op;
   restore_op = other.restore_op;
@@ -82,6 +83,9 @@ std::unordered_set<string> GrapplerItem::NodesToPreserve() const {
   for (const auto& node : init_ops) {
     result.insert(NodeName(node));
   }
+  for (const auto& node : keep_ops) {
+    result.insert(NodeName(node));
+  }
   if (!save_op.empty()) {
     result.insert(NodeName(save_op));
   }
diff --git a/tensorflow/core/grappler/grappler_item.h b/tensorflow/core/grappler/grappler_item.h
index 302685972a7f2908278a881112db9dbfb53c1c1a..06bba544c315476219ee83684df59a3da8720eea 100644
--- a/tensorflow/core/grappler/grappler_item.h
+++ b/tensorflow/core/grappler/grappler_item.h
@@ -58,6 +58,11 @@ struct GrapplerItem {
   // Queue runner(s) required to run the queue(s) of this model.
   std::vector<QueueRunnerDef> queue_runners;
 
+  // List of op names to keep in the graph. This includes nodes that are
+  // referenced in various collections, and therefore must be preserved to
+  // ensure that the optimized metagraph can still be loaded.
+  std::vector<string> keep_ops;
+
   // Return the set of node evaluated during a regular train/inference step.
   std::vector<const NodeDef*> MainOpsFanin() const;
   // Return the set of node run to populate the queues (if any).
@@ -66,7 +71,8 @@ struct GrapplerItem {
   std::vector<const NodeDef*> InitOpsFanin() const;
   // Return the set of variables accessed during a regular train/inference step.
   std::vector<const NodeDef*> MainVariables() const;
-  // Return a set of node names that must be preserved.
+  // Return a set of node names that must be preserved. This includes feed and
+  // fetch nodes, keep_ops, init_ops.
   std::unordered_set<string> NodesToPreserve() const;
 };
 
diff --git a/tensorflow/core/grappler/grappler_item_builder.cc b/tensorflow/core/grappler/grappler_item_builder.cc
index 7ba498dd06409635d7dfc282ab29f1133e299c9b..33ad426bbf0e82d8966b6df57abb39b6946ba880 100644
--- a/tensorflow/core/grappler/grappler_item_builder.cc
+++ b/tensorflow/core/grappler/grappler_item_builder.cc
@@ -168,12 +168,6 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
   // Fill in feed nodes from config, if any provided.
   for (const auto& feed_node : cfg.feed_nodes) {
     const string feed_name = NodeName(feed_node);
-    if (feed_name.empty()) {
-      LOG(ERROR) << "Invalid feed node name " << feed_node
-                 << ", skipping this input.";
-      return nullptr;
-    }
-    VLOG(1) << "Will use feed node " << feed_name;
     new_item->feed.emplace_back(feed_name, Tensor());
   }
 
@@ -182,17 +176,75 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
     const CollectionDef& nodes = meta_graph.collection_def().at("train_op");
     if (nodes.has_node_list()) {
       for (const auto& node : nodes.node_list().value()) {
-        const string name = NodeName(node);
-        if (name.empty()) {
-          LOG(ERROR) << "Invalid fetch node name " << node
-                     << ", skipping this input";
+        new_item->fetch.push_back(NodeName(node));
+      }
+    }
+  }
+
+  // Detect feed and fetch nodes from signature defs.
+  for (const auto& name_and_signature : meta_graph.signature_def()) {
+    for (const auto& name_and_input : name_and_signature.second.inputs()) {
+      const TensorInfo& input = name_and_input.second;
+      if (input.has_coo_sparse()) {
+        // Define the shapes following the comment of CooSparse.
+        PartialTensorShape partial_shape_1d({-1});
+        PartialTensorShape partial_shape_2d({-1, -1});
+        TensorShape shape_1d;
+        TensorShape shape_2d;
+        if (!partial_shape_1d.AsTensorShape(&shape_1d) ||
+            !partial_shape_2d.AsTensorShape(&shape_2d)) {
+          LOG(ERROR) << "Internal error when constructing tensor shapes.";
           return nullptr;
         }
-        VLOG(1) << "Will use fetch node " << name;
-        new_item->fetch.push_back(name);
+
+        new_item->feed.emplace_back(
+            NodeName(input.coo_sparse().values_tensor_name()),
+            Tensor(input.dtype(), shape_1d));
+        new_item->feed.emplace_back(
+            NodeName(input.coo_sparse().indices_tensor_name()),
+            Tensor(DT_INT64, shape_2d));
+        new_item->feed.emplace_back(
+            NodeName(input.coo_sparse().dense_shape_tensor_name()),
+            Tensor(DT_INT64, shape_1d));
+      } else {
+        new_item->feed.emplace_back(
+            NodeName(input.name()),
+            Tensor(input.dtype(), input.tensor_shape()));
       }
     }
+    for (const auto& name_and_output : name_and_signature.second.outputs()) {
+      const TensorInfo& output = name_and_output.second;
+      if (output.has_coo_sparse()) {
+        new_item->fetch.push_back(
+            NodeName(output.coo_sparse().values_tensor_name()));
+        new_item->fetch.push_back(
+            NodeName(output.coo_sparse().indices_tensor_name()));
+        new_item->fetch.push_back(
+            NodeName(output.coo_sparse().dense_shape_tensor_name()));
+      } else {
+        new_item->fetch.push_back(NodeName(output.name()));
+      }
+    }
+  }
+
+  for (const auto& feed : new_item->feed) {
+    if (feed.first.empty()) {
+      LOG(ERROR) << "Invalid feed node name skipping this input";
+      return nullptr;
+    } else {
+      VLOG(1) << "Will use feed node " << feed.first;
+    }
   }
+
+  for (const auto& fetch : new_item->fetch) {
+    if (fetch.empty()) {
+      LOG(ERROR) << "Invalid fetch node name skipping this input";
+      return nullptr;
+    } else {
+      VLOG(1) << "Will use fetch node " << fetch;
+    }
+  }
+
   if (new_item->fetch.empty()) {
     LOG(ERROR) << "Failed to detect the fetch node(s), skipping this input";
     return nullptr;
@@ -296,6 +348,14 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
     }
   }
 
+  // Add each node referenced in a collection to the list of nodes to keep.
+  for (const auto& col : meta_graph.collection_def()) {
+    const CollectionDef& collection = col.second;
+    for (const string& node : collection.node_list().value()) {
+      new_item->keep_ops.push_back(NodeName(node));
+    }
+  }
+
   for (auto& node : *new_item->graph.mutable_node()) {
     if (IsPlaceholder(node) && node.op() != "PlaceholderWithDefault") {
       if (node.attr().count("dtype") == 0) {
@@ -510,113 +570,5 @@ std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
   return new_item;
 }
 
-std::unique_ptr<GrapplerItem> GrapplerItemFromFunctionDef(
-    const FunctionDef& func,
-    const std::unordered_map<string, AttrValue>& func_attr,
-    const FunctionDefLibrary& library) {
-  if (func.signature().name().empty()) {
-    LOG(ERROR) << "function name must be specified.";
-    return nullptr;
-  }
-  std::unique_ptr<GrapplerItem> new_item(new GrapplerItem());
-  new_item->id = func.signature().name();
-
-  std::unordered_map<string, string> port_map;
-
-  // Add the function inputs as placeholder
-  for (const auto& inp : func.signature().input_arg()) {
-    NodeDef* ph = new_item->graph.add_node();
-    ph->set_name(inp.name());
-    ph->set_op("Placeholder");
-    if (inp.type() != DT_INVALID) {
-      (*ph->mutable_attr())["T"].set_type(inp.type());
-    } else {
-      auto it = func_attr.find(inp.type_attr());
-      if (it == func_attr.end()) {
-        LOG(ERROR) << "Unknown type attribute " << inp.type_attr()
-                   << " for function input " << inp.name();
-        return nullptr;
-      } else {
-        (*ph->mutable_attr())["T"] = it->second;
-      }
-    }
-    port_map[inp.name()] = inp.name();
-  }
-
-  // Add the function body to the graph.
-  FunctionLibraryDefinition func_def(OpRegistry::Global(), library);
-
-  for (const NodeDef& node : func.node_def()) {
-    NodeDef* new_node = new_item->graph.add_node();
-    *new_node = node;
-    // Replace the placeholder attribute values with the specified value.
-    for (auto& attr : *new_node->mutable_attr()) {
-      const string& ph_name = attr.second.placeholder();
-      auto it = func_attr.find(ph_name);
-      if (it != func_attr.end()) {
-        attr.second = it->second;
-      }
-    }
-
-    // Functions use a custom format to encode connectivity. Map these custom
-    // strings to regular ones.
-    const OpRegistrationData* registration;
-    Status status = func_def.LookUp(node.op(), &registration);
-    if (!status.ok()) {
-      LOG(ERROR) << "Op " << node.op() << " not registered: " << status;
-      return nullptr;
-    }
-
-    tensorflow::NameRangeMap inputs;
-    tensorflow::NameRangeMap outputs;
-    status = tensorflow::NameRangesForNode(node, registration->op_def, &inputs,
-                                           &outputs);
-    if (!status.ok()) {
-      LOG(ERROR) << "Op " << node.op() << " invalid: " << status;
-      return nullptr;
-    }
-    for (const auto& name_range : outputs) {
-      string port_prefix =
-          strings::StrCat(node.name(), ":", name_range.first, ":");
-      int index_start = name_range.second.first;
-      int index_end = name_range.second.second;
-      for (int i = index_start; i < index_end; ++i) {
-        string port_id = strings::StrCat(port_prefix, i - index_start);
-        string port_name = strings::StrCat(node.name(), ":", i);
-        port_map[port_id] = port_name;
-      }
-    }
-  }
-
-  for (auto& node : *new_item->graph.mutable_node()) {
-    // Rewrite the inputs to use the normal naming convention.
-    for (int i = 0; i < node.input_size(); ++i) {
-      const string& input = node.input(i);
-      if (IsControlInput(input)) {
-        // No need to remap control dependencies.
-        continue;
-      } else {
-        auto it = port_map.find(input);
-        if (it == port_map.end()) {
-          LOG(ERROR) << "Unknown input: " << input;
-          return nullptr;
-        }
-        node.set_input(i, it->second);
-      }
-    }
-  }
-
-  // Add the function outputs to the list of fetch nodes.
-  for (const auto& out : func.signature().output_arg()) {
-    new_item->fetch.emplace_back(out.name());
-  }
-  // Add the function inputs to the list of feeds.
-  for (const auto& inp : func.signature().input_arg()) {
-    new_item->feed.emplace_back(inp.name(), Tensor());
-  }
-
-  return new_item;
-}
-
 }  // end namespace grappler
 }  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/grappler_item_builder.h b/tensorflow/core/grappler/grappler_item_builder.h
index e892a3f556f7e9ccba91d5ce672a12d2eac49f5a..c877d911636d8620e7951a5d8279e426d109b2d3 100644
--- a/tensorflow/core/grappler/grappler_item_builder.h
+++ b/tensorflow/core/grappler/grappler_item_builder.h
@@ -58,13 +58,6 @@ struct ItemConfig {
 std::unique_ptr<GrapplerItem> GrapplerItemFromMetaGraphDef(
     const string& id, const MetaGraphDef& meta_graph, const ItemConfig& cfg);
 
-// Factory method for creating a GrapplerItem from a FunctionDef.
-// Returns nullptr if the given function def cannot be converted.
-std::unique_ptr<GrapplerItem> GrapplerItemFromFunctionDef(
-    const FunctionDef& func,
-    const std::unordered_map<string, AttrValue>& func_attr,
-    const FunctionDefLibrary& library);
-
 }  // end namespace grappler
 }  // end namespace tensorflow
 
diff --git a/tensorflow/core/grappler/grappler_item_builder_test.cc b/tensorflow/core/grappler/grappler_item_builder_test.cc
index 68437b60419f73419bca4467b409818bc0b11650..78cbff6c902f1dd83d6e2132887211ec8d5a1e15 100644
--- a/tensorflow/core/grappler/grappler_item_builder_test.cc
+++ b/tensorflow/core/grappler/grappler_item_builder_test.cc
@@ -280,203 +280,57 @@ TEST_F(GrapplerItemBuilderTest, GraphWithFunctions) {
   ASSERT_TRUE(item != nullptr);
 }
 
-TEST_F(GrapplerItemBuilderTest, FromSimpleFunctionDef) {
-  const Tensor kTwo = test::AsScalar<int64>(2);
-  FunctionDef func = FunctionDefHelper::Define(
-      // Name
-      "XTimesTwo",
-      // Args
-      {"x: T"},
-      // Return values
-      {"y: T"},
-      // Attr def
-      {"T: {float, double, int32, int64}"},
-      // Nodes
-      {
-          {{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_INT64}}},
-          {{"scale"}, "Cast", {"two"}, {{"SrcT", DT_INT64}, {"DstT", "$T"}}},
-          {{"y"}, "Mul", {"x", "scale"}, {{"T", "$T"}}},
-      });
+TEST_F(GrapplerItemBuilderTest, FromGraphWithSignatureDef) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  auto x = ops::Const(s.WithOpName("x"), 0);
+  auto y = ops::Const(s.WithOpName("y"), 1);
+  auto z = ops::Add(s.WithOpName("z"), x, y);
 
-  std::unordered_map<string, AttrValue> func_attr;
-  func_attr["T"].set_type(DT_FLOAT);
-  FunctionDefLibrary library;
-  std::unique_ptr<GrapplerItem> item =
-      GrapplerItemFromFunctionDef(func, func_attr, library);
-  CHECK(item);
-  EXPECT_EQ("XTimesTwo", item->id);
-  EXPECT_EQ(4, item->graph.node_size());
-  EXPECT_EQ(std::vector<string>({"y"}), item->fetch);
-  EXPECT_EQ(1, item->feed.size());
-  EXPECT_EQ("x", item->feed[0].first);
+  MetaGraphDef meta_graph;
+  TF_CHECK_OK(s.ToGraphDef(meta_graph.mutable_graph_def()));
 
-  for (const NodeDef &node : item->graph.node()) {
-    if (node.name() == "x") {
-      EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
-      EXPECT_EQ(0, node.input_size());
-    } else if (node.name() == "two") {
-      EXPECT_EQ("Const", node.op());
-      EXPECT_EQ(0, node.input_size());
-    } else if (node.name() == "scale") {
-      EXPECT_EQ("Cast", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("DstT").type());
-      EXPECT_EQ(1, node.input_size());
-      EXPECT_EQ("two:0", node.input(0));
-    } else if (node.name() == "y") {
-      EXPECT_EQ("Mul", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("x", node.input(0));
-      EXPECT_EQ("scale:0", node.input(1));
-    }
-  }
-}
+  TensorInfo input, output;
+  input.set_name("x");
+  input.set_dtype(DT_FLOAT);
+  output.set_name("z");
+  SignatureDef serving_signature;
+  (*serving_signature.mutable_inputs())["input"] = input;
+  (*serving_signature.mutable_outputs())["output"] = output;
+  (*meta_graph.mutable_signature_def())["serving"] = serving_signature;
 
-TEST_F(GrapplerItemBuilderTest, FromFunctionDefWithMultiOutputNodes) {
-  // Gradient graph for the Subtract operation
-  std::vector<FunctionDefHelper::Node> nodes = {
-      {{"sx"}, "Shape", {"x"}},
-      {{"sy"}, "Shape", {"y"}},
-      {{"gx"}, "Identity", {"dz"}},
-      {{"gy"}, "Neg", {"dz"}},
-      {{"rx", "ry"}, "BroadcastGradientArgs", {"sx", "sy"}},
-      {{"sum_gx"}, "Sum", {"gx", "rx"}},
-      {{"dx"}, "Reshape", {"sum_gx", "sx"}},
-      {{"sum_gy"}, "Sum", {"gy", "ry"}},
-      {{"dy"}, "Reshape", {"sum_gy", "sy"}},
-  };
-
-  for (auto &n : nodes) {
-    // "BroadcastGradientArgs" doesn't need any attrs.
-    if (n.attr.empty() && n.op != "BroadcastGradientArgs") {
-      n.attr = {{"T", "$T"}};
-    }
-  }
-  FunctionDef func = FunctionDefHelper::Define(
-      // Name
-      "SubGrad",
-      // Arg defs
-      {"x: T", "y: T", "dz: T"},
-      // Ret val defs
-      {"dx: T", "dy: T"},
-      // Attr defs
-      {{"T: {half, float, double}"}},
-      // Nodes
-      nodes);
-
-  std::unordered_map<string, AttrValue> func_attr;
-  func_attr["T"].set_type(DT_FLOAT);
-  FunctionDefLibrary library;
   std::unique_ptr<GrapplerItem> item =
-      GrapplerItemFromFunctionDef(func, func_attr, library);
-  CHECK(item);
-  EXPECT_EQ("SubGrad", item->id);
-  EXPECT_EQ(12, item->graph.node_size());
-  EXPECT_EQ(std::vector<string>({"dx", "dy"}), item->fetch);
-  EXPECT_EQ(3, item->feed.size());
-  EXPECT_EQ("x", item->feed[0].first);
-  EXPECT_EQ("y", item->feed[1].first);
-  EXPECT_EQ("dz", item->feed[2].first);
+      GrapplerItemFromMetaGraphDef("0", meta_graph, ItemConfig());
+  ASSERT_TRUE(item != nullptr);
 
-  for (const NodeDef &node : item->graph.node()) {
-    if (node.name() == "x" || node.name() == "y" || node.name() == "dz") {
-      EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
-      EXPECT_EQ(0, node.input_size());
-    } else if (node.name() == "rx") {
-      EXPECT_EQ("BroadcastGradientArgs", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("sx:0", node.input(0));
-      EXPECT_EQ("sy:0", node.input(1));
-    } else if (node.name() == "sum_gx") {
-      EXPECT_EQ("Sum", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("gx:0", node.input(0));
-      EXPECT_EQ("rx:0", node.input(1));
-    } else if (node.name() == "sum_gy") {
-      EXPECT_EQ("Sum", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("gy:0", node.input(0));
-      EXPECT_EQ("rx:1", node.input(1));
-    }
-  }
+  EXPECT_EQ(item->feed[0].first, "x");
+  EXPECT_EQ(item->fetch[0], "z");
 }
 
-TEST_F(GrapplerItemBuilderTest, FromFunctionDefWithNestedFuncs) {
-  FunctionDefLibrary library;
-  *library.add_function() = FunctionDefHelper::Define(
-      // Name
-      "Swap",
-      // Args
-      {"i0: T", "i1: T"},
-      // Return values
-      {"o0: T", "o1: T"},
-      // Attr def
-      {"T: {float, double}"},
-      // Nodes
-      {{{"o0"}, "Identity", {"i1"}, {{"T", "$T"}}},
-       {{"o1"}, "Identity", {"i0"}, {{"T", "$T"}}}});
-
-  FunctionDef func = FunctionDefHelper::Create(
-      // Name
-      "ManySwapsFirst",
-      // Args
-      {"x: float", "y: float"},
-      // Return values
-      {"o: float"},
-      // attr def
-      {},
-      // Nodes
-      // o = x*x + y*y.  Furthermore, The 1st swap depends on x2, and
-      // y2 depends on the 2nd swap.  The 2nd swap has data dependency
-      // on the 1st swap.
-      {{{"a0"}, "Swap", {"x", "y"}, {{"T", DT_FLOAT}}, {"x2"}},
-       {{"a1"}, "Swap", {"a0:o0:0", "a0:o1:0"}, {{"T", DT_FLOAT}}},
-       {{"x2"}, "Mul", {"x", "x"}, {{"T", DT_FLOAT}}},
-       {{"y2"}, "Mul", {"y", "y"}, {{"T", DT_FLOAT}}, {"a1"}},
-       {{"o"}, "Add", {"x2:z:0", "y2:z:0"}, {{"T", DT_FLOAT}}}},
-      {{"o", "o:z:0"}});
-
-  std::unordered_map<string, AttrValue> func_attr;
-  func_attr["T"].set_type(DT_FLOAT);
-  std::unique_ptr<GrapplerItem> item =
-      GrapplerItemFromFunctionDef(func, func_attr, library);
+TEST_F(GrapplerItemBuilderTest, FromGraphWithIncompleteSignatureDef) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  auto x = ops::Const(s.WithOpName("x"), 0);
+  auto y = ops::Const(s.WithOpName("y"), 1);
 
-  for (const NodeDef &node : item->graph.node()) {
-    if (node.name() == "x" || node.name() == "y") {
-      EXPECT_EQ("Placeholder", node.op());
-      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
-      EXPECT_EQ(0, node.input_size());
-    } else if (node.name() == "a0") {
-      EXPECT_EQ("Swap", node.op());
-      EXPECT_EQ(3, node.input_size());
-      EXPECT_EQ("x", node.input(0));
-      EXPECT_EQ("y", node.input(1));
-      EXPECT_EQ("^x2", node.input(2));
-    } else if (node.name() == "a1") {
-      EXPECT_EQ("Swap", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("a0:0", node.input(0));
-      EXPECT_EQ("a0:1", node.input(1));
-    } else if (node.name() == "x2") {
-      EXPECT_EQ("Mul", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("x", node.input(0));
-      EXPECT_EQ("x", node.input(1));
-    } else if (node.name() == "y2") {
-      EXPECT_EQ("Mul", node.op());
-      EXPECT_EQ(3, node.input_size());
-      EXPECT_EQ("y", node.input(0));
-      EXPECT_EQ("y", node.input(1));
-      EXPECT_EQ("^a1", node.input(2));
-    } else if (node.name() == "o") {
-      EXPECT_EQ("Add", node.op());
-      EXPECT_EQ(2, node.input_size());
-      EXPECT_EQ("x2:0", node.input(0));
-      EXPECT_EQ("y2:0", node.input(1));
-    }
-  }
+  MetaGraphDef meta_graph;
+  TF_CHECK_OK(s.ToGraphDef(meta_graph.mutable_graph_def()));
+
+  CollectionDef train_op;
+  train_op.mutable_node_list()->add_value("y");
+  (*meta_graph.mutable_collection_def())["train_op"] = train_op;
+
+  TensorInfo input, output;
+  input.set_name("x");
+  input.set_dtype(DT_FLOAT);
+  // Its coo_sparse proto is incomplete.
+  output.mutable_coo_sparse()->set_values_tensor_name("z");
+  SignatureDef serving_signature;
+  (*serving_signature.mutable_inputs())["input"] = input;
+  (*serving_signature.mutable_outputs())["output"] = output;
+  (*meta_graph.mutable_signature_def())["serving"] = serving_signature;
+
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromMetaGraphDef("0", meta_graph, ItemConfig());
+  ASSERT_TRUE(item == nullptr);
 }
 
 }  // namespace
diff --git a/tensorflow/core/grappler/op_types.cc b/tensorflow/core/grappler/op_types.cc
index fdf4540540b4b9f3d64ea767240ca4ea0c353d48..fb46b584b231df185bb189e16dec68e11ca8dab5 100644
--- a/tensorflow/core/grappler/op_types.cc
+++ b/tensorflow/core/grappler/op_types.cc
@@ -256,6 +256,10 @@ bool IsRestore(const NodeDef& node) {
           node.op() == "RestoreSlice");
 }
 
+bool IsReverse(const NodeDef& node) {
+  return node.op() == "Reverse" || node.op() == "ReverseV2";
+}
+
 bool IsReverseV2(const NodeDef& node) { return node.op() == "ReverseV2"; }
 
 bool IsRsqrtGrad(const NodeDef& node) { return node.op() == "RsqrtGrad"; }
@@ -272,6 +276,10 @@ bool IsShape(const NodeDef& node) { return node.op() == "Shape"; }
 
 bool IsShapeN(const NodeDef& node) { return node.op() == "ShapeN"; }
 
+bool IsShuffle(const NodeDef& node) {
+  return node.op() == "Shuffle" || node.op() == "RandomShuffle";
+}
+
 bool IsSigmoidGrad(const NodeDef& node) { return node.op() == "SigmoidGrad"; }
 
 bool IsSlice(const NodeDef& node) { return node.op() == "Slice"; }
@@ -292,6 +300,19 @@ bool IsSquaredDifference(const NodeDef& node) {
 
 bool IsSqueeze(const NodeDef& node) { return node.op() == "Squeeze"; }
 
+bool IsStackOp(const NodeDef& node) {
+  return node.op() == "Stack" || node.op() == "StackV2";
+}
+bool IsStackCloseOp(const NodeDef& node) {
+  return node.op() == "StackClose" || node.op() == "StackCloseV2";
+}
+bool IsStackPushOp(const NodeDef& node) {
+  return node.op() == "StackPush" || node.op() == "StackPushV2";
+}
+bool IsStackPopOp(const NodeDef& node) {
+  return node.op() == "StackPop" || node.op() == "StackPopV2";
+}
+
 bool IsStopGradient(const NodeDef& node) {
   const auto& op = node.op();
   return op == "StopGradient" || op == "PreventGradient";
@@ -346,7 +367,8 @@ bool IsFreeOfSideEffect(const NodeDef& node) {
     return false;
   }
   const OpDef* op_def = nullptr;
-  Status status = OpRegistry::Global()->LookUpOpDef(node.op(), &op_def);
+  const string& op_name = node.op();
+  Status status = OpRegistry::Global()->LookUpOpDef(op_name, &op_def);
   if (!status.ok()) {
     return false;
   }
@@ -360,7 +382,8 @@ bool IsFreeOfSideEffect(const NodeDef& node) {
     }
   }
   // Some nodes do in-place updates on regular tensor inputs.
-  if (GetBoolAttr(node, "in_place") || GetBoolAttr(node, "inplace")) {
+  if (GetBoolAttr(node, "in_place") || GetBoolAttr(node, "inplace") ||
+      StringPiece(op_name).starts_with("Inplace")) {
     return false;
   }
   return true;
diff --git a/tensorflow/core/grappler/op_types.h b/tensorflow/core/grappler/op_types.h
index 9cda40c0a6515caa9754d0c2f4f50a32f9fe8d98..a7c33ef97bca5c20417e6d5f24bcb40572e15b0a 100644
--- a/tensorflow/core/grappler/op_types.h
+++ b/tensorflow/core/grappler/op_types.h
@@ -100,6 +100,7 @@ bool IsRecv(const NodeDef& node);
 bool IsReduction(const NodeDef& node);
 bool IsReshape(const NodeDef& node);
 bool IsRestore(const NodeDef& node);
+bool IsReverse(const NodeDef& node);
 bool IsReverseV2(const NodeDef& node);
 bool IsRsqrtGrad(const NodeDef& node);
 bool IsSelect(const NodeDef& node);
@@ -108,6 +109,7 @@ bool IsSend(const NodeDef& node);
 bool IsSlice(const NodeDef& node);
 bool IsShape(const NodeDef& node);
 bool IsShapeN(const NodeDef& node);
+bool IsShuffle(const NodeDef& node);
 bool IsSigmoidGrad(const NodeDef& node);
 bool IsSoftplusGrad(const NodeDef& node);
 bool IsSoftsignGrad(const NodeDef& node);
@@ -116,6 +118,10 @@ bool IsSplitV(const NodeDef& node);
 bool IsSqrtGrad(const NodeDef& node);
 bool IsSquaredDifference(const NodeDef& node);
 bool IsSqueeze(const NodeDef& node);
+bool IsStackOp(const NodeDef& node);
+bool IsStackCloseOp(const NodeDef& node);
+bool IsStackPushOp(const NodeDef& node);
+bool IsStackPopOp(const NodeDef& node);
 bool IsStopGradient(const NodeDef& node);
 bool IsStridedSlice(const NodeDef& node);
 bool IsStridedSliceGrad(const NodeDef& node);
diff --git a/tensorflow/core/grappler/optimizers/BUILD b/tensorflow/core/grappler/optimizers/BUILD
index e839630605a96f1528114f98b88e90a7a20b0a3a..fd790d0888df4ea2ae0e5e64f1ccd473c6a77a6d 100644
--- a/tensorflow/core/grappler/optimizers/BUILD
+++ b/tensorflow/core/grappler/optimizers/BUILD
@@ -1,6 +1,9 @@
 licenses(["notice"])  # Apache 2.0
 
 load("//tensorflow:tensorflow.bzl", "tf_cc_test")
+load("//tensorflow:tensorflow.bzl", "tf_cc_test_gpu")
+load("//tensorflow:tensorflow.bzl", "tf_kernel_library")
+load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda")
 
 filegroup(
     name = "all_files",
@@ -129,6 +132,45 @@ tf_cc_test(
     ],
 )
 
+cc_library(
+    name = "function_optimizer",
+    srcs = ["function_optimizer.cc"],
+    hdrs = [
+        "function_optimizer.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_optimizer",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:op_types",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/utils:functions",
+    ],
+)
+
+tf_cc_test(
+    name = "function_optimizer_test",
+    srcs = ["function_optimizer_test.cc"],
+    deps = [
+        ":function_optimizer",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/cc:cc_ops_internal",
+        "//tensorflow/core:all_kernels",
+        "//tensorflow/core:core_cpu",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:direct_session",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/utils:grappler_test",
+    ],
+)
+
 cc_library(
     name = "graph_rewriter",
     srcs = ["graph_rewriter.cc"],
@@ -157,6 +199,18 @@ cc_library(
     ],
 )
 
+cc_library(
+    name = "custom_graph_optimizer",
+    hdrs = [
+        "custom_graph_optimizer.h",
+    ],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":graph_optimizer",
+        "//tensorflow/core:lib",
+    ],
+)
+
 cc_library(
     name = "arithmetic_optimizer",
     srcs = ["arithmetic_optimizer.cc"],
@@ -270,9 +324,36 @@ tf_cc_test(
     ],
 )
 
+tf_kernel_library(
+    name = "gpu_swapping_kernels",
+    srcs = [
+        "gpu_swapping_kernels.cc",
+    ],
+    deps = [
+        "//tensorflow/core:core_cpu_base",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+    ],
+)
+
+cc_library(
+    name = "gpu_swapping_ops",
+    srcs = [
+        "gpu_swapping_ops.cc",
+    ],
+    deps = [
+        "//tensorflow/core:core_cpu_base",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+    ],
+    alwayslink = 1,
+)
+
 cc_library(
     name = "memory_optimizer",
-    srcs = ["memory_optimizer.cc"],
+    srcs = [
+        "memory_optimizer.cc",
+    ],
     hdrs = [
         "memory_optimizer.h",
     ],
@@ -282,6 +363,7 @@ cc_library(
         ":graph_rewriter",
         ":static_schedule",
         "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core/grappler:graph_view",
         "//tensorflow/core/grappler:grappler_item",
@@ -292,12 +374,16 @@ cc_library(
         "//tensorflow/core/grappler/costs:graph_properties",
         "//tensorflow/core/grappler/utils:topological_sort",
         "//tensorflow/core/grappler/utils:traversal",
-    ],
+    ] + if_cuda([
+        ":gpu_swapping_kernels",
+        ":gpu_swapping_ops",
+    ]),
 )
 
-tf_cc_test(
+tf_cc_test_gpu(
     name = "memory_optimizer_test",
     srcs = ["memory_optimizer_test.cc"],
+    tags = ["no_cuda_on_cpu_tap"],
     deps = [
         ":memory_optimizer",
         "//tensorflow/cc:cc_ops",
@@ -368,7 +454,10 @@ cc_library(
         ":arithmetic_optimizer",
         ":auto_parallel",
         ":constant_folding",
+        ":custom_graph_optimizer",
+        ":custom_graph_optimizer_registry",
         ":dependency_optimizer",
+        ":function_optimizer",
         ":graph_optimizer",
         ":layout_optimizer",
         ":loop_optimizer",
@@ -382,6 +471,48 @@ cc_library(
     ],
 )
 
+tf_cc_test(
+    name = "meta_optimizer_test",
+    srcs = ["meta_optimizer_test.cc"],
+    deps = [
+        ":custom_graph_optimizer",
+        ":custom_graph_optimizer_registry",
+        ":meta_optimizer",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:tensorflow",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:utils",
+        "//tensorflow/core/grappler/inputs:trivial_test_graph_input_yielder",
+    ],
+)
+
+cc_library(
+    name = "custom_graph_optimizer_registry",
+    srcs = ["custom_graph_optimizer_registry.cc"],
+    hdrs = ["custom_graph_optimizer_registry.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        ":custom_graph_optimizer",
+        "//tensorflow/core:lib",
+    ],
+)
+
+tf_cc_test(
+    name = "custom_graph_optimizer_registry_test",
+    size = "small",
+    srcs = ["custom_graph_optimizer_registry_test.cc"],
+    deps = [
+        ":custom_graph_optimizer",
+        ":custom_graph_optimizer_registry",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+    ],
+)
+
 cc_library(
     name = "loop_optimizer",
     srcs = ["loop_optimizer.cc"],
@@ -390,6 +521,7 @@ cc_library(
     ],
     visibility = ["//visibility:public"],
     deps = [
+        ":constant_folding",
         ":graph_optimizer",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
@@ -399,6 +531,7 @@ cc_library(
         "//tensorflow/core/grappler:op_types",
         "//tensorflow/core/grappler:utils",
         "//tensorflow/core/grappler/costs:graph_properties",
+        "//tensorflow/core/grappler/utils:frame",
     ],
 )
 
@@ -406,6 +539,10 @@ tf_cc_test(
     name = "loop_optimizer_test",
     size = "small",
     srcs = ["loop_optimizer_test.cc"],
+    tags = [
+        "manual",
+        "no_oss",  # b/74111495
+    ],
     deps = [
         ":loop_optimizer",
         "//tensorflow/cc:cc_ops",
diff --git a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
index 9c544c82bf7f77760e5a2090ca947fd7185e27b7..709a434e40e887502cac1317870eb0db8e0c2910 100644
--- a/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/arithmetic_optimizer.cc
@@ -45,45 +45,6 @@ namespace tensorflow {
 namespace grappler {
 namespace {
 
-template <typename T>
-bool SafeSetTensorValue(double value, Tensor* tensor) {
-  using RealType = typename Eigen::NumTraits<T>::Real;
-  if (value > std::numeric_limits<RealType>::max() ||
-      value < std::numeric_limits<RealType>::min()) {
-    return false;
-  }
-  tensor->flat<T>()(0) = static_cast<T>(value);
-  return true;
-}
-
-#define HANDLE_CASE(DTYPE)                                          \
-  case DTYPE:                                                       \
-    if (!SafeSetTensorValue<EnumToDataType<DTYPE>::Type>(           \
-            static_cast<double>(value), tensor)) {                  \
-      return errors::InvalidArgument("Cannot store value ", value,  \
-                                     " in tensor of type " #DTYPE); \
-    }                                                               \
-    break
-
-Status SetTensorValue(DataType dtype, int value, Tensor* tensor) {
-  switch (dtype) {
-    //    HANDLE_CASE(DT_HALF);
-    HANDLE_CASE(DT_FLOAT);
-    HANDLE_CASE(DT_DOUBLE);
-    HANDLE_CASE(DT_UINT8);
-    HANDLE_CASE(DT_INT8);
-    HANDLE_CASE(DT_UINT16);
-    HANDLE_CASE(DT_INT16);
-    HANDLE_CASE(DT_INT32);
-    HANDLE_CASE(DT_INT64);
-    HANDLE_CASE(DT_COMPLEX64);
-    HANDLE_CASE(DT_COMPLEX128);
-    default:
-      return errors::InvalidArgument("Unexpected type ", DataTypeString(dtype));
-  }
-  return Status::OK();
-}
-
 template <typename T>
 bool AreInversePermutations(const std::vector<T>& a, const std::vector<T>& b) {
   if (a.size() != b.size()) {
@@ -870,8 +831,13 @@ string ArithmeticOptimizer::TrySimplifyAndReplaceUses(
       }
       TensorValue value(&t);
       NodeDef* new_const_node = AddNode(*node, "const", /*copy_node=*/false);
-      *new_const_node =
-          ConstantFolding::CreateNodeDef(new_const_node->name(), value);
+      status = ConstantFolding::CreateNodeDef(new_const_node->name(), value,
+                                              new_const_node);
+      if (!status.ok()) {
+        LOG(WARNING) << "Failed to create const node: "
+                     << status.error_message();
+        return "";
+      }
       new_const_node->set_device(node->device());
       nodes_to_simplify->PushBack(new_const_node);
 
@@ -1077,7 +1043,12 @@ Status ArithmeticOptimizer::SimplifyArithmeticOps() {
       // consumers of `node` are already redirected to `simplified_tensor`.
       // Re-push the consumers into `nodes_to_simplify` for further
       // optimizations.
-      std::set<NodeDef*> consumers = node_map_->GetOutputs(node->name());
+      const std::set<NodeDef*> outputs = node_map_->GetOutputs(node->name());
+      std::vector<NodeDef*> consumers(outputs.begin(), outputs.end());
+      std::sort(consumers.begin(), consumers.end(),
+                [](const NodeDef* n1, const NodeDef* n2) {
+                  return n1->name() < n2->name();
+                });
       for (NodeDef* consumer : consumers) {
         // Update `consumer`'s use of `node` to `input`'s operand.
         for (int i = 0; i < consumer->input_size(); ++i) {
diff --git a/tensorflow/core/grappler/optimizers/constant_folding.cc b/tensorflow/core/grappler/optimizers/constant_folding.cc
index b8a21ea5a15ec1556db47b13db43d19bf070c266..77804142e61366eda980022d0525463a8e0e91c8 100644
--- a/tensorflow/core/grappler/optimizers/constant_folding.cc
+++ b/tensorflow/core/grappler/optimizers/constant_folding.cc
@@ -35,7 +35,9 @@ limitations under the License.
 #include "tensorflow/core/lib/gtl/inlined_vector.h"
 #include "tensorflow/core/lib/strings/numbers.h"
 #include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/denormal.h"
 #include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/setround.h"
 #include "tensorflow/core/platform/tensor_coding.h"
 #include "tensorflow/core/public/version.h"
 #include "tensorflow/core/util/bcast.h"
@@ -51,7 +53,14 @@ class EigenThreadPoolWrapper : public Eigen::ThreadPoolInterface {
   explicit EigenThreadPoolWrapper(thread::ThreadPool* pool) : pool_(pool) {}
   ~EigenThreadPoolWrapper() override {}
   void Schedule(std::function<void()> fn) override {
-    pool_->Schedule(std::move(fn));
+    auto wrapped = [=]() {
+      // TensorFlow flushes denormals to zero and rounds to nearest, so we do
+      // the same here.
+      port::ScopedFlushDenormal flush;
+      port::ScopedSetRound round(FE_TONEAREST);
+      fn();
+    };
+    pool_->Schedule(std::move(wrapped));
   }
   int NumThreads() const override { return pool_->NumThreads(); }
   int CurrentThreadId() const override { return pool_->CurrentThreadId(); }
@@ -292,16 +301,16 @@ Status ConstantFolding::MaterializeShapes(const GraphProperties& properties) {
   // graph.
   const int node_count = graph_->node_size();
   for (int i = 0; i < node_count; ++i) {
-    NodeDef& node = *graph_->mutable_node(i);
-    const string op = node.op();
+    NodeDef* node = graph_->mutable_node(i);
+    const string op = node->op();
     if (op != "Shape" && op != "Size" && op != "Rank" && op != "ShapeN") {
       continue;
     }
 
     const std::vector<OpInfo::TensorProperties>& output =
-        properties.GetOutputProperties(node.name());
+        properties.GetOutputProperties(node->name());
     const std::vector<OpInfo::TensorProperties>& input =
-        properties.GetInputProperties(node.name());
+        properties.GetInputProperties(node->name());
     if (input.empty() || output.empty()) {
       continue;
     }
@@ -328,35 +337,35 @@ Status ConstantFolding::MaterializeShapes(const GraphProperties& properties) {
         // could have multiple outputs).
         if (op == "Shape" || op == "Size" || op == "Rank") {
           // Replace the node with the corresponding constant.
-          node.set_op("Const");
-          node.clear_attr();
-          (*node.mutable_attr())["dtype"].set_type(type);
+          node->set_op("Const");
+          node->clear_attr();
+          (*node->mutable_attr())["dtype"].set_type(type);
           value.AsProtoTensorContent(
-              (*node.mutable_attr())["value"].mutable_tensor());
+              (*node->mutable_attr())["value"].mutable_tensor());
 
           // Turn the data input into a control dependency: this is needed to
           // ensure that the constant value will only be run in the
           // cases where the shape/rank/size would have been run in
           // the original graph. Additional inputs are extra control
           string ctrl_dep =
-              AddControlDependency(node.input(0), graph_, node_map_.get());
-          node.set_input(0, ctrl_dep);
-          node_map_->AddOutput(NodeName(ctrl_dep), node.name());
+              AddControlDependency(node->input(0), graph_, node_map_.get());
+          node->set_input(0, ctrl_dep);
+          node_map_->AddOutput(NodeName(ctrl_dep), node->name());
         } else {
-          auto outputs = node_map_->GetOutputs(node.name());
+          auto outputs = node_map_->GetOutputs(node->name());
           for (const auto& output : outputs) {
             for (int k = 0; k < output->input_size(); ++k) {
               int port;
               string node_name = ParseNodeName(output->input(k), &port);
-              if (node_name == node.name() && port == j) {
+              if (node_name == node->name() && port == j) {
                 // Create a const node as ShapeN's output if not already.
                 const string const_name =
-                    OptimizedNodeName(node, strings::StrCat("-matshapes-", j));
+                    OptimizedNodeName(*node, strings::StrCat("-matshapes-", j));
                 if (node_map_->GetNode(const_name) == nullptr) {
                   NodeDef* added_node = graph_->add_node();
                   added_node->set_name(const_name);
                   added_node->set_op("Const");
-                  added_node->set_device(node.device());
+                  added_node->set_device(node->device());
                   node_map_->AddNode(added_node->name(), added_node);
                   (*added_node->mutable_attr())["dtype"].set_type(type);
                   value.AsProtoTensorContent(
@@ -364,7 +373,7 @@ Status ConstantFolding::MaterializeShapes(const GraphProperties& properties) {
                   // We add a control dependency to the original ShapeN node,
                   // so that the node will only be run if all inputs of the
                   // original ShapeN node are run.
-                  string ctrl_dep = AddControlDependency(node.name(), graph_,
+                  string ctrl_dep = AddControlDependency(node->name(), graph_,
                                                          node_map_.get());
                   *added_node->add_input() = ctrl_dep;
                   node_map_->AddOutput(NodeName(ctrl_dep), added_node->name());
@@ -529,7 +538,8 @@ Status ConstantFolding::MaterializeBroadcastGradientArgs(
     out[j] = node_map_->GetNode(const_name);
     if (out[j] == nullptr) {
       out[j] = graph_->add_node();
-      *out[j] = CreateNodeDef(const_name, TensorValue(&value));
+      TF_RETURN_IF_ERROR(
+          CreateNodeDef(const_name, TensorValue(&value), out[j]));
       out[j]->set_device(node.device());
       node_map_->AddNode(const_name, out[j]);
       string ctrl_dep =
@@ -637,7 +647,8 @@ Status ConstantFolding::MaterializeReductionIndices(
       value.vec<int64>()(i) = i;
     }
   }
-  *reduction_indices = CreateNodeDef(const_name, TensorValue(&value));
+  TF_RETURN_IF_ERROR(
+      CreateNodeDef(const_name, TensorValue(&value), reduction_indices));
   reduction_indices->set_device(node->device());
   string ctrl_dep =
       AddControlDependency(node->input(1), graph_, node_map_.get());
@@ -677,7 +688,7 @@ bool ConstantFolding::IsFoldable(const NodeDef& node) const {
       nodes_whitelist_.find(node.name()) == nodes_whitelist_.end()) {
     return false;
   }
-  // Skip control flow nodes, they can't be folded
+  // Skip control flow nodes, they can't be folded.
   if (ModifiesFrameInfo(node)) {
     return false;
   }
@@ -686,12 +697,16 @@ bool ConstantFolding::IsFoldable(const NodeDef& node) const {
     return false;
   }
 
-  // Skips ops that don't benefit from folding.
-  const string& op = node.op();
+  // Don't fold stateful ops such as TruncatedNormal.
+  if (!IsFreeOfSideEffect(node)) {
+    return false;
+  }
 
-  if (op.find("Placeholder") == 0) {
+  // Skips ops that don't benefit from folding.
+  if (IsPlaceholder(node)) {
     return false;
   }
+  const string& op = node.op();
   if (op.find("Save") != string::npos || op.find("Restore") != string::npos ||
       op.find("Reader") != string::npos) {
     return false;
@@ -703,16 +718,12 @@ bool ConstantFolding::IsFoldable(const NodeDef& node) const {
     return false;
   }
 
-  // Don't fold stateful ops such as TruncatedNormal.
   const OpDef* op_def = nullptr;
   Status status = OpRegistry::Global()->LookUpOpDef(node.op(), &op_def);
   if (!status.ok()) {
     return false;
   }
-  if (op_def->is_stateful()) {
-    return false;
-  }
-
+  // Don't fold ops without outputs.
   if (op_def->output_arg_size() == 0) {
     return false;
   }
@@ -777,8 +788,11 @@ Status CreateConstantTensorAttrValue(DataType type, double value,
     SET_TENSOR_VAL_CASE(DT_FLOAT, float, float);
     SET_TENSOR_VAL_CASE(DT_DOUBLE, double, double);
     SET_TENSOR_VAL_CASE(DT_INT64, int64, int64);
+    SET_TENSOR_VAL_CASE(DT_UINT64, int64, int64);
     SET_TENSOR_VAL_CASE(DT_INT32, int32, int);
+    SET_TENSOR_VAL_CASE(DT_UINT32, int32, int);
     SET_TENSOR_VAL_CASE(DT_INT16, int32, int);
+    SET_TENSOR_VAL_CASE(DT_UINT16, int32, int);
     SET_TENSOR_VAL_CASE(DT_INT8, int32, int);
     SET_TENSOR_VAL_CASE(DT_UINT8, int32, int);
     SET_TENSOR_VAL_CASE(DT_BOOL, bool, bool);
@@ -792,59 +806,74 @@ Status CreateConstantTensorAttrValue(DataType type, double value,
 }  // namespace
 
 // static
-NodeDef ConstantFolding::CreateNodeDef(const string& name,
-                                       const TensorValue& tensor) {
-  NodeDef node;
-  node.set_name(name);
-  node.set_op("Const");
+Status ConstantFolding::CreateNodeDef(const string& name,
+                                      const TensorValue& tensor,
+                                      NodeDef* node) {
+  node->set_name(name);
+  node->set_op("Const");
 
   AttrValue attr_type;
   attr_type.set_type(tensor->dtype());
-  node.mutable_attr()->insert({"dtype", attr_type});
+  node->mutable_attr()->insert({"dtype", attr_type});
 
   AttrValue attr_tensor;
   TensorProto* t = attr_tensor.mutable_tensor();
   bool optimized = false;
+  size_t encoded_size;
   // Use the packed representation whenever possible to avoid generating large
   // graphdefs. Moreover, avoid repeating the last values if they're equal.
   if (tensor->NumElements() > 4) {
-#define POPULATE_TENSOR_PROTO(tensor, t, TYPE, NAME)                \
-  const TYPE* val_ptr = tensor->flat<TYPE>().data();                \
-  TYPE last = *val_ptr;                                             \
-  int64 last_index = 0;                                             \
-  for (int64 i = 0; i < tensor->NumElements(); ++i) {               \
-    TYPE cur = *val_ptr++;                                          \
-    if (cur != last) {                                              \
-      last = cur;                                                   \
-      last_index = i;                                               \
-    }                                                               \
-  }                                                                 \
-  if (last_index < kint32max) {                                     \
-    optimized = true;                                               \
-    t->mutable_##NAME##_val()->Reserve(last_index + 1);             \
-    t->mutable_##NAME##_val()->AddNAlreadyReserved(last_index + 1); \
-    val_ptr = tensor->flat<TYPE>().data();                          \
-    for (int64 i = 0; i <= last_index; ++i) {                       \
-      t->set_##NAME##_val(i, *val_ptr++);                           \
-    }                                                               \
-  }
-
-    if (tensor->dtype() == DT_FLOAT) {
-      POPULATE_TENSOR_PROTO(tensor, t, float, float)
-    } else if (tensor->dtype() == DT_DOUBLE) {
-      POPULATE_TENSOR_PROTO(tensor, t, double, double)
-    } else if (tensor->dtype() == DT_INT64) {
-      POPULATE_TENSOR_PROTO(tensor, t, int64, int64)
-    } else if (tensor->dtype() == DT_INT32) {
-      POPULATE_TENSOR_PROTO(tensor, t, int32, int)
-    } else if (tensor->dtype() == DT_INT16) {
-      POPULATE_TENSOR_PROTO(tensor, t, int16, int)
-    } else if (tensor->dtype() == DT_INT8) {
-      POPULATE_TENSOR_PROTO(tensor, t, int8, int)
-    } else if (tensor->dtype() == DT_UINT8) {
-      POPULATE_TENSOR_PROTO(tensor, t, uint8, int)
-    } else if (tensor->dtype() == DT_BOOL) {
-      POPULATE_TENSOR_PROTO(tensor, t, bool, bool)
+#define POPULATE_TENSOR_PROTO(tensor, t, TYPE, NAME)                  \
+  {                                                                   \
+    const TYPE* val_ptr = tensor->flat<TYPE>().data();                \
+    TYPE last = *val_ptr;                                             \
+    int64 last_index = 0;                                             \
+    for (int64 i = 0; i < tensor->NumElements(); ++i) {               \
+      TYPE cur = *val_ptr++;                                          \
+      if (cur != last) {                                              \
+        last = cur;                                                   \
+        last_index = i;                                               \
+      }                                                               \
+    }                                                                 \
+    if (last_index < kint32max) {                                     \
+      optimized = true;                                               \
+      encoded_size = (last_index + 1) * sizeof(NAME);                 \
+      t->mutable_##NAME##_val()->Reserve(last_index + 1);             \
+      t->mutable_##NAME##_val()->AddNAlreadyReserved(last_index + 1); \
+      val_ptr = tensor->flat<TYPE>().data();                          \
+      for (int64 i = 0; i <= last_index; ++i) {                       \
+        t->set_##NAME##_val(i, *val_ptr++);                           \
+      }                                                               \
+    }                                                                 \
+  }                                                                   \
+  break
+
+    switch (tensor->dtype()) {
+      case DT_FLOAT:
+        POPULATE_TENSOR_PROTO(tensor, t, float, float);
+      case DT_DOUBLE:
+        POPULATE_TENSOR_PROTO(tensor, t, double, double);
+      case DT_INT64:
+        POPULATE_TENSOR_PROTO(tensor, t, int64, int64);
+      case DT_UINT64:
+        POPULATE_TENSOR_PROTO(tensor, t, uint64, int64);
+      case DT_INT32:
+        POPULATE_TENSOR_PROTO(tensor, t, int32, int);
+      case DT_UINT32:
+        POPULATE_TENSOR_PROTO(tensor, t, uint32, int);
+      case DT_INT16:
+        POPULATE_TENSOR_PROTO(tensor, t, int16, int);
+      case DT_UINT16:
+        POPULATE_TENSOR_PROTO(tensor, t, uint16, int);
+      case DT_INT8:
+        POPULATE_TENSOR_PROTO(tensor, t, int8, int);
+      case DT_UINT8:
+        POPULATE_TENSOR_PROTO(tensor, t, uint8, int);
+      case DT_BOOL:
+        POPULATE_TENSOR_PROTO(tensor, t, bool, bool);
+      default:
+        /* Do nothing. */
+        break;
     }
   }
   if (optimized) {
@@ -853,9 +882,15 @@ NodeDef ConstantFolding::CreateNodeDef(const string& name,
     tensor->shape().AsProto(t->mutable_tensor_shape());
   } else {
     tensor->AsProtoTensorContent(t);
+    encoded_size = t->tensor_content().size();
+  }
+  node->mutable_attr()->insert({"value", attr_tensor});
+
+  if (encoded_size < 10 * 1024 * 1024) {
+    return Status::OK();
   }
-  node.mutable_attr()->insert({"value", attr_tensor});
-  return node;
+  return errors::InvalidArgument(
+      strings::StrCat("Can't fold ", name, ", its size would be too large"));
 }
 
 Status ConstantFolding::EvaluateNode(const NodeDef& node,
@@ -929,17 +964,19 @@ Status ConstantFolding::EvaluateOneFoldable(const NodeDef& node,
     return Status(error::INVALID_ARGUMENT, "Expected at least one output.");
   }
 
+  outputs->resize(output_tensors.size());
   for (size_t i = 0; i < output_tensors.size(); i++) {
     string node_name = OptimizedNodeName(node, "-folded");
     if (output_tensors.size() > 1) {
       node_name = strings::StrCat(node_name, "-", i);
     }
     if (output_tensors[i].tensor) {
-      outputs->push_back(CreateNodeDef(node_name, output_tensors[i]));
+      TF_RETURN_IF_ERROR(
+          CreateNodeDef(node_name, output_tensors[i], &outputs->at(i)));
     } else {
       // Create an empty NodeDef to identify dead outputs (e.g. the output of a
       // switch that's not selected by the switch predicate).
-      outputs->push_back(NodeDef());
+      outputs->at(i) = NodeDef();
     }
   }
   return Status::OK();
@@ -1147,9 +1184,8 @@ Status ConstantFolding::FoldGraph(GraphDef* output) {
   std::unordered_set<string> processed_nodes;
   std::deque<NodeDef*> queue;
   for (int i = 0; i < graph_->node_size(); i++) {
-    auto node = graph_->mutable_node(i);
-    if (IsFoldable(*node)) {
-      queue.push_back(node);
+    if (IsFoldable(graph_->node(i))) {
+      queue.push_back(graph_->mutable_node(i));
     }
   }
   while (!queue.empty()) {
@@ -1159,14 +1195,20 @@ Status ConstantFolding::FoldGraph(GraphDef* output) {
       continue;
     }
     // We need to record a copy of output nodes before FoldNode() modifies it.
-    std::set<NodeDef*> outputs = node_map_->GetOutputs(node->name());
+    // We also need to ensure that the fanout is sorted deterministically.
+    const std::set<NodeDef*>& outputs = node_map_->GetOutputs(node->name());
+    std::vector<NodeDef*> fanout(outputs.begin(), outputs.end());
+    std::sort(fanout.begin(), fanout.end(),
+              [](const NodeDef* n1, const NodeDef* n2) {
+                return n1->name() < n2->name();
+              });
 
     Status s = FoldNode(node, output);
     processed_nodes.insert(node->name());
     if (!s.ok()) {
       VLOG(1) << "Failed to fold node " << node->name() << ": " << s;
     } else {
-      for (auto& output : outputs) {
+      for (auto& output : fanout) {
         if (IsFoldable(*output)) {
           queue.push_back(output);
         }
@@ -1178,8 +1220,8 @@ Status ConstantFolding::FoldGraph(GraphDef* output) {
   int last = output->node_size() - 1;
   for (int i = output->node_size() - 1; i >= 0; --i) {
     const NodeDef& node = output->node(i);
-    auto outputs = node_map_->GetOutputs(node.name());
-    if (outputs.empty()) {
+    auto fanout = node_map_->GetOutputs(node.name());
+    if (fanout.empty()) {
       output->mutable_node()->SwapElements(i, last);
       last--;
     }
@@ -1191,8 +1233,8 @@ Status ConstantFolding::FoldGraph(GraphDef* output) {
     // If no fetch nodes is provided, we conservatively
     // keep all nodes in the original graph in case users need to fetch
     // their values.
-    auto outputs = node_map_->GetOutputs(node.name());
-    if (!outputs.empty() || !has_fetch_ ||
+    auto fanout = node_map_->GetOutputs(node.name());
+    if (!fanout.empty() || !has_fetch_ ||
         nodes_to_preserve_.find(node.name()) != nodes_to_preserve_.end()) {
       auto added_node = output->add_node();
       *added_node = node;
@@ -1306,14 +1348,14 @@ bool ConstantFolding::IsOnes(const NodeDef& node) const {
     //    IS_ONES_CASE(DT_HALF);
     IS_ONES_CASE(DT_FLOAT);
     IS_ONES_CASE(DT_DOUBLE);
+    IS_ONES_CASE(DT_COMPLEX64);
+    IS_ONES_CASE(DT_COMPLEX128);
     IS_ONES_CASE(DT_UINT8);
     IS_ONES_CASE(DT_INT8);
     IS_ONES_CASE(DT_UINT16);
     IS_ONES_CASE(DT_INT16);
     IS_ONES_CASE(DT_INT32);
     IS_ONES_CASE(DT_INT64);
-    IS_ONES_CASE(DT_COMPLEX64);
-    IS_ONES_CASE(DT_COMPLEX128);
     default:
       VLOG(1) << "Unsupported type " << DataTypeString(dtype);
       return false;
@@ -1337,14 +1379,14 @@ bool ConstantFolding::IsZeros(const NodeDef& node) const {
     //    IS_ZEROS_CASE(DT_HALF);
     IS_ZEROS_CASE(DT_FLOAT);
     IS_ZEROS_CASE(DT_DOUBLE);
+    IS_ZEROS_CASE(DT_COMPLEX64);
+    IS_ZEROS_CASE(DT_COMPLEX128);
     IS_ZEROS_CASE(DT_UINT8);
     IS_ZEROS_CASE(DT_INT8);
     IS_ZEROS_CASE(DT_UINT16);
     IS_ZEROS_CASE(DT_INT16);
     IS_ZEROS_CASE(DT_INT32);
     IS_ZEROS_CASE(DT_INT64);
-    IS_ZEROS_CASE(DT_COMPLEX64);
-    IS_ZEROS_CASE(DT_COMPLEX128);
     default:
       VLOG(1) << "Unsupported type " << DataTypeString(dtype);
       return false;
@@ -1409,6 +1451,17 @@ void ConstantFolding::ReplaceDivisionOfOnesByReciprocal(NodeDef* node,
   graph_modified_ = true;
 }
 
+void ConstantFolding::ReplaceSubtractionFromZeroByNegation(NodeDef* node,
+                                                           GraphDef* graph) {
+  node->set_op("Neg");
+  node->mutable_input()->SwapElements(0, 1);
+  const string ctrl_dep =
+      AddControlDependency(node->input(1), graph, node_map_.get());
+  node_map_->UpdateInput(node->name(), node->input(1), ctrl_dep);
+  node->set_input(1, ctrl_dep);
+  graph_modified_ = true;
+}
+
 Status ConstantFolding::ReplaceOperationWithConstant(
     double value, const TensorShapeProto& shape, NodeDef* node,
     GraphDef* graph) {
@@ -1440,6 +1493,123 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
   const bool is_aggressive = opt_level_ == RewriterConfig::AGGRESSIVE;
   for (int i = 0; i < output->node_size(); ++i) {
     NodeDef* node = output->mutable_node(i);
+
+    // Remove Shuffle or Reverse op over scalar values.
+    if (use_shape_info &&
+        (IsShuffle(*node) || IsReverse(*node) || IsTranspose(*node))) {
+      const auto& shape =
+          properties.GetInputProperties(node->name())[0].shape();
+      // The node is replaceable iff
+      // unknown_rank == false && (dim_size == 0 || all dims have size 1)
+      bool replaceable = !shape.unknown_rank();
+      for (int j = 0; j < shape.dim_size(); ++j) {
+        replaceable &= shape.dim(j).size() == 1;
+      }
+      if (replaceable) {
+        ReplaceOperationWithIdentity(0, node, output);
+      }
+    }
+
+    // Switch(x, x) will always feed false to its false branch and true to
+    // its true branch. By rewriting the graph a bit, we can propagate these
+    // constants down the two output branches, and just use control dependencies
+    // to trigger the selected one at runtime. For example,
+    //
+    //     +------+
+    // x-->|Switch|-->a  (in practice there may be multiple consumers of each
+    // x-->|      |-->b   output branch.)
+    //     +------+
+    //
+    // Is rewritten as
+    //
+    //     +------+
+    // x-->|Switch|-->Identity--^>Const(false)-->a
+    // x-->|      |-->Identity--^>Const(true)-->b
+    //     +------+
+    if (node->op() == "Switch" && node->input(0) == node->input(1) &&
+        !OptimizedNodeExists(*node, "_const_false") &&
+        !OptimizedNodeExists(*node, "_const_true")) {
+      bool already_optimized = true;
+      // If the optimization was already applied, the switch would have exactly
+      // one Identity node consuming each of its outputs, each without any
+      // non-control outputs.
+      auto fanouts = node_map_->GetOutputs(node->name());
+      if (fanouts.size() == 2) {
+        for (NodeDef* fanout : fanouts) {
+          if (!IsIdentity(*fanout) ||
+              NumNonControlOutputs(*fanout, *node_map_) > 0) {
+            already_optimized = false;
+            break;
+          }
+        }
+      }
+      Tensor false_t(DT_BOOL, TensorShape({}));
+      Tensor true_t(DT_BOOL, TensorShape({}));
+      // Make sure we don't proceed if this switch node was already optimized.
+      if (!already_optimized && SetTensorValue(DT_BOOL, true, &true_t).ok() &&
+          SetTensorValue(DT_BOOL, false, &false_t).ok()) {
+        // Copy the set of consumers of the switch as they will be manipulated
+        // below.
+        const std::set<NodeDef*>& consumer_set =
+            node_map_->GetOutputs(node->name());
+        std::vector<NodeDef*> consumers(consumer_set.begin(),
+                                        consumer_set.end());
+        std::sort(consumers.begin(), consumers.end(),
+                  [](const NodeDef* n1, const NodeDef* n2) {
+                    return n1->name() < n2->name();
+                  });
+        // Create constant false & true nodes.
+        NodeDef* false_node = output->add_node();
+        false_node->set_name(OptimizedNodeName(*node, "_const_false"));
+        if (!CreateNodeDef(false_node->name(), TensorValue(&false_t),
+                           false_node)
+                 .ok()) {
+          continue;
+        }
+        false_node->set_device(node->device());
+
+        NodeDef* true_node = output->add_node();
+        true_node->set_name(OptimizedNodeName(*node, "_const_true"));
+        if (!CreateNodeDef(true_node->name(), TensorValue(&true_t), true_node)
+                 .ok()) {
+          continue;
+        }
+        true_node->set_device(node->device());
+
+        // Add controls from the switch ports to the constants, and connect the
+        // constants to the original switch outputs.
+        const string false_port = node->name();
+        const string true_port = strings::StrCat(node->name(), ":1");
+        const string false_ctrl_dep =
+            AddControlDependency(false_port, output, node_map_.get());
+        false_node->add_input(false_ctrl_dep);
+        const string true_ctrl_dep =
+            AddControlDependency(true_port, output, node_map_.get());
+        true_node->add_input(true_ctrl_dep);
+
+        node_map_->AddNode(false_node->name(), false_node);
+        node_map_->AddNode(true_node->name(), true_node);
+        node_map_->AddOutput(NodeName(false_ctrl_dep), false_node->name());
+        node_map_->AddOutput(NodeName(true_ctrl_dep), true_node->name());
+
+        for (NodeDef* consumer : consumers) {
+          for (int i = 0; i < consumer->input_size(); ++i) {
+            const string& input = consumer->input(i);
+            if (input == false_port) {
+              consumer->set_input(i, false_node->name());
+              node_map_->UpdateInput(consumer->name(), false_port,
+                                     false_node->name());
+            } else if (input == true_port) {
+              consumer->set_input(i, true_node->name());
+              node_map_->UpdateInput(consumer->name(), true_port,
+                                     true_node->name());
+            }
+          }
+        }
+        graph_modified_ = true;
+        continue;
+      }
+    }
     if (IsSimplifiableReduction(*node)) {
       // Replace the reduction node with an identity node, that can be further
       // optimized by the model pruner.
@@ -1495,12 +1665,17 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
       const bool y_matches_output_shape = ShapesEqual(output_shape, y_shape);
       if (y_matches_output_shape &&
           ((is_mul && x_is_one) || (is_add && x_is_zero))) {
-        // TODO(rmlarsen): Handle subtraction 0 - y.
         // 1 * y = y or 0 + y = y.
         ReplaceOperationWithSnapshot(1, node, output);
         continue;
       }
 
+      if (y_matches_output_shape && (is_sub && x_is_zero)) {
+        // Replace 0 - y with Neg(y).
+        ReplaceSubtractionFromZeroByNegation(node, output);
+        continue;
+      }
+
       // Replace 1 / y with Reciprocal op.
       if (y_matches_output_shape && is_any_div && x_is_one) {
         DataType type = node->attr().at("T").type();
@@ -1515,9 +1690,8 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
       const bool y_is_zero = IsZeros(*y);
       const bool y_is_one = IsOnes(*y);
       const bool x_matches_output_shape = ShapesEqual(output_shape, x_shape);
-      if (x_matches_output_shape &&
-          (((is_mul || is_any_div) && y_is_one) ||
-           ((is_add || is_sub) && y_is_zero))) {
+      if (x_matches_output_shape && (((is_mul || is_any_div) && y_is_one) ||
+                                     ((is_add || is_sub) && y_is_zero))) {
         // x * 1 = x or x / 1 = x or x +/- 0 = x
         ReplaceOperationWithSnapshot(0, node, output);
         continue;
@@ -1569,8 +1743,7 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
       }
       // Insert new reciprocal op and change node from Div to Mul.
       NodeDef* reciprocal_node = output->add_node();
-      reciprocal_node->set_name(AddPrefixToNodeName(
-          strings::StrCat(node->name(), "_recip"), kConstantFoldingConst));
+      reciprocal_node->set_name(OptimizedNodeName(*node, "_recip"));
       reciprocal_node->set_op("Reciprocal");
       reciprocal_node->set_device(node->device());
       node->set_op("Mul");
@@ -1667,8 +1840,137 @@ Status ConstantFolding::SimplifyGraph(GraphDef* output,
       std::swap(*node->mutable_input(parent_const_input),
                 *op_child_node->mutable_input(non_const_leaf_input));
       graph_modified_ = true;
+      continue;
+    }
+
+    // Partial constant propagation through IdentityN.
+    if (IsIdentityN(*node) && NumNonControlInputs(*node) > 0) {
+      const std::set<NodeDef*>& tmp = node_map_->GetOutputs(node->name());
+      const std::vector<NodeDef*> consumers(tmp.begin(), tmp.end());
+      for (int input_idx = 0; input_idx < node->input_size(); ++input_idx) {
+        const string& input = node->input(input_idx);
+        if (IsControlInput(input)) {
+          break;
+        }
+        const NodeDef* input_node = node_map_->GetNode(NodeName(input));
+        if (input_node == nullptr) {
+          LOG(ERROR) << "Bad input: " << input;
+          break;
+        }
+        // Forward constant inputs to outputs and add a control dependency on
+        // the IdentityN node.
+        if (IsReallyConstant(*input_node)) {
+          // Update each consumer.
+          for (NodeDef* consumer : consumers) {
+            bool add_dep = false;
+            for (int consumer_input_idx = 0;
+                 consumer_input_idx < consumer->input_size();
+                 ++consumer_input_idx) {
+              const string& consumer_input =
+                  consumer->input(consumer_input_idx);
+              if (IsControlInput(consumer_input)) {
+                break;
+              }
+              int output_idx;
+              const string input_node_name =
+                  ParseNodeName(consumer_input, &output_idx);
+              if (input_node_name == node->name() && output_idx == input_idx) {
+                consumer->set_input(consumer_input_idx, input);
+                // We will keep the input from IdentityN through a control
+                // dependendy, so we only need to add the consumer as an output
+                // for the constant input node.
+                node_map_->AddOutput(NodeName(input), consumer->name());
+                add_dep = true;
+              }
+            }
+            if (add_dep) {
+              consumer->add_input(AsControlDependency(node->name()));
+            }
+          }
+        }
+      }
+      for (NodeDef* consumer : consumers) {
+        DedupControlInputs(consumer);
+      }
+    }
+
+    // Partial constant folding for associative operators:
+    // Split AddN/AccumulateNV2 to enable partial
+    // folding of ops when more than one but not all inputs are constant.
+    // For AddN and AccumulateNV2, we may furthermore reorder inputs, since
+    // addition is commutative.
+    // TODO(rmlarsen): Concat/Pack/ParallelConcat which are not commutative, so
+    // we have to preserve order and can only push consecutive runs of constant
+    // inputs into sub-nodes.
+    if (IsAggregate(*node) && IsCommutative(*node) &&
+        NumNonControlInputs(*node) > 2) {
+      const int num_control_inputs =
+          node->input_size() - NumNonControlInputs(*node);
+      std::vector<int> const_inputs;
+      std::vector<int> nonconst_inputs;
+      for (int i = 0; i < node->input_size(); ++i) {
+        const string& input = node->input(i);
+        const NodeDef* input_node = node_map_->GetNode(NodeName(input));
+        CHECK(input_node != nullptr) << input;
+        if (!IsControlInput(input) && IsReallyConstant(*input_node)) {
+          const_inputs.push_back(i);
+        } else {
+          // Non-const and control inputs.
+          nonconst_inputs.push_back(i);
+        }
+      }
+      // Promote AccumulateNV2 with all constant inputs to AddN, since it is
+      // a fake node that cannot be constant folded by itself.
+      if (const_inputs.size() == NumNonControlInputs(*node) &&
+          node->op() == "AccumulateNV2") {
+        node->set_op("AddN");
+        node->mutable_attr()->erase("shape");
+        graph_modified_ = true;
+        continue;
+      }
+      const string new_node_name = OptimizedNodeName(
+          *node, strings::StrCat("_partial_split_", const_inputs.size()));
+      if (1 < const_inputs.size() &&
+          const_inputs.size() < NumNonControlInputs(*node) &&
+          !node_map_->NodeExists(new_node_name)) {
+        NodeDef* added_node = output->add_node();
+        *added_node = *node;
+        // Always use AddN for the constant node, since AccumulateNV2 is a fake
+        // node that cannot be constant folded, since it does not have a kernel.
+        added_node->set_op("AddN");
+        added_node->mutable_attr()->erase("shape");
+        added_node->set_name(new_node_name);
+        node_map_->AddNode(added_node->name(), added_node);
+        added_node->clear_input();
+        for (int i : const_inputs) {
+          added_node->add_input(node->input(i));
+          node_map_->UpdateOutput(NodeName(node->input(i)), node->name(),
+                                  added_node->name());
+        }
+
+        // Overwrite the first const input with the added node.
+        node->set_input(const_inputs[0], added_node->name());
+        node_map_->AddOutput(added_node->name(), node->name());
+        nonconst_inputs.push_back(const_inputs[0]);
+        // Compact the remaining inputs to the original node.
+        std::sort(nonconst_inputs.begin(), nonconst_inputs.end());
+        int idx = 0;
+        for (int i : nonconst_inputs) {
+          if (idx != i) {
+            node->set_input(idx, node->input(i));
+          }
+          ++idx;
+        }
+        node->mutable_input()->DeleteSubrange(nonconst_inputs.size(),
+                                              const_inputs.size() - 1);
+        (*node->mutable_attr())["N"].set_i(node->input_size() -
+                                           num_control_inputs);
+        (*added_node->mutable_attr())["N"].set_i(const_inputs.size());
+        graph_modified_ = true;
+      }
     }
   }
+
   return Status::OK();
 }
 
@@ -1707,11 +2009,17 @@ Status ConstantFolding::RunOptimizationPass(Cluster* cluster,
   TF_RETURN_IF_ERROR(FoldGraph(output));
   node_map_.reset(new NodeMap(output));
   TF_RETURN_IF_ERROR(SimplifyGraph(output, properties, can_use_shape_info));
+
   return Status::OK();
 }
 
 Status ConstantFolding::Optimize(Cluster* cluster, const GrapplerItem& item,
                                  GraphDef* output) {
+  // TensorFlow flushes denormals to zero and rounds to nearest, so we do
+  // the same here.
+  port::ScopedFlushDenormal flush;
+  port::ScopedSetRound round(FE_TONEAREST);
+
   nodes_to_preserve_ = item.NodesToPreserve();
   for (const auto& feed : item.feed) {
     feed_nodes_.insert(NodeName(feed.first));
@@ -1746,5 +2054,5 @@ void ConstantFolding::Feedback(Cluster* cluster, const GrapplerItem& item,
   // Nothing to do for ConstantFolding.
 }
 
-}  // end namespace grappler
-}  // end namespace tensorflow
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/constant_folding.h b/tensorflow/core/grappler/optimizers/constant_folding.h
index e4078514af11174788bc5a436125efeb3fa37177..2fd59c7f9ccf3f94e683d7ec41a5848b9eec4a8f 100644
--- a/tensorflow/core/grappler/optimizers/constant_folding.h
+++ b/tensorflow/core/grappler/optimizers/constant_folding.h
@@ -33,7 +33,8 @@ const char kConstantFoldingCtrl[] = "ConstantFoldingCtrl";
 // Constant folding optimization for a graph.
 class ConstantFolding : public GraphOptimizer {
  public:
-  static NodeDef CreateNodeDef(const string& name, const TensorValue& tensor);
+  static Status CreateNodeDef(const string& name, const TensorValue& tensor,
+                              NodeDef* node);
   static string AddControlDependency(const string& input_name, GraphDef* graph,
                                      NodeMap* node_map);
 
@@ -81,6 +82,7 @@ class ConstantFolding : public GraphOptimizer {
                                     GraphDef* graph);
   void ReplaceOperationWithSnapshot(int input_to_forward, NodeDef* node,
                                     GraphDef* graph);
+  void ReplaceSubtractionFromZeroByNegation(NodeDef* node, GraphDef* graph);
   Status ReplaceOperationWithConstant(double value,
                                       const TensorShapeProto& shape,
                                       NodeDef* node, GraphDef* graph);
diff --git a/tensorflow/core/grappler/optimizers/constant_folding_test.cc b/tensorflow/core/grappler/optimizers/constant_folding_test.cc
index d8df19fe6a0a5daafefd11e5ac39c8e3bc50e6e1..29dc93c257f44752c94b007aba4288a2d93f1ea5 100644
--- a/tensorflow/core/grappler/optimizers/constant_folding_test.cc
+++ b/tensorflow/core/grappler/optimizers/constant_folding_test.cc
@@ -187,20 +187,21 @@ TEST_F(ConstantFoldingTest, NeutralElement) {
     Output bias_add2 = ops::BiasAdd(s.WithOpName("bias_add2"), zeros, bias);
     Output sub1 = ops::Sub(s.WithOpName("sub1"), x, zeros);
     Output sub2 = ops::Sub(s.WithOpName("sub2"), zeros, y);
-    Output addn =
-        ops::AddN(s.WithOpName("addn"),
-                  {mul1, mul2, mul3, mul4, mul5, mul6, div1, div2, matmul1,
-                   matmul2, add1, add2, bias_add1, bias_add2, sub1, sub2});
+    Output concat =
+        ops::Concat(s.WithOpName("concat"),
+                    {mul1, mul2, mul3, mul4, mul5, mul6, div1, div2, matmul1,
+                     matmul2, add1, add2, bias_add1, bias_add2, sub1, sub2},
+                    0);
     GrapplerItem item;
     TF_CHECK_OK(s.ToGraphDef(&item.graph));
-    item.fetch = {"addn", "matmul3", "matmul4"};
+    item.fetch = {"concat", "matmul3", "matmul4"};
 
     ConstantFolding optimizer(nullptr /* cpu_device */);
     GraphDef output;
     Status status = optimizer.Optimize(nullptr, item, &output);
     TF_EXPECT_OK(status);
 
-    EXPECT_EQ(27, output.node_size());
+    EXPECT_EQ(28, output.node_size());
     for (int i = 0; i < output.node_size(); ++i) {
       const NodeDef& node = output.node(i);
       const string& name = node.name();
@@ -286,10 +287,9 @@ TEST_F(ConstantFoldingTest, NeutralElement) {
         EXPECT_EQ("x", node.input(0));
         EXPECT_EQ("^zeros", node.input(1));
       } else if (name == "sub2") {
-        // We don't handle this case yet.
-        EXPECT_EQ("Sub", node.op());
-        EXPECT_EQ("zeros", node.input(0));
-        EXPECT_EQ("y", node.input(1));
+        EXPECT_EQ("Neg", node.op());
+        EXPECT_EQ("y", node.input(0));
+        EXPECT_EQ("^zeros", node.input(1));
       }
       const std::set<string> square_zero_const{"mul1", "mul2",    "mul5",
                                                "mul6", "matmul1", "matmul2"};
@@ -415,7 +415,6 @@ TEST_F(ConstantFoldingTest, NeutralElement_PartialShape_UnknownOutputShape) {
   GraphDef output;
   Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
-  LOG(INFO) << output.DebugString();
 
   EXPECT_EQ(15, output.node_size());
   for (int i = 0; i < output.node_size(); ++i) {
@@ -469,7 +468,6 @@ TEST_F(ConstantFoldingTest, NeutralElement_PartialShape_KnownOutputShape) {
   GraphDef output;
   Status status = optimizer.Optimize(nullptr, item, &output);
   TF_EXPECT_OK(status);
-  LOG(INFO) << output.DebugString();
 
   EXPECT_EQ(10, output.node_size());
   for (int i = 0; i < output.node_size(); ++i) {
@@ -991,8 +989,10 @@ TEST_F(ConstantFoldingTest, SwitchNodesEmptyFetch) {
   EXPECT_EQ(present_nodes.size(), output.node_size());
   int found = 0;
   for (const auto& node : output.node()) {
-    EXPECT_TRUE(present_nodes.find(node.name()) != present_nodes.end());
-    EXPECT_TRUE(not_present_nodes.find(node.name()) == not_present_nodes.end());
+    EXPECT_TRUE(present_nodes.find(node.name()) != present_nodes.end())
+        << node.name();
+    EXPECT_TRUE(not_present_nodes.find(node.name()) == not_present_nodes.end())
+        << node.name();
     present_nodes.erase(node.name());
     not_present_nodes.erase(node.name());
     if (node.name() == "rank") {
@@ -1177,8 +1177,43 @@ TEST_F(ConstantFoldingTest, MergeNodes) {
   EXPECT_EQ(2, out_idx.flat<int32>()(0));
 }
 
+TEST_F(ConstantFoldingTest, ShuffleReverseOnScalarRemoval) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+
+  Output in1 =
+      ops::Variable(scope.WithOpName("in1"), TensorShape({}), DT_FLOAT);
+  Output in2 =
+      ops::Variable(scope.WithOpName("in2"), TensorShape({}), DT_FLOAT);
+  ops::RandomShuffle s1(scope.WithOpName("s1"), in1);
+  ops::RandomShuffle s2(scope.WithOpName("s2").WithControlDependencies({in1}),
+                        in2);
+
+  ops::Add out1(scope.WithOpName("out1"), s1, s2);
+  ops::Identity out2(scope.WithOpName("out2"), s2);
+
+  GrapplerItem item;
+  item.fetch = {"out1", "out2"};
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+
+  ConstantFolding fold(nullptr /* cpu_device */);
+  GraphDef got;
+  Status status = fold.Optimize(nullptr, item, &got);
+  TF_EXPECT_OK(status);
+
+  GraphDef want;
+  AddNode("in1", "VariableV2", {}, &want);
+  AddNode("in2", "VariableV2", {}, &want);
+  AddNode("s1", "Identity", {"in1"}, &want);
+  AddNode("s2", "Identity", {"in2", AsControlDependency("in1")}, &want);
+  AddNode("out1", "Add", {"s1", "s2"}, &want);
+  AddNode("out2", "Identity", {"s2"}, &want);
+
+  CompareGraphs(want, got);
+}
+
 TEST_F(ConstantFoldingTest, NoOpReduction) {
-  // Build a simple graph with a reduction that can be reduced to the identity.
+  // Build a simple graph with a reduction that can be reduced to the
+  // identity.
   tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
 
   Output v = ops::Variable(scope.WithOpName("v"), {3, 5, 7}, DT_FLOAT);
@@ -1304,8 +1339,8 @@ TEST_F(ConstantFoldingTest, Packing) {
   TF_EXPECT_OK(status);
 
   // Make sure that the representation of the folded constant is space
-  // efficient: in particular, the whole message should be smaller than 8k (the
-  // size needed to naively encode 1000 floats folded twice).
+  // efficient: in particular, the whole message should be smaller than 8k
+  // (the size needed to naively encode 1000 floats folded twice).
   EXPECT_GT(8000, output.ByteSizeLong());
 }
 
@@ -1422,8 +1457,248 @@ TEST_F(ConstantFoldingTest, MaterializeReductionIndices) {
   EXPECT_EQ(3, found);
 }
 
+TEST_F(ConstantFoldingTest, LargeConstant) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+  // Generate a 4k by 4k constant matrix.
+  Output mat_diag =
+      ops::Const(scope.WithOpName("mat_diag"), 3.14f, TensorShape({1024 * 4}));
+  Output mat = ops::Diag(scope.WithOpName("mat"), mat_diag);
+  Output out = ops::Identity(scope.WithOpName("out"), mat);
+
+  GrapplerItem item;
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+  item.fetch.push_back("out");
+
+  ConstantFolding fold(nullptr /* cpu_device */);
+  GraphDef output;
+  Status status = fold.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  // Make sure the diag node hasn't been folded, since it would use too much
+  // memory to encode the corresponding constant.
+  int found = 0;
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "out") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("mat", node.input(0));
+      ++found;
+    } else if (node.name() == "mat") {
+      EXPECT_EQ("Diag", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("mat_diag", node.input(0));
+      ++found;
+    }
+  }
+  EXPECT_EQ(2, found);
+
+  EXPECT_GT(1024 * 1024, output.ByteSizeLong());
+}
+
+TEST_F(ConstantFoldingTest, SwitchIdenticalInputs) {
+  tensorflow::Scope s = tensorflow::Scope::NewRootScope();
+  Output x = ops::Placeholder(s.WithOpName("x"), DT_BOOL,
+                              ops::Placeholder::Shape(TensorShape({})));
+  ops::Switch sw = ops::Switch(s.WithOpName("switch"), x, x);
+  Output id_false = ops::LogicalNot(s.WithOpName("id_false"), sw.output_false);
+  Output id_true = ops::LogicalNot(s.WithOpName("id_true"), sw.output_true);
+
+  GrapplerItem item;
+  item.fetch.push_back("id_false");
+  item.fetch.push_back("id_true");
+  TF_CHECK_OK(s.ToGraphDef(&item.graph));
+
+  ConstantFolding fold(nullptr /* cpu_device */);
+  GraphDef output;
+  Status status = fold.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  EXPECT_EQ(6, output.node_size());
+  int found = 0;
+  for (const auto& node : output.node()) {
+    if (node.name() == "switch" || node.name() == "x") {
+      ++found;
+    }
+    if (node.name() == "id_false") {
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("^ConstantFoldingCtrl/switch_0", node.input(0));
+      ++found;
+    }
+    if (node.name() == "id_true") {
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("^ConstantFoldingCtrl/switch_1", node.input(0));
+      ++found;
+    }
+    if (node.name() == "ConstantFoldingCtrl/switch_0") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("switch", node.input(0));
+      ++found;
+    }
+    if (node.name() == "ConstantFoldingCtrl/switch_1") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("switch:1", node.input(0));
+      ++found;
+    }
+  }
+  EXPECT_EQ(6, found);
+}
+
+TEST_F(ConstantFoldingTest, PartialFolding_AssociativeAndCommutative) {
+  std::function<Output(const Scope&, InputList)> addn_fun =
+      [](const Scope& scope, InputList inputs) {
+        return ops::AddN(scope, inputs);
+      };
+  std::function<Output(const Scope&, InputList)> accumulate_fun =
+      [](const Scope& scope, InputList inputs) {
+        return ops::AccumulateNV2(scope, inputs, TensorShape({2, 2}));
+      };
+  for (bool use_add_n : {true, false}) {
+    auto fun = use_add_n ? addn_fun : accumulate_fun;
+    const string op_name = use_add_n ? "AddN" : "AccumulateNV2";
+    Scope s = Scope::NewRootScope();
+    Output x = ops::Placeholder(s.WithOpName("x"), DT_FLOAT,
+                                ops::Placeholder::Shape(TensorShape({2, 2})));
+    Output y = ops::Placeholder(s.WithOpName("y"), DT_FLOAT,
+                                ops::Placeholder::Shape(TensorShape({2, 2})));
+    Output z = ops::Placeholder(s.WithOpName("z"), DT_FLOAT,
+                                ops::Placeholder::Shape(TensorShape({2, 2})));
+    Output c1 = ops::Const(s.WithOpName("c1"), 1.0f, {2, 2});
+    Output c2 = ops::Const(s.WithOpName("c2"), 2.0f, {2, 2});
+    Output c3 = ops::Const(s.WithOpName("c3"), 3.0f, {2, 2});
+    Output acc0 = fun(s.WithOpName("acc0"), {c1, c2, c3});
+    Output acc1 = fun(s.WithOpName("acc1"), {x, y, z});
+    Output acc2 = fun(s.WithOpName("acc2"), {c1, x, y});
+    Output acc3 = fun(s.WithOpName("acc3"), {c1, c2, z});
+    Output acc4 = fun(s.WithOpName("acc4"), {c1, y, c2});
+    Output acc5 = fun(s.WithOpName("acc5"), {x, c1, c2});
+    Output acc6 = fun(s.WithOpName("acc6"), {x, c1, y, c2});
+    Output concat = ops::Concat(s.WithOpName("concat"),
+                                {acc0, acc1, acc2, acc3, acc4, acc5, acc6}, 0);
+
+    GrapplerItem item;
+    TF_CHECK_OK(s.ToGraphDef(&item.graph));
+    item.fetch = {"concat"};
+
+    ConstantFolding optimizer(nullptr /* cpu_device */);
+    GraphDef output;
+    Status status = optimizer.Optimize(nullptr, item, &output);
+    TF_EXPECT_OK(status);
+
+    EXPECT_EQ(17, output.node_size());
+    for (const NodeDef& node : output.node()) {
+      if (node.name() == "acc0") {
+        EXPECT_EQ("Const", node.op());
+      }
+      if (node.name() == "acc1") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(3, node.input_size());
+        EXPECT_EQ("x", node.input(0));
+        EXPECT_EQ("y", node.input(1));
+        EXPECT_EQ("z", node.input(2));
+      }
+      if (node.name() == "acc2") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(3, node.input_size());
+        EXPECT_EQ("c1", node.input(0));
+        EXPECT_EQ("x", node.input(1));
+        EXPECT_EQ("y", node.input(2));
+      }
+      if (node.name() == "acc3") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(2, node.input_size());
+        EXPECT_EQ("ConstantFolding/acc3_partial_split_2", node.input(0));
+        EXPECT_EQ("z", node.input(1));
+      }
+      if (node.name() == "acc4") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(2, node.input_size());
+        EXPECT_EQ("ConstantFolding/acc4_partial_split_2", node.input(0));
+        EXPECT_EQ("y", node.input(1));
+      }
+      if (node.name() == "acc5") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(2, node.input_size());
+        EXPECT_EQ("x", node.input(0));
+        EXPECT_EQ("ConstantFolding/acc5_partial_split_2", node.input(1));
+      }
+      if (node.name() == "acc6") {
+        EXPECT_EQ(op_name, node.op());
+        EXPECT_EQ(3, node.input_size());
+        EXPECT_EQ("x", node.input(0));
+        EXPECT_EQ("ConstantFolding/acc6_partial_split_2", node.input(1));
+        EXPECT_EQ("y", node.input(2));
+      }
+      if (StringPiece(node.name()).starts_with("ConstantFolding/")) {
+        EXPECT_EQ("Const", node.op());
+      }
+    }
+
+    std::vector<string> fetch = {"acc0"};
+    auto tensors_expected = EvaluateNodes(item.graph, fetch);
+    auto tensors = EvaluateNodes(output, fetch);
+    EXPECT_EQ(1, tensors_expected.size());
+    EXPECT_EQ(1, tensors.size());
+    test::ExpectTensorNear<float>(tensors_expected[0], tensors[0], 1e-6);
+  }
+}
+
+TEST_F(ConstantFoldingTest, IdenticalN) {
+  tensorflow::Scope scope = tensorflow::Scope::NewRootScope();
+  Output x = ops::Placeholder(scope.WithOpName("x"), DT_FLOAT,
+                              ops::Placeholder::Shape(TensorShape({})));
+  Output c1 = ops::Const(scope.WithOpName("c1"), 1.0f, {2, 2});
+  Output c2 = ops::Const(scope.WithOpName("c2"), 2.0f, {2, 2});
+  auto id_n = ops::IdentityN(scope.WithOpName("id_n"), {c1, x, c2});
+  auto id0 = ops::Identity(scope.WithOpName("id0"), id_n[1]);
+  auto id1 = ops::Identity(scope.WithOpName("id1"), id_n[0]);
+  auto add0 = ops::Add(scope.WithOpName("add0"), id_n[0], id_n[1]);
+  auto add1 = ops::Add(scope.WithOpName("add1"), id_n[0], id_n[2]);
+
+  GrapplerItem item;
+  TF_CHECK_OK(scope.ToGraphDef(&item.graph));
+  item.fetch.push_back("id0");
+  item.fetch.push_back("id1");
+  item.fetch.push_back("add0");
+  item.fetch.push_back("add1");
+
+  ConstantFolding fold(nullptr /* cpu_device */);
+  GraphDef output;
+  Status status = fold.Optimize(nullptr, item, &output);
+
+  TF_EXPECT_OK(status);
+  EXPECT_EQ(8, output.node_size());
+  // id_n should remain unchanged.
+  EXPECT_EQ("id_n", output.node(3).name());
+  EXPECT_EQ(3, output.node(3).input_size());
+  EXPECT_EQ("c1", output.node(3).input(0));
+  EXPECT_EQ("x", output.node(3).input(1));
+  EXPECT_EQ("c2", output.node(3).input(2));
+  // id0 is unchanged.
+  EXPECT_EQ("id0", output.node(4).name());
+  EXPECT_EQ(1, output.node(4).input_size());
+  // id1 should have the constant input forwarded to it,
+  // and a control dependency from id_n.
+  EXPECT_EQ("id1", output.node(5).name());
+  EXPECT_EQ(2, output.node(5).input_size());
+  EXPECT_EQ("c1", output.node(5).input(0));
+  EXPECT_EQ("^id_n", output.node(5).input(1));
+
+  EXPECT_EQ("add0", output.node(6).name());
+  EXPECT_EQ(2, output.node(6).input_size());
+  EXPECT_EQ("c1", output.node(6).input(0));
+  EXPECT_EQ("id_n:1", output.node(6).input(1));
+
+  EXPECT_EQ("add1", output.node(7).name());
+  EXPECT_EQ(3, output.node(7).input_size());
+  EXPECT_EQ("c1", output.node(7).input(0));
+  EXPECT_EQ("c2", output.node(7).input(1));
+  EXPECT_EQ("^id_n", output.node(7).input(2));
+}
+
 }  // namespace
 }  // namespace grappler
 }  // namespace tensorflow
-
-//  LocalWords:  NewRootScope
diff --git a/tensorflow/core/grappler/optimizers/custom_graph_optimizer.h b/tensorflow/core/grappler/optimizers/custom_graph_optimizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..a80d46f416d8c1f43c46c3183f19e4e582dec8ec
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/custom_graph_optimizer.h
@@ -0,0 +1,35 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_GRAPPLER_OPTIMIZERS_CUSTOM_GRAPH_OPTIMIZER_H_
+#define TENSORFLOW_GRAPPLER_OPTIMIZERS_CUSTOM_GRAPH_OPTIMIZER_H_
+
+#include "tensorflow/core/grappler/optimizers/graph_optimizer.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+namespace grappler {
+
+// A custom optimizer that can be registered.
+class CustomGraphOptimizer : public GraphOptimizer {
+ public:
+  virtual ~CustomGraphOptimizer() {}
+  virtual Status Init() = 0;
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_GRAPPLER_OPTIMIZERS_CUSTOM_GRAPH_OPTIMIZER_H_
diff --git a/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc b/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc
new file mode 100644
index 0000000000000000000000000000000000000000..6eed43c2b132c02b58a0088c30dd5648fe80d212
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc
@@ -0,0 +1,61 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+
+#include <string>
+#include <unordered_map>
+
+#include "tensorflow/core/platform/logging.h"
+
+namespace tensorflow {
+namespace grappler {
+
+namespace {
+typedef std::unordered_map<string, CustomGraphOptimizerRegistry::Creator>
+    RegistrationMap;
+RegistrationMap* registered_optimizers = nullptr;
+RegistrationMap* GetRegistrationMap() {
+  if (registered_optimizers == nullptr)
+    registered_optimizers = new RegistrationMap;
+  return registered_optimizers;
+}
+}  // namespace
+
+std::unique_ptr<CustomGraphOptimizer>
+CustomGraphOptimizerRegistry::CreateByNameOrNull(const string& name) {
+  const auto it = GetRegistrationMap()->find(name);
+  if (it == GetRegistrationMap()->end()) return nullptr;
+  return std::unique_ptr<CustomGraphOptimizer>(it->second());
+}
+
+std::vector<string> CustomGraphOptimizerRegistry::GetRegisteredOptimizers() {
+  std::vector<string> optimizer_names;
+  optimizer_names.reserve(GetRegistrationMap()->size());
+  for (const auto& opt : *GetRegistrationMap())
+    optimizer_names.emplace_back(opt.first);
+  return optimizer_names;
+}
+
+void CustomGraphOptimizerRegistry::RegisterOptimizerOrDie(
+    const Creator& optimizer_creator, const string& name) {
+  const auto it = GetRegistrationMap()->find(name);
+  if (it != GetRegistrationMap()->end()) {
+    LOG(FATAL) << "CustomGraphOptimizer is registered twice: " << name;
+  }
+  GetRegistrationMap()->insert({name, optimizer_creator});
+}
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h b/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h
new file mode 100644
index 0000000000000000000000000000000000000000..796da913737b9db1fe4e5cb00b235bf0f5979593
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h
@@ -0,0 +1,65 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#ifndef TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_CUSTOM_GRAPH_OPTIMIZER_REGISTRY_H_
+#define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_CUSTOM_GRAPH_OPTIMIZER_REGISTRY_H_
+
+#include <functional>
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
+
+namespace tensorflow {
+namespace grappler {
+
+class CustomGraphOptimizerRegistry {
+ public:
+  static std::unique_ptr<CustomGraphOptimizer> CreateByNameOrNull(
+      const string& name);
+
+  static std::vector<string> GetRegisteredOptimizers();
+
+  typedef std::function<CustomGraphOptimizer*()> Creator;
+  // Regsiter graph optimizer which can be called during program initialization.
+  // This class is not thread-safe.
+  static void RegisterOptimizerOrDie(const Creator& optimizer_creator,
+                                     const string& name);
+};
+
+class CustomGraphOptimizerRegistrar {
+ public:
+  explicit CustomGraphOptimizerRegistrar(
+      const CustomGraphOptimizerRegistry::Creator& creator,
+      const string& name) {
+    CustomGraphOptimizerRegistry::RegisterOptimizerOrDie(creator, name);
+  }
+};
+
+#define REGISTER_GRAPH_OPTIMIZER_AS(MyCustomGraphOptimizerClass, name) \
+  namespace {                                                          \
+  static CustomGraphOptimizerRegistrar                                 \
+      MyCustomGraphOptimizerClass##_registrar(                         \
+          []() { return new MyCustomGraphOptimizerClass; }, (name));   \
+  }  // namespace
+
+#define REGISTER_GRAPH_OPTIMIZER(MyCustomGraphOptimizerClass) \
+  REGISTER_GRAPH_OPTIMIZER_AS(MyCustomGraphOptimizerClass,    \
+                              #MyCustomGraphOptimizerClass)
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_CUSTOM_GRAPH_OPTIMIZER_REGISTRY_H_
diff --git a/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry_test.cc b/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..629f5e83c12e91a7cc0f68dc9993e0f7c0117d3c
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry_test.cc
@@ -0,0 +1,87 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+
+#include <algorithm>
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+static const char* kTestOptimizerName = "Test";
+
+class TestGraphOptimizer : public CustomGraphOptimizer {
+ public:
+  Status Init() override { return Status::OK(); }
+  string name() const override { return kTestOptimizerName; }
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* optimized_graph) override {
+    return Status::OK();
+  }
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimized_graph, double result) override {}
+};
+
+REGISTER_GRAPH_OPTIMIZER_AS(TestGraphOptimizer, "StaticRegister");
+
+TEST(CustomGraphOptimizerRegistryTest, DynamicRegistration) {
+  std::vector<string> optimizers =
+      CustomGraphOptimizerRegistry::GetRegisteredOptimizers();
+  std::unique_ptr<const CustomGraphOptimizer> test_optimizer;
+  ASSERT_EQ(
+      0, std::count(optimizers.begin(), optimizers.end(), "DynamicRegister"));
+  test_optimizer =
+      CustomGraphOptimizerRegistry::CreateByNameOrNull("DynamicRegister");
+  EXPECT_EQ(nullptr, test_optimizer);
+  CustomGraphOptimizerRegistry::RegisterOptimizerOrDie(
+      []() { return new TestGraphOptimizer; }, "DynamicRegister");
+  optimizers = CustomGraphOptimizerRegistry::GetRegisteredOptimizers();
+  ASSERT_EQ(
+      1, std::count(optimizers.begin(), optimizers.end(), "DynamicRegister"));
+  test_optimizer =
+      CustomGraphOptimizerRegistry::CreateByNameOrNull("DynamicRegister");
+  ASSERT_NE(nullptr, test_optimizer);
+  EXPECT_EQ(kTestOptimizerName, test_optimizer->name());
+}
+
+TEST(CustomGraphOptimizerRegistryTest, StaticRegistration) {
+  const std::vector<string> optimizers =
+      CustomGraphOptimizerRegistry::GetRegisteredOptimizers();
+  EXPECT_EQ(1,
+            std::count(optimizers.begin(), optimizers.end(), "StaticRegister"));
+  std::unique_ptr<const CustomGraphOptimizer> test_optimizer =
+      CustomGraphOptimizerRegistry::CreateByNameOrNull("StaticRegister");
+  ASSERT_NE(nullptr, test_optimizer);
+  EXPECT_EQ(kTestOptimizerName, test_optimizer->name());
+}
+
+TEST(GraphOptimizerRegistryTest, CrashesOnDuplicateRegistration) {
+  const auto creator = []() { return new TestGraphOptimizer; };
+  EXPECT_DEATH(CustomGraphOptimizerRegistry::RegisterOptimizerOrDie(
+                   creator, "StaticRegister"),
+               "twice");
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/dependency_optimizer.cc b/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
index edb0db65e987318e1e64bf0288b6ef18a7b9d662..b47cba5ff79acf428e2f65f3008a9130df61970f 100644
--- a/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/dependency_optimizer.cc
@@ -286,7 +286,10 @@ void DependencyOptimizer::OptimizeNode(int node_idx,
     std::vector<NodeDef*> input_nodes;
     for (int i = 0; i < num_inputs; ++i) {
       NodeDef* input_node = node_map_->GetNode(node->input(i));
-      CHECK_NE(input_node, nullptr);
+      if (input_node == nullptr) {
+        LOG(ERROR) << "Invalid input " << node->input(i);
+        return;
+      }
       input_nodes.push_back(input_node);
     }
 
diff --git a/tensorflow/core/grappler/optimizers/function_optimizer.cc b/tensorflow/core/grappler/optimizers/function_optimizer.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4b830bcc6e7891c9affacdf788280f3e1543afaa
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/function_optimizer.cc
@@ -0,0 +1,176 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/function_optimizer.h"
+#include <unordered_map>
+#include "tensorflow/core/framework/function.pb.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/op_def.pb.h"
+#include "tensorflow/core/framework/versions.pb.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/functions.h"
+
+namespace tensorflow {
+namespace grappler {
+
+Status InlineFunction(const NodeDef& node, const FunctionDef& func,
+                      const FunctionDefLibrary& library, GraphDef* graph) {
+  const std::unordered_map<string, AttrValue> attr(node.attr().begin(),
+                                                   node.attr().end());
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, attr, library);
+  if (!item) {
+    return errors::InvalidArgument("Failed to inline function ", node.op(),
+                                   " instantiated by ", node.name());
+  }
+
+  std::unordered_map<string, int> input_nodes;
+  for (int i = 0; i < func.signature().input_arg_size(); ++i) {
+    const OpDef::ArgDef& arg = func.signature().input_arg(i);
+    input_nodes[arg.name()] = i;
+  }
+
+  // Add an IdentityN op to hook the function inputs to: this ensures that
+  // they're all evaluated before the evaluation of the function body starts.
+  NodeDef* func_inputs = graph->add_node();
+  func_inputs->set_name(strings::StrCat(node.name(), "/", "inlined_inputs"));
+  func_inputs->set_op("IdentityN");
+  func_inputs->set_device(node.device());
+  *func_inputs->mutable_input() = node.input();
+  AttrValue::ListValue* type_list =
+      (*func_inputs->mutable_attr())["T"].mutable_list();
+  for (const OpDef::ArgDef& arg : func.signature().input_arg()) {
+    if (arg.type() != DT_INVALID) {
+      type_list->add_type(arg.type());
+    } else {
+      auto it = attr.find(arg.type_attr());
+      if (it == attr.end()) {
+        return errors::InvalidArgument("Invalid input argument ", arg.name(),
+                                       " for function ", node.op(),
+                                       " instantiated by ", node.name());
+      }
+      type_list->add_type(it->second.type());
+    }
+  }
+
+  for (NodeDef& func_body_node : *item->graph.mutable_node()) {
+    if (input_nodes.find(func_body_node.name()) != input_nodes.end()) {
+      // Turn input placeholders into identity nodes
+      if (IsPlaceholder(func_body_node)) {
+        func_body_node.set_op("Identity");
+      }
+      CHECK_EQ(0, func_body_node.input_size());
+      int input_id = input_nodes[func_body_node.name()];
+      func_body_node.add_input(
+          strings::StrCat(func_inputs->name(), ":", input_id));
+    } else {
+      // Update the input names.
+      for (string& input : *func_body_node.mutable_input()) {
+        input = AddPrefixToNodeName(input, node.name());
+      }
+    }
+
+    // Add the node name as a prefix to avoid collisions after inlining
+    func_body_node.set_name(
+        strings::StrCat(node.name(), "/", func_body_node.name()));
+
+    // Make sure the node is placed
+    func_body_node.set_device(node.device());
+
+    // Move the node to the main graph
+    graph->add_node()->Swap(&func_body_node);
+  }
+
+  // Add an IdentityN op to hook the function outputs to: this ensures that the
+  // function body is fully evaluated before its fanout gets scheduled.
+  NodeDef* func_outputs = graph->add_node();
+  func_outputs->set_name(node.name());
+  func_outputs->set_op("IdentityN");
+  func_outputs->set_device(node.device());
+  type_list = (*func_outputs->mutable_attr())["T"].mutable_list();
+  for (int i = 0; i < func.signature().output_arg_size(); ++i) {
+    const OpDef::ArgDef& arg = func.signature().output_arg(i);
+    if (arg.type() != DT_INVALID) {
+      type_list->add_type(arg.type());
+    } else {
+      auto it = attr.find(arg.type_attr());
+      if (it == attr.end()) {
+        return errors::InvalidArgument("Invalid output argument ", arg.name(),
+                                       " for function ", node.op(),
+                                       " instantiated by ", node.name());
+      }
+      type_list->add_type(it->second.type());
+    }
+    // Use the fetch names since they take into account the output mapping.
+    func_outputs->add_input(strings::StrCat(node.name(), "/", item->fetch[i]));
+  }
+
+  return Status::OK();
+}
+
+Status FunctionOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
+                                   GraphDef* optimized_graph) {
+  std::unordered_map<string, const FunctionDef*> functions;
+  for (const FunctionDef& func : item.graph.library().function()) {
+    // Don't inline functions marked as noinline
+    if (func.attr().count("_noinline") != 0) {
+      continue;
+    }
+    // Can't create IdentityN nodes with no input or output: skip these
+    // functions for now.
+    if (func.signature().input_arg_size() == 0 ||
+        func.signature().output_arg_size() == 0) {
+      continue;
+    }
+    functions[func.signature().name()] = &func;
+  }
+
+  // Nothing to do.
+  if (functions.empty()) {
+    *optimized_graph = item.graph;
+    return Status::OK();
+  }
+
+  // Inline functions when possible.
+  for (const NodeDef& node : item.graph.node()) {
+    auto it = functions.find(node.op());
+    if (it == functions.end()) {
+      *optimized_graph->add_node() = node;
+    } else {
+      TF_RETURN_IF_ERROR(InlineFunction(node, *it->second, item.graph.library(),
+                                        optimized_graph));
+    }
+  }
+
+  // TODO(bsteiner): specialize the implementation of functions that can't be
+  // inlined based on the context in which they're instantiated.
+
+  // TODO(bsteiner): trim the library to remove unused function definitions
+  *optimized_graph->mutable_library() = item.graph.library();
+  *optimized_graph->mutable_versions() = item.graph.versions();
+
+  return Status::OK();
+}
+
+void FunctionOptimizer::Feedback(Cluster* cluster, const GrapplerItem& item,
+                                 const GraphDef& optimized_graph,
+                                 double result) {
+  // Nothing to do for FunctionOptimizer.
+}
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/function_optimizer.h b/tensorflow/core/grappler/optimizers/function_optimizer.h
new file mode 100644
index 0000000000000000000000000000000000000000..5c80226e9dbf57908f8942f31051761f743265b8
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/function_optimizer.h
@@ -0,0 +1,43 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_GRAPPLER_OPTIMIZERS_FUNCTION_OPTIMIZER_H_
+#define TENSORFLOW_GRAPPLER_OPTIMIZERS_FUNCTION_OPTIMIZER_H_
+
+#include "tensorflow/core/grappler/optimizers/graph_optimizer.h"
+
+namespace tensorflow {
+namespace grappler {
+
+// Remap TensorFlow subgraphs onto alternative operations or collection of
+// operations to make the overall graph more efficient.
+class FunctionOptimizer : public GraphOptimizer {
+ public:
+  FunctionOptimizer() {}
+  ~FunctionOptimizer() override {}
+
+  string name() const override { return "function_optimizer"; };
+
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* optimized_graph) override;
+
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimized_graph, double result) override;
+};
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_GRAPPLER_OPTIMIZERS_FUNCTION_OPTIMIZER_H_
diff --git a/tensorflow/core/grappler/optimizers/function_optimizer_test.cc b/tensorflow/core/grappler/optimizers/function_optimizer_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..8db9b7f77adadc6a6404d34fbd63b9fa840c5006
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/function_optimizer_test.cc
@@ -0,0 +1,378 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/function_optimizer.h"
+#include "tensorflow/core/framework/function_testlib.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/utils/grappler_test.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+class FunctionOptimizerTest : public GrapplerTest {};
+
+TEST_F(FunctionOptimizerTest, SimpleFunction) {
+  // Build a graph to compute y = XTimesTwo(x)
+  GrapplerItem item;
+  constexpr char device[] = "/device:CPU:0";
+  item.graph = test::function::GDef(
+      {test::function::NDef("x", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("y", "XTimesTwo", {"x"}, {{"T", DT_FLOAT}}, device),
+       test::function::NDef("z", "Identity", {"y"}, {{"T", DT_FLOAT}}, device)},
+      // FunctionLib
+      {
+          test::function::XTimesTwo(),
+      });
+
+  FunctionOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  int count = 0;
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "y/inlined_inputs") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+    } else if (node.name() == "y/x") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/inlined_inputs:0", node.input(0));
+    } else if (node.name() == "y/two") {
+      count++;
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(device, node.device());
+    } else if (node.name() == "y/scale") {
+      count++;
+      EXPECT_EQ("Cast", node.op());
+      EXPECT_EQ(device, node.device());
+    } else if (node.name() == "y/y") {
+      count++;
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("y/x", node.input(0));
+      EXPECT_EQ("y/scale:0", node.input(1));
+    } else if (node.name() == "y") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/y:0", node.input(0));
+    } else if (node.name() == "z") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+    }
+  }
+  EXPECT_EQ(7, count);
+
+  item.fetch = {"z"};
+  Tensor pi(DT_FLOAT, {});
+  pi.flat<float>()(0) = 3.14f;
+  item.feed.emplace_back("x", pi);
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+}
+
+TEST_F(FunctionOptimizerTest, FixedTypeFunction) {
+  // Create and instantiate a version of the XTimesTwo function that only
+  // accepts floats a inputs.
+  const Tensor kTwo = test::AsScalar<float>(2.0f);
+  FunctionDef x_times_two = FunctionDefHelper::Define(
+      // Name
+      "XTimesTwo",
+      // Args
+      {"x: float"},
+      // Return values
+      {"y: float"},
+      // Attr def
+      {},
+      // Nodes
+      {
+          {{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_FLOAT}}},
+          {{"y"}, "Mul", {"x", "two"}, {{"T", DT_FLOAT}}},
+      });
+
+  constexpr char device[] = "/device:CPU:0";
+  GrapplerItem item;
+  item.graph = test::function::GDef(
+      {test::function::NDef("x", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("y", "XTimesTwo", {"x"}, {}, device),
+       test::function::NDef("z", "Identity", {"y"}, {{"T", DT_FLOAT}}, device)},
+      // FunctionLib
+      {
+          x_times_two,
+      });
+
+  FunctionOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  int count = 0;
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "y/inlined_inputs") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+    } else if (node.name() == "y/x") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/inlined_inputs:0", node.input(0));
+    } else if (node.name() == "y/two") {
+      count++;
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(device, node.device());
+    } else if (node.name() == "y/y") {
+      count++;
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("y/x", node.input(0));
+      EXPECT_EQ("y/two:0", node.input(1));
+    } else if (node.name() == "y") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/y:0", node.input(0));
+    } else if (node.name() == "z") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+    }
+  }
+  EXPECT_EQ(6, count);
+
+  item.fetch = {"z"};
+  Tensor pi(DT_FLOAT, {});
+  pi.flat<float>()(0) = 3.14f;
+  item.feed.emplace_back("x", pi);
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+}
+
+TEST_F(FunctionOptimizerTest, FunctionWithOutputMapping) {
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "Exp_func",
+      // Args
+      {"in: float"},
+      // Return values
+      {"out: float"},
+      // Attr def
+      {},
+      // Nodes
+      {{{"Linear_func"}, "Identity", {"in"}, {{"T", DT_FLOAT}}},
+       {{"Exp"}, "Exp", {"Linear_func:output:0"}, {{"T", DT_FLOAT}}}},
+      // Mapping
+      {{"out", "Exp:y:0"}});
+
+  GrapplerItem item;
+  constexpr char device[] = "/device:CPU:0";
+  item.graph = test::function::GDef(
+      {test::function::NDef("x", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("y", "Exp_func", {"x"}, {}, device),
+       test::function::NDef("z", "Identity", {"y"}, {{"T", DT_FLOAT}}, device)},
+      // FunctionLib
+      {
+          func,
+      });
+
+  FunctionOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  int count = 0;
+  for (const NodeDef& node : output.node()) {
+    if (node.name() == "y/inlined_inputs") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+    } else if (node.name() == "y/in") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/inlined_inputs:0", node.input(0));
+    } else if (node.name() == "y/Linear_func") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/in", node.input(0));
+    } else if (node.name() == "y/Exp") {
+      count++;
+      EXPECT_EQ("Exp", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/Linear_func:0", node.input(0));
+    } else if (node.name() == "y") {
+      count++;
+      EXPECT_EQ("IdentityN", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y/Exp:0", node.input(0));
+    } else if (node.name() == "z") {
+      count++;
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(device, node.device());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+    }
+  }
+  EXPECT_EQ(6, count);
+
+  item.fetch = {"z"};
+  Tensor pi(DT_FLOAT, {});
+  pi.flat<float>()(0) = 3.14f;
+  item.feed.emplace_back("x", pi);
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+}
+
+TEST_F(FunctionOptimizerTest, FunctionWithInputForwarding) {
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "ForwardInputs",
+      // Args
+      {"in0: float", "in1: float", "arg2: float", "arg3: int32", "arg4: float"},
+      // Return values
+      {"out0: float", "arg2: float", "arg3: int32"},
+      // Attr def
+      {},
+      // Nodes
+      {},
+      // Mapping
+      {{"out0", "in0"}, {"arg2", "arg2"}, {"arg3", "arg3"}});
+
+  GrapplerItem item;
+  constexpr char device[] = "/device:CPU:0";
+  item.graph = test::function::GDef(
+      {test::function::NDef("x0", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("x1", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("x2", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("x3", "Placeholder", {}, {{"dtype", DT_INT32}},
+                            device),
+       test::function::NDef("x4", "Placeholder", {}, {{"dtype", DT_FLOAT}},
+                            device),
+       test::function::NDef("y", "ForwardInputs",
+                            {"x0", "x1", "x2", "x3", "x4"}, {}, device),
+       test::function::NDef("z0", "Identity", {"y:0"}, {{"T", DT_FLOAT}},
+                            device),
+       test::function::NDef("z1", "Identity", {"y:1"}, {{"T", DT_FLOAT}},
+                            device),
+       test::function::NDef("z2", "Identity", {"y:2"}, {{"T", DT_INT32}},
+                            device)},
+      // FunctionLib
+      {
+          func,
+      });
+
+  FunctionOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  item.fetch = {"z0", "z1", "z2"};
+  Tensor in(DT_FLOAT, {});
+  in.flat<float>()(0) = 3.14f;
+  item.feed.emplace_back("x0", in);
+  in.flat<float>()(0) = 2.7f;
+  item.feed.emplace_back("x1", in);
+  in.flat<float>()(0) = 1.0f;
+  item.feed.emplace_back("x2", in);
+  in.flat<float>()(0) = -1.0f;
+  item.feed.emplace_back("x4", in);
+  Tensor in_int(DT_INT32, {});
+  in_int.flat<int>()(0) = 1234;
+  item.feed.emplace_back("x3", in_int);
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+  test::ExpectTensorEqual<float>(tensors_expected[1], tensors[1]);
+  test::ExpectTensorEqual<int>(tensors_expected[2], tensors[2]);
+}
+
+TEST_F(FunctionOptimizerTest, FunctionWithoutInput) {
+  const Tensor kTwo = test::AsScalar<int64>(2);
+  FunctionDef func = FunctionDefHelper::Define(
+      // Name
+      "GenerateTwo",
+      // Args
+      {},
+      // Return value
+      {"o: T"},
+      // Attr def
+      {"T: {float, double}"},
+      // Nodes
+      {{{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_INT64}}},
+       {{"o"}, "Cast", {"two"}, {{"SrcT", DT_INT64}, {"DstT", "$T"}}}});
+
+  GrapplerItem item;
+  constexpr char device[] = "/device:CPU:0";
+  item.graph = test::function::GDef(
+      {test::function::NDef("y", "GenerateTwo", {}, {}, device),
+       test::function::NDef("z", "Identity", {"y"}, {{"T", DT_FLOAT}}, device)},
+      // FunctionLib
+      {
+          func,
+      });
+
+  FunctionOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  // For now we won't inline the function.
+  EXPECT_EQ(item.graph.DebugString(), output.DebugString());
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/gpu_swapping_kernels.cc b/tensorflow/core/grappler/optimizers/gpu_swapping_kernels.cc
new file mode 100644
index 0000000000000000000000000000000000000000..1820af6844215475d2bfccba93891a52029218b2
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/gpu_swapping_kernels.cc
@@ -0,0 +1,88 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Op kernels used to swap data in and out of GPU memory.
+
+#include "tensorflow/core/common_runtime/device.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+namespace {
+
+class CopyFromGpuToHostKernel : public AsyncOpKernel {
+ public:
+  explicit CopyFromGpuToHostKernel(OpKernelConstruction* context)
+      : AsyncOpKernel(context) {}
+  void ComputeAsync(OpKernelContext* ctx, DoneCallback done) override {
+    const Tensor& input = ctx->input(0);
+    OP_REQUIRES_ASYNC(
+        ctx, !ctx->input_alloc_attr(0).on_host(),
+        errors::Internal("The input tensor to the _CopyFromGpuToHost kernel "
+                         "must reside on the device."),
+        done);
+
+    AllocatorAttributes alloc_attrs;
+    alloc_attrs.set_gpu_compatible(true);
+    alloc_attrs.set_on_host(true);
+    Tensor* output;
+    OP_REQUIRES_OK_ASYNC(
+        ctx, ctx->allocate_output(0, input.shape(), &output, alloc_attrs),
+        done);
+
+    ctx->op_device_context()->CopyDeviceTensorToCPU(
+        &input, "CopyFromGpuToHost", static_cast<Device*>(ctx->device()),
+        output, [ctx, done](const Status& s) {
+          ctx->SetStatus(s);
+          done();
+        });
+  }
+};
+
+REGISTER_KERNEL_BUILDER(
+    Name("_CopyFromGpuToHost").Device(DEVICE_GPU).HostMemory("output"),
+    CopyFromGpuToHostKernel);
+
+class CopyFromHostToGpuKernel : public AsyncOpKernel {
+ public:
+  explicit CopyFromHostToGpuKernel(OpKernelConstruction* context)
+      : AsyncOpKernel(context) {}
+  void ComputeAsync(OpKernelContext* ctx, DoneCallback done) override {
+    const Tensor& input = ctx->input(0);
+    OP_REQUIRES_ASYNC(
+        ctx, ctx->input_alloc_attr(0).on_host(),
+        errors::Internal("The input tensor to the _CopyFromHostToGpu kernel "
+                         "must reside on the host."),
+        done);
+
+    Tensor* output;
+    OP_REQUIRES_OK_ASYNC(ctx, ctx->allocate_output(0, input.shape(), &output),
+                         done);
+
+    ctx->op_device_context()->CopyCPUTensorToDevice(
+        &input, static_cast<Device*>(ctx->device()), output,
+        [ctx, done](const Status& s) {
+          ctx->SetStatus(s);
+          done();
+        });
+  }
+};
+
+REGISTER_KERNEL_BUILDER(
+    Name("_CopyFromHostToGpu").Device(DEVICE_GPU).HostMemory("input"),
+    CopyFromHostToGpuKernel);
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/gpu_swapping_ops.cc b/tensorflow/core/grappler/optimizers/gpu_swapping_ops.cc
new file mode 100644
index 0000000000000000000000000000000000000000..46828346da608a237528da2a2a8070c57946f762
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/gpu_swapping_ops.cc
@@ -0,0 +1,58 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+// Definition for the ops used to swap data in and out of GPU memory.
+
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/shape_inference.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+namespace {
+
+// The _CopyFromGpuToHost op copies its input tensor to the host. The input must
+// reside on GPU. The op itself must be placed on GPU.
+REGISTER_OP("_CopyFromGpuToHost")
+    .Input("input: T")
+    .Output("output: T")
+    .Attr("T: type")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      c->set_output(0, c->input(0));
+      auto* handle_data = c->input_handle_shapes_and_types(0);
+      if (handle_data != nullptr) {
+        c->set_output_handle_shapes_and_types(0, *handle_data);
+      }
+      return Status::OK();
+    })
+    .Doc("Copies the input tensor from gpu to the host.");
+
+// The _CopyFromHostToGpu op copies its input tensor from the host to the GPU.
+// The input must reside on CPU. The op itself must be placed on GPU.
+REGISTER_OP("_CopyFromHostToGpu")
+    .Input("input: T")
+    .Output("output: T")
+    .Attr("T: type")
+    .SetShapeFn([](shape_inference::InferenceContext* c) {
+      c->set_output(0, c->input(0));
+      auto* handle_data = c->input_handle_shapes_and_types(0);
+      if (handle_data != nullptr) {
+        c->set_output_handle_shapes_and_types(0, *handle_data);
+      }
+      return Status::OK();
+    })
+    .Doc("Copies the input tensor from the host to the GPU.");
+
+}  // namespace
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer.cc b/tensorflow/core/grappler/optimizers/loop_optimizer.cc
index 102526e22f4742cb90757a1daf55467dd16afc3e..91a090f803094bf76480268111d940d0fb939eaa 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer.cc
@@ -15,25 +15,465 @@ limitations under the License.
 
 #include "tensorflow/core/grappler/optimizers/loop_optimizer.h"
 
+#include <algorithm>
+#include <limits>
 #include <unordered_map>
 #include <unordered_set>
+#include <vector>
+#include <deque>
 
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/node_def.pb.h"
 #include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/tensor_shape.pb.h"
+#include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/grappler/costs/graph_properties.h"
 #include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/grappler/op_types.h"
+#include "tensorflow/core/grappler/optimizers/constant_folding.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/frame.h"
 #include "tensorflow/core/lib/core/errors.h"
 #include "tensorflow/core/lib/core/stringpiece.h"
 #include "tensorflow/core/lib/strings/strcat.h"
+#include "tensorflow/core/platform/tensor_coding.h"
+#include "tensorflow/core/util/device_name_utils.h"
+#include "tensorflow/core/util/saved_tensor_slice_util.h"
+
+using tensorflow::strings::StrCat;
 
 namespace tensorflow {
 namespace grappler {
+namespace {
+
+std::vector<int> GetStackPushNodesToConvert(const SimpleGraphView& graph_view,
+                                            int stack_node_idx) {
+  VLOG(1) << "Stack node: " << graph_view.graph()->node(stack_node_idx).name();
+  const std::unordered_set<string> op_types_to_traverse(
+      {"Stack", "StackV2", "Enter", "RefEnter", "Switch", "RefSwitch",
+       "Identity", "RefIdentity"});
+  std::vector<int> nodes_to_convert;
+  std::set<int> fanout;
+  graph_view.DepthFirstSearch(op_types_to_traverse, stack_node_idx, &fanout);
+  for (int fanout_idx : fanout) {
+    const NodeDef& fanout_node = graph_view.graph()->node(fanout_idx);
+    VLOG(1) << "Fanout " << fanout_idx << " : " << fanout_node.name();
+    if (IsStackPushOp(fanout_node)) {
+      nodes_to_convert.push_back(fanout_idx);
+    } else if (IsStackOp(fanout_node) || IsStackCloseOp(fanout_node) ||
+               op_types_to_traverse.find(fanout_node.op()) !=
+                   op_types_to_traverse.end()) {
+      continue;
+    } else {
+      // The node is either a StackPop node or something unexpected behind which
+      // may hide a StackPop node, so we leave the graph alone.
+      nodes_to_convert.clear();
+      break;
+    }
+  }
+  return nodes_to_convert;
+}
+
+Status RemoveStackOps(const GraphDef& graph, GraphDef* optimized_graph) {
+  *optimized_graph = graph;
+  NodeMap node_map(optimized_graph);
+  SimpleGraphView graph_view;
+  TF_RETURN_IF_ERROR(graph_view.Initialize(graph));
+  for (int node_idx = 0; node_idx < graph.node_size(); ++node_idx) {
+    if (IsStackOp(graph.node(node_idx))) {
+      for (int push_node_idx :
+           GetStackPushNodesToConvert(graph_view, node_idx)) {
+        // We found push nodes without corresponding pops. Convert them to
+        // Identity passing the data through and add a control dependency from
+        // the op supplying the handle.
+        NodeDef* push_node = optimized_graph->mutable_node(push_node_idx);
+        VLOG(1) << "Converting " << push_node_idx << " : "
+                << push_node->DebugString();
+        if (push_node->attr().count("swap_memory") != 0) {
+          push_node->mutable_attr()->erase("swap_memory");
+        }
+        push_node->set_op("Identity");
+        push_node->mutable_input()->SwapElements(0, 1);
+        const string ctrl_dep = ConstantFolding::AddControlDependency(
+            push_node->input(1), optimized_graph, &node_map);
+        push_node->set_input(1, ctrl_dep);
+        VLOG(1) << "After converting: " << push_node->DebugString();
+      }
+    }
+  }
+  return Status::OK();
+}
+
+}  // namespace
+
+Status LoopOptimizer::LINMHandleInvariantEnter(NodeDef* node,
+                                               const int num_outputs) {
+  auto consumers = node_map_->GetOutputs(node->name());
+  std::vector<string> enter_control_inputs;
+  string enter_input;
+  for (auto& input : node->input()) {
+    if (IsControlInput(input)) {
+      enter_control_inputs.push_back(input);
+    } else {
+      enter_input = input;
+    }
+  }
+  for (auto* consumer : consumers) {
+    if (invariant_nodes_.count(consumer)) {
+      for (int i = 0; i < consumer->input_size(); ++i) {
+        if (NodeName(consumer->input(i)) == node->name()) {
+          consumer->set_input(i, enter_input);
+          node_map_->AddOutput(NodeName(enter_input), consumer->name());
+          node_map_->RemoveOutput(node->name(), consumer->name());
+        }
+      }
+      for (auto& control_input : enter_control_inputs) {
+        consumer->add_input(control_input);
+        node_map_->AddOutput(NodeName(control_input), consumer->name());
+      }
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::LINMHandleConst(NodeDef* node,
+    const int num_outputs, const int frame_id) {
+  NodeDef* const_node;
+  if (num_outputs == 0) {
+    // all successor nodes are invariant
+    // Remove the control inputs from this frame to the const node,
+    // when moving it out of the frame (in parent frame)
+    const_node = node;
+    node_map_->RemoveInputs(node->name());
+    node->clear_input();
+  } else {
+    // some successor nodes are variant
+    // Have to keep the const node in the frame,
+    // so create a new one outside the frame (in parent frame)
+    const_node = optimized_graph_->add_node();
+    const_node->set_name(AddPrefixToNodeName(node->name(), kLoopOptimizer));
+    const_node->set_op("Const");
+    const_node->set_device(node->device());
+    *const_node->mutable_attr() = node->attr();
+    node_map_->AddNode(const_node->name(), const_node);
+    auto consumers = node_map_->GetOutputs(node->name());
+    for (auto* consumer : consumers) {
+      if (invariant_nodes_.count(consumer)) {
+        for (int i = 0; i < consumer->input_size(); ++i) {
+          if (NodeName(consumer->input(i)) == node->name()) {
+            if (IsControlInput(consumer->input(i))) {
+              *consumer->mutable_input(i) = AsControlDependency(*const_node);
+            } else {
+              *consumer->mutable_input(i) = const_node->name();
+            }
+            node_map_->AddOutput(const_node->name(), consumer->name());
+            node_map_->RemoveOutput(node->name(), consumer->name());
+          }
+        }
+      }
+    }
+  }
+  // add a control input from the parent frame
+  auto parent_it = frame_parent_.find(frame_id);
+  if (parent_it != frame_parent_.end()) {
+    int parent_id = parent_it->second;
+    auto loop_cond_it = loop_cond_.find(parent_id);
+    if (loop_cond_it == loop_cond_.end()) {
+      return errors::InvalidArgument(
+          "Frame ", frame_id, " doesn't have a LoopCond node");
+    }
+    auto& loop_cond_name = loop_cond_it->second->name();
+    NodeDef* switch_node = nullptr;
+    for (auto* node : node_map_->GetOutputs(loop_cond_name)) {
+      if (node->op() == "Switch") {
+        switch_node = node;
+        break;
+      }
+    }
+    if (!switch_node) {
+      return errors::InvalidArgument(
+          "LoopCond node of Frame ", frame_id,
+          " doesn't connect to any Switch node");
+    }
+    string switch_output = StrCat(switch_node->name(), ":1");
+    const string ctrl_dep = ConstantFolding::AddControlDependency(
+        switch_output, optimized_graph_, node_map_.get());
+    const_node->add_input(ctrl_dep);
+    node_map_->AddOutput(NodeName(ctrl_dep), const_node->name());
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::LINMHandleInvariantNode(NodeDef* node,
+    const int num_outputs, const int frame_id) {
+  // have to remove control inputs to the invariant node from the same frame
+  // when moving this node out of this frame
+  for (int i = 0; i < node->input_size(); ++i) {
+    if (IsControlInput(node->input(i))) {
+      node->mutable_input()->SwapElements(i, node->input_size() - 1);
+      node->mutable_input()->RemoveLast();
+    }
+  }
+  if (num_outputs == 0) {
+    return Status::OK();
+  }
+
+  DataTypeVector input_types;
+  DataTypeVector output_types;
+  OpRegistryInterface* op_registry = OpRegistry::Global();
+  const OpRegistrationData* op_reg_data = nullptr;
+  TF_RETURN_IF_ERROR(
+      op_registry->LookUp(node->op(), &op_reg_data));
+  TF_RETURN_IF_ERROR(
+      InOutTypesForNode(*node, op_reg_data->op_def,
+                        &input_types, &output_types));
+
+  auto consumers = node_map_->GetOutputs(node->name());
+  string fname = invariant_enters_[frame_id][0]->attr().at("frame_name").s();
+  int piterations = invariant_enters_[frame_id][0]
+                    ->attr().at("parallel_iterations").i();
+  for (auto* consumer : consumers) {
+    if (!invariant_nodes_.count(consumer)) {
+      for (int i = 0; i < consumer->input_size(); ++i) {
+        int port;
+        string node_name = ParseNodeName(consumer->input(i), &port);
+        if (node_name != node->name()) {
+          continue;
+        }
+        if (port < 0) {
+          return errors::InvalidArgument(
+              "Invariant node should not have control outputs "
+              "to variant node");
+        }
+        DataType output_type = output_types[port];
+        NodeDef* new_enter = optimized_graph_->add_node();
+        new_enter->set_op("Enter");
+        new_enter->set_device(node->device());
+        new_enter->set_name(AddPrefixToNodeName(
+            StrCat(fname, "_enter_", new_enter_id_++), kLoopOptimizer));
+        AttrValue data_type;
+        data_type.set_type(output_type);
+        new_enter->mutable_attr()->insert({"T", data_type});
+        AttrValue frame_name;
+        frame_name.set_s(fname);
+        new_enter->mutable_attr()->insert({"frame_name", frame_name});
+        AttrValue is_const;
+        is_const.set_b(true);
+        new_enter->mutable_attr()->insert({"is_constant", is_const});
+        AttrValue parallel_iterations;
+        parallel_iterations.set_i(piterations);
+        new_enter->mutable_attr()->insert(
+            {"parallel_iterations", parallel_iterations});
+        new_enter->add_input(consumer->input(i));
+        *consumer->mutable_input(i) = new_enter->name();
+        node_map_->AddNode(new_enter->name(), new_enter);
+        node_map_->AddOutput(node->name(), new_enter->name());
+        node_map_->AddOutput(new_enter->name(), consumer->name());
+      }
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::MoveInvariantNodes(const int frame_id) {
+  for (auto iter = invariant_nodes_.begin();
+       iter != invariant_nodes_.end(); ++iter) {
+    auto* invariant_node = iter->first;
+    const int num_outputs = iter->second;
+    if (IsEnter(*invariant_node)) {
+      TF_RETURN_IF_ERROR(
+          LINMHandleInvariantEnter(invariant_node, num_outputs));
+    } else if (IsConstant(*invariant_node)) {
+      TF_RETURN_IF_ERROR(
+          LINMHandleConst(invariant_node, num_outputs, frame_id));
+    } else {
+      TF_RETURN_IF_ERROR(
+          LINMHandleInvariantNode(invariant_node, num_outputs, frame_id));
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::RevertInvariantNodes() {
+  std::deque<const NodeDef*> reverted_nodes;
+  for (auto iter=invariant_nodes_.begin(); iter != invariant_nodes_.end();) {
+    bool erased = false;
+    const auto* node = iter->first;
+    if (!IsConstant(*node) && !IsEnter(*node) && iter->second > 0) {
+      auto& consumers = node_map_->GetOutputs(node->name());
+      for (auto* consumer : consumers) {
+        if (!invariant_nodes_.count(consumer)) {
+          for (const auto& input : consumer->input()) {
+            if (IsControlInput(input) && NodeName(input) == node->name()) {
+              reverted_nodes.push_back(node);
+              invariant_nodes_.erase(iter++);
+              erased = true;
+              break;
+            }
+          }
+          if (erased) break;
+        }
+      }
+    }
+    if (!erased) ++iter;
+  }
+  while (!reverted_nodes.empty()) {
+    const auto* node = reverted_nodes.front();
+    reverted_nodes.pop_front();
+    std::set<NodeDef*> producers;
+    for (const auto& input : node->input()) {
+      auto* producer = node_map_->GetNode(input);
+      auto iter = invariant_nodes_.find(producer);
+      if (iter != invariant_nodes_.end()) {
+        if (IsControlInput(input) &&
+            !IsConstant(*producer) && !IsEnter(*producer)) {
+          reverted_nodes.push_back(producer);
+          invariant_nodes_.erase(iter);
+        } else {
+          producers.insert(producer);
+        }
+      }
+    }
+    for (auto* producer : producers) {
+      auto iter = invariant_nodes_.find(producer);
+      if (iter != invariant_nodes_.end()) {
+        ++iter->second;
+      }
+    }
+    for (auto* consumer : node_map_->GetOutputs(node->name())) {
+      auto iter = invariant_nodes_.find(consumer);
+      if (iter != invariant_nodes_.end()) {
+        reverted_nodes.push_back(consumer);
+        invariant_nodes_.erase(iter);
+      }
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::FindInvariantNodes(NodeDef* node) {
+  auto consumers = node_map_->GetOutputs(node->name());
+  invariant_nodes_.insert(std::make_pair(node, consumers.size()));
+  for (auto* consumer : consumers) {
+    if (invariant_nodes_.count(consumer) ||
+        ModifiesFrameInfo(*consumer)) {
+      continue;
+    }
+    bool is_invariant = true;
+    for (const auto& input : consumer->input()) {
+      if (!IsControlInput(input)) {
+        const auto& name = NodeName(input);
+        auto* producer = node_map_->GetNode(name);
+        if (!invariant_nodes_.count(producer)) {
+          if (IsConstant(*producer)) {
+            invariant_nodes_.insert(
+                std::make_pair(producer, node_map_->GetOutputs(name).size()));
+          } else {
+            is_invariant = false;
+            break;
+          }
+        }
+      }
+    }
+    if (is_invariant) {
+      std::set<NodeDef*> producers;
+      for (const auto& input : consumer->input()) {
+        auto* producer = node_map_->GetNode(input);
+        producers.insert(producer);
+      }
+      for (auto* producer : producers) {
+        auto iter = invariant_nodes_.find(producer);
+        if (iter != invariant_nodes_.end()) {
+          --iter->second;
+        }
+      }
+      TF_RETURN_IF_ERROR(FindInvariantNodes(consumer));
+    }
+  }
+  return Status::OK();
+}
+
+Status LoopOptimizer::LoopInvariantNodeMotion() {
+  std::deque<int> worklist;
+  for (auto iter = frame_map_.begin(); iter != frame_map_.end(); ++iter) {
+    auto* node = iter->first;
+    auto& frame_ids = iter->second;
+    if (frame_ids.size() >= 3) {
+      for (unsigned int i = 1; i < frame_ids.size() - 1; ++i) {
+        frame_parent_[frame_ids[i]] = frame_ids[i - 1];
+        frame_children_[frame_ids[i]].insert(frame_ids[i + 1]);
+      }
+    }
+    if (frame_ids.size() >= 2) {
+      frame_children_[frame_ids[0]].insert(frame_ids[1]);
+      frame_parent_[frame_ids.back()] = frame_ids[frame_ids.size() - 2];
+    }
+    if (frame_ids.size() >= 1) {
+      frame_children_.insert(std::make_pair(frame_ids.back(), empty_set_));
+      if (node->op() == "LoopCond") {
+        if (loop_cond_.count(frame_ids.back())) {
+          return errors::InvalidArgument(
+              "Loop ", frame_ids.back(),
+              " has more than one LoopCond node: ", node->name(), " and ",
+              loop_cond_[frame_ids.back()]->name());
+        }
+        loop_cond_[frame_ids.back()] = node;
+      }
+      if (IsEnter(*node) && node->attr().at("is_constant").b()) {
+        invariant_enters_[frame_ids.back()].push_back(
+            const_cast<NodeDef*>(node));
+      }
+    }
+  }
+
+  for (auto it = frame_children_.begin(); it != frame_children_.end(); ++it) {
+    if (it->second.size() == 0) {
+      worklist.push_back(it->first);
+    }
+  }
+
+  while (!worklist.empty()) {
+    int frame_id = worklist.front();
+    new_enter_id_ = 0;
+    worklist.pop_front();
+    auto parent_it = frame_parent_.find(frame_id);
+    if (parent_it != frame_parent_.end()) {
+      int parent_id = parent_it->second;
+      frame_children_[parent_id].erase(frame_id);
+      if (frame_children_[parent_id].size() == 0) {
+        worklist.push_back(parent_id);
+      }
+    }
+
+    if (invariant_enters_[frame_id].empty()) {
+      continue;
+    }
+    invariant_nodes_.clear();
+    for (auto* enter : invariant_enters_[frame_id]) {
+      TF_RETURN_IF_ERROR(FindInvariantNodes(enter));
+    }
+
+    // revert invariant nodes that have control outputs to variant nodes
+    TF_RETURN_IF_ERROR(RevertInvariantNodes());
+
+    TF_RETURN_IF_ERROR(MoveInvariantNodes(frame_id));
+  }
+  return Status::OK();
+}
 
 Status LoopOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
                                GraphDef* optimized_graph) {
-  *optimized_graph = item.graph;
+  TF_RETURN_IF_ERROR(RemoveStackOps(item.graph, optimized_graph));
+  
+  optimized_graph_ = optimized_graph;
 
-  return Status::OK();
+  // Set up helper data structures.
+  node_map_.reset(new NodeMap(optimized_graph_));
+  int num_frames;
+  TF_RETURN_IF_ERROR(IdentifyFramesWithNodeMap(*optimized_graph_, *node_map_,
+                                               &frame_map_, &num_frames));
+
+  TF_RETURN_IF_ERROR(LoopInvariantNodeMotion());
 }
 
 void LoopOptimizer::Feedback(Cluster* /*cluster*/, const GrapplerItem& /*item*/,
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer.h b/tensorflow/core/grappler/optimizers/loop_optimizer.h
index 106d4628ae68f3c92ab597f903f96a6af8a64b8d..b5944cd30bf53f0c512d8aa6ea8350af73c8d038 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer.h
@@ -17,13 +17,17 @@ limitations under the License.
 #define TENSORFLOW_CORE_GRAPPLER_OPTIMIZERS_LOOP_OPTIMIZER_H_
 
 #include <unordered_set>
+#include "tensorflow/core/grappler/costs/graph_properties.h"
 #include "tensorflow/core/grappler/optimizers/graph_optimizer.h"
 #include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/grappler/utils/frame.h"
 #include "tensorflow/core/protobuf/rewriter_config.pb.h"
 
 namespace tensorflow {
 namespace grappler {
 
+constexpr char kLoopOptimizer[] = "LoopOptimizer";
+
 class LoopOptimizer : public GraphOptimizer {
  public:
   LoopOptimizer() : opt_level_(RewriterConfig::ON) {}
@@ -40,7 +44,29 @@ class LoopOptimizer : public GraphOptimizer {
                 const GraphDef& optimized_graph, double result) override;
 
  private:
+  Status LoopInvariantNodeMotion();
+  Status FindInvariantNodes(NodeDef* node);
+  Status RevertInvariantNodes();
+  Status MoveInvariantNodes(const int fname);
+  Status LINMHandleInvariantNode(NodeDef* node, const int num_outputs,
+      const int frame_id);
+  Status LINMHandleConst(NodeDef* node, const int num_outputs,
+      const int frame_id);
+  Status LINMHandleInvariantEnter(NodeDef* node, const int num_outputs);
+
+  std::map<NodeDef*, int> invariant_nodes_;
+  std::set<int> empty_set_;
+  std::map<int, std::set<int>> frame_children_;
+  std::map<int, int> frame_parent_;
+  std::map<int, const NodeDef*> loop_cond_;
+  std::map<int, std::vector<NodeDef*>> invariant_enters_;
+  int new_enter_id_;
   RewriterConfig::Toggle opt_level_;
+
+  std::unique_ptr<NodeMap> node_map_;
+  FrameMap frame_map_;
+  std::unique_ptr<GraphProperties> graph_properties_;
+  GraphDef* optimized_graph_;  // Not owned.
 };
 
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc b/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc
index c09434f60916b9bf269b0f5006b8a3732afaa5fc..1b1955db0effa1b93ac922d1fee687e146da85dd 100644
--- a/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/loop_optimizer_test.cc
@@ -26,7 +26,494 @@ namespace tensorflow {
 namespace grappler {
 namespace {
 
-class LoopOptimizerTest : public ::testing::Test {};
+class LoopOptimizerTest : public ::testing::Test {
+ protected:
+  static NodeDef CreateNode(const string& name,
+                            const std::vector<string>& inputs) {
+    return CreateNode(name, "Identity", "", false, 0, inputs);
+  }
+  static NodeDef CreateNode(const string& name, const string& op,
+                            const std::vector<string>& inputs) {
+    return CreateNode(name, op, "", false, 0, inputs);
+  }
+  static NodeDef CreateNode(const string& name, const string& op,
+                            const string& frame,
+                            const bool is_constant,
+                            const int piterations,
+                            const std::vector<string>& inputs) {
+    NodeDef node;
+    node.set_name(name);
+    if (!op.empty()) {
+      node.set_op(op);
+    }
+    if (!frame.empty()) {
+      AttrValue frame_name;
+      frame_name.set_s(frame);
+      node.mutable_attr()->insert({"frame_name", frame_name});
+    }
+    if (op == "Enter") {
+      AttrValue is_const;
+      is_const.set_b(is_constant);
+      node.mutable_attr()->insert({"is_constant", is_const});
+      AttrValue parallel_iterations;
+      parallel_iterations.set_i(piterations);
+      node.mutable_attr()->insert(
+          {"parallel_iterations", parallel_iterations});
+    }
+    AttrValue type;
+    type.set_type(DT_FLOAT);
+    node.mutable_attr()->insert({"T", type});
+    for (const string& input : inputs) {
+      node.add_input(input);
+    }
+    return node;
+  }
+};
+
+TEST_F(LoopOptimizerTest, Basic) {
+  GraphDef graph;
+  *graph.add_node() = CreateNode("0", {});
+  *graph.add_node() = CreateNode(
+      "InvariantEnter", "Enter", "while/while_context", true, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd", "Add", {"InvariantAdd", "Identity"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter", "Enter", "while/while_context", false, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "Merge", "Merge", {"VariantEnter", "NextIteration"});
+  *graph.add_node() = CreateNode("Less/y", "Const", {"^Identity"});
+  *graph.add_node() = CreateNode("Less", "Less", {"VariantAdd", "less/y"});
+  *graph.add_node() = CreateNode("LoopCond", "LoopCond", {"Less"});
+  *graph.add_node() = CreateNode("Switch", "Switch", {"Merge", "LoopCond"});
+  *graph.add_node() = CreateNode("Identity", {"Switch:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration", "NextIteration", {"VariantAdd"});
+  *graph.add_node() = CreateNode("Exit", "Exit", {"Switch"});
+  *graph.add_node() = CreateNode("1", {"Exit"});
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd")).back(), 0);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd")).back(), 0);
+}
+
+TEST_F(LoopOptimizerTest, Const) {
+  GraphDef graph;
+  *graph.add_node() = CreateNode("0", {});
+  *graph.add_node() = CreateNode(
+      "InvariantEnter", "Enter", "while/while_context", true, 1, {"0"});
+  *graph.add_node() = CreateNode("Const", "Const", {"^Identity"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd", "Add", {"InvariantEnter", "Const"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd", "Add", {"InvariantAdd", "Identity"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter", "Enter", "while/while_context", false, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "Merge", "Merge", {"VariantEnter", "NextIteration"});
+  *graph.add_node() = CreateNode("Less/y", "Const", {"^Identity"});
+  *graph.add_node() = CreateNode("Less", "Less", {"VariantAdd", "less/y"});
+  *graph.add_node() = CreateNode("LoopCond", "LoopCond", {"Less"});
+  *graph.add_node() = CreateNode("Switch", "Switch", {"Merge", "LoopCond"});
+  *graph.add_node() = CreateNode("Identity", {"Switch:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration", "NextIteration", {"VariantAdd"});
+  *graph.add_node() = CreateNode("Exit", "Exit", {"Switch"});
+  *graph.add_node() = CreateNode("1", {"Exit"});
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const")).back(), 0);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const")).size(), 0);
+}
+
+TEST_F(LoopOptimizerTest, ControlOutput) {
+  GraphDef graph;
+  *graph.add_node() = CreateNode("0", {});
+  *graph.add_node() = CreateNode(
+      "InvariantEnter", "Enter", "while/while_context", true, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd", "Add", {"InvariantAdd", "Identity"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter", "Enter", "while/while_context", false, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "Merge", "Merge", {"VariantEnter", "NextIteration"});
+  *graph.add_node() = CreateNode("Less/y", "Const", {"^Identity"});
+  *graph.add_node() = CreateNode(
+      "Less", "Less", {"VariantAdd", "less/y", "^InvariantAdd"});
+  *graph.add_node() = CreateNode("LoopCond", "LoopCond", {"Less"});
+  *graph.add_node() = CreateNode("Switch", "Switch", {"Merge", "LoopCond"});
+  *graph.add_node() = CreateNode("Identity", {"Switch:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration", "NextIteration", {"VariantAdd"});
+  *graph.add_node() = CreateNode("Exit", "Exit", {"Switch"});
+  *graph.add_node() = CreateNode("1", {"Exit"});
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+}
+
+TEST_F(LoopOptimizerTest, NestedLoop1) {
+  GraphDef graph;
+  *graph.add_node() = CreateNode("0", {});
+  *graph.add_node() = CreateNode(
+      "InvariantEnter", "Enter", "while/while_context", true, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd", "Add", {"InvariantAdd", "Identity"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter", "Enter", "while/while_context", false, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "Merge", "Merge", {"VariantEnter", "NextIteration"});
+  *graph.add_node() = CreateNode("Less/y", "Const", {"^Identity"});
+  *graph.add_node() = CreateNode("Less", "Less", {"Exit2", "less/y"});
+  *graph.add_node() = CreateNode("LoopCond", "LoopCond", {"Less"});
+  *graph.add_node() = CreateNode("Switch", "Switch", {"Merge", "LoopCond"});
+  *graph.add_node() = CreateNode("Identity", {"Switch:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration", "NextIteration", {"Exit2"});
+  *graph.add_node() = CreateNode("Exit", "Exit", {"Switch"});
+  *graph.add_node() = CreateNode("1", {"Exit"});
+
+  *graph.add_node() = CreateNode(
+      "InvariantEnter2", "Enter", "while/while/while_context", true, 1,
+      {"VariantAdd"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd2", "Add", {"InvariantEnter2", "InvariantEnter2"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd2", "Add", {"InvariantAdd2", "Identity2"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter2", "Enter", "while/while/while_context", false, 1,
+      {"VariantEnter"});
+  *graph.add_node() = CreateNode(
+      "Merge2", "Merge", {"VariantEnter2", "NextIteration2"});
+  *graph.add_node() = CreateNode("Less2/y", "Const", {"^Identity2"});
+  *graph.add_node() = CreateNode("Less2", "Less", {"VariantAdd2", "less2/y"});
+  *graph.add_node() = CreateNode("LoopCond2", "LoopCond", {"Less2"});
+  *graph.add_node() = CreateNode("Switch2", "Switch", {"Merge2", "LoopCond2"});
+  *graph.add_node() = CreateNode("Identity2", {"Switch2:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration2", "NextIteration", {"VariantAdd2"});
+  *graph.add_node() = CreateNode("Exit2", "Exit", {"Switch2"});
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).back(), 0);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd")).size(), 0);
+}
+
+TEST_F(LoopOptimizerTest, NestedLoop2) {
+  GraphDef graph;
+  *graph.add_node() = CreateNode("0", {});
+  *graph.add_node() = CreateNode(
+      "InvariantEnter", "Enter", "while/while_context", true, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd", "Add", {"InvariantAdd", "Identity"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter", "Enter", "while/while_context", false, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "Merge", "Merge", {"VariantEnter", "NextIteration"});
+  *graph.add_node() = CreateNode("Less/y", "Const", {"^Identity"});
+  *graph.add_node() = CreateNode("Less", "Less", {"Exit2", "less/y"});
+  *graph.add_node() = CreateNode("LoopCond", "LoopCond", {"Less"});
+  *graph.add_node() = CreateNode("Switch", "Switch", {"Merge", "LoopCond"});
+  *graph.add_node() = CreateNode("Identity", {"Switch:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration", "NextIteration", {"Exit2"});
+  *graph.add_node() = CreateNode("Exit", "Exit", {"Switch"});
+  *graph.add_node() = CreateNode("1", {"Exit"});
+
+  *graph.add_node() = CreateNode(
+      "InvariantEnter2", "Enter", "while/while/while_context", true, 1,
+      {"InvariantAdd"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd2", "Add", {"InvariantEnter2", "InvariantEnter2"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd2", "Add", {"InvariantAdd2", "Identity2"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter2", "Enter", "while/while/while_context", false, 1,
+      {"VariantEnter"});
+  *graph.add_node() = CreateNode(
+      "Merge2", "Merge", {"VariantEnter2", "NextIteration2"});
+  *graph.add_node() = CreateNode("Less2/y", "Const", {"^Identity2"});
+  *graph.add_node() = CreateNode("Less2", "Less", {"VariantAdd2", "less2/y"});
+  *graph.add_node() = CreateNode("LoopCond2", "LoopCond", {"Less2"});
+  *graph.add_node() = CreateNode("Switch2", "Switch", {"Merge2", "LoopCond2"});
+  *graph.add_node() = CreateNode("Identity2", {"Switch2:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration2", "NextIteration", {"VariantAdd2"});
+  *graph.add_node() = CreateNode("Exit2", "Exit", {"Switch2"});
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).back(), 1);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("VariantAdd2")).back(), 1);
+}
+
+TEST_F(LoopOptimizerTest, NestedLoopConst1) {
+  GraphDef graph;
+  *graph.add_node() = CreateNode("0", {});
+  *graph.add_node() = CreateNode(
+      "InvariantEnter", "Enter", "while/while_context", true, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd", "Add", {"InvariantAdd", "Identity"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter", "Enter", "while/while_context", false, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "Merge", "Merge", {"VariantEnter", "NextIteration"});
+  *graph.add_node() = CreateNode("Less/y", "Const", {"^Identity"});
+  *graph.add_node() = CreateNode("Less", "Less", {"Exit2", "less/y"});
+  *graph.add_node() = CreateNode("LoopCond", "LoopCond", {"Less"});
+  *graph.add_node() = CreateNode("Switch", "Switch", {"Merge", "LoopCond"});
+  *graph.add_node() = CreateNode("Identity", {"Switch:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration", "NextIteration", {"Exit2"});
+  *graph.add_node() = CreateNode("Exit", "Exit", {"Switch"});
+  *graph.add_node() = CreateNode("1", {"Exit"});
+
+  *graph.add_node() = CreateNode(
+      "InvariantEnter2", "Enter", "while/while/while_context", true, 1,
+      {"VariantAdd"});
+  *graph.add_node() = CreateNode("Const2", "Const", {"^Identity2"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd2", "Add", {"InvariantEnter2", "Const2"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd2", "Add", {"InvariantAdd2", "Identity2"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter2", "Enter", "while/while/while_context", false, 1,
+      {"VariantEnter"});
+  *graph.add_node() = CreateNode(
+      "Merge2", "Merge", {"VariantEnter2", "NextIteration2"});
+  *graph.add_node() = CreateNode("Less2/y", "Const", {"^Identity2"});
+  *graph.add_node() = CreateNode("Less2", "Less", {"VariantAdd2", "less2/y"});
+  *graph.add_node() = CreateNode("LoopCond2", "LoopCond", {"Less2"});
+  *graph.add_node() = CreateNode("Switch2", "Switch", {"Merge2", "LoopCond2"});
+  *graph.add_node() = CreateNode("Identity2", {"Switch2:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration2", "NextIteration", {"VariantAdd2"});
+  *graph.add_node() = CreateNode("Exit2", "Exit", {"Switch2"});
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).back(), 1);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).size(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).back(), 0);
+}
+
+TEST_F(LoopOptimizerTest, NestedLoopConst2) {
+  GraphDef graph;
+  *graph.add_node() = CreateNode("0", {});
+  *graph.add_node() = CreateNode(
+      "InvariantEnter", "Enter", "while/while_context", true, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd", "Add", {"InvariantEnter", "InvariantEnter"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd", "Add", {"InvariantAdd", "Identity"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter", "Enter", "while/while_context", false, 1, {"0"});
+  *graph.add_node() = CreateNode(
+      "Merge", "Merge", {"VariantEnter", "NextIteration"});
+  *graph.add_node() = CreateNode("Less/y", "Const", {"^Identity"});
+  *graph.add_node() = CreateNode("Less", "Less", {"Exit2", "less/y"});
+  *graph.add_node() = CreateNode("LoopCond", "LoopCond", {"Less"});
+  *graph.add_node() = CreateNode("Switch", "Switch", {"Merge", "LoopCond"});
+  *graph.add_node() = CreateNode("Identity", {"Switch:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration", "NextIteration", {"Exit2"});
+  *graph.add_node() = CreateNode("Exit", "Exit", {"Switch"});
+  *graph.add_node() = CreateNode("1", {"Exit"});
+
+  *graph.add_node() = CreateNode(
+      "InvariantEnter2", "Enter", "while/while/while_context", true, 1,
+      {"InvariantAdd"});
+  *graph.add_node() = CreateNode("Const2", "Const", {"^Identity2"});
+  *graph.add_node() = CreateNode(
+      "InvariantAdd2", "Add", {"InvariantEnter2", "Const2"});
+  *graph.add_node() = CreateNode(
+      "VariantAdd2", "Add", {"InvariantAdd2", "Identity2"});
+  *graph.add_node() = CreateNode(
+      "VariantEnter2", "Enter", "while/while/while_context", false, 1,
+      {"VariantEnter"});
+  *graph.add_node() = CreateNode(
+      "Merge2", "Merge", {"VariantEnter2", "NextIteration2"});
+  *graph.add_node() = CreateNode("Less2/y", "Const", {"^Identity2"});
+  *graph.add_node() = CreateNode("Less2", "Less", {"VariantAdd2", "less2/y"});
+  *graph.add_node() = CreateNode("LoopCond2", "LoopCond", {"Less2"});
+  *graph.add_node() = CreateNode("Switch2", "Switch", {"Merge2", "LoopCond2"});
+  *graph.add_node() = CreateNode("Identity2", {"Switch2:1"});
+  *graph.add_node() = CreateNode(
+      "NextIteration2", "NextIteration", {"VariantAdd2"});
+  *graph.add_node() = CreateNode("Exit2", "Exit", {"Switch2"});
+
+  GrapplerItem item;
+  item.graph = graph;
+
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+
+  std::unique_ptr<NodeMap> node_map;
+  std::unordered_map<const NodeDef*, std::vector<int>> frames;
+  int num_frames;
+
+  node_map.reset(new NodeMap(&graph));
+  EXPECT_TRUE(IdentifyFrames(graph, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).back(), 1);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).size(), 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).back(), 1);
+
+  node_map.reset(new NodeMap(&output));
+  EXPECT_TRUE(IdentifyFrames(output, &frames, &num_frames).ok());
+  EXPECT_EQ(num_frames, 2);
+  EXPECT_EQ(frames.at(node_map->GetNode("InvariantAdd2")).size(), 0);
+  EXPECT_EQ(frames.at(node_map->GetNode("Const2")).size(), 0);
+}
 
 void VerifyGraphsEqual(const GraphDef& original_graph,
                        const GraphDef& optimized_graph, const string& func) {
@@ -57,6 +544,93 @@ TEST_F(LoopOptimizerTest, NoOp) {
   VerifyGraphsEqual(item.graph, output, __FUNCTION__);
 }
 
+namespace {
+NodeDef* AddNode(const string& name, const string& op,
+                 const std::vector<string>& inputs,
+                 const std::vector<std::pair<string, AttrValue>>& attributes,
+                 GraphDef* graph) {
+  NodeDef* node = graph->add_node();
+  node->set_name(name);
+  node->set_op(op);
+  for (const string& input : inputs) {
+    node->add_input(input);
+  }
+  for (auto attr : attributes) {
+    (*node->mutable_attr())[attr.first] = attr.second;
+  }
+  return node;
+}
+}  // namespace
+
+TEST_F(LoopOptimizerTest, RemovePush_NoOp) {
+  GrapplerItem item;
+  AttrValue frame_name;
+  frame_name.set_s("foo");
+  AttrValue type;
+  type.set_type(DT_RESOURCE);
+  GraphDef& graph = item.graph;
+  AddNode("c", "Const", {}, {}, &graph);
+  // Stack with corresponding push/pop.
+  AddNode("stack1", "StackV2", {}, {}, &graph);
+  AddNode("push1", "StackPushV2", {"stack1", "c"}, {}, &graph);
+  AddNode("pop1", "StackPopV2", {"stack1"}, {}, &graph);
+  // Stack with corresponding push/pop behind Enter.
+  AddNode("stack2", "StackV2", {}, {}, &graph);
+  AddNode("push_enter", "Enter", {"stack2"},
+          {{"T", type}, {"frame_name", frame_name}}, &graph);
+  AddNode("push2", "StackPushV2", {"push_enter", "c"}, {}, &graph);
+  AddNode("pop_enter", "Enter", {"stack2"},
+          {{"T", type}, {"frame_name", frame_name}}, &graph);
+  AddNode("pop2", "StackPopV2", {"pop_enter"}, {}, &graph);
+  // Stack with unexpected op type in fanout of Stack.
+  AddNode("stack3", "StackV2", {}, {}, &graph);
+  AddNode("push3", "StackPushV2", {"stack3", "c"}, {}, &graph);
+  AddNode("stop", "StopGradient", {"stack3"}, {}, &graph);
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+  VerifyGraphsEqual(item.graph, output, __FUNCTION__);
+}
+
+TEST_F(LoopOptimizerTest, RemovePushWithoutMatchingPop) {
+  GrapplerItem item;
+  GraphDef& graph = item.graph;
+  AttrValue frame_name;
+  frame_name.set_s("foo");
+  AttrValue type;
+  type.set_type(DT_RESOURCE);
+  AddNode("c", "Const", {}, {}, &graph);
+  AddNode("stack1", "StackV2", {}, {}, &graph);
+  AddNode("push1", "StackPushV2", {"stack1", "c"}, {}, &graph);
+  AddNode("stack2", "StackV2", {}, {}, &graph);
+  AddNode("push_enter", "Enter", {"stack2"},
+          {{"T", type}, {"frame_name", frame_name}}, &graph);
+  AddNode("push2", "StackPushV2", {"push_enter", "c"}, {}, &graph);
+  LoopOptimizer optimizer;
+  GraphDef output;
+  Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+  EXPECT_EQ(6, output.node_size());
+  for (int i = 0; i < output.node_size(); ++i) {
+    const NodeDef& node = output.node(i);
+    if (node.name() == "push1") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("c", node.input(0));
+      EXPECT_EQ("^stack1", node.input(1));
+    } else if (node.name() == "push2") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("c", node.input(0));
+      EXPECT_EQ("^push_enter", node.input(1));
+    } else {
+      const NodeDef& orig_node = item.graph.node(i);
+      EXPECT_EQ(orig_node.ShortDebugString(), node.ShortDebugString());
+    }
+  }
+}
+
 }  // namespace
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/memory_optimizer.cc b/tensorflow/core/grappler/optimizers/memory_optimizer.cc
index 3057ee5fa14bd209ad4bb6a9ad690d57435601f4..27e9d2c78d0456e61d31f7f772172fb8d17a11ac 100644
--- a/tensorflow/core/grappler/optimizers/memory_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/memory_optimizer.cc
@@ -413,7 +413,7 @@ void RecomputeSubgraph(
 }
 
 void RecomputationRewritingPass(RewriterConfig::MemOptType optimization_level,
-                                const string& recomputation_targets_name_prefix,
+                                const string& recomputation_targets_name_scope,
                                 GraphDef* graph, const GrapplerItem& item) {
   if (optimization_level != RewriterConfig::RECOMPUTATION_HEURISTICS &&
       optimization_level != RewriterConfig::HEURISTICS &&
@@ -438,15 +438,14 @@ void RecomputationRewritingPass(RewriterConfig::MemOptType optimization_level,
     feeds.insert(NodeName(feed.first));
   }
   std::function<bool(const NodeDef&)> is_target =
-      [&recomputation_targets_name_prefix](const NodeDef& node) {
-        // Nodes whose inputs we may want to recompute. Typically targets will
-        // be gradients (recomputation_targets_name_prefix="gradients/"),
-        // although the prefix is configurable since gradients may be created
-        // in a name scope.
-        // TODO(allenl): Use a static schedule
-        // (grappler::EstimateEarliestExecutionTimes) to recompute only nodes
-        // whose outputs will sit around for a while.
-        return node.name().find(recomputation_targets_name_prefix) == 0;
+      [&recomputation_targets_name_scope](const NodeDef& node) {
+        // Nodes whose inputs we may want to recompute. This matches node names
+        // that contain recomputation_targets_name_scope as a name scope,
+        // meaning it either begins with or contains the name scope.
+        // Defaults to "gradients/" which will match any node names that begins
+        // with "gradients/" or contains "/gradients/".
+        return node.name().find(recomputation_targets_name_scope) == 0 ||
+               node.name().find("/" + recomputation_targets_name_scope) != -1;
       };
 
   if (optimization_level == RewriterConfig::RECOMPUTATION_HEURISTICS ||
@@ -720,18 +719,19 @@ Status BuildSwapPair(NodeDef* node, int input_to_swap,
   // Force the tensor to be copied to cpu.
   NodeDef* swap_out_node = graph->add_node();
   swap_out_node->set_name(swap_out_name);
-  swap_out_node->set_op("Identity");
-  swap_out_node->set_device("/device:CPU:0");
+  swap_out_node->set_op("_CopyFromGpuToHost");
 
   // Force the tensor to be restored to the device.
   NodeDef* swap_in_node = graph->add_node();
   swap_in_node->set_name(swap_in_name);
-  swap_in_node->set_op("Identity");
+  swap_in_node->set_op("_CopyFromHostToGpu");
   *swap_in_node->add_input() = swap_out_node->name();
 
-  // Colocate the swap_in_ node with the node itself.
+  // Colocate the swap_out_ and swap_in_ nodes with the node itself.
+  swap_out_node->set_device(node->device());
   swap_in_node->set_device(node->device());
   string coloc_group = strings::StrCat("loc@", tensor_to_swap);
+  (*swap_out_node->mutable_attr())["_class"].mutable_list()->add_s(coloc_group);
   (*swap_in_node->mutable_attr())["_class"].mutable_list()->add_s(coloc_group);
   (*node->mutable_attr())["_class"].mutable_list()->add_s(coloc_group);
 
@@ -1104,7 +1104,8 @@ bool SwappingPass(RewriterConfig::MemOptType optimization_level,
                   Cluster* cluster, GrapplerItem* item,
                   std::unordered_set<string>* skip_list) {
   std::unordered_map<NodeDef*, SwapInfo> nodes_to_swap;
-  if (optimization_level == RewriterConfig::SWAPPING_HEURISTICS ||
+  if (optimization_level == RewriterConfig::DEFAULT_MEM_OPT ||
+      optimization_level == RewriterConfig::SWAPPING_HEURISTICS ||
       optimization_level == RewriterConfig::HEURISTICS) {
     // Use heuristics to figure out what needs to be swapped;
     IdentifySwappingCandidates(cluster, item, skip_list, &nodes_to_swap);
@@ -1223,8 +1224,8 @@ Status MemoryOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
   *optimized_graph = item.graph;
 
   RecomputationRewritingPass(optimization_level_,
-                             recomputation_targets_name_prefix_,
-                             optimized_graph, item);
+                             recomputation_targets_name_scope_, optimized_graph,
+                             item);
 
   GrapplerItem optimized_item(item, std::move(*optimized_graph));
   std::unordered_set<string> skip_list;
@@ -1240,7 +1241,8 @@ Status MemoryOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
       updated_graph |= SchedulingPass(cluster, &optimized_item);
     }
 
-    if ((optimization_level_ == RewriterConfig::SWAPPING_HEURISTICS ||
+    if ((optimization_level_ == RewriterConfig::DEFAULT_MEM_OPT ||
+         optimization_level_ == RewriterConfig::SWAPPING_HEURISTICS ||
          optimization_level_ == RewriterConfig::HEURISTICS ||
          optimization_level_ == RewriterConfig::MANUAL) &&
         cluster != nullptr) {
diff --git a/tensorflow/core/grappler/optimizers/memory_optimizer.h b/tensorflow/core/grappler/optimizers/memory_optimizer.h
index c3dd0c45c6c524ef850ce7cfb9f6543d22e783ec..5c555a26746b759500f3d778ce137d6d9bedb67b 100644
--- a/tensorflow/core/grappler/optimizers/memory_optimizer.h
+++ b/tensorflow/core/grappler/optimizers/memory_optimizer.h
@@ -27,14 +27,14 @@ class MemoryOptimizer : public GraphOptimizer {
  public:
   // optimization_level: Controls the level of autonomy for the memory
   //   optimizer. See RewriterConfig::memory_optimization.
-  // recomputation_targets_name_prefix: Name prefix for potential outputs of
+  // recomputation_targets_name_scope: Name scope for potential outputs of
   //   recomputations. See
-  //   RewriterConfig::memory_optimizer_target_node_name_prefix.
+  //   RewriterConfig::memory_optimizer_target_node_name_scope.
   explicit MemoryOptimizer(
       RewriterConfig::MemOptType optimization_level,
-      const string& recomputation_targets_name_prefix = "gradients/")
+      const string& recomputation_targets_name_scope = "gradients/")
       : optimization_level_(optimization_level),
-        recomputation_targets_name_prefix_(recomputation_targets_name_prefix) {}
+        recomputation_targets_name_scope_(recomputation_targets_name_scope) {}
   ~MemoryOptimizer() override {}
 
   string name() const override { return "memory_optimizer"; };
@@ -47,7 +47,7 @@ class MemoryOptimizer : public GraphOptimizer {
 
  private:
   RewriterConfig::MemOptType optimization_level_;
-  string recomputation_targets_name_prefix_;
+  string recomputation_targets_name_scope_;
 };
 
 }  // end namespace grappler
diff --git a/tensorflow/core/grappler/optimizers/memory_optimizer_test.cc b/tensorflow/core/grappler/optimizers/memory_optimizer_test.cc
index 5d7913e0c018ecf14cc09ab91d3a71125c720aa5..9595936e9e6158045a13ebede95d63b9291ca434 100644
--- a/tensorflow/core/grappler/optimizers/memory_optimizer_test.cc
+++ b/tensorflow/core/grappler/optimizers/memory_optimizer_test.cc
@@ -221,16 +221,20 @@ TEST_F(MemoryOptimizerTest, SimpleSwapping) {
   // Build a simple graph with an op that's marked for swapping.
   tensorflow::Scope s = tensorflow::Scope::NewRootScope();
 
-  Output a = ops::Variable(s.WithOpName("a"), {10, 10}, DT_FLOAT);
-  Output b = ops::AddN(s.WithOpName("b"), {a});
-  Output c = ops::AddN(s.WithOpName("c"), {b});
-  Output d = ops::AddN(s.WithOpName("d"), {c});
-  Output e = ops::AddN(s.WithOpName("e"), {b, d});
+  Output a =
+      ops::Variable(s.WithOpName("a").WithDevice("/gpu:0"), {10, 10}, DT_FLOAT);
+  Output b = ops::AddN(s.WithOpName("b").WithDevice("/gpu:0"), {a});
+  Output c = ops::AddN(s.WithOpName("c").WithDevice("/gpu:0"), {b});
+  Output d = ops::AddN(s.WithOpName("d").WithDevice("/gpu:0"), {c});
+  Output e = ops::AddN(s.WithOpName("e").WithDevice("/gpu:0"), {b, d});
+
+  Output constant = ops::Const(s.WithOpName("constant"), 0.0f, {10, 10});
+  Output init = ops::Assign(s.WithOpName("init"), a, constant);
 
   GrapplerItem item;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
 
-  EXPECT_EQ(5, item.graph.node_size());
+  EXPECT_EQ(7, item.graph.node_size());
   EXPECT_EQ(NodeName(e.name()), item.graph.node(4).name());
   AttrValue& val =
       (*item.graph.mutable_node(4)->mutable_attr())["_swap_to_host"];
@@ -243,32 +247,43 @@ TEST_F(MemoryOptimizerTest, SimpleSwapping) {
   Status status = optimizer.Optimize(cluster.get(), item, &output);
   TF_EXPECT_OK(status);
 
-  EXPECT_EQ(7, output.node_size());
-  const NodeDef& new_e = output.node(4);
+  EXPECT_EQ(9, output.node_size());
+  const NodeDef& new_e = output.node(6);
   EXPECT_EQ(NodeName(e.name()), new_e.name());
 
   EXPECT_EQ(2, new_e.input_size());
   EXPECT_EQ(NodeName(d.name()), new_e.input(1));
   EXPECT_EQ("swap_in_e_0", new_e.input(0));
 
-  const NodeDef& swap_out = output.node(5);
+  const NodeDef& swap_out = output.node(7);
   EXPECT_EQ("swap_out_e_0", swap_out.name());
+  EXPECT_EQ("_CopyFromGpuToHost", swap_out.op());
 
-  const NodeDef& swap_in = output.node(6);
+  const NodeDef& swap_in = output.node(8);
   EXPECT_EQ("swap_in_e_0", swap_in.name());
+  EXPECT_EQ("_CopyFromHostToGpu", swap_in.op());
 
   EXPECT_EQ(NodeName(b.name()), swap_out.input(0));
   EXPECT_EQ(NodeName(swap_out.name()), swap_in.input(0));
   EXPECT_EQ("^c", swap_in.input(1));
 
-  const NodeDef& new_c = output.node(2);
+  const NodeDef& new_c = output.node(4);
   EXPECT_EQ(NodeName(c.name()), new_c.name());
   EXPECT_EQ("^swap_out_e_0", new_c.input(1));
 
   // Run the optimizer a second time to ensure it's idempotent.
-  item.graph.Swap(&output);
-  status = optimizer.Optimize(cluster.get(), item, &output);
+  GrapplerItem item_copy(item, std::move(output));
+  status = optimizer.Optimize(cluster.get(), item_copy, &output);
   TF_EXPECT_OK(status);
+
+#if GOOGLE_CUDA
+  item.fetch = {"e"};
+  item.init_ops = {init.name()};
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+#endif
 }
 
 TEST_F(MemoryOptimizerTest, SwappingHeuristics) {
@@ -287,9 +302,13 @@ TEST_F(MemoryOptimizerTest, SwappingHeuristics) {
   Output h = ops::Exp(s.WithOpName("h").WithDevice("/gpu:0"), c);
   Output i = ops::Log(s.WithOpName("i").WithDevice("/gpu:0"), d);
 
+  Output constant = ops::Const(s.WithOpName("constant"), 0.0f, {128, 128, 8});
+  Output init = ops::Assign(s.WithOpName("init"), v, constant);
+
   GrapplerItem item;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
   item.fetch = {"e", "f", "g", "h", "i"};
+  item.init_ops = {init.name()};
 
   std::unique_ptr<VirtualCluster> cluster(CreateVirtualCluster());
 
@@ -308,6 +327,15 @@ TEST_F(MemoryOptimizerTest, SwappingHeuristics) {
       EXPECT_EQ("axis", node.input(4));
     }
   }
+
+#if GOOGLE_CUDA
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  for (int i = 0; i < item.fetch.size(); ++i) {
+    test::ExpectTensorEqual<float>(tensors_expected[i], tensors[i]);
+  }
+#endif
 }
 
 TEST_F(MemoryOptimizerTest, UnswappableInputs) {
@@ -325,9 +353,13 @@ TEST_F(MemoryOptimizerTest, UnswappableInputs) {
   Output e =
       ops::Concat(s.WithOpName("e").WithDevice("/gpu:0"), {b, c, d}, axis);
 
+  Output constant = ops::Const(s.WithOpName("constant"), 0.0f, {128, 128, 8});
+  Output init = ops::Assign(s.WithOpName("init"), v, constant);
+
   GrapplerItem item;
   TF_CHECK_OK(s.ToGraphDef(&item.graph));
   item.fetch = {"e"};
+  item.init_ops = {init.name()};
 
   std::unique_ptr<VirtualCluster> cluster(CreateVirtualCluster());
 
@@ -344,6 +376,13 @@ TEST_F(MemoryOptimizerTest, UnswappableInputs) {
       EXPECT_EQ("^swap_out_d_2", node.input(4));
     }
   }
+
+#if GOOGLE_CUDA
+  auto tensors_expected = EvaluateFetchNodes(item);
+  GrapplerItem optimized(item, std::move(output));
+  auto tensors = EvaluateFetchNodes(optimized);
+  test::ExpectTensorEqual<float>(tensors_expected[0], tensors[0]);
+#endif
 }
 
 TEST_F(MemoryOptimizerTest, AccumulationRewrites) {
diff --git a/tensorflow/core/grappler/optimizers/meta_optimizer.cc b/tensorflow/core/grappler/optimizers/meta_optimizer.cc
index e27b9df6206c652e4503bb064366201a2b90f13a..6fa8c035485d7d8208e95041afbe0dfa689e60cf 100644
--- a/tensorflow/core/grappler/optimizers/meta_optimizer.cc
+++ b/tensorflow/core/grappler/optimizers/meta_optimizer.cc
@@ -19,7 +19,9 @@ limitations under the License.
 #include "tensorflow/core/grappler/optimizers/arithmetic_optimizer.h"
 #include "tensorflow/core/grappler/optimizers/auto_parallel.h"
 #include "tensorflow/core/grappler/optimizers/constant_folding.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
 #include "tensorflow/core/grappler/optimizers/dependency_optimizer.h"
+#include "tensorflow/core/grappler/optimizers/function_optimizer.h"
 #include "tensorflow/core/grappler/optimizers/graph_optimizer.h"
 #include "tensorflow/core/grappler/optimizers/layout_optimizer.h"
 #include "tensorflow/core/grappler/optimizers/loop_optimizer.h"
@@ -55,6 +57,9 @@ std::unique_ptr<GraphOptimizer> MetaOptimizer::NewOptimizer(
   if (optimizer == "pruning") {
     graph_optimizer.reset(new ModelPruner());
   }
+  if (optimizer == "function") {
+    graph_optimizer.reset(new FunctionOptimizer());
+  }
   if (optimizer == "constfold") {
     graph_optimizer.reset(new ConstantFolding(cpu_device_));
   }
@@ -72,13 +77,13 @@ std::unique_ptr<GraphOptimizer> MetaOptimizer::NewOptimizer(
     graph_optimizer.reset(
         new AutoParallel(cfg_.auto_parallel().num_replicas()));
   }
+  if (optimizer == "loop") {
+    graph_optimizer.reset(new LoopOptimizer(cfg_.loop_optimization()));
+  }
   if (optimizer == "dependency") {
     graph_optimizer.reset(
         new DependencyOptimizer(cfg_.dependency_optimization()));
   }
-  if (optimizer == "loop") {
-    graph_optimizer.reset(new LoopOptimizer(cfg_.loop_optimization()));
-  }
   return graph_optimizer;
 }
 
@@ -89,6 +94,10 @@ Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
     if (!cfg_.disable_model_pruning()) {
       optimizers.push_back(std::unique_ptr<GraphOptimizer>(new ModelPruner()));
     }
+    if (cfg_.function_optimization() == RewriterConfig::ON) {
+      optimizers.push_back(
+          std::unique_ptr<GraphOptimizer>(new FunctionOptimizer()));
+    }
     if (cfg_.constant_folding() != RewriterConfig::OFF) {
       optimizers.push_back(std::unique_ptr<GraphOptimizer>(
           new ConstantFolding(cfg_.constant_folding(), cpu_device_)));
@@ -97,20 +106,20 @@ Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
       optimizers.push_back(std::unique_ptr<GraphOptimizer>(
           new ArithmeticOptimizer(cfg_.arithmetic_optimization())));
     }
-    if (cfg_.dependency_optimization() != RewriterConfig::OFF) {
+    if (cfg_.loop_optimization() == RewriterConfig::ON) {
       optimizers.push_back(std::unique_ptr<GraphOptimizer>(
-          new DependencyOptimizer(cfg_.dependency_optimization())));
+          new LoopOptimizer(cfg_.loop_optimization())));
     }
-    if (cfg_.loop_optimization() != RewriterConfig::OFF) {
+    if (cfg_.dependency_optimization() != RewriterConfig::OFF) {
       optimizers.push_back(std::unique_ptr<GraphOptimizer>(
-          new LoopOptimizer(cfg_.loop_optimization())));
+          new DependencyOptimizer(cfg_.dependency_optimization())));
     }
     if (cfg_.layout_optimizer() != RewriterConfig::OFF) {
       optimizers.push_back(
           std::unique_ptr<GraphOptimizer>(new LayoutOptimizer()));
     }
     if (cfg_.memory_optimization() != RewriterConfig::NO_MEM_OPT) {
-      if (cfg_.memory_optimizer_target_node_name_prefix().empty()) {
+      if (cfg_.memory_optimizer_target_node_name_scope().empty()) {
         optimizers.push_back(std::unique_ptr<GraphOptimizer>(
             // Use the default target node name prefix "gradients/"
             new MemoryOptimizer(cfg_.memory_optimization())));
@@ -118,7 +127,7 @@ Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
         optimizers.push_back(
             std::unique_ptr<GraphOptimizer>(new MemoryOptimizer(
                 cfg_.memory_optimization(),
-                cfg_.memory_optimizer_target_node_name_prefix())));
+                cfg_.memory_optimizer_target_node_name_scope())));
       }
     }
     if (cfg_.auto_parallel().enable()) {
@@ -126,14 +135,26 @@ Status MetaOptimizer::Optimize(Cluster* cluster, const GrapplerItem& item,
           new AutoParallel(cfg_.auto_parallel().num_replicas())));
     }
   } else {
-    std::set<string> available_optimizers = {
-        "pruning",      "constfold",  "layout",     "memory",
-        "autoparallel", "arithmetic", "dependency", "loop"};
-    for (const auto& optimizer : cfg_.optimizers()) {
-      if (available_optimizers.find(optimizer) != available_optimizers.end()) {
-        optimizers.push_back(NewOptimizer(optimizer));
+    const std::set<string> available_optimizers = {
+        "pruning",      "function",   "constfold", "layout",    "memory",
+        "autoparallel", "arithmetic", "loop",      "dependency"};
+    std::vector<string> custom_optimizer_names;
+    for (const auto& optimizer_name : cfg_.optimizers()) {
+      if (available_optimizers.find(optimizer_name) !=
+          available_optimizers.end()) {
+        optimizers.push_back(NewOptimizer(optimizer_name));
+      } else {
+        custom_optimizer_names.push_back(optimizer_name);
       }
     }
+    // Now run the custom optimizers.
+    for (const auto& optimizer_name : custom_optimizer_names) {
+      std::unique_ptr<CustomGraphOptimizer> opt =
+          CustomGraphOptimizerRegistry::CreateByNameOrNull(optimizer_name);
+      if (opt == nullptr) continue;
+      TF_RETURN_IF_ERROR(opt->Init());
+      optimizers.push_back(std::move(opt));
+    }
   }
 
   if (optimizers.empty()) {
@@ -210,10 +231,11 @@ void MetaOptimizer::Feedback(Cluster* cluster, const GrapplerItem& item,
 bool MetaOptimizerEnabled(const RewriterConfig& cfg) {
   return !cfg.disable_model_pruning() ||
          cfg.layout_optimizer() != RewriterConfig::OFF ||
+         cfg.function_optimization() == RewriterConfig::ON ||
          cfg.constant_folding() != RewriterConfig::OFF ||
-         cfg.dependency_optimization() != RewriterConfig::OFF ||
-         cfg.loop_optimization() == RewriterConfig::ON ||
          cfg.arithmetic_optimization() != RewriterConfig::OFF ||
+         cfg.loop_optimization() == RewriterConfig::ON ||
+         cfg.dependency_optimization() != RewriterConfig::OFF ||
          cfg.auto_parallel().enable() ||
          cfg.memory_optimization() != RewriterConfig::NO_MEM_OPT ||
          !cfg.optimizers().empty();
diff --git a/tensorflow/core/grappler/optimizers/meta_optimizer_test.cc b/tensorflow/core/grappler/optimizers/meta_optimizer_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..536347d8348738e1755e920f3f08c2d4858cb256
--- /dev/null
+++ b/tensorflow/core/grappler/optimizers/meta_optimizer_test.cc
@@ -0,0 +1,77 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/optimizers/meta_optimizer.h"
+
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+#include "tensorflow/core/grappler/inputs/trivial_test_graph_input_yielder.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer.h"
+#include "tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.h"
+#include "tensorflow/core/grappler/utils.h"
+#include "tensorflow/core/lib/core/status_test_util.h"
+#include "tensorflow/core/platform/test.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+class TestOptimizer : public CustomGraphOptimizer {
+ public:
+  static void SetOptimized(const bool flag_value) { optimized_ = flag_value; }
+  static bool IsOptimized() { return optimized_; }
+
+  TestOptimizer() {}
+  string name() const override { return "test_optimizer"; }
+
+  Status Init() override { return Status::OK(); }
+
+  Status Optimize(Cluster* cluster, const GrapplerItem& item,
+                  GraphDef* optimized_graph) override {
+    optimized_ = true;
+    *optimized_graph = item.graph;
+    return Status::OK();
+  }
+
+  void Feedback(Cluster* cluster, const GrapplerItem& item,
+                const GraphDef& optimized_graph, double result) override {}
+
+ private:
+  static bool optimized_;
+};
+
+bool TestOptimizer::optimized_;
+
+REGISTER_GRAPH_OPTIMIZER(TestOptimizer);
+
+TEST(MetaOptimizerTest, RunsCustomOptimizer) {
+  TrivialTestGraphInputYielder fake_input(4, 1, 10, false, {"CPU:0"});
+  GrapplerItem item;
+  CHECK(fake_input.NextItem(&item));
+
+  TestOptimizer::SetOptimized(false);
+  RewriterConfig rewriter_config;
+  rewriter_config.add_optimizers("TestOptimizer");
+
+  MetaOptimizer optimizer(nullptr, rewriter_config);
+  GraphDef output;
+  const Status status = optimizer.Optimize(nullptr, item, &output);
+  TF_EXPECT_OK(status);
+  EXPECT_TRUE(TestOptimizer::IsOptimized());
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/optimizers/model_pruner.cc b/tensorflow/core/grappler/optimizers/model_pruner.cc
index f52a2ab86288adacefec6796ceed4cea73d9b632..3311e970108d94d34a92842d51aca8f0c99d904c 100644
--- a/tensorflow/core/grappler/optimizers/model_pruner.cc
+++ b/tensorflow/core/grappler/optimizers/model_pruner.cc
@@ -50,7 +50,7 @@ bool IsTrivialOp(const NodeDef& node, const GraphRewriter& rewriter) {
 
 Status ModelPruner::Optimize(Cluster* cluster, const GrapplerItem& item,
                              GraphDef* pruned_graph) {
-  const std::unordered_set<string>& nodes_to_preserve = item.NodesToPreserve();
+  const std::unordered_set<string> nodes_to_preserve = item.NodesToPreserve();
 
   // Prune all the nodes that won't be executed, ie all the nodes that aren't in
   // the fanin of a fetch node. If fetch nodes aren't specified, we'll assume
@@ -59,6 +59,7 @@ Status ModelPruner::Optimize(Cluster* cluster, const GrapplerItem& item,
   if (!nodes_to_preserve.empty()) {
     std::vector<string> terminal_nodes(nodes_to_preserve.begin(),
                                        nodes_to_preserve.end());
+    std::sort(terminal_nodes.begin(), terminal_nodes.end());
     bool ill_formed = false;
     std::vector<const NodeDef*> keep =
         ComputeTransitiveFanin(item.graph, terminal_nodes, &ill_formed);
diff --git a/tensorflow/core/grappler/utils.cc b/tensorflow/core/grappler/utils.cc
index eb5a2c48dc8b12f7b4090e80c403e238a526e122..eb1f882ff1d21034e233987fb778d295f00bec85 100644
--- a/tensorflow/core/grappler/utils.cc
+++ b/tensorflow/core/grappler/utils.cc
@@ -29,6 +29,18 @@ limitations under the License.
 
 namespace tensorflow {
 namespace grappler {
+namespace {
+template <typename T>
+bool SafeSetScalarTensorValue(double value, Tensor* tensor) {
+  using RealType = typename Eigen::NumTraits<T>::Real;
+  if (value > std::numeric_limits<RealType>::max() ||
+      value < std::numeric_limits<RealType>::min()) {
+    return false;
+  }
+  tensor->flat<T>()(0) = static_cast<T>(value);
+  return true;
+}
+}  // namespace
 
 NodeMap::NodeMap(GraphDef* graph) {
   CHECK(graph != nullptr);
@@ -336,6 +348,7 @@ inline void STLSortAndRemoveDuplicates(T* v) {
 
 Status SimpleGraphView::Initialize(const GraphDef& graph, bool dedup_inputs,
                                    bool dedup_outputs) {
+  graph_ = &graph;
   const int num_nodes = graph.node_size();
   inputs_.clear();
   inputs_.resize(num_nodes);
@@ -382,6 +395,22 @@ Status SimpleGraphView::Initialize(const GraphDef& graph, bool dedup_inputs,
   return Status::OK();
 }
 
+void SimpleGraphView::DepthFirstSearch(
+    const std::unordered_set<string>& op_types_to_traverse, int node_idx,
+    std::set<int>* nodes_found) const {
+  if (nodes_found->find(node_idx) != nodes_found->end()) {
+    return;
+  }
+  nodes_found->insert(node_idx);
+  const string& op_type = graph_->node(node_idx).op();
+  if (op_types_to_traverse.find(op_type) == op_types_to_traverse.end()) {
+    return;
+  }
+  for (auto output_idx : this->outputs(node_idx)) {
+    DepthFirstSearch(op_types_to_traverse, output_idx, nodes_found);
+  }
+}
+
 string SimpleGraphView::PrintToString() const {
   string str;
   for (int i = 0; i < num_nodes(); ++i) {
@@ -402,5 +431,43 @@ string SimpleGraphView::PrintToString() const {
   return str;
 }
 
+#define HANDLE_CASE(DTYPE)                                          \
+  case DTYPE:                                                       \
+    if (!SafeSetScalarTensorValue<EnumToDataType<DTYPE>::Type>(     \
+            static_cast<double>(value), tensor)) {                  \
+      return errors::InvalidArgument("Cannot store value ", value,  \
+                                     " in tensor of type " #DTYPE); \
+    }                                                               \
+    break
+
+Status SetTensorValue(DataType dtype, int value, Tensor* tensor) {
+  // TODO(rmlarsen): Support more general shapes.
+  if (tensor->NumElements() != 1) {
+    return errors::InvalidArgument(
+        "Expected scalar tensor, got num_elements = ", tensor->NumElements());
+  }
+  switch (dtype) {
+    // TODO(rmlarsen): Handle DT_HALF.
+    //    HANDLE_CASE(DT_HALF);
+    HANDLE_CASE(DT_BOOL);
+    HANDLE_CASE(DT_FLOAT);
+    HANDLE_CASE(DT_DOUBLE);
+    HANDLE_CASE(DT_UINT8);
+    HANDLE_CASE(DT_INT8);
+    HANDLE_CASE(DT_UINT16);
+    HANDLE_CASE(DT_INT16);
+    HANDLE_CASE(DT_INT32);
+    HANDLE_CASE(DT_INT64);
+    HANDLE_CASE(DT_COMPLEX64);
+    HANDLE_CASE(DT_COMPLEX128);
+    default:
+      return errors::InvalidArgument("Unsupported type ",
+                                     DataTypeString(dtype));
+  }
+  return Status::OK();
+}
+
+#undef HANDLE_CASE
+
 }  // end namespace grappler
 }  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/utils.h b/tensorflow/core/grappler/utils.h
index 4ecb28f681507f50ad5909f15cf1b408ed6e2979..fbd38c1531e3945091fcb328633a750c6a71ce2e 100644
--- a/tensorflow/core/grappler/utils.h
+++ b/tensorflow/core/grappler/utils.h
@@ -23,6 +23,7 @@ limitations under the License.
 
 #include "tensorflow/core/framework/graph.pb.h"
 #include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/types.h"
 #include "tensorflow/core/lib/core/status.h"
 #include "tensorflow/core/lib/core/threadpool.h"
@@ -167,6 +168,8 @@ NodeDef* GetTailOfChain(const NodeDef& source, const NodeMap& node_map,
 void PermuteNodesInPlace(GraphDef* graph, std::vector<int>* permutation,
                          bool invert_permutation);
 
+Status SetTensorValue(DataType dtype, int value, Tensor* tensor);
+
 class SimpleGraphView {
  public:
   Status Initialize(const GraphDef& graph) {
@@ -175,6 +178,7 @@ class SimpleGraphView {
   Status Initialize(const GraphDef& graph, bool dedup_inputs,
                     bool dedup_outputs);
 
+  const GraphDef* graph() const { return graph_; }
   inline int num_nodes() const { return index_to_name_.size(); }
   inline const int index(const string& node_name) const {
     const auto& it = name_to_index_.find(node_name);
@@ -191,9 +195,17 @@ class SimpleGraphView {
     return outputs_[node_idx];
   }
 
+  // Traverse the graph starting at `node_idx`, collecting indices of nodes
+  // visited in nodes_found. If a node has an op in `op_types_to_traverse`, the
+  // walk continues to its children. It is assumed that *graph_ was not modified
+  // after the call to Initialize().
+  void DepthFirstSearch(const std::unordered_set<string>& op_types_to_traverse,
+                        int node_idx, std::set<int>* nodes_found) const;
+
   string PrintToString() const;
 
  private:
+  const GraphDef* graph_;  // Not owned.
   std::vector<string> index_to_name_;
   std::unordered_map<string, int> name_to_index_;
   std::vector<gtl::InlinedVector<int, 4>> inputs_;
diff --git a/tensorflow/core/grappler/utils/BUILD b/tensorflow/core/grappler/utils/BUILD
index 0a9dbe22cfe3cd01c2c61661adcdd4839a957f03..3dbad40cae0e4b138d674854e9782713ec6cc530 100644
--- a/tensorflow/core/grappler/utils/BUILD
+++ b/tensorflow/core/grappler/utils/BUILD
@@ -142,6 +142,38 @@ cc_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:protos_all_cc",
         "//tensorflow/core:test",
+        "//tensorflow/core/grappler:grappler_item",
         "//tensorflow/core/grappler:utils",
     ],
 )
+
+cc_library(
+    name = "functions",
+    srcs = [
+        "functions.cc",
+    ],
+    hdrs = ["functions.h"],
+    visibility = ["//visibility:public"],
+    deps = [
+        "//tensorflow/core:framework",
+        "//tensorflow/core:framework_internal",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core/grappler:grappler_item",
+        "//tensorflow/core/grappler:utils",
+    ],
+)
+
+tf_cc_test(
+    name = "functions_test",
+    srcs = ["functions_test.cc"],
+    deps = [
+        ":functions",
+        "//tensorflow/cc:cc_ops",
+        "//tensorflow/core:all_kernels",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:protos_all_cc",
+        "//tensorflow/core:test",
+        "//tensorflow/core:test_main",
+        "//tensorflow/core:testlib",
+    ],
+)
diff --git a/tensorflow/core/grappler/utils/functions.cc b/tensorflow/core/grappler/utils/functions.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4f286ce1c8bc3df4065f39c1744600d457173c2e
--- /dev/null
+++ b/tensorflow/core/grappler/utils/functions.cc
@@ -0,0 +1,153 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include "tensorflow/core/grappler/utils/functions.h"
+
+#include <unordered_map>
+
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/function.h"
+#include "tensorflow/core/framework/function.pb.h"
+#include "tensorflow/core/framework/graph_def_util.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/framework/types.pb.h"
+#include "tensorflow/core/grappler/utils.h"
+
+namespace tensorflow {
+namespace grappler {
+
+std::unique_ptr<GrapplerItem> GrapplerItemFromFunctionDef(
+    const FunctionDef& func,
+    const std::unordered_map<string, AttrValue>& func_attr,
+    const FunctionDefLibrary& library) {
+  if (func.signature().name().empty()) {
+    LOG(ERROR) << "function name must be specified.";
+    return nullptr;
+  }
+  std::unique_ptr<GrapplerItem> new_item(new GrapplerItem());
+  new_item->id = func.signature().name();
+
+  std::unordered_map<string, string> port_map;
+
+  // Add the function inputs as placeholder
+  for (const auto& inp : func.signature().input_arg()) {
+    NodeDef* ph = new_item->graph.add_node();
+    ph->set_name(inp.name());
+    ph->set_op("Placeholder");
+    if (inp.type() != DT_INVALID) {
+      (*ph->mutable_attr())["T"].set_type(inp.type());
+    } else {
+      auto it = func_attr.find(inp.type_attr());
+      if (it == func_attr.end()) {
+        LOG(ERROR) << "Unknown type attribute " << inp.type_attr()
+                   << " for function input " << inp.name();
+        return nullptr;
+      } else {
+        (*ph->mutable_attr())["T"] = it->second;
+      }
+    }
+    port_map[inp.name()] = inp.name();
+  }
+
+  // Add the function body to the graph.
+  FunctionLibraryDefinition func_def(OpRegistry::Global(), library);
+
+  for (const NodeDef& node : func.node_def()) {
+    NodeDef* new_node = new_item->graph.add_node();
+    *new_node = node;
+    // Replace the placeholder attribute values with the specified value.
+    for (auto& attr : *new_node->mutable_attr()) {
+      const string& ph_name = attr.second.placeholder();
+      auto it = func_attr.find(ph_name);
+      if (it != func_attr.end()) {
+        attr.second = it->second;
+      }
+    }
+
+    // Functions use a custom format to encode connectivity. Map these custom
+    // strings to regular ones.
+    const OpRegistrationData* registration;
+    Status status = func_def.LookUp(node.op(), &registration);
+    if (!status.ok()) {
+      LOG(ERROR) << "Op " << node.op() << " not registered: " << status;
+      return nullptr;
+    }
+
+    tensorflow::NameRangeMap inputs;
+    tensorflow::NameRangeMap outputs;
+    status = tensorflow::NameRangesForNode(node, registration->op_def, &inputs,
+                                           &outputs);
+    if (!status.ok()) {
+      LOG(ERROR) << "Op " << node.op() << " invalid: " << status;
+      return nullptr;
+    }
+    for (const auto& name_range : outputs) {
+      string port_prefix =
+          strings::StrCat(node.name(), ":", name_range.first, ":");
+      int index_start = name_range.second.first;
+      int index_end = name_range.second.second;
+      for (int i = index_start; i < index_end; ++i) {
+        string port_id = strings::StrCat(port_prefix, i - index_start);
+        string port_name = strings::StrCat(node.name(), ":", i);
+        port_map[port_id] = port_name;
+      }
+    }
+  }
+
+  for (auto& node : *new_item->graph.mutable_node()) {
+    // Rewrite the inputs to use the normal naming convention.
+    for (int i = 0; i < node.input_size(); ++i) {
+      const string& input = node.input(i);
+      if (IsControlInput(input)) {
+        // No need to remap control dependencies.
+        continue;
+      } else {
+        auto it = port_map.find(input);
+        if (it == port_map.end()) {
+          LOG(ERROR) << "Unknown input: " << input;
+          return nullptr;
+        }
+        node.set_input(i, it->second);
+      }
+    }
+  }
+
+  // Add the function outputs to the list of fetch nodes, taking into account
+  // the output mapping if any.
+  for (const auto& out : func.signature().output_arg()) {
+    auto it = func.ret().find(out.name());
+    if (it != func.ret().end()) {
+      auto it2 = port_map.find(it->second);
+      if (it2 == port_map.end()) {
+        LOG(ERROR) << "Unknown output mapping: " << it->first << " to "
+                   << it->second;
+        return nullptr;
+      } else {
+        new_item->fetch.emplace_back(it2->second);
+      }
+    } else {
+      new_item->fetch.emplace_back(out.name());
+    }
+  }
+  // Add the function inputs to the list of feeds.
+  for (const auto& inp : func.signature().input_arg()) {
+    new_item->feed.emplace_back(inp.name(), Tensor());
+  }
+
+  return new_item;
+}
+
+}  // end namespace grappler
+}  // end namespace tensorflow
diff --git a/tensorflow/core/grappler/utils/functions.h b/tensorflow/core/grappler/utils/functions.h
new file mode 100644
index 0000000000000000000000000000000000000000..8f9b7d848a89435e1839e540f33d87213beb8a45
--- /dev/null
+++ b/tensorflow/core/grappler/utils/functions.h
@@ -0,0 +1,39 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_GRAPPLER_UTILS_FUNCTIONS_H_
+#define TENSORFLOW_GRAPPLER_UTILS_FUNCTIONS_H_
+
+#include <memory>
+#include <string>
+#include "tensorflow/core/framework/attr_value.pb.h"
+#include "tensorflow/core/framework/function.pb.h"
+#include "tensorflow/core/grappler/grappler_item.h"
+
+namespace tensorflow {
+
+namespace grappler {
+
+// Factory method for creating a GrapplerItem from a FunctionDef.
+// Returns nullptr if the given function def cannot be converted.
+std::unique_ptr<GrapplerItem> GrapplerItemFromFunctionDef(
+    const FunctionDef& func,
+    const std::unordered_map<string, AttrValue>& func_attr,
+    const FunctionDefLibrary& library);
+
+}  // end namespace grappler
+}  // end namespace tensorflow
+
+#endif  // TENSORFLOW_GRAPPLER_UTILS_FUNCTIONS_H_
diff --git a/tensorflow/core/grappler/utils/functions_test.cc b/tensorflow/core/grappler/utils/functions_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..6a7d766b1c6b49f8fc13b3b0294f3e3f8a74eb35
--- /dev/null
+++ b/tensorflow/core/grappler/utils/functions_test.cc
@@ -0,0 +1,350 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/grappler/utils/functions.h"
+#include "tensorflow/cc/ops/standard_ops.h"
+#include "tensorflow/core/framework/function_testlib.h"
+#include "tensorflow/core/framework/node_def.pb.h"
+#include "tensorflow/core/framework/node_def_util.h"
+#include "tensorflow/core/framework/tensor_testutil.h"
+#include "tensorflow/core/platform/test.h"
+#include "tensorflow/core/protobuf/meta_graph.pb.h"
+
+namespace tensorflow {
+namespace grappler {
+namespace {
+
+class FunctionsTest : public ::testing::Test {};
+
+TEST_F(FunctionsTest, FromSimpleFunctionDef) {
+  const Tensor kTwo = test::AsScalar<int64>(2);
+  FunctionDef func = FunctionDefHelper::Define(
+      // Name
+      "XTimesTwo",
+      // Args
+      {"x: T"},
+      // Return values
+      {"y: T"},
+      // Attr def
+      {"T: {float, double, int32, int64}"},
+      // Nodes
+      {
+          {{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_INT64}}},
+          {{"scale"}, "Cast", {"two"}, {{"SrcT", DT_INT64}, {"DstT", "$T"}}},
+          {{"y"}, "Mul", {"x", "scale"}, {{"T", "$T"}}},
+      });
+
+  std::unordered_map<string, AttrValue> func_attr;
+  func_attr["T"].set_type(DT_FLOAT);
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+  CHECK(item);
+  EXPECT_EQ("XTimesTwo", item->id);
+  EXPECT_EQ(4, item->graph.node_size());
+  EXPECT_EQ(std::vector<string>({"y:0"}), item->fetch);
+  EXPECT_EQ(1, item->feed.size());
+  EXPECT_EQ("x", item->feed[0].first);
+
+  for (const NodeDef &node : item->graph.node()) {
+    if (node.name() == "x") {
+      EXPECT_EQ("Placeholder", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "two") {
+      EXPECT_EQ("Const", node.op());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "scale") {
+      EXPECT_EQ("Cast", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("DstT").type());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("two:0", node.input(0));
+    } else if (node.name() == "y") {
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("scale:0", node.input(1));
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithMultiOutputNodes) {
+  // Gradient graph for the Subtract operation
+  std::vector<FunctionDefHelper::Node> nodes = {
+      {{"sx"}, "Shape", {"x"}},
+      {{"sy"}, "Shape", {"y"}},
+      {{"gx"}, "Identity", {"dz"}},
+      {{"gy"}, "Neg", {"dz"}},
+      {{"rx", "ry"}, "BroadcastGradientArgs", {"sx", "sy"}},
+      {{"sum_gx"}, "Sum", {"gx", "rx"}},
+      {{"dx"}, "Reshape", {"sum_gx", "sx"}},
+      {{"sum_gy"}, "Sum", {"gy", "ry"}},
+      {{"dy"}, "Reshape", {"sum_gy", "sy"}},
+  };
+
+  for (auto &n : nodes) {
+    // "BroadcastGradientArgs" doesn't need any attrs.
+    if (n.attr.empty() && n.op != "BroadcastGradientArgs") {
+      n.attr = {{"T", "$T"}};
+    }
+  }
+  FunctionDef func = FunctionDefHelper::Define(
+      // Name
+      "SubGrad",
+      // Arg defs
+      {"x: T", "y: T", "dz: T"},
+      // Ret val defs
+      {"dx: T", "dy: T"},
+      // Attr defs
+      {{"T: {half, float, double}"}},
+      // Nodes
+      nodes);
+
+  std::unordered_map<string, AttrValue> func_attr;
+  func_attr["T"].set_type(DT_FLOAT);
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+  CHECK(item);
+  EXPECT_EQ("SubGrad", item->id);
+  EXPECT_EQ(12, item->graph.node_size());
+  EXPECT_EQ(std::vector<string>({"dx:0", "dy:0"}), item->fetch);
+  EXPECT_EQ(3, item->feed.size());
+  EXPECT_EQ("x", item->feed[0].first);
+  EXPECT_EQ("y", item->feed[1].first);
+  EXPECT_EQ("dz", item->feed[2].first);
+
+  for (const NodeDef &node : item->graph.node()) {
+    if (node.name() == "x" || node.name() == "y" || node.name() == "dz") {
+      EXPECT_EQ("Placeholder", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "rx") {
+      EXPECT_EQ("BroadcastGradientArgs", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("sx:0", node.input(0));
+      EXPECT_EQ("sy:0", node.input(1));
+    } else if (node.name() == "sum_gx") {
+      EXPECT_EQ("Sum", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("gx:0", node.input(0));
+      EXPECT_EQ("rx:0", node.input(1));
+    } else if (node.name() == "sum_gy") {
+      EXPECT_EQ("Sum", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("gy:0", node.input(0));
+      EXPECT_EQ("rx:1", node.input(1));
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithNestedFuncs) {
+  FunctionDefLibrary library;
+  *library.add_function() = FunctionDefHelper::Define(
+      // Name
+      "Swap",
+      // Args
+      {"i0: T", "i1: T"},
+      // Return values
+      {"o0: T", "o1: T"},
+      // Attr def
+      {"T: {float, double}"},
+      // Nodes
+      {{{"o0"}, "Identity", {"i1"}, {{"T", "$T"}}},
+       {{"o1"}, "Identity", {"i0"}, {{"T", "$T"}}}});
+
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "ManySwapsFirst",
+      // Args
+      {"x: float", "y: float"},
+      // Return values
+      {"o: float"},
+      // attr def
+      {},
+      // Nodes
+      // o = x*x + y*y.  Furthermore, The 1st swap depends on x2, and
+      // y2 depends on the 2nd swap.  The 2nd swap has data dependency
+      // on the 1st swap.
+      {{{"a0"}, "Swap", {"x", "y"}, {{"T", DT_FLOAT}}, {"x2"}},
+       {{"a1"}, "Swap", {"a0:o0:0", "a0:o1:0"}, {{"T", DT_FLOAT}}},
+       {{"x2"}, "Mul", {"x", "x"}, {{"T", DT_FLOAT}}},
+       {{"y2"}, "Mul", {"y", "y"}, {{"T", DT_FLOAT}}, {"a1"}},
+       {{"o"}, "Add", {"x2:z:0", "y2:z:0"}, {{"T", DT_FLOAT}}}},
+      // Output Mapping
+      {{"o", "o:z:0"}});
+
+  std::unordered_map<string, AttrValue> func_attr;
+  func_attr["T"].set_type(DT_FLOAT);
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+
+  for (const NodeDef &node : item->graph.node()) {
+    if (node.name() == "x" || node.name() == "y") {
+      EXPECT_EQ("Placeholder", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "a0") {
+      EXPECT_EQ("Swap", node.op());
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("y", node.input(1));
+      EXPECT_EQ("^x2", node.input(2));
+    } else if (node.name() == "a1") {
+      EXPECT_EQ("Swap", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("a0:0", node.input(0));
+      EXPECT_EQ("a0:1", node.input(1));
+    } else if (node.name() == "x2") {
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("x", node.input(0));
+      EXPECT_EQ("x", node.input(1));
+    } else if (node.name() == "y2") {
+      EXPECT_EQ("Mul", node.op());
+      EXPECT_EQ(3, node.input_size());
+      EXPECT_EQ("y", node.input(0));
+      EXPECT_EQ("y", node.input(1));
+      EXPECT_EQ("^a1", node.input(2));
+    } else if (node.name() == "o") {
+      EXPECT_EQ("Add", node.op());
+      EXPECT_EQ(2, node.input_size());
+      EXPECT_EQ("x2:0", node.input(0));
+      EXPECT_EQ("y2:0", node.input(1));
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithOutputMappings) {
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "Exp_func",
+      // Args
+      {"in: float"},
+      // Return values
+      {"out: float"},
+      // Attr def
+      {},
+      // Nodes
+      {{{"Linear_func"}, "Identity", {"in"}, {{"T", DT_FLOAT}}},
+       {{"Exp"}, "Exp", {"Linear_func:output:0"}, {{"T", DT_FLOAT}}}},
+      // Mapping
+      {{"out", "Exp:y:0"}});
+
+  std::unordered_map<string, AttrValue> func_attr;
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+
+  EXPECT_EQ(1, item->fetch.size());
+  EXPECT_EQ("Exp:0", item->fetch[0]);
+
+  for (const NodeDef &node : item->graph.node()) {
+    if (node.name() == "in") {
+      EXPECT_EQ("Placeholder", node.op());
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+      EXPECT_EQ(0, node.input_size());
+    } else if (node.name() == "Linear_func") {
+      EXPECT_EQ("Identity", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("in", node.input(0));
+    } else if (node.name() == "Exp") {
+      EXPECT_EQ("Exp", node.op());
+      EXPECT_EQ(1, node.input_size());
+      EXPECT_EQ("Linear_func:0", node.input(0));
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithInputForwarding) {
+  FunctionDef func = FunctionDefHelper::Create(
+      // Name
+      "ForwardInputs",
+      // Args
+      {"in0: float", "in1: float", "arg2: float", "arg3: int32", "arg4: float"},
+      // Return values
+      {"out0: float", "arg2: float", "arg3: int32"},
+      // Attr def
+      {},
+      // Nodes
+      {},
+      // Mapping
+      {{"out0", "in0"}});
+
+  std::unordered_map<string, AttrValue> func_attr;
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+
+  EXPECT_EQ(3, item->fetch.size());
+  EXPECT_EQ("in0", item->fetch[0]);
+  EXPECT_EQ("arg2", item->fetch[1]);
+  EXPECT_EQ("arg3", item->fetch[2]);
+
+  EXPECT_EQ(5, item->graph.node_size());
+  for (const NodeDef &node : item->graph.node()) {
+    EXPECT_TRUE(node.name() == "in0" || node.name() == "in1" ||
+                node.name() == "arg2" || node.name() == "arg3" ||
+                node.name() == "arg4");
+    EXPECT_EQ("Placeholder", node.op());
+    if (node.name() == "arg3") {
+      EXPECT_EQ(DT_INT32, node.attr().at("T").type());
+    } else {
+      EXPECT_EQ(DT_FLOAT, node.attr().at("T").type());
+    }
+  }
+}
+
+TEST_F(FunctionsTest, FromFunctionDefWithoutInput) {
+  const Tensor kTwo = test::AsScalar<int64>(2);
+  FunctionDef func = FunctionDefHelper::Define(
+      // Name
+      "GenerateTwo",
+      // Args
+      {},
+      // Return value
+      {"o: T"},
+      // Attr def
+      {"T: {float, double}"},
+      // Nodes
+      {{{"two"}, "Const", {}, {{"value", kTwo}, {"dtype", DT_INT64}}},
+       {{"o"}, "Cast", {"two"}, {{"SrcT", DT_INT64}, {"DstT", "$T"}}}});
+
+  std::unordered_map<string, AttrValue> func_attr;
+  func_attr["T"].set_type(DT_FLOAT);
+  FunctionDefLibrary library;
+  std::unique_ptr<GrapplerItem> item =
+      GrapplerItemFromFunctionDef(func, func_attr, library);
+
+  EXPECT_EQ(0, item->feed.size());
+  EXPECT_EQ(1, item->fetch.size());
+  EXPECT_EQ("o:0", item->fetch[0]);
+
+  EXPECT_EQ(2, item->graph.node_size());
+  const NodeDef &two = item->graph.node(0);
+  EXPECT_EQ("two", two.name());
+  EXPECT_EQ(0, two.input_size());
+  const NodeDef &cast = item->graph.node(1);
+  EXPECT_EQ("o", cast.name());
+  EXPECT_EQ(1, cast.input_size());
+  EXPECT_EQ("two:0", cast.input(0));
+
+  std::cout << item->graph.DebugString() << std::endl;
+}
+
+}  // namespace
+}  // namespace grappler
+}  // namespace tensorflow
diff --git a/tensorflow/core/grappler/utils/grappler_test.cc b/tensorflow/core/grappler/utils/grappler_test.cc
index 813f65f825759ca22dba2bdfd8433d946b7dd852..79b2aa2808a513e7db7c447d5deb26d6c319e083 100644
--- a/tensorflow/core/grappler/utils/grappler_test.cc
+++ b/tensorflow/core/grappler/utils/grappler_test.cc
@@ -35,5 +35,60 @@ std::vector<Tensor> GrapplerTest::EvaluateNodes(
   return output_tensors;
 }
 
+std::vector<Tensor> GrapplerTest::EvaluateFetchNodes(const GrapplerItem& item) {
+  SessionOptions options;
+  std::unique_ptr<tensorflow::Session> session(NewSession(options));
+  TF_CHECK_OK(session->Create(item.graph));
+  RunOptions run_options;
+  if (!item.init_ops.empty()) {
+    std::vector<Tensor> dummy;
+    TF_CHECK_OK(
+        session->Run(run_options, {}, {}, item.init_ops, &dummy, nullptr));
+  }
+  std::vector<Tensor> output_tensors;
+  TF_CHECK_OK(session->Run(run_options, item.feed, item.fetch, {},
+                           &output_tensors, nullptr));
+  TF_CHECK_OK(session->Close());
+  return output_tensors;
+}
+
+void GrapplerTest::AddNode(const string& name, const string& op,
+                           const std::vector<string>& inputs, GraphDef* graph) {
+  auto* node = graph->add_node();
+  node->set_name(name);
+  node->set_op(op);
+  for (const auto& input : inputs) {
+    node->add_input(input);
+  }
+}
+
+void GrapplerTest::CompareGraphs(GraphDef want, GraphDef got) {
+  auto comparator = [](const NodeDef& n1, const NodeDef& n2) -> bool {
+    return n1.name() < n2.name();
+  };
+  std::sort(want.mutable_node()->begin(), want.mutable_node()->end(),
+            comparator);
+  std::sort(got.mutable_node()->begin(), got.mutable_node()->end(), comparator);
+
+  for (int i = 0; i < want.node_size(); ++i) {
+    std::sort(want.mutable_node(i)->mutable_input()->begin(),
+              want.mutable_node(i)->mutable_input()->end());
+  }
+  for (int i = 0; i < got.node_size(); ++i) {
+    std::sort(got.mutable_node(i)->mutable_input()->begin(),
+              got.mutable_node(i)->mutable_input()->end());
+  }
+
+  ASSERT_EQ(want.node_size(), got.node_size());
+  for (int i = 0; i < want.node_size(); ++i) {
+    EXPECT_EQ(want.node(i).op(), got.node(i).op());
+    EXPECT_EQ(want.node(i).name(), got.node(i).name());
+    ASSERT_EQ(want.node(i).input_size(), got.node(i).input_size());
+    for (int j = 0; j < want.node(i).input_size(); ++j) {
+      EXPECT_TRUE(IsSameInput(want.node(i).input(j), got.node(i).input(j)));
+    }
+  }
+}
+
 }  // namespace grappler
 }  // namespace tensorflow
diff --git a/tensorflow/core/grappler/utils/grappler_test.h b/tensorflow/core/grappler/utils/grappler_test.h
index 46ce47c8c3b6bc18b6eac76bbdb8ec1f8a58fab2..fd6809b6e21b87bf5420898def17ea17bc0b427b 100644
--- a/tensorflow/core/grappler/utils/grappler_test.h
+++ b/tensorflow/core/grappler/utils/grappler_test.h
@@ -20,6 +20,7 @@ limitations under the License.
 
 #include "tensorflow/core/framework/graph.pb.h"
 #include "tensorflow/core/framework/types.h"
+#include "tensorflow/core/grappler/grappler_item.h"
 #include "tensorflow/core/platform/test.h"
 
 namespace tensorflow {
@@ -29,6 +30,13 @@ class GrapplerTest : public ::testing::Test {
  protected:
   std::vector<Tensor> EvaluateNodes(const GraphDef& graph,
                                     const std::vector<string>& node_names);
+
+  std::vector<Tensor> EvaluateFetchNodes(const GrapplerItem& item);
+
+  void AddNode(const string& name, const string& op,
+               const std::vector<string>& inputs, GraphDef* graph);
+
+  void CompareGraphs(GraphDef want, GraphDef got);
 };
 
 }  // end namespace grappler
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index dc93c76eaee6c3408453a74bac98f5e365364247..10f4d42147b9a487aee918211cc186a34bfb1c56 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -56,8 +56,8 @@ load("@local_config_cuda//cuda:build_defs.bzl", "if_cuda")
 
 config_setting(
     # Add "--define tensorflow_xsmm=1" to your build command to use libxsmm for
-    # convolutions (and possibly more in the future). You will also need
-    # appropriate -mavx*, as required by specific op you use.
+    # sparse matrix multiplications. You will also need appropriate -mavx*
+    # options, as required by specific op you use.
     name = "xsmm",
     values = {
         "define": "tensorflow_xsmm=1",
@@ -65,12 +65,23 @@ config_setting(
 )
 
 config_setting(
-    # Add "--define tensorflow_xsmm_backward=1" to your build command to use
-    # libxsmm for backward convolutions (and possibly more in the future). You
-    # will also need appropriate -mavx*, as required by specific op you use.
-    name = "xsmm_backward",
+    # Add "--define tensorflow_xsmm_convolutions=1" to your build command to
+    # use libxsmm for forward convolutions. You will also need appropriate
+    # -mavx* # options, as required by specific op you use.
+    name = "xsmm_convolutions",
     values = {
-        "define": "tensorflow_xsmm_backward=1",
+        "define": "tensorflow_xsmm_convolutions=1",
+    },
+)
+
+config_setting(
+    # Add "--define tensorflow_xsmm_convolutions=1 --define
+    # tensorflow_xsmm_backward_convolutions=1" to your build command to use libxsmm for
+    # backward convolutions (and possibly more in the future). You will also
+    # need appropriate -mavx* options, as required by specific op you use.
+    name = "xsmm_backward_convolutions",
+    values = {
+        "define": "tensorflow_xsmm_backward_convolutions=1",
     },
 )
 
@@ -868,7 +879,7 @@ tf_kernel_library(
     hdrs = ["transpose_op.h"],
     deps = ARRAY_DEPS + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
+        "@mkl_dnn",
     ]),
 )
 
@@ -1017,7 +1028,7 @@ tf_cc_test(
     name = "xsmm_conv2d_test",
     size = "small",
     srcs = select({
-        ":xsmm": ["xsmm_conv2d_test.cc"],
+        ":xsmm_convolutions": ["xsmm_conv2d_test.cc"],
         "//conditions:default": [],
     }),
     deps = [
@@ -1032,7 +1043,7 @@ tf_cc_test(
         "//tensorflow/core:test_main",
         "//tensorflow/core:testlib",
     ] + select({
-        ":xsmm": [
+        ":xsmm_convolutions": [
             "@libxsmm_archive//:xsmm_avx",
         ],
         "//conditions:default": [],
@@ -1891,9 +1902,9 @@ tf_kernel_library(
     srcs = ["resource_variable_ops.cc"],
     deps = [
         ":bounds_check",
-        ":critical_section",
         ":dense_update_functor",
         ":gather_functor",
+        ":mutex_ops",
         ":scatter_functor",
         ":state",
         ":training_op_helpers",
@@ -2799,7 +2810,7 @@ tf_kernel_library(
         "//conditions:default": [],
     }) + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
+        "@mkl_dnn",
     ]) + if_cuda([
         "//tensorflow/core/platform/default/build_config:cublas_plugin",
     ]),
@@ -3138,7 +3149,7 @@ tf_kernel_library(
         "conv_grad_ops_3d.cc",
         "deep_conv2d.cc",
     ] + select({
-        ":xsmm": ["xsmm_conv2d.cc"],
+        ":xsmm_convolutions": ["xsmm_conv2d.cc"],
         "//conditions:default": [],
     }),
     hdrs = [
@@ -3148,7 +3159,7 @@ tf_kernel_library(
         "gemm_functors.h",
         "winograd_transform.h",
     ] + select({
-        ":xsmm": ["xsmm_conv2d.h"],
+        ":xsmm_convolutions": ["xsmm_conv2d.h"],
         "//conditions:default": [],
     }),
     # Override EIGEN_STRONG_INLINE to inline when --define=override_eigen_strong_inline=true,
@@ -3156,13 +3167,15 @@ tf_kernel_library(
     # on Windows. See https://github.com/tensorflow/tensorflow/issues/10521
     copts = if_override_eigen_strong_inline(["/DEIGEN_STRONG_INLINE=inline"]),
     defines = select({
-        ":xsmm": [
-            "TENSORFLOW_USE_LIBXSMM",
-            "EIGEN_USE_LIBXSMM",
+        ":xsmm_convolutions": [
+            "TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS",
         ],
         "//conditions:default": [],
     }) + select({
-        ":xsmm_backward": ["TENSORFLOW_USE_LIBXSMM_BACKWARD"],
+        ":xsmm": ["EIGEN_USE_LIBXSMM"],
+        "//conditions:default": [],
+    }) + select({
+        ":xsmm_backward_convolutions": ["TENSORFLOW_USE_LIBXSMM_BACKWARD_CONVOLUTIONS"],
         "//conditions:default": [],
     }),
     prefix = "conv_ops",
@@ -3179,7 +3192,7 @@ tf_kernel_library(
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:nn_ops_op_lib",
     ] + select({
-        ":xsmm": [
+        ":xsmm_convolutions": [
             "@libxsmm_archive//:xsmm_avx",
         ],
         "//conditions:default": [],
@@ -4094,9 +4107,9 @@ tf_kernel_library(
 )
 
 tf_kernel_library(
-    name = "critical_section",
-    prefix = "critical_section",
-    deps = STATE_DEPS + [":captured_function"],
+    name = "mutex_ops",
+    prefix = "mutex_ops",
+    deps = STATE_DEPS + [":ops_util"],
 )
 
 tf_cc_test(
@@ -4142,6 +4155,7 @@ cc_library(
         ":as_string_op",
         ":base64_ops",
         ":reduce_join_op",
+        ":regex_replace_op",
         ":string_join_op",
         ":string_split_op",
         ":string_to_hash_bucket_op",
@@ -4176,6 +4190,12 @@ tf_kernel_library(
     deps = STRING_DEPS,
 )
 
+tf_kernel_library(
+    name = "regex_replace_op",
+    prefix = "regex_replace_op",
+    deps = STRING_DEPS + ["@com_googlesource_code_re2//:re2"],
+)
+
 tf_kernel_library(
     name = "string_split_op",
     prefix = "string_split_op",
@@ -4868,7 +4888,7 @@ filegroup(
         "winograd_transform.h",
         ":android_extended_ops_headers",
     ] + select({
-        ":xsmm": [
+        ":xsmm_convolutions": [
             "xsmm_conv2d.h",
             "xsmm_conv2d.cc",
         ],
@@ -5048,8 +5068,9 @@ filegroup(
             # Excluded due to experimental status:
             "debug_ops.*",
             "scatter_nd_op*",
-            "critical_section.*",
+            "mutex_ops.*",
             "batch_kernels.*",
+            "regex_replace_op.cc",
         ],
     ),
     visibility = ["//visibility:public"],
@@ -5115,7 +5136,6 @@ tf_kernel_library(
     srcs = [
         "dequantize_op.cc",
         "meta_support.cc",
-        "quantization_utils.cc",
         "quantize_down_and_shrink_range.cc",
         "quantize_op.cc",
         "quantized_activation_ops.cc",
@@ -5136,7 +5156,6 @@ tf_kernel_library(
     ],
     hdrs = [
         "meta_support.h",
-        "quantization_utils.h",
         "reference_gemm.h",
     ],
     deps = [
@@ -5147,6 +5166,7 @@ tf_kernel_library(
         ":image_resizer_state",
         ":ops_util",
         ":pooling_ops",
+        ":quantization_utils",
         "//tensorflow/core:array_ops_op_lib",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework",
@@ -5693,6 +5713,16 @@ tf_kernel_library(
     ],
 )
 
+cc_library(
+    name = "quantization_utils",
+    srcs = ["quantization_utils.cc"],
+    hdrs = ["quantization_utils.h"],
+    deps = [
+        "//tensorflow/core:framework",
+        "@gemmlowp",
+    ],
+)
+
 cc_library(
     name = "remote_fused_graph_execute_utils",
     srcs = [
@@ -5829,10 +5859,9 @@ tf_mkl_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:nn_ops_op_lib",
-    ] + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5846,10 +5875,9 @@ tf_mkl_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:nn_ops_op_lib",
-    ] + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5877,6 +5905,7 @@ tf_mkl_kernel_library(
     ],
     hdrs = ["mkl_pooling_ops_common.h"],
     deps = [
+        ":bounds_check",
         ":ops_util",
         "//tensorflow/core:core_cpu",
         "//tensorflow/core:framework",
@@ -5898,10 +5927,10 @@ tf_mkl_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:nn_ops_op_lib",
-    ] + if_mkl([
+        "//third_party/eigen3",
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5915,19 +5944,18 @@ tf_mkl_kernel_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:nn_ops_op_lib",
-    ] + if_mkl([
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
     name = "mkl_fused_batch_norm_op",
     srcs = ["mkl_fused_batch_norm_op.cc"],
-    deps = NN_DEPS + if_mkl([
+    deps = NN_DEPS + [
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5941,10 +5969,10 @@ tf_mkl_kernel_library(
 tf_mkl_kernel_library(
     name = "mkl_concat_op",
     prefix = "mkl_concat_op",
-    deps = ARRAY_DEPS + if_mkl([
+    deps = ARRAY_DEPS + [
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -5958,19 +5986,19 @@ tf_mkl_kernel_library(
 tf_mkl_kernel_library(
     name = "mkl_identity_op",
     prefix = "mkl_identity_op",
-    deps = ARRAY_DEPS + if_mkl([
+    deps = ARRAY_DEPS + [
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
     name = "mkl_lrn_op",
     prefix = "mkl_lrn_op",
-    deps = NN_DEPS + if_mkl([
+    deps = NN_DEPS + [
         "//third_party/mkl:intel_binary_blob",
-        "@mkl_dnn//:mkl_dnn",
-    ]),
+        "@mkl_dnn",
+    ],
 )
 
 tf_mkl_kernel_library(
@@ -6068,7 +6096,6 @@ cc_library(
     srcs = [
         "cwise_ops_common.cc",
         "meta_support.cc",
-        "quantization_utils.cc",
     ],
     hdrs = [
         "cwise_ops.h",
@@ -6077,10 +6104,10 @@ cc_library(
         "cwise_ops_gpu_gradients.cu.h",
         "cwise_ops_gradients.h",
         "meta_support.h",
-        "quantization_utils.h",
     ],
     deps = [
         ":bounds_check",
+        ":quantization_utils",
         "//tensorflow/core:framework",
         "//tensorflow/core:lib",
         "//third_party/eigen3",
diff --git a/tensorflow/core/kernels/batch_kernels.cc b/tensorflow/core/kernels/batch_kernels.cc
index 546e51be53cee1833e8e1d4a15ea9b5be8a31506..8c99ded0a89e8065f4a7112db3b14eb2b27010c1 100644
--- a/tensorflow/core/kernels/batch_kernels.cc
+++ b/tensorflow/core/kernels/batch_kernels.cc
@@ -146,7 +146,7 @@ Status SplitCPU(OpKernelContext* context, const Tensor& input,
     suffix_dim_size *= input.shape().dim_size(i);
   }
   auto input_reshaped =
-      input.shaped<T, 3>({1, input.shape().dim_size(0), suffix_dim_size});
+      input.shaped<T, 2>({input.shape().dim_size(0), suffix_dim_size});
 
   int64 position = 0;
   for (const int64 size : sizes) {
@@ -155,13 +155,13 @@ Status SplitCPU(OpKernelContext* context, const Tensor& input,
     Tensor output;
     TF_RETURN_IF_ERROR(
         context->allocate_temp(input.dtype(), output_shape, &output));
-    auto output_shaped = output.shaped<T, 3>({1, size, suffix_dim_size});
+    auto output_shaped = output.shaped<T, 2>({size, suffix_dim_size});
 
-    Eigen::DSizes<Eigen::DenseIndex, 3> slice_indices{0, position, 0};
-    Eigen::DSizes<Eigen::DenseIndex, 3> slice_sizes{1, size, suffix_dim_size};
-    functor::Split<CPUDevice, T>()(context->eigen_device<CPUDevice>(),
-                                   output_shaped, input_reshaped, slice_indices,
-                                   slice_sizes);
+    Eigen::DSizes<Eigen::DenseIndex, 2> slice_indices{position, 0};
+    Eigen::DSizes<Eigen::DenseIndex, 2> slice_sizes{size, suffix_dim_size};
+    functor::Split<CPUDevice, T, 2>()(context->eigen_device<CPUDevice>(),
+                                      output_shaped, input_reshaped,
+                                      slice_indices, slice_sizes);
 
     outputs->emplace_back(output);
 
diff --git a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h
index 25c5f9cf424fdb286922548ea7ab0a35157a3502..339d792302dd96e7a157c1df893d3ea62080c51a 100644
--- a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h
+++ b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler.h
@@ -19,7 +19,6 @@ limitations under the License.
 #include <algorithm>
 #include <functional>
 #include <memory>
-#include <queue>
 #include <random>
 #include <unordered_map>
 #include <vector>
@@ -44,49 +43,31 @@ template <typename TaskType>
 class ASBSQueue;
 }  // namespace internal
 
-// EXPERIMENTAL: API MAY BE SUBJECTED TO SUDDEN CHANGES.
-//
 // Shared batch scheduler designed to minimize latency. The scheduler keeps
 // track of a number of queues (one per model or model version) which are
 // continuously enqueuing requests. The scheduler groups the requests into
 // batches which it periodically sends off for processing (see
-// shared_batch_scheduler.h for more details). The AdaptiveSharedBatchScheduler
-// prioritizes batches by age (i.e. the batch's oldest request) irrespective of
-// queue or batch size.
-//
-// The scheduling decision currently exists in two flavors, controlled by the
-// option use_in_flight_batches_implementation. It is expected that setting this
-// option to true will give universally better results; after a period of
-// testing to confirm, the old implementation will be removed.
+// shared_batch_scheduler.h for more details). AdaptiveSharedBatchScheduler
+// (ASBS) prioritizes batches primarily by age (i.e. the batch's oldest request)
+// along with a configurable preference for scheduling larger batches first.
 //
-// If use_in_flight_batches_implementation is set to true, the scheduler
-// limits the number of batches which can be processed concurrently.  If a new
-// batch is created, and the number of in flight batches is below the limit,
-// the next (i.e. oldest) batch is immediately scheduled.  Similarly, when a
-// batch finishes processing, the limit is rechecked, and another batch may be
-// scheduled.  To avoid the need to carefully tune the limit for workload,
-// model type, platform, etc, it is dynamically adjusted in order to provide the
-// lowest latency.
 //
-// If use_in_flight_batches_implementation is set to false, the scheduler will
-// process the oldest batch at an adjustable rate, regardless of batch size.
-// The user can provide feedback to help set this rate to achieve some goal
-// (i.e. minimize overall latency, limit cpu usage, etc). The rate (or rather,
-// the corresponding period) is adjusted each time a batch is processed, using
-// an exponentially weighted moving average to smooth noisy feedback:
-// ewma_feedback = ((N - 1) * ewma_feedback + feedback()) / N
-// period *= (1 + K * emwa_feedback)
+// ASBS tries to keep the system busy by maintaining an adjustable number of
+// concurrently processed batches.  If a new batch is created, and the number of
+// in flight batches is below the target, the next (i.e. oldest) batch is
+// immediately scheduled.  Similarly, when a batch finishes processing, the
+// target is rechecked, and another batch may be scheduled.  To avoid the need
+// to carefully tune the target for workload, model type, platform, etc, it is
+// dynamically adjusted in order to provide the lowest average latency.
 //
 // Some potential use cases:
 // Hardware Accelerators (GPUs & TPUs) - If some phase of batch processing
 //   involves serial processing by a device, from a latency perspective it is
 //   desirable to keep the device evenly loaded, avoiding the need to wait for
 //   the device to process prior batches.
-//   feedback = num_pending_on_device() - desired_pending.
 // CPU utilization - If the batch processing is cpu dominated, you can reap
 //   latency gains when underutilized by increasing the processing rate, but
 //   back the rate off when the load increases to avoid overload.
-//   feedback = cpu_rate() - desired_cpu_rate.
 
 template <typename TaskType>
 class AdaptiveSharedBatchScheduler
@@ -101,13 +82,24 @@ class AdaptiveSharedBatchScheduler
   struct Options {
     // The name to use for the pool of batch threads.
     string thread_pool_name = {"batch_threads"};
-    // Number of batch processing threads; equivalently the maximum number of
-    // concurrently running batches.
+    // Number of batch processing threads - the maximum value of
+    // in_flight_batches_limit_.  It is recommended that this value be set by
+    // running the system under load, observing the learned value for
+    // in_flight_batches_limit_, and setting this maximum to ~ 2x the value.
+    // Under low load, in_flight_batches_limit_ has no substantial effect on
+    // latency and therefore undergoes a random walk.  Unreasonably large values
+    // for num_batch_threads allows for large in_flight_batches_limit_, which
+    // will harm latency for some time once load increases again.
     int64 num_batch_threads = port::NumSchedulableCPUs();
+    // Although batch selection is primarily based on age, this parameter
+    // specifies a preference for larger batches.  A full batch will be
+    // scheduled before an older, nearly empty batch as long as the age gap is
+    // less than full_batch_scheduling_boost_micros.  The optimal value for this
+    // parameter should be of order the batch processing latency, but must be
+    // chosen carefully, as too large a value will harm tail latency.
+    int64 full_batch_scheduling_boost_micros = 0;
     // The environment to use (typically only overridden by test code).
     Env* env = Env::Default();
-    // Which implementation to use (described in class comments above).
-    bool use_in_flight_batches_implementation = false;
     // Initial limit for number of batches being concurrently processed.
     // Non-integer values correspond to probabilistic limits - i.e. a value of
     // 3.2 results in an actual cap of 3 80% of the time, and 4 20% of the time.
@@ -116,28 +108,6 @@ class AdaptiveSharedBatchScheduler
     // numbers will give less noisy latency measurements, but will be less
     // responsive to changes in workload.
     int64 batches_to_average_over = 1000;
-
-    // TODO(kte): remove the rate based implementation and corresponding options
-    // below once testing confirms the superiority of the in flight batches
-    // implementation.
-    // Initial batch scheduling period in microseconds. Will be altered for
-    // non-zero rate_feedback.
-    double initial_scheduling_period_micros = 500;
-    // Minimum batch scheduling period in microseconds. Recommend setting this
-    // value greater than 0, otherwise it may take a while to recover from a
-    // sustained time of negative scheduling_period_feedback (which may occur
-    // under low load).
-    double min_scheduling_period_micros = 100;
-    // Maximum batch scheduling period in microseconds.
-    double max_scheduling_period_micros = 10000;
-    // Feedback function used to modify the scheduling period each time a batch
-    // is scheduled.  Should return values roughly O(1), with positive values
-    // resulting in an increased period.
-    std::function<double()> scheduling_period_feedback{[] { return 0.; }};
-    // To handle potentially noisy scheduling_period_feedback, the period is
-    // adjusted using an exponentially weighted moving average over the previous
-    // feedback_smoothing_batches batches.  Must be greater than 0.
-    int64 feedback_smoothing_batches = 10;
   };
 
   // Ownership is shared between the caller of Create() and any queues created
@@ -171,17 +141,11 @@ class AdaptiveSharedBatchScheduler
 
   explicit AdaptiveSharedBatchScheduler(const Options& options);
 
-  // Batch scheduling function which runs every scheduling_period_ microseconds.
-  // Only used when options_.use_in_flight_batches_implementation == false.
-  void ProcessOneBatch();
-
   // Tracks processing latency and adjusts in_flight_batches_limit to minimize.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   void CallbackWrapper(const internal::ASBSBatch<TaskType>* batch,
                        BatchProcessor callback);
 
   // Schedules batch if in_flight_batches_limit_ is not met.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   void MaybeScheduleNextBatch() EXCLUSIVE_LOCKS_REQUIRED(mu_);
 
   // Notifies scheduler of non-empty batch which is eligible for processing.
@@ -194,17 +158,9 @@ class AdaptiveSharedBatchScheduler
 
   const Options options_;
 
-  struct BatchCompare {
-    bool operator()(const internal::ASBSBatch<TaskType>* a,
-                    const internal::ASBSBatch<TaskType>* b);
-  };
-
   // Collection of batches added by AddBatch, ordered by age. Owned by scheduler
   // until they are released for processing.
-  std::priority_queue<const internal::ASBSBatch<TaskType>*,
-                      std::vector<const internal::ASBSBatch<TaskType>*>,
-                      BatchCompare>
-      batches_ GUARDED_BY(mu_);
+  std::vector<const internal::ASBSBatch<TaskType>*> batches_ GUARDED_BY(mu_);
 
   // Unowned queues and callbacks added by AddQueue.
   std::unordered_map<const internal::ASBSQueue<TaskType>*, BatchProcessor>
@@ -212,41 +168,22 @@ class AdaptiveSharedBatchScheduler
 
   mutex mu_;
 
-  // Responsible for running ProcessOneBatch. PeriodicFunction was used in order
-  // to check for deletion so that the thread can be shut down.
-  // Only used when options_.use_in_flight_batches_implementation == false.
-  std::unique_ptr<PeriodicFunction> scheduling_thread_;
-
   // Responsible for running the batch processing callbacks.
   std::unique_ptr<thread::ThreadPool> batch_thread_pool_;
 
-  // Time interval in microseconds between successive ProcessOneBatch calls.
-  // Only used when options_.use_in_flight_batches_implementation == false.
-  double scheduling_period_;
-
-  // Exponentially weighted moving average of
-  // options_.scheduling_period_feedback() evaluated in each ProcessOneBatch
-  // call.
-  // Only used when options_.use_in_flight_batches_implementation == false.
-  double ewma_feedback_ = 0;
-
   // Limit on number of batches which can be concurrently processed.
   // Non-integer values correspond to probabilistic limits - i.e. a value of 3.2
   // results in an actual cap of 3 80% of the time, and 4 20% of the time.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   double in_flight_batches_limit_ GUARDED_BY(mu_);
 
   // Number of batches currently being processed.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   int64 in_flight_batches_ GUARDED_BY(mu_) = 0;
 
   // RNG engine and distribution.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   std::default_random_engine rand_engine_;
   std::uniform_real_distribution<double> rand_double_;
 
   // Fields controlling the dynamic adjustment of in_flight_batches_limit_.
-  // Only used when options_.use_in_flight_batches_implementation == true.
   // Number of batches since the last in_flight_batches_limit_ adjustment.
   int64 batch_count_ GUARDED_BY(mu_) = 0;
   // Sum of processing latency for batches counted by batch_count_.
@@ -348,31 +285,10 @@ Status AdaptiveSharedBatchScheduler<TaskType>::Create(
     return errors::InvalidArgument("num_batch_threads must be positive; was ",
                                    options.num_batch_threads);
   }
-  if (options.min_scheduling_period_micros < 0) {
+  if (options.full_batch_scheduling_boost_micros < 0) {
     return errors::InvalidArgument(
-        "min_scheduling_period_micros must be >= 0; was ",
-        options.min_scheduling_period_micros);
-  }
-  if (options.min_scheduling_period_micros >
-      options.initial_scheduling_period_micros) {
-    return errors::InvalidArgument(
-        "initial_scheduling_period_micros (",
-        options.initial_scheduling_period_micros,
-        ") must be >= min_scheduling_period_micros (",
-        options.min_scheduling_period_micros, ")");
-  }
-  if (options.initial_scheduling_period_micros >
-      options.max_scheduling_period_micros) {
-    return errors::InvalidArgument(
-        "initial_scheduling_period_micros (",
-        options.initial_scheduling_period_micros,
-        ") must be <= max_scheduling_period_micros (",
-        options.max_scheduling_period_micros, ")");
-  }
-  if (options.feedback_smoothing_batches < 1) {
-    return errors::InvalidArgument(
-        "feedback_smoothing_batches must be positive; was ",
-        options.feedback_smoothing_batches);
+        "full_batch_scheduling_boost_micros can't be negative; was ",
+        options.full_batch_scheduling_boost_micros);
   }
   if (options.initial_in_flight_batches_limit > options.num_batch_threads) {
     return errors::InvalidArgument(
@@ -401,20 +317,12 @@ template <typename TaskType>
 AdaptiveSharedBatchScheduler<TaskType>::AdaptiveSharedBatchScheduler(
     const Options& options)
     : options_(options),
-      scheduling_period_(options.initial_scheduling_period_micros),
       in_flight_batches_limit_(options.initial_in_flight_batches_limit),
       rand_double_(0.0, 1.0) {
   std::random_device device;
   rand_engine_.seed(device());
-  PeriodicFunction::Options opts;
-  opts.thread_name_prefix = "scheduling_thread";
-  opts.env = GetEnv();
   batch_thread_pool_.reset(new thread::ThreadPool(
       GetEnv(), options.thread_pool_name, options.num_batch_threads));
-  if (!options.use_in_flight_batches_implementation) {
-    scheduling_thread_.reset(
-        new PeriodicFunction([this] { ProcessOneBatch(); }, 0, opts));
-  }
 }
 
 template <typename TaskType>
@@ -442,10 +350,8 @@ template <typename TaskType>
 void AdaptiveSharedBatchScheduler<TaskType>::AddBatch(
     const internal::ASBSBatch<TaskType>* batch) {
   mutex_lock l(mu_);
-  batches_.push(batch);
-  if (options_.use_in_flight_batches_implementation) {
-    MaybeScheduleNextBatch();
-  }
+  batches_.push_back(batch);
+  MaybeScheduleNextBatch();
 }
 
 template <typename TaskType>
@@ -462,10 +368,26 @@ void AdaptiveSharedBatchScheduler<TaskType>::MaybeScheduleNextBatch() {
   // Non-integer limit handled probabilistially.
   if (in_flight_batches_limit_ - in_flight_batches_ < 1 &&
       rand_double_(rand_engine_) >
-          (in_flight_batches_limit_ - in_flight_batches_))
+          in_flight_batches_limit_ - in_flight_batches_) {
     return;
-  const internal::ASBSBatch<TaskType>* batch = batches_.top();
-  batches_.pop();
+  }
+  auto best_it = batches_.begin();
+  double best_score =
+      (*best_it)->creation_time_micros() -
+      options_.full_batch_scheduling_boost_micros * (*best_it)->size() /
+          static_cast<double>((*best_it)->queue()->max_task_size());
+  for (auto it = batches_.begin() + 1; it != batches_.end(); it++) {
+    const double score =
+        (*it)->creation_time_micros() -
+        options_.full_batch_scheduling_boost_micros * (*it)->size() /
+            static_cast<double>((*it)->queue()->max_task_size());
+    if (score < best_score) {
+      best_score = score;
+      best_it = it;
+    }
+  }
+  const internal::ASBSBatch<TaskType>* batch = *best_it;
+  batches_.erase(best_it);
   // Queue may destroy itself after ReleaseBatch is called.
   batch->queue()->ReleaseBatch(batch);
   batch_thread_pool_->Schedule(
@@ -523,51 +445,6 @@ void AdaptiveSharedBatchScheduler<TaskType>::CallbackWrapper(
   MaybeScheduleNextBatch();
 }
 
-template <typename TaskType>
-void AdaptiveSharedBatchScheduler<TaskType>::ProcessOneBatch() {
-  static const double kFeedbackMultiplier = .001;
-  const internal::ASBSBatch<TaskType>* batch = nullptr;
-  BatchProcessor callback;
-  const int64 start_time_micros = GetEnv()->NowMicros();
-  {
-    mutex_lock l(mu_);
-    if (!batches_.empty()) {
-      batch = batches_.top();
-      batches_.pop();
-      callback = queues_and_callbacks_[batch->queue()];
-    }
-  }
-  if (batch != nullptr) {
-    double feedback = options_.scheduling_period_feedback();
-    const int64 N = options_.feedback_smoothing_batches;
-    ewma_feedback_ = ((N - 1) * ewma_feedback_ + feedback) / N;
-    scheduling_period_ *= (1 + kFeedbackMultiplier * ewma_feedback_);
-    if (scheduling_period_ < options_.min_scheduling_period_micros) {
-      scheduling_period_ = options_.min_scheduling_period_micros;
-    } else if (scheduling_period_ > options_.max_scheduling_period_micros) {
-      scheduling_period_ = options_.max_scheduling_period_micros;
-    }
-    // Queue may destroy itself after ReleaseBatch is called.
-    batch->queue()->ReleaseBatch(batch);
-    batch_thread_pool_->Schedule([callback, batch] {
-      callback(std::unique_ptr<Batch<TaskType>>(
-          const_cast<internal::ASBSBatch<TaskType>*>(batch)));
-    });
-  }
-  const int64 sleep_time =
-      scheduling_period_ - (GetEnv()->NowMicros() - start_time_micros);
-  if (sleep_time > 0) {
-    GetEnv()->SleepForMicroseconds(sleep_time);
-  }
-}
-
-template <typename TaskType>
-bool AdaptiveSharedBatchScheduler<TaskType>::BatchCompare::operator()(
-    const internal::ASBSBatch<TaskType>* a,
-    const internal::ASBSBatch<TaskType>* b) {
-  return a->creation_time_micros() > b->creation_time_micros();
-}
-
 // ---------------- ASBSQueue ----------------
 
 namespace internal {
diff --git a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler_test.cc b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler_test.cc
index 8ae8ca02eca20b5d1184e6e588f013d59d10464a..1be0c1f5c65cad3245a9392e2a2db61cfb2dd904 100644
--- a/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler_test.cc
+++ b/tensorflow/core/kernels/batching_util/adaptive_shared_batch_scheduler_test.cc
@@ -64,59 +64,6 @@ std::unique_ptr<Thread> CreateFakeClockAdvancerThread(
       }));
 }
 
-TEST(AdaptiveSharedBatchSchedulerTest, Basic) {
-  for (const bool delete_scheduler_early : {false, true}) {
-    for (const bool delete_queue_1_early : {false, true}) {
-      int queue_0_tasks = 0;
-      auto queue_0_callback =
-          [&queue_0_tasks](std::unique_ptr<Batch<FakeTask>> batch) {
-            ASSERT_TRUE(batch->IsClosed());
-            EXPECT_GT(batch->num_tasks(), 0);
-            for (int i = 0; i < batch->num_tasks(); i++) {
-              queue_0_tasks += batch->task(i).size();
-            }
-          };
-      int queue_1_tasks = 0;
-      auto queue_1_callback =
-          [&queue_1_tasks](std::unique_ptr<Batch<FakeTask>> batch) {
-            ASSERT_TRUE(batch->IsClosed());
-            EXPECT_GT(batch->num_tasks(), 0);
-            for (int i = 0; i < batch->num_tasks(); i++) {
-              queue_1_tasks += batch->task(i).size();
-            }
-          };
-      {
-        std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-        TF_ASSERT_OK(
-            AdaptiveSharedBatchScheduler<FakeTask>::Create({}, &scheduler));
-
-        // Create two queues.
-        std::unique_ptr<BatchScheduler<FakeTask>> queue_0;
-        TF_ASSERT_OK(scheduler->AddQueue({}, queue_0_callback, &queue_0));
-        std::unique_ptr<BatchScheduler<FakeTask>> queue_1;
-        TF_ASSERT_OK(scheduler->AddQueue({}, queue_1_callback, &queue_1));
-
-        if (delete_scheduler_early) {
-          // Delete our copy of the scheduler. The queues should keep it alive
-          // under the covers.
-          scheduler = nullptr;
-        }
-        // Submit tasks to the two queues, and (optionally) remove the queues.
-        TF_ASSERT_OK(ScheduleTask(1, queue_0.get()));
-        TF_ASSERT_OK(ScheduleTask(2, queue_1.get()));
-        TF_ASSERT_OK(ScheduleTask(3, queue_0.get()));
-        TF_ASSERT_OK(ScheduleTask(4, queue_1.get()));
-        if (delete_queue_1_early) {
-          queue_1 = nullptr;
-        }
-        TF_ASSERT_OK(ScheduleTask(5, queue_0.get()));
-      }
-      EXPECT_EQ(queue_0_tasks, 9);
-      EXPECT_EQ(queue_1_tasks, 6);
-    }
-  }
-}
-
 TEST(AdaptiveSharedBatchSchedulerTest, BadOptions) {
   using Scheduler = AdaptiveSharedBatchScheduler<FakeTask>;
   std::shared_ptr<Scheduler> scheduler;
@@ -124,24 +71,6 @@ TEST(AdaptiveSharedBatchSchedulerTest, BadOptions) {
   options.num_batch_threads = 0;
   EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
   options = Scheduler::Options();
-  options.min_scheduling_period_micros = 50;
-  options.max_scheduling_period_micros = 100;
-  options.initial_scheduling_period_micros = 1;
-  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
-  options = Scheduler::Options();
-  options.min_scheduling_period_micros = 50;
-  options.max_scheduling_period_micros = 100;
-  options.initial_scheduling_period_micros = 1000;
-  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
-  options = Scheduler::Options();
-  options.min_scheduling_period_micros = 100;
-  options.max_scheduling_period_micros = 50;
-  options.initial_scheduling_period_micros = 75;
-  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
-  options = Scheduler::Options();
-  options.feedback_smoothing_batches = 0;
-  EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
-  options = Scheduler::Options();
   options.initial_in_flight_batches_limit = 0.5;
   EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
   options = Scheduler::Options();
@@ -153,301 +82,8 @@ TEST(AdaptiveSharedBatchSchedulerTest, BadOptions) {
   EXPECT_FALSE(Scheduler::Create(options, &scheduler).ok());
 }
 
-TEST(AdaptiveSharedBatchSchedulerTest, ObeysQueueOptions) {
-  test_util::FakeClockEnv env(Env::Default());
-  Notification start_teardown, stop_teardown;
-  std::unique_ptr<Thread> teardown_thread =
-      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
-  {
-    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-    options.initial_scheduling_period_micros = 1000;
-    options.env = &env;
-    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-    TF_ASSERT_OK(
-        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
-    std::unique_ptr<BatchScheduler<FakeTask>> queue_0;
-    std::unique_ptr<BatchScheduler<FakeTask>> queue_1;
-    int queue_0_tasks = 0;
-    int queue_1_tasks = 0;
-    auto queue_0_callback = [&queue_0_tasks,
-                             &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        queue_0_tasks += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-    auto queue_1_callback = [&queue_1_tasks,
-                             &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        queue_1_tasks += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-    AdaptiveSharedBatchScheduler<FakeTask>::QueueOptions queue_options;
-    queue_options.max_batch_size = 10;
-    queue_options.max_enqueued_batches = 0;
-    // Queue must have max_enqueued_batchs > 1.
-    EXPECT_FALSE(
-        scheduler->AddQueue(queue_options, queue_0_callback, &queue_0).ok());
-    queue_options.max_enqueued_batches = 2;
-    TF_ASSERT_OK(
-        scheduler->AddQueue(queue_options, queue_0_callback, &queue_0));
-    EXPECT_EQ(10, queue_0->max_task_size());
-    queue_options.max_batch_size = 0;
-    // Queue must have max_batch_size > 0.
-    EXPECT_FALSE(
-        scheduler->AddQueue(queue_options, queue_1_callback, &queue_1).ok());
-    queue_options.max_batch_size = 2;
-    queue_options.max_enqueued_batches = 1;
-    TF_ASSERT_OK(
-        scheduler->AddQueue(queue_options, queue_1_callback, &queue_1));
-
-    // Wait for scheduling_thread to sleep.
-    env.BlockUntilThreadsAsleep(1);
-    // Task larger than max_batch_size shouldn't schedule.
-    EXPECT_FALSE(ScheduleTask(15, queue_0.get()).ok());
-    TF_ASSERT_OK(ScheduleTask(5, queue_0.get()));
-    TF_ASSERT_OK(ScheduleTask(5, queue_0.get()));
-    env.AdvanceByMicroseconds(1);
-
-    // Task larger than max_batch_size shouldn't schedule.
-    EXPECT_FALSE(ScheduleTask(3, queue_1.get()).ok());
-    TF_ASSERT_OK(ScheduleTask(1, queue_1.get()));
-    TF_ASSERT_OK(ScheduleTask(1, queue_1.get()));
-    env.AdvanceByMicroseconds(1);
-    // Exceeds max_enqueued_batches, shouldn't schedule.
-    EXPECT_FALSE(ScheduleTask(1, queue_1.get()).ok());
-
-    TF_ASSERT_OK(ScheduleTask(5, queue_0.get()));
-    // Exceeds max_enqueued_batches, shouldn't schedule.
-    EXPECT_FALSE(ScheduleTask(6, queue_0.get()).ok());
-    TF_ASSERT_OK(ScheduleTask(4, queue_0.get()));
-
-    // Batches should be processed in order from oldest to newest.
-    env.AdvanceByMicroseconds(1000);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(queue_0_tasks, 10);
-    EXPECT_EQ(queue_1_tasks, 0);
-
-    env.AdvanceByMicroseconds(1000);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(queue_0_tasks, 10);
-    EXPECT_EQ(queue_1_tasks, 2);
-
-    env.AdvanceByMicroseconds(1000);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(queue_0_tasks, 19);
-    EXPECT_EQ(queue_1_tasks, 2);
-    start_teardown.Notify();
-  }
-  stop_teardown.Notify();
-}
-
-TEST(AdaptiveSharedBatchSchedulerTest, RateFeedback) {
-  test_util::FakeClockEnv env(Env::Default());
-  Notification start_teardown, stop_teardown;
-  std::unique_ptr<Thread> teardown_thread =
-      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
-  {
-    double feedback = 0;
-    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-    options.initial_scheduling_period_micros = 1000;
-    options.min_scheduling_period_micros = 200;
-    options.max_scheduling_period_micros = 2000;
-    options.env = &env;
-    options.scheduling_period_feedback = [&feedback] { return feedback; };
-    options.feedback_smoothing_batches = 1;
-    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-    TF_ASSERT_OK(
-        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
-    std::unique_ptr<BatchScheduler<FakeTask>> queue;
-    int scheduled_items = 0;
-    auto queue_callback = [&scheduled_items,
-                           &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      scheduled_items = 0;
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        scheduled_items += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-
-    TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
-
-    // Wait for scheduling_thread to sleep.
-    env.BlockUntilThreadsAsleep(1);
-    // Enqueue 6 batches.
-    for (int i = 0; i < 6; i++) {
-      TF_ASSERT_OK(ScheduleTask(900 + i, queue.get()));
-      env.AdvanceByMicroseconds(1);
-    }
-    feedback = -500;
-    env.AdvanceByMicroseconds(994);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 500 usec.
-    EXPECT_EQ(scheduled_items, 900);
-    env.AdvanceByMicroseconds(500);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 250 usec.
-    EXPECT_EQ(scheduled_items, 901);
-    feedback = 0;
-    env.AdvanceByMicroseconds(250);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 250 usec.
-    EXPECT_EQ(scheduled_items, 902);
-    feedback = 10000;  // large feedback should hit max_scheduling_period.
-    env.AdvanceByMicroseconds(250);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 2000 usec.
-    EXPECT_EQ(scheduled_items, 903);
-    feedback = -10000;  // large feedback should hit min_scheduling_period.
-    env.AdvanceByMicroseconds(1999);
-    // No callback scheduled, only scheduling thread sleeping.
-    env.BlockUntilThreadsAsleep(1);
-    EXPECT_EQ(scheduled_items, 903);
-    env.AdvanceByMicroseconds(1);
-    env.BlockUntilThreadsAsleep(2);  // scheduling period = 200 usec.
-    EXPECT_EQ(scheduled_items, 904);
-    env.AdvanceByMicroseconds(200);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(scheduled_items, 905);
-    start_teardown.Notify();
-  }
-  stop_teardown.Notify();
-}
-
-TEST(AdaptiveSharedBatchSchedulerTest, FeedbackSmoothing) {
-  test_util::FakeClockEnv env(Env::Default());
-  Notification start_teardown, stop_teardown;
-  std::unique_ptr<Thread> teardown_thread =
-      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
-  {
-    double feedback = 0;
-    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-    options.initial_scheduling_period_micros = 1000;
-    options.env = &env;
-    options.scheduling_period_feedback = [&feedback] { return feedback; };
-    options.feedback_smoothing_batches = 3;
-    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-    TF_ASSERT_OK(
-        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
-    std::unique_ptr<BatchScheduler<FakeTask>> queue;
-    int scheduled_items = 0;
-    auto queue_callback = [&scheduled_items,
-                           &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      scheduled_items = 0;
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        scheduled_items += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-
-    TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
-
-    // Wait for scheduling_thread to sleep.
-    env.BlockUntilThreadsAsleep(1);
-    // Enqueue 4 batches.
-    for (int i = 0; i < 4; i++) {
-      TF_ASSERT_OK(ScheduleTask(900 + i, queue.get()));
-      env.AdvanceByMicroseconds(1);
-    }
-    feedback = -300;
-    env.AdvanceByMicroseconds(996);
-    env.BlockUntilThreadsAsleep(2);
-    // ewma_feedback = 100, scheduling_period = 900.
-    EXPECT_EQ(scheduled_items, 900);
-    env.AdvanceByMicroseconds(899);
-    // No callback scheduled, only scheduling thread sleeping.
-    env.BlockUntilThreadsAsleep(1);
-    EXPECT_EQ(scheduled_items, 900);
-    env.AdvanceByMicroseconds(1);
-    env.BlockUntilThreadsAsleep(2);
-    // ewma_feedback = 167, scheduling_period = 750.
-    EXPECT_EQ(scheduled_items, 901);
-    env.AdvanceByMicroseconds(749);
-    // No callback scheduled, only scheduling thread sleeping.
-    env.BlockUntilThreadsAsleep(1);
-    EXPECT_EQ(scheduled_items, 901);
-    feedback = 1000 / 3.;
-    env.AdvanceByMicroseconds(1);
-    env.BlockUntilThreadsAsleep(2);
-    // emwa_feedback = 0, scheduling_period = 750.
-    EXPECT_EQ(scheduled_items, 902);
-    env.AdvanceByMicroseconds(749);
-    // No callback scheduled, only scheduling thread sleeping.
-    env.BlockUntilThreadsAsleep(1);
-    EXPECT_EQ(scheduled_items, 902);
-    env.AdvanceByMicroseconds(1);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(scheduled_items, 903);
-    start_teardown.Notify();
-  }
-  stop_teardown.Notify();
-}
-
-TEST(AdaptiveSharedBatchSchedulerTest, QueueCapacityInfo) {
-  test_util::FakeClockEnv env(Env::Default());
-  Notification start_teardown, stop_teardown;
-  std::unique_ptr<Thread> teardown_thread =
-      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
-  {
-    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-    options.initial_scheduling_period_micros = 1000;
-    options.env = &env;
-    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
-    TF_ASSERT_OK(
-        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
-    std::unique_ptr<BatchScheduler<FakeTask>> queue;
-    int scheduled_items = 0;
-    auto queue_callback = [&scheduled_items,
-                           &env](std::unique_ptr<Batch<FakeTask>> batch) {
-      ASSERT_TRUE(batch->IsClosed());
-      EXPECT_GT(batch->num_tasks(), 0);
-      scheduled_items = 0;
-      for (int i = 0; i < batch->num_tasks(); i++) {
-        scheduled_items += batch->task(i).size();
-      }
-      env.SleepForMicroseconds(1);
-    };
-    AdaptiveSharedBatchScheduler<FakeTask>::QueueOptions queue_options;
-    queue_options.max_batch_size = 10;
-    queue_options.max_enqueued_batches = 10;
-    TF_ASSERT_OK(scheduler->AddQueue(queue_options, queue_callback, &queue));
-
-    // Wait for scheduling_thread to sleep.
-    env.BlockUntilThreadsAsleep(1);
-    // Enqueue 3 tasks.
-    EXPECT_EQ(queue->NumEnqueuedTasks(), 0);
-    EXPECT_EQ(queue->SchedulingCapacity(), 100);
-    TF_ASSERT_OK(ScheduleTask(5, queue.get()));
-    EXPECT_EQ(queue->NumEnqueuedTasks(), 1);
-    EXPECT_EQ(queue->SchedulingCapacity(), 95);
-    env.AdvanceByMicroseconds(1);
-    TF_ASSERT_OK(ScheduleTask(6, queue.get()));
-    EXPECT_EQ(queue->NumEnqueuedTasks(), 2);
-    EXPECT_EQ(queue->SchedulingCapacity(), 84);
-    env.AdvanceByMicroseconds(1);
-    TF_ASSERT_OK(ScheduleTask(1, queue.get()));
-    EXPECT_EQ(queue->NumEnqueuedTasks(), 3);
-    EXPECT_EQ(queue->SchedulingCapacity(), 83);
-
-    env.AdvanceByMicroseconds(998);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(scheduled_items, 5);
-    env.AdvanceByMicroseconds(1000);
-    env.BlockUntilThreadsAsleep(2);
-    EXPECT_EQ(scheduled_items, 7);
-    start_teardown.Notify();
-  }
-  stop_teardown.Notify();
-}
-
-TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesImplementation) {
+TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesLimit) {
   AdaptiveSharedBatchScheduler<FakeTask>::Options options;
-  options.use_in_flight_batches_implementation = true;
   options.initial_in_flight_batches_limit = 2;
   options.batches_to_average_over = 1000;
   mutex mu;
@@ -476,7 +112,7 @@ TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesImplementation) {
   std::unique_ptr<BatchScheduler<FakeTask>> queue;
   TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
 
-  // Enqueue 3 batches.
+  // Enqueue 3 tasks, should result in 3 batches.
   for (int i = 0; i < 3; i++) {
     TF_ASSERT_OK(ScheduleTask(100, queue.get()));
   }
@@ -490,7 +126,6 @@ TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesLimitTuning) {
   {
     AdaptiveSharedBatchScheduler<FakeTask>::Options options;
     options.env = &env;
-    options.use_in_flight_batches_implementation = true;
     options.initial_in_flight_batches_limit = 2;
     options.batches_to_average_over = 1;
     auto queue_callback = [&env](std::unique_ptr<Batch<FakeTask>> batch) {
@@ -544,6 +179,196 @@ TEST(AdaptiveSharedBatchSchedulerTest, InFlightBatchesLimitTuning) {
   }
   stop_teardown.Notify();
 }
+
+TEST(AdaptiveSharedBatchSchedulerTest, FullBatchSchedulingBoostMicros) {
+  test_util::FakeClockEnv env(Env::Default());
+  Notification start_teardown, stop_teardown;
+  std::unique_ptr<Thread> teardown_thread =
+      CreateFakeClockAdvancerThread(&env, &start_teardown, &stop_teardown);
+  {
+    AdaptiveSharedBatchScheduler<FakeTask>::Options options;
+    options.env = &env;
+    options.initial_in_flight_batches_limit = 1;
+    options.batches_to_average_over = 1000;
+    options.full_batch_scheduling_boost_micros = 100;
+    mutex mu;
+    int processed_batches = 0;
+    Notification finish_processing;
+    auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                              std::unique_ptr<Batch<FakeTask>> batch) {
+      ASSERT_TRUE(batch->IsClosed());
+      finish_processing.WaitForNotification();
+      mutex_lock l(mu);
+      processed_batches++;
+      switch (processed_batches) {
+        case 1:
+          EXPECT_EQ(100, batch->size());
+          break;
+        case 2:
+          EXPECT_EQ(50, batch->size());
+          break;
+        case 3:
+          EXPECT_EQ(900, batch->size());
+          break;
+        case 4:
+          EXPECT_EQ(200, batch->size());
+          break;
+        default:
+          EXPECT_TRUE(false) << "Should only have 4 batches";
+      }
+    };
+    std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
+    TF_ASSERT_OK(
+        AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
+    AdaptiveSharedBatchScheduler<FakeTask>::QueueOptions queue_options;
+    std::unique_ptr<BatchScheduler<FakeTask>> queue1;
+    std::unique_ptr<BatchScheduler<FakeTask>> queue2;
+    queue_options.max_batch_size = 1000;
+    TF_ASSERT_OK(scheduler->AddQueue(queue_options, queue_callback, &queue1));
+    queue_options.max_batch_size = 100;
+    TF_ASSERT_OK(scheduler->AddQueue(queue_options, queue_callback, &queue2));
+
+    // First batch immediately processed.
+    TF_ASSERT_OK(ScheduleTask(100, queue1.get()));
+
+    TF_ASSERT_OK(ScheduleTask(100, queue1.get()));
+    env.AdvanceByMicroseconds(10);
+    TF_ASSERT_OK(ScheduleTask(100, queue1.get()));
+    env.AdvanceByMicroseconds(10);
+
+    TF_ASSERT_OK(ScheduleTask(50, queue2.get()));
+    env.AdvanceByMicroseconds(45);
+
+    TF_ASSERT_OK(ScheduleTask(900, queue1.get()));
+
+    // Second batch - creation time: 0, fullness: 0.2, sched score: -20
+    // Third batch - creation time: 20, fullness: 0.5, sched score: -30
+    // Fourth batch - creation time: 65, fullness: 0.9, sched score: -25
+
+    finish_processing.Notify();
+    start_teardown.Notify();
+  }
+  stop_teardown.Notify();
+}
+
+TEST(AdaptiveSharedBatchSchedulerTest, DeleteQueue) {
+  AdaptiveSharedBatchScheduler<FakeTask>::Options options;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1000;
+  mutex mu;
+  int processed_batches = 0;
+  Notification finish_processing;
+  auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                            std::unique_ptr<Batch<FakeTask>> batch) {
+    ASSERT_TRUE(batch->IsClosed());
+    EXPECT_GT(batch->num_tasks(), 0);
+    finish_processing.WaitForNotification();
+    mu.lock();
+    processed_batches++;
+    mu.unlock();
+  };
+
+  std::unique_ptr<Thread> queue_deleter;
+  std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
+
+  // Enqueue 2 tasks, should result in 2 batches.
+  for (int i = 0; i < 2; i++) {
+    TF_ASSERT_OK(ScheduleTask(100, queue.get()));
+  }
+  // Delete queue, should be kept alive until empty.
+  queue_deleter.reset(Env::Default()->StartThread(
+      {}, "QueueDeleterThread", [&queue, &mu, &processed_batches] {
+        queue.reset();
+        mutex_lock l(mu);
+        EXPECT_EQ(processed_batches, 2);
+      }));
+  // Give queue_deleter thread time to delete queue.
+  Env::Default()->SleepForMicroseconds(1000);
+  finish_processing.Notify();
+}
+
+TEST(AdaptiveSharedBatchSchedulerTest, DeleteScheduler) {
+  AdaptiveSharedBatchScheduler<FakeTask>::Options options;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1000;
+  mutex mu;
+  int processed_batches = 0;
+  Notification finish_processing;
+  auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                            std::unique_ptr<Batch<FakeTask>> batch) {
+    ASSERT_TRUE(batch->IsClosed());
+    EXPECT_GT(batch->num_tasks(), 0);
+    finish_processing.WaitForNotification();
+    mu.lock();
+    processed_batches++;
+    mu.unlock();
+  };
+
+  std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
+
+  // Enqueue 2 tasks, should result in 2 batches.
+  for (int i = 0; i < 2; i++) {
+    TF_ASSERT_OK(ScheduleTask(100, queue.get()));
+  }
+  // Delete scheduler, should be kept alive until queues are empty.
+  scheduler.reset();
+  finish_processing.Notify();
+  while (true) {
+    mutex_lock l(mu);
+    if (processed_batches == 2) break;
+  }
+}
+
+TEST(AdaptiveSharedBatchSchedulerTest, QueueCapacityInfo) {
+  AdaptiveSharedBatchScheduler<FakeTask>::Options options;
+  options.initial_in_flight_batches_limit = 1;
+  options.batches_to_average_over = 1000;
+  mutex mu;
+  int processed_batches = 0;
+  Notification finish_processing;
+  auto queue_callback = [&mu, &processed_batches, &finish_processing](
+                            std::unique_ptr<Batch<FakeTask>> batch) {
+    ASSERT_TRUE(batch->IsClosed());
+    EXPECT_GT(batch->num_tasks(), 0);
+    mu.lock();
+    int batch_num = ++processed_batches;
+    mu.unlock();
+    if (batch_num == 1) {
+      finish_processing.WaitForNotification();
+    }
+  };
+  std::shared_ptr<AdaptiveSharedBatchScheduler<FakeTask>> scheduler;
+  TF_ASSERT_OK(
+      AdaptiveSharedBatchScheduler<FakeTask>::Create(options, &scheduler));
+  std::unique_ptr<BatchScheduler<FakeTask>> queue;
+  TF_ASSERT_OK(scheduler->AddQueue({}, queue_callback, &queue));
+
+  // Enqueue 2 tasks, should result in 2 batches.
+  for (int i = 0; i < 2; i++) {
+    TF_ASSERT_OK(ScheduleTask(100, queue.get()));
+  }
+  // First batch was immediately processed, no longer counts as enqueued.
+  EXPECT_EQ(queue->NumEnqueuedTasks(), 1);
+  EXPECT_EQ(queue->SchedulingCapacity(), 9 * 1000 + 900);
+  // Enqueue 2 more tasks, should fall in same batch.
+  TF_ASSERT_OK(ScheduleTask(100, queue.get()));
+  TF_ASSERT_OK(ScheduleTask(200, queue.get()));
+  EXPECT_EQ(queue->NumEnqueuedTasks(), 3);
+  EXPECT_EQ(queue->SchedulingCapacity(), 9 * 1000 + 600);
+  // Enqueue 1 more task, should create new batch.
+  TF_ASSERT_OK(ScheduleTask(700, queue.get()));
+  EXPECT_EQ(queue->NumEnqueuedTasks(), 4);
+  EXPECT_EQ(queue->SchedulingCapacity(), 8 * 1000 + 300);
+  finish_processing.Notify();
+}
 }  // namespace anonymous
 }  // namespace serving
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/check_numerics_op.cc b/tensorflow/core/kernels/check_numerics_op.cc
index 534527c6bdc9ab971cd4c6001dcef8ee59a13a8d..d3b67f4614e82d3efe851038f2ac8ba29a38521e 100644
--- a/tensorflow/core/kernels/check_numerics_op.cc
+++ b/tensorflow/core/kernels/check_numerics_op.cc
@@ -15,6 +15,8 @@ limitations under the License.
 
 // See docs in ../ops/array_ops.cc.
 
+#include "tensorflow/core/lib/bfloat16/bfloat16.h"
+
 #include <math.h>
 #include <algorithm>
 #include <numeric>
@@ -47,6 +49,8 @@ template <typename Device, typename T>
 class CheckNumericsOp;
 
 // Partial specialization for CPU
+// TODO(jeff,rmlarsen): We should make this variant be an AsyncOpKernel, as
+// was done for the GPU case below.
 template <typename T>
 class CheckNumericsOp<CPUDevice, T> : public OpKernel {
  public:
@@ -67,28 +71,32 @@ class CheckNumericsOp<CPUDevice, T> : public OpKernel {
     int fp_props =
         std::accumulate(data, data + size, 0, [](const int& x, const T& y) {
           int result = x;
-          if (Eigen::numext::isinf(y)) {
+          if (TF_PREDICT_TRUE(Eigen::numext::isfinite(y))) {
+            // Do nothing: common case
+          } else if (Eigen::numext::isinf(y)) {
             result |= kInfBit;
           } else if (Eigen::numext::isnan(y)) {
             result |= kNaNBit;
           }
           return result;
         });
-    string status;
-    if ((fp_props & kInfBit) && (fp_props & kNaNBit)) {
-      status = "Inf and NaN";
-    } else {
-      if (fp_props & kInfBit) {
-        status = "Inf";
+    if (fp_props != 0) {
+      string status;
+      if ((fp_props & kInfBit) && (fp_props & kNaNBit)) {
+        status = "Inf and NaN";
+      } else {
+        if (fp_props & kInfBit) {
+          status = "Inf";
+        }
+        if (fp_props & kNaNBit) {
+          status = "NaN";
+        }
       }
-      if (fp_props & kNaNBit) {
-        status = "NaN";
+      if (!status.empty()) {
+        context->SetStatus(errors::InvalidArgument(message_, " : Tensor had ",
+                                                   status, " values"));
       }
     }
-    if (!status.empty()) {
-      context->SetStatus(errors::InvalidArgument(message_, " : Tensor had ",
-                                                 status, " values"));
-    }
   }
 
  private:
@@ -213,6 +221,7 @@ class CheckNumericsOp<GPUDevice, T> : public AsyncOpKernel {
       Name("CheckNumerics").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
       CheckNumericsOp<CPUDevice, T>);
 TF_CALL_half(REGISTER_CPU_KERNEL);
+TF_CALL_bfloat16(REGISTER_CPU_KERNEL);
 TF_CALL_float(REGISTER_CPU_KERNEL);
 TF_CALL_double(REGISTER_CPU_KERNEL);
 
diff --git a/tensorflow/core/kernels/constant_op.cc b/tensorflow/core/kernels/constant_op.cc
index 4ab6fdbca1a3415937213d46fac3058097130f55..312c1a41d36245ae3ca5a09d2e76a430bc464953 100644
--- a/tensorflow/core/kernels/constant_op.cc
+++ b/tensorflow/core/kernels/constant_op.cc
@@ -102,9 +102,15 @@ REGISTER_KERNEL(GPU, float);
 REGISTER_KERNEL(GPU, double);
 REGISTER_KERNEL(GPU, uint8);
 REGISTER_KERNEL(GPU, int8);
+REGISTER_KERNEL(GPU, qint8);
 REGISTER_KERNEL(GPU, uint16);
 REGISTER_KERNEL(GPU, int16);
+REGISTER_KERNEL(GPU, qint16);
+REGISTER_KERNEL(GPU, quint16);
+REGISTER_KERNEL(GPU, uint32);
+REGISTER_KERNEL(GPU, qint32);
 REGISTER_KERNEL(GPU, int64);
+REGISTER_KERNEL(GPU, uint64);
 REGISTER_KERNEL(GPU, complex64);
 REGISTER_KERNEL(GPU, complex128);
 REGISTER_KERNEL(GPU, bool);
@@ -121,9 +127,15 @@ REGISTER_SYCL_KERNEL(SYCL, float);
 REGISTER_SYCL_KERNEL(SYCL, double);
 REGISTER_SYCL_KERNEL(SYCL, uint8);
 REGISTER_SYCL_KERNEL(SYCL, int8);
+REGISTER_SYCL_KERNEL(SYCL, qint8);
 REGISTER_SYCL_KERNEL(SYCL, uint16);
 REGISTER_SYCL_KERNEL(SYCL, int16);
+REGISTER_SYCL_KERNEL(SYCL, qint16);
+REGISTER_SYCL_KERNEL(SYCL, quint16);
+REGISTER_SYCL_KERNEL(SYCL, uint32);
+REGISTER_SYCL_KERNEL(SYCL, qint32);
 REGISTER_SYCL_KERNEL(SYCL, int64);
+REGISTER_SYCL_KERNEL(SYCL, uint64);
 REGISTER_SYCL_KERNEL(SYCL, bool);
 #undef REGISTER_SYCL_KERNEL
 #endif
diff --git a/tensorflow/core/kernels/conv_2d.h b/tensorflow/core/kernels/conv_2d.h
index 2142207b0d89a4b2f02c7f7b5d320c3b4b48462c..6949e5b5fd85f399473095f26314e9d58fa65464 100644
--- a/tensorflow/core/kernels/conv_2d.h
+++ b/tensorflow/core/kernels/conv_2d.h
@@ -54,10 +54,12 @@ struct InflatePadAndShuffle {
 template <typename Device, typename Input, typename Filter, typename Output>
 void SpatialConvolutionFunc(const Device& d, Output output, Input input,
                             Filter filter, int row_stride, int col_stride,
+                            int row_dilation, int col_dilation,
                             const Eigen::PaddingType& padding) {
   // Need to swap row/col when calling Eigen.
   output.device(d) =
-      Eigen::SpatialConvolution(input, filter, col_stride, row_stride, padding);
+      Eigen::SpatialConvolution(input, filter, col_stride, row_stride, padding,
+                                col_dilation, row_dilation);
 }
 
 template <typename Device, typename T>
@@ -65,9 +67,10 @@ struct SpatialConvolution {
   void operator()(const Device& d, typename TTypes<T, 4>::Tensor output,
                   typename TTypes<T, 4>::ConstTensor input,
                   typename TTypes<T, 4>::ConstTensor filter, int row_stride,
-                  int col_stride, const Eigen::PaddingType& padding) {
+                  int col_stride, int row_dilation, int col_dilation,
+                  const Eigen::PaddingType& padding) {
     SpatialConvolutionFunc(d, output, input, filter, row_stride, col_stride,
-                           padding);
+                           row_dilation, col_dilation, padding);
   }
 };
 
@@ -77,11 +80,12 @@ struct SpatialConvolution<Device, Eigen::half> {
                   typename TTypes<Eigen::half, 4>::Tensor output,
                   typename TTypes<Eigen::half, 4>::ConstTensor input,
                   typename TTypes<Eigen::half, 4>::ConstTensor filter,
-                  int row_stride, int col_stride,
-                  const Eigen::PaddingType& padding) {
+                  int row_stride, int col_stride, int row_dilation,
+                  int col_dilation, const Eigen::PaddingType& padding) {
     output.device(d) =
         Eigen::SpatialConvolution(input.cast<float>(), filter.cast<float>(),
-                                  col_stride, row_stride, padding)
+                                  col_stride, row_stride, padding, col_dilation,
+                                  row_dilation)
             .cast<Eigen::half>();
   }
 };
@@ -91,11 +95,13 @@ struct SpatialConvolutionBackwardInput {
   void operator()(const Device& d, typename TTypes<T, 4>::Tensor input_backward,
                   typename TTypes<T, 4>::ConstTensor kernel,
                   typename TTypes<T, 4>::ConstTensor output_backward,
-                  int row_stride, int col_stride) {
+                  int row_stride, int col_stride, int row_dilation,
+                  int col_dilation) {
     // Need to swap row/col when calling Eigen.
     input_backward.device(d) = Eigen::SpatialConvolutionBackwardInput(
         kernel, output_backward, input_backward.dimension(2),
-        input_backward.dimension(1), col_stride, row_stride);
+        input_backward.dimension(1), col_stride, row_stride, col_dilation,
+        row_dilation);
   }
 };
 
@@ -105,11 +111,13 @@ struct SpatialConvolutionBackwardFilter {
                   typename TTypes<T, 4>::Tensor kernel_backward,
                   typename TTypes<T, 4>::ConstTensor input,
                   typename TTypes<T, 4>::ConstTensor output_backward,
-                  int row_stride, int col_stride) {
+                  int row_stride, int col_stride, int row_dilation,
+                  int col_dilation) {
     // Need to swap row/col when calling Eigen.
     kernel_backward.device(d) = Eigen::SpatialConvolutionBackwardKernel(
         input, output_backward, kernel_backward.dimension(1),
-        kernel_backward.dimension(0), col_stride, row_stride);
+        kernel_backward.dimension(0), col_stride, row_stride, col_dilation,
+        row_dilation);
   }
 };
 
diff --git a/tensorflow/core/kernels/conv_grad_filter_ops.cc b/tensorflow/core/kernels/conv_grad_filter_ops.cc
index 512bcc6c01bf3eb4aed92f90eebb060abda8a7fc..e6ae59529107e529a9ccf7c790da0da62c90c199 100644
--- a/tensorflow/core/kernels/conv_grad_filter_ops.cc
+++ b/tensorflow/core/kernels/conv_grad_filter_ops.cc
@@ -31,7 +31,7 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor_slice.h"
 #include "tensorflow/core/kernels/conv_2d.h"
 #include "tensorflow/core/kernels/fill_functor.h"
-#ifdef TENSORFLOW_USE_LIBXSMM
+#ifdef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
 #include "tensorflow/core/kernels/xsmm_conv2d.h"
 #endif
 #include "tensorflow/core/kernels/ops_util.h"
@@ -101,11 +101,12 @@ struct LaunchConv2DBackpropFilterOp<CPUDevice, T> {
     const CPUDevice& d = ctx->eigen_device<CPUDevice>();
     functor::SpatialConvolutionBackwardFilter<CPUDevice, T>()(
         d, filter_backprop->tensor<T, 4>(), input.tensor<T, 4>(),
-        out_backprop.tensor<T, 4>(), row_stride, col_stride);
+        out_backprop.tensor<T, 4>(), row_stride, col_stride,
+        /*row_dilation=*/1, /*col_dilation=*/1);
   }
 };
 
-#ifdef TENSORFLOW_USE_LIBXSMM
+#ifdef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
 template <typename Device, class T>
 struct LaunchXsmmBackwardFilter {
   bool operator()(OpKernelContext* context, const Device& d,
@@ -242,7 +243,8 @@ class Conv2DFastBackpropFilterOp : public OpKernel {
       return;
     }
 
-#if defined TENSORFLOW_USE_LIBXSMM && defined TENSORFLOW_USE_LIBXSMM_BACKWARD
+#if defined TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS && \
+    defined TENSORFLOW_USE_LIBXSMM_BACKWARD_CONVOLUTIONS
     int64 pad_top, pad_bottom;
     int64 pad_left, pad_right;
     OP_REQUIRES_OK(
@@ -370,7 +372,8 @@ class Conv2DCustomBackpropFilterOp : public OpKernel {
             dims.spatial_dims[1].input_size, dims.spatial_dims[1].filter_size,
             dims.spatial_dims[1].stride, padding_,
             &dims.spatial_dims[1].output_size, &pad_left, &pad_right));
-#if defined TENSORFLOW_USE_LIBXSMM && defined TENSORFLOW_USE_LIBXSMM_BACKWARD
+#if defined TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS && \
+    defined TENSORFLOW_USE_LIBXSMM_BACKWARD_CONVOLUTIONS
     if (pad_left == pad_right && pad_top == pad_bottom) {
       if (LaunchXsmmBackwardFilter<Device, T>()(
               context, context->eigen_device<Device>(), input.tensor<T, 4>(),
diff --git a/tensorflow/core/kernels/conv_grad_input_ops.cc b/tensorflow/core/kernels/conv_grad_input_ops.cc
index 0356ff4c0f4240ec806d1e337546cfce6771d92f..15c55e4d9903b3bbd53e1b6e1c95571ef7834015 100644
--- a/tensorflow/core/kernels/conv_grad_input_ops.cc
+++ b/tensorflow/core/kernels/conv_grad_input_ops.cc
@@ -30,7 +30,7 @@ limitations under the License.
 #include "tensorflow/core/framework/tensor_shape.h"
 #include "tensorflow/core/framework/tensor_slice.h"
 #include "tensorflow/core/kernels/conv_2d.h"
-#ifdef TENSORFLOW_USE_LIBXSMM
+#ifdef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
 #include "tensorflow/core/kernels/xsmm_conv2d.h"
 #endif
 #include "tensorflow/core/kernels/ops_util.h"
@@ -106,11 +106,12 @@ struct LaunchConv2DBackpropInputOp<CPUDevice, T> {
     const CPUDevice& d = ctx->eigen_device<CPUDevice>();
     functor::SpatialConvolutionBackwardInput<CPUDevice, T>()(
         d, in_backprop->tensor<T, 4>(), filter.tensor<T, 4>(),
-        out_backprop.tensor<T, 4>(), row_stride, col_stride);
+        out_backprop.tensor<T, 4>(), row_stride, col_stride,
+        /*row_dilation=*/1, /*col_dilation=*/1);
   }
 };
 
-#ifdef TENSORFLOW_USE_LIBXSMM
+#ifdef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
 template <typename Device, class T>
 struct LaunchXsmmBackwardInputConvolution {
   bool operator()(OpKernelContext* context, const Device& d,
@@ -245,7 +246,8 @@ class Conv2DFastBackpropInputOp : public OpKernel {
       return;
     }
 
-#if defined TENSORFLOW_USE_LIBXSMM && defined TENSORFLOW_USE_LIBXSMM_BACKWARD
+#if defined TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS && \
+    defined TENSORFLOW_USE_LIBXSMM_BACKWARD_CONVOLUTIONS
     int64 pad_top, pad_bottom;
     int64 pad_left, pad_right;
     OP_REQUIRES_OK(
@@ -362,7 +364,8 @@ class Conv2DCustomBackpropInputOp : public OpKernel {
 
 // TODO(andydavis) Consider moving code shared with
 // Conv2DCustomBackpropFilterOp into a shared helper function.
-#if defined TENSORFLOW_USE_LIBXSMM && defined TENSORFLOW_USE_LIBXSMM_BACKWARD
+#if defined TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS && \
+    defined TENSORFLOW_USE_LIBXSMM_BACKWARD_CONVOLUTIONS
     int64 pad_top, pad_bottom;
     int64 pad_left, pad_right;
     OP_REQUIRES_OK(
diff --git a/tensorflow/core/kernels/conv_ops.cc b/tensorflow/core/kernels/conv_ops.cc
index dbddaf3dc640dcf2cad8f6ba7dd00aaa33a30e0c..47f6907c04b4e48607e66b5c9601cd9030fa9001 100644
--- a/tensorflow/core/kernels/conv_ops.cc
+++ b/tensorflow/core/kernels/conv_ops.cc
@@ -32,7 +32,7 @@ limitations under the License.
 #include "tensorflow/core/kernels/conv_2d.h"
 #include "tensorflow/core/kernels/deep_conv2d.h"
 #include "tensorflow/core/kernels/ops_util.h"
-#ifdef TENSORFLOW_USE_LIBXSMM
+#ifdef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
 #include "tensorflow/core/kernels/xsmm_conv2d.h"
 #endif
 #include "tensorflow/core/lib/core/errors.h"
@@ -60,8 +60,8 @@ template <typename Device, typename T>
 struct LaunchGeneric {
   void operator()(OpKernelContext* ctx, const Tensor& input,
                   const Tensor& filter, int row_stride, int col_stride,
-                  const Padding& padding, Tensor* output,
-                  TensorFormat data_format) {
+                  int row_dilation, int col_dilation, const Padding& padding,
+                  Tensor* output, TensorFormat data_format) {
     CHECK(data_format == FORMAT_NHWC) << "Generic conv implementation only "
                                          "supports NHWC tensor format for now.";
     if (filter.dim_size(0) == 1 && filter.dim_size(1) == 1 && row_stride == 1 &&
@@ -86,7 +86,8 @@ struct LaunchGeneric {
           filter.shaped<T, 2>({filter.dim_size(2), filter.dim_size(3)}),
           dim_pair);
     } else if (filter.dim_size(0) == input.dim_size(1) &&
-               filter.dim_size(1) == input.dim_size(2) && padding == VALID) {
+               filter.dim_size(1) == input.dim_size(2) && row_dilation == 1 &&
+               col_dilation == 1 && padding == VALID) {
       // If the input data and filter have the same height/width,
       // the 2D convolution is reduced to matrix multiplication.
       const int k =  // Length of reduction dimension.
@@ -103,7 +104,7 @@ struct LaunchGeneric {
       functor::SpatialConvolution<Device, T>()(
           ctx->eigen_device<Device>(), output->tensor<T, 4>(),
           input.tensor<T, 4>(), filter.tensor<T, 4>(), row_stride, col_stride,
-          BrainPadding2EigenPadding(padding));
+          row_dilation, col_dilation, BrainPadding2EigenPadding(padding));
     }
   }
 };
@@ -122,15 +123,9 @@ struct LaunchConv2DOp<CPUDevice, T> {
                                 "NHWC tensor format for now."));
       return;
     }
-    // TODO(yangzihao): Add the CPU implementation of dilated conv 2D.
-    if (row_dilation > 1 || col_dilation > 1) {
-      ctx->SetStatus(
-          errors::Unimplemented("Generic conv implementation only supports "
-                                "dilated rate of 1 for now."));
-      return;
-    }
     LaunchGeneric<CPUDevice, T>()(ctx, input, filter, row_stride, col_stride,
-                                  padding, output, data_format);
+                                  row_dilation, col_dilation, padding, output,
+                                  data_format);
   }
 };
 
@@ -190,7 +185,7 @@ class LaunchDeepConvOp<CPUDevice, float> {
   }
 };
 
-#ifdef TENSORFLOW_USE_LIBXSMM
+#ifdef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
 template <typename Device, typename T>
 class LaunchXsmmConvOp {
  public:
@@ -406,7 +401,7 @@ class Conv2DOp : public BinaryOp<T> {
       return;
     }
 
-#ifdef TENSORFLOW_USE_LIBXSMM
+#ifdef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
     if (LaunchXsmmConvOp<Device, T>::Run(
             context, input, filter, batch, input_rows, input_cols, in_depth,
             filter_rows, filter_cols, pad_rows, pad_cols, out_rows, out_cols,
@@ -792,7 +787,8 @@ namespace functor {
       const GPUDevice& d, typename TTypes<T, 4>::Tensor output,              \
       typename TTypes<T, 4>::ConstTensor input,                              \
       typename TTypes<T, 4>::ConstTensor filter, int row_stride,             \
-      int col_stride, const Eigen::PaddingType& padding);                    \
+      int col_stride, int row_dilation, int col_dilation,                    \
+      const Eigen::PaddingType& padding);                                    \
   extern template struct SpatialConvolution<GPUDevice, T>;                   \
   template <>                                                                \
   void MatMulConvFunctor<GPUDevice, T>::operator()(                          \
diff --git a/tensorflow/core/kernels/critical_section.cc b/tensorflow/core/kernels/critical_section.cc
deleted file mode 100644
index 30a9abf4ee78cdb336e4c25c217239daf89bae11..0000000000000000000000000000000000000000
--- a/tensorflow/core/kernels/critical_section.cc
+++ /dev/null
@@ -1,246 +0,0 @@
-/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-==============================================================================*/
-
-#define EIGEN_USE_THREADS
-
-#include <deque>
-#include <utility>
-
-#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
-#include "tensorflow/core/framework/resource_mgr.h"
-#include "tensorflow/core/kernels/captured_function.h"
-#include "tensorflow/core/lib/core/errors.h"
-#include "tensorflow/core/platform/macros.h"
-#include "tensorflow/core/platform/mutex.h"
-#include "tensorflow/core/platform/types.h"
-
-namespace tensorflow {
-
-class CriticalSection : public ResourceBase {
- public:
-  explicit CriticalSection() : is_locked_(false) {}
-  ~CriticalSection() override {
-    // Wait for all closures to finish running.
-    mutex_lock lock(mu_);
-    while (!closures_.empty()) {
-      queue_empty_cv_.wait(lock);
-    }
-  }
-
- private:
-  friend class ExecuteInCriticalSectionOp;
-
-  void Acquire(std::function<void()> closure) {
-    std::function<void()> next;
-    {
-      mutex_lock ml(mu_);
-      if (is_locked_) {
-        closures_.push_back(std::move(closure));
-      } else {
-        // This branch is the common case.  Avoid the queue.
-        is_locked_ = true;
-        next = std::move(closure);
-      }
-    }
-    if (next) {
-      next();
-    }
-  }
-
-  void Release() {
-    std::function<void()> next;
-    {
-      mutex_lock ml(mu_);
-      CHECK(is_locked_);
-      if (!closures_.empty()) {
-        // if queue is not empty, start the next entry off the queue.
-        std::swap(next, closures_.front());
-        closures_.pop_front();
-      } else {
-        is_locked_ = false;
-        queue_empty_cv_.notify_all();
-      }
-    }
-    if (next) {
-      next();
-    }
-  }
-
-  string DebugString() override {
-    tf_shared_lock ml(mu_);
-    return strings::StrCat("CriticalSection(locked: ", is_locked_,
-                           " queue_size: ", closures_.size(), ")");
-  }
-
- private:
-  mutex mu_;
-  std::deque<std::function<void()>> closures_ GUARDED_BY(mu_);
-  bool is_locked_ GUARDED_BY(mu_);
-  condition_variable queue_empty_cv_ GUARDED_BY(mu_);
-};
-
-class ExecuteInCriticalSectionOp : public AsyncOpKernel {
- public:
-  explicit ExecuteInCriticalSectionOp(OpKernelConstruction* c)
-      : AsyncOpKernel(c) {
-    OP_REQUIRES_OK(c, c->GetAttr("f", &func_));
-  }
-
- public:
-  void ComputeAsync(OpKernelContext* c, DoneCallback done) override {
-    CriticalSection* critical_section = nullptr;
-    OP_REQUIRES_OK_ASYNC(c,
-                         LookupOrCreateResource<CriticalSection>(
-                             c, HandleFromInput(c, 0), &critical_section,
-                             [this, c](CriticalSection** ptr) {
-                               *ptr = new CriticalSection;
-                               return Status::OK();
-                             }),
-                         done);
-    // No need to Unref critical_section; the Closure below will take
-    // care of the Unref associated with this execution.
-
-    auto* execution = new Closure{std::move(done), c, critical_section, &func_};
-    execution->Start();
-  }
-
- private:
-  class Closure {
-   public:
-    AsyncOpKernel::DoneCallback done_;
-    OpKernelContext* ctx_;
-    CriticalSection* cs_;
-    FunctionLibraryRuntime::Handle handle_;
-    FunctionLibraryRuntime::Options opts_;
-    std::vector<Tensor> arguments_t_;
-    std::vector<Tensor> output_t_;
-    NameAttrList* func_;
-
-    explicit Closure(AsyncOpKernel::DoneCallback done, OpKernelContext* ctx,
-                     CriticalSection* critical_section, NameAttrList* func)
-        : done_(std::move(done)),
-          ctx_(ctx),
-          cs_(critical_section),
-          handle_(-1),
-          func_(func) {}
-
-    ~Closure();
-
-    void Start() {
-      // Perform ExecuteFunction isnide a separate thread to avoid
-      // having lightweight Functions be inlined in this thread.
-      // That inlining would in turn inline DoneAndDelete inside the
-      // same thread.  Since DoneAndDelete can call the next
-      // ExecuteFunction in the CriticalSection, this can cause a
-      // stack overflow.
-      cs_->Acquire(
-          [this]() { (*ctx_->runner())([this]() { ExecuteFunction(); }); });
-    }
-
-   private:
-    void ExecuteFunction();
-    void DoneAndDelete(const Status& status);
-  };
-
-  NameAttrList func_;
-};
-
-void ExecuteInCriticalSectionOp::Closure::ExecuteFunction() {
-  // Arguments to a Function are in the order:
-  //   concat(<formal arguments>, <captured arguments>)
-  OpInputList arguments;
-  Status s = ctx_->input_list("arguments", &arguments);
-  if (!s.ok()) {
-    DoneAndDelete(s);
-    return;
-  }
-
-  arguments_t_.reserve(arguments.size());
-  for (const Tensor& t : arguments) {
-    arguments_t_.push_back(t);
-  }
-
-  auto* function_library = ctx_->function_library();
-  s = function_library->Instantiate(func_->name(), AttrSlice(&func_->attr()),
-                                    &handle_);
-  if (!s.ok()) {
-    DoneAndDelete(s);
-    return;
-  }
-
-  opts_.step_id = CapturedFunction::generate_step_id();
-  auto* step_container =
-      new ScopedStepContainer(opts_.step_id, [this](const string& name) {
-        ctx_->resource_manager()->Cleanup(name).IgnoreError();
-      });
-  opts_.cancellation_manager = ctx_->cancellation_manager();
-  opts_.step_container = step_container;
-  opts_.runner = ctx_->runner();
-
-  function_library->Run(opts_, handle_, arguments_t_, &output_t_,
-                        [this](const Status& s) { DoneAndDelete(s); });
-}
-
-void ExecuteInCriticalSectionOp::Closure::DoneAndDelete(const Status& status) {
-  cs_->Release();
-
-  if (!status.ok()) {
-    ctx_->SetStatus(status);
-  } else {
-    OpOutputList output;
-    const Status s = ctx_->output_list("outputs", &output);
-    if (!s.ok()) {
-      ctx_->SetStatus(s);
-    } else if (output_t_.size() != output.size()) {
-      ctx_->SetStatus(errors::Internal(
-          "Could not set all outputs.  Expected output size is ", output.size(),
-          " but function set ", output_t_.size(), " output values."));
-    } else {
-      for (int i = 0; i < output_t_.size(); ++i) {
-        output.set(i, output_t_[i]);
-      }
-    }
-  }
-
-  delete opts_.step_container;
-  opts_.step_container = nullptr;
-  done_();
-  cs_->Unref();
-  delete this;
-}
-
-ExecuteInCriticalSectionOp::Closure::~Closure() {
-  CHECK(!opts_.step_container)
-      << "Initialized closure destroyed without calling Done";
-}
-
-REGISTER_KERNEL_BUILDER(Name("ExecuteInCriticalSection").Device(DEVICE_CPU),
-                        ExecuteInCriticalSectionOp);
-
-REGISTER_KERNEL_BUILDER(Name("CriticalSectionOp").Device(DEVICE_CPU),
-                        ResourceHandleOp<CriticalSection>);
-
-// TODO(ebrevdo): Re-enable once the cross-device function execution works.
-#if GOOGLE_CUDA
-REGISTER_KERNEL_BUILDER(Name("ExecuteInCriticalSection")
-                            .Device(DEVICE_GPU)
-                            .HostMemory("critical_section"),
-                        ExecuteInCriticalSectionOp);
-REGISTER_KERNEL_BUILDER(
-    Name("CriticalSectionOp").Device(DEVICE_GPU).HostMemory("resource"),
-    ResourceHandleOp<CriticalSection>);
-#endif  // GOOGLE_CUDA
-
-}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/cwise_op_add_1.cc b/tensorflow/core/kernels/cwise_op_add_1.cc
index bf32c8a54b34586e43d34cf8890ed37fe64b8c34..9e4ffe950c9a88d22a3bfc081adc4703fd9e6b65 100644
--- a/tensorflow/core/kernels/cwise_op_add_1.cc
+++ b/tensorflow/core/kernels/cwise_op_add_1.cc
@@ -16,10 +16,10 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER5(BinaryOp, CPU, "Add", functor::add, float, Eigen::half, double, int32,
-          int64);
-REGISTER5(BinaryOp, CPU, "AddV2", functor::add, float, Eigen::half, double,
-          int32, int64);
+REGISTER6(BinaryOp, CPU, "Add", functor::add, float, Eigen::half, double, int32,
+          int64, bfloat16);
+REGISTER6(BinaryOp, CPU, "AddV2", functor::add, float, Eigen::half, double,
+          int32, int64, bfloat16);
 
 #if GOOGLE_CUDA
 REGISTER3(BinaryOp, GPU, "Add", functor::add, float, Eigen::half, double);
diff --git a/tensorflow/core/kernels/cwise_op_isnan.cc b/tensorflow/core/kernels/cwise_op_isnan.cc
index aa180c247e7d01ef0f2898b4a50a71c3c3bc6941..707dc9e49ca1dd5b9872c2e5d3184e11eddd7a1f 100644
--- a/tensorflow/core/kernels/cwise_op_isnan.cc
+++ b/tensorflow/core/kernels/cwise_op_isnan.cc
@@ -16,7 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER3(UnaryOp, CPU, "IsNan", functor::isnan, float, Eigen::half, double);
+REGISTER4(UnaryOp, CPU, "IsNan", functor::isnan, float, Eigen::half, double,
+          bfloat16);
 
 #if GOOGLE_CUDA
 REGISTER3(UnaryOp, GPU, "IsNan", functor::isnan, float, Eigen::half, double);
diff --git a/tensorflow/core/kernels/cwise_op_mul_1.cc b/tensorflow/core/kernels/cwise_op_mul_1.cc
index 0e8d2e37350dbbb942bd5ed6b16392b6288313fe..cff0407b83a4bafd27573325615322f92e594d46 100644
--- a/tensorflow/core/kernels/cwise_op_mul_1.cc
+++ b/tensorflow/core/kernels/cwise_op_mul_1.cc
@@ -17,8 +17,8 @@ limitations under the License.
 
 namespace tensorflow {
 
-REGISTER5(BinaryOp, CPU, "Mul", functor::mul, float, Eigen::half, double, uint8,
-          int32);
+REGISTER6(BinaryOp, CPU, "Mul", functor::mul, float, Eigen::half, double, uint8,
+          int32, bfloat16);
 #if defined(__ANDROID_TYPES_SLIM__)
 // We only register the first type when we have multi-argument calls in the
 // case where we're trying to reduce executable size, but it turns out that the
diff --git a/tensorflow/core/kernels/cwise_op_square.cc b/tensorflow/core/kernels/cwise_op_square.cc
index 7fc2f6bf08b2c825f471123e1ab58bd060f6070a..84f695ddc29d7f8d3afc12ea81515e80a8a75255 100644
--- a/tensorflow/core/kernels/cwise_op_square.cc
+++ b/tensorflow/core/kernels/cwise_op_square.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER7(UnaryOp, CPU, "Square", functor::square, float, Eigen::half, double,
-          int32, int64, complex64, complex128);
+REGISTER8(UnaryOp, CPU, "Square", functor::square, float, Eigen::half, double,
+          int32, int64, complex64, complex128, bfloat16);
 
 #if GOOGLE_CUDA
 REGISTER4(UnaryOp, GPU, "Square", functor::square, float, Eigen::half, double,
diff --git a/tensorflow/core/kernels/cwise_op_sub.cc b/tensorflow/core/kernels/cwise_op_sub.cc
index 025041946ac71f0e8f4724f9432d5e2901e348cc..eb27bddb78dfd8679b010f7f2cb67d2049a22a4b 100644
--- a/tensorflow/core/kernels/cwise_op_sub.cc
+++ b/tensorflow/core/kernels/cwise_op_sub.cc
@@ -16,8 +16,8 @@ limitations under the License.
 #include "tensorflow/core/kernels/cwise_ops_common.h"
 
 namespace tensorflow {
-REGISTER7(BinaryOp, CPU, "Sub", functor::sub, float, Eigen::half, double, int32,
-          int64, complex64, complex128);
+REGISTER8(BinaryOp, CPU, "Sub", functor::sub, float, Eigen::half, double, int32,
+          int64, bfloat16, complex64, complex128);
 #if !defined(__ANDROID_TYPES_SLIM__)
 // Sub op for int8, uint8, int16, uint16
 REGISTER4(BinaryOp, CPU, "Sub", functor::sub, int8, uint8, int16, uint16);
diff --git a/tensorflow/core/kernels/cwise_ops_common.h b/tensorflow/core/kernels/cwise_ops_common.h
index 8295fa939ee1aabf78a7d7b7f4677d851b407573..e32eccf547e07b71678abf0e75ac20973ecbf380 100644
--- a/tensorflow/core/kernels/cwise_ops_common.h
+++ b/tensorflow/core/kernels/cwise_ops_common.h
@@ -20,6 +20,8 @@ limitations under the License.
 
 #define EIGEN_USE_THREADS
 
+#include "tensorflow/core/lib/bfloat16/bfloat16.h"
+
 #ifdef TENSORFLOW_USE_SYCL
 #include "tensorflow/core/kernels/cwise_ops_sycl_common.h"
 #endif
diff --git a/tensorflow/core/kernels/cwise_ops_test.cc b/tensorflow/core/kernels/cwise_ops_test.cc
index 39f497e71612fc08a085e410edae73669fc9993a..696d5840e8ce39c1bf210b54b9f28ae83cf232c7 100644
--- a/tensorflow/core/kernels/cwise_ops_test.cc
+++ b/tensorflow/core/kernels/cwise_ops_test.cc
@@ -231,14 +231,22 @@ BM_BIAS_ADD_GRAD_ALL(gpu, NHWC, half, DT_HALF);
 
 Graph* BcastAdd(int rows, int cols, int dim) {
   Graph* g = new Graph(OpRegistry::Global());
-  Tensor lhs(DT_FLOAT, TensorShape({rows, cols}));
-  lhs.flat<float>().setRandom();
-  TensorShape rhs_shape;
-  if (dim == 0) {
+  TensorShape lhs_shape, rhs_shape;
+  if (dim == 0) {  // row
+    lhs_shape = TensorShape({rows, cols});
     rhs_shape = TensorShape({rows, 1});
-  } else {
+  } else if (dim == 1) {  // col
+    lhs_shape = TensorShape({rows, cols});
     rhs_shape = TensorShape({cols});
+  } else if (dim == 2) {  // cross_rc
+    lhs_shape = TensorShape({rows, 1});
+    rhs_shape = TensorShape({1, cols});
+  } else {  // cross_cr
+    lhs_shape = TensorShape({1, cols});
+    rhs_shape = TensorShape({rows, 1});
   }
+  Tensor lhs(DT_FLOAT, lhs_shape);
+  lhs.flat<float>().setRandom();
   Tensor rhs(DT_FLOAT, rhs_shape);
   rhs.flat<float>().setRandom();
   test::graph::Binary(g, "Add", test::graph::Constant(g, lhs),
@@ -298,5 +306,59 @@ BM_BCAST_ADD_COL_ALL(sycl);
 #undef BM_BCAST_ADD_COL_ALL
 #undef BM_BCAST_ADD_COL
 
+#define BM_BCAST_ADD_CROSS_RC(DEVICE, R, C)                            \
+  void BM_##DEVICE##_BcastAddCrossRC_R##R##_C##C(int iters, int arg) { \
+    const int rows = RowsFromArg(arg);                                 \
+    const int cols = ColsFromArg(arg);                                 \
+    const int64 tot = static_cast<int64>(iters) * rows * cols;         \
+    testing::ItemsProcessed(tot);                                      \
+    testing::BytesProcessed(tot * sizeof(float));                      \
+    test::Benchmark(#DEVICE, BcastAdd(rows, cols, 2)).Run(iters);      \
+  }                                                                    \
+  BENCHMARK(BM_##DEVICE##_BcastAddCrossRC_R##R##_C##C)                 \
+      ->Arg(RowsAndColsArg(R, C));
+
+#define BM_BCAST_ADD_CROSS_RC_ALL(DEVICE)   \
+  BM_BCAST_ADD_CROSS_RC(DEVICE, 512, 2048); \
+  BM_BCAST_ADD_CROSS_RC(DEVICE, 512, 4096); \
+  BM_BCAST_ADD_CROSS_RC(DEVICE, 2048, 512); \
+  BM_BCAST_ADD_CROSS_RC(DEVICE, 4096, 512);
+BM_BCAST_ADD_CROSS_RC_ALL(cpu);
+#if GOOGLE_CUDA
+BM_BCAST_ADD_CROSS_RC_ALL(gpu);
+#endif  // GOOGLE_CUDA
+#ifdef TENSORFLOW_USE_SYCL
+BM_BCAST_ADD_CROSS_RC_ALL(sycl);
+#endif  // TENSORFLOW_USE_SYCL
+#undef BM_BCAST_ADD_CROSS_RC_ALL
+#undef BM_BCAST_ADD_CROSS_RC
+
+#define BM_BCAST_ADD_CROSS_CR(DEVICE, R, C)                            \
+  void BM_##DEVICE##_BcastAddCrossCR_R##R##_C##C(int iters, int arg) { \
+    const int rows = RowsFromArg(arg);                                 \
+    const int cols = ColsFromArg(arg);                                 \
+    const int64 tot = static_cast<int64>(iters) * rows * cols;         \
+    testing::ItemsProcessed(tot);                                      \
+    testing::BytesProcessed(tot * sizeof(float));                      \
+    test::Benchmark(#DEVICE, BcastAdd(rows, cols, 3)).Run(iters);      \
+  }                                                                    \
+  BENCHMARK(BM_##DEVICE##_BcastAddCrossCR_R##R##_C##C)                 \
+      ->Arg(RowsAndColsArg(R, C));
+
+#define BM_BCAST_ADD_CROSS_CR_ALL(DEVICE)   \
+  BM_BCAST_ADD_CROSS_CR(DEVICE, 512, 2048); \
+  BM_BCAST_ADD_CROSS_CR(DEVICE, 512, 4096); \
+  BM_BCAST_ADD_CROSS_CR(DEVICE, 2048, 512); \
+  BM_BCAST_ADD_CROSS_CR(DEVICE, 4096, 512);
+BM_BCAST_ADD_CROSS_CR_ALL(cpu);
+#if GOOGLE_CUDA
+BM_BCAST_ADD_CROSS_CR_ALL(gpu);
+#endif  // GOOGLE_CUDA
+#ifdef TENSORFLOW_USE_SYCL
+BM_BCAST_ADD_CROSS_CR_ALL(sycl);
+#endif  // TENSORFLOW_USE_SYCL
+#undef BM_BCAST_ADD_CROSS_CR_ALL
+#undef BM_BCAST_ADD_CROSS_CR
+
 }  // namespace
 }  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/BUILD b/tensorflow/core/kernels/data/BUILD
index 1e3b0c231f35c12d2e9e23d8d503b3a7492ab676..253399c1e4ec7fe8edeeeee161ef3413d1dbea09 100644
--- a/tensorflow/core/kernels/data/BUILD
+++ b/tensorflow/core/kernels/data/BUILD
@@ -209,6 +209,19 @@ tf_kernel_library(
     ],
 )
 
+tf_kernel_library(
+    name = "generator_dataset_op",
+    srcs = ["generator_dataset_op.cc"],
+    deps = [
+        ":captured_function",
+        "//tensorflow/core:core_cpu_internal",
+        "//tensorflow/core:dataset_ops_op_lib",
+        "//tensorflow/core:framework",
+        "//tensorflow/core:lib",
+        "//tensorflow/core:lib_internal",
+    ],
+)
+
 tf_kernel_library(
     name = "scan_dataset_op",
     srcs = ["scan_dataset_op.cc"],
@@ -498,18 +511,6 @@ tf_kernel_library(
     ],
 )
 
-tf_kernel_library(
-    name = "unique_dataset_op",
-    srcs = ["unique_dataset_op.cc"],
-    deps = [
-        ":dataset",
-        "//tensorflow/core:dataset_ops_op_lib",
-        "//tensorflow/core:framework",
-        "//tensorflow/core:lib",
-        "//tensorflow/core:lib_internal",
-    ],
-)
-
 tf_kernel_library(
     name = "dataset_ops",
     deps = [
@@ -519,6 +520,7 @@ tf_kernel_library(
         ":dense_to_sparse_batch_dataset_op",
         ":filter_dataset_op",
         ":flat_map_dataset_op",
+        ":generator_dataset_op",
         ":group_by_window_dataset_op",
         ":interleave_dataset_op",
         ":iterator_ops",
@@ -543,7 +545,6 @@ tf_kernel_library(
         ":tensor_dataset_op",
         ":tensor_queue_dataset_op",
         ":tensor_slice_dataset_op",
-        ":unique_dataset_op",
         ":zip_dataset_op",
     ],
 )
diff --git a/tensorflow/core/kernels/data/captured_function.cc b/tensorflow/core/kernels/data/captured_function.cc
index c4aa9ec26545a2792c1e741af69f61a292fcc216..dd61b7daee153bf2f3be3c72dd5c8e6032d0080b 100644
--- a/tensorflow/core/kernels/data/captured_function.cc
+++ b/tensorflow/core/kernels/data/captured_function.cc
@@ -256,6 +256,62 @@ Status CapturedFunction::RunWithBorrowedArgs(IteratorContext* ctx,
   return frame.ConsumeRetvals(rets);
 }
 
+Status CapturedFunction::Instantiate(IteratorContext* ctx) {
+  FunctionLibraryRuntime::Handle unused_handle;
+  TF_RETURN_IF_ERROR(MaybeInstantiate(ctx, &unused_handle));
+  mutex_lock l(mu_);
+  if (captured_runner_ == nullptr) {
+    captured_runner_ = *ctx->runner();
+  }
+  return Status::OK();
+}
+
+Status CapturedFunction::RunInstantiated(const std::vector<Tensor>& args,
+                                         std::vector<Tensor>* rets) {
+  FunctionLibraryRuntime* lib;
+  FunctionLibraryRuntime::Handle handle;
+  std::function<void(std::function<void()>)>* runner;
+  {
+    tf_shared_lock l(mu_);
+    if (lib_ == nullptr) {
+      return errors::FailedPrecondition(
+          "`CapturedFunction::Instantiate()` must be called before a call to "
+          "`CapturedFunction::RunInstantiated()`.");
+    }
+    lib = lib_;
+    handle = f_handle_;
+    runner = &captured_runner_;
+  }
+
+  FunctionLibraryRuntime::Options f_opts;
+  f_opts.step_id = CapturedFunction::generate_step_id();
+  ScopedStepContainer step_container(f_opts.step_id, [lib](const string& name) {
+    lib->device()->resource_manager()->Cleanup(name).IgnoreError();
+  });
+  f_opts.step_container = &step_container;
+  f_opts.runner = runner;
+  // TODO(mrry): Add cancellation manager support to IteratorContext
+  // so that we can cancel running map functions. The local
+  // cancellation manager here is created so that we can run kernels
+  // (such as queue kernels) that depend on the non-nullness of
+  // `OpKernelContext::cancellation_manager()`, but additional effort
+  // will be required to plumb it through the `IteratorContext`.
+  CancellationManager c_mgr;
+  f_opts.cancellation_manager = &c_mgr;
+
+  BorrowedArgsCallFrame frame(args, &captured_inputs_, ret_types_);
+  Notification n;
+  Status s;
+
+  lib->Run(f_opts, handle, &frame, [&n, &s](Status func_status) {
+    s.Update(func_status);
+    n.Notify();
+  });
+  n.WaitForNotification();
+  TF_RETURN_IF_ERROR(s);
+  return frame.ConsumeRetvals(rets);
+}
+
 void CapturedFunction::RunAsync(IteratorContext* ctx,
                                 std::vector<Tensor>&& args,
                                 std::vector<Tensor>* rets,
diff --git a/tensorflow/core/kernels/data/captured_function.h b/tensorflow/core/kernels/data/captured_function.h
index 32d2bc3aaebf440584934231a8555199026074ae..490f5cd1e3b6676decc6646df9dfb722524d58e8 100644
--- a/tensorflow/core/kernels/data/captured_function.h
+++ b/tensorflow/core/kernels/data/captured_function.h
@@ -64,6 +64,21 @@ class CapturedFunction {
                              const std::vector<Tensor>& args,
                              std::vector<Tensor>* rets);
 
+  // Explicitly instantiate this function for use in the given
+  // context. This method, and the context-less overload
+  // `RunInstantiated()` below can be useful for calling a captured
+  // function in cases where an `IteratorContext*` is not available
+  // (such as a destructor).
+  Status Instantiate(IteratorContext* ctx);
+
+  // Synchronously runs the captured function on the given `args`, and stores
+  // the results in `*rets`. Prefer to use `Run()` or `RunAsync()` when
+  // possible.
+  //
+  // REQUIRES: `this->Instantiate()` must have been called before this method.
+  Status RunInstantiated(const std::vector<Tensor>& args,
+                         std::vector<Tensor>* rets);
+
   // Asynchronously runs the captured function on the given `args`, stores
   // the results in `*rets`, and calls the given `done` callback when the
   // function returns. This method takes ownership of the tensors in `args`,
@@ -99,6 +114,7 @@ class CapturedFunction {
   FunctionLibraryRuntime::Handle f_handle_ GUARDED_BY(mu_);
   const std::vector<Tensor> captured_inputs_;
   DataTypeSlice ret_types_;
+  std::function<void(std::function<void()>)> captured_runner_ = nullptr;
 
   TF_DISALLOW_COPY_AND_ASSIGN(CapturedFunction);
 };
diff --git a/tensorflow/core/kernels/data/generator_dataset_op.cc b/tensorflow/core/kernels/data/generator_dataset_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3f1e441b91d0102b112523a46ac75ce415eacdd7
--- /dev/null
+++ b/tensorflow/core/kernels/data/generator_dataset_op.cc
@@ -0,0 +1,201 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+#include <iterator>
+#include <vector>
+
+#include "tensorflow/core/framework/dataset.h"
+#include "tensorflow/core/framework/partial_tensor_shape.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/kernels/data/captured_function.h"
+#include "tensorflow/core/lib/random/random.h"
+
+namespace tensorflow {
+
+namespace {
+
+// See documentation in ../ops/dataset_ops.cc for a high-level
+// description of the following op.
+
+class GeneratorDatasetOp : public DatasetOpKernel {
+ public:
+  explicit GeneratorDatasetOp(OpKernelConstruction* ctx)
+      : DatasetOpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("init_func", &init_func_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("next_func", &next_func_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("finalize_func", &finalize_func_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_types", &output_types_));
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("output_shapes", &output_shapes_));
+  }
+
+  void MakeDataset(OpKernelContext* ctx, DatasetBase** output) override {
+    OpInputList init_func_other_args_input;
+    OP_REQUIRES_OK(ctx, ctx->input_list("init_func_other_args",
+                                        &init_func_other_args_input));
+    std::vector<Tensor> init_func_other_args;
+    init_func_other_args.reserve(init_func_other_args_input.size());
+    for (const Tensor& t : init_func_other_args_input) {
+      init_func_other_args.push_back(t);
+    }
+    std::unique_ptr<CapturedFunction> init_func;
+    OP_REQUIRES_OK(
+        ctx, CapturedFunction::Create(
+                 init_func_, std::move(init_func_other_args), &init_func));
+
+    OpInputList next_func_other_args_input;
+    OP_REQUIRES_OK(ctx, ctx->input_list("next_func_other_args",
+                                        &next_func_other_args_input));
+    std::vector<Tensor> next_func_other_args;
+    next_func_other_args.reserve(next_func_other_args_input.size());
+    for (const Tensor& t : next_func_other_args_input) {
+      next_func_other_args.push_back(t);
+    }
+    std::unique_ptr<CapturedFunction> next_func;
+    OP_REQUIRES_OK(
+        ctx, CapturedFunction::Create(
+                 next_func_, std::move(next_func_other_args), &next_func));
+
+    OpInputList finalize_func_other_args_input;
+    OP_REQUIRES_OK(ctx, ctx->input_list("finalize_func_other_args",
+                                        &finalize_func_other_args_input));
+    std::vector<Tensor> finalize_func_other_args;
+    finalize_func_other_args.reserve(finalize_func_other_args_input.size());
+    for (const Tensor& t : finalize_func_other_args_input) {
+      finalize_func_other_args.push_back(t);
+    }
+    std::unique_ptr<CapturedFunction> finalize_func;
+    OP_REQUIRES_OK(ctx, CapturedFunction::Create(
+                            finalize_func_, std::move(finalize_func_other_args),
+                            &finalize_func));
+
+    *output =
+        new Dataset(ctx, std::move(init_func), std::move(next_func),
+                    std::move(finalize_func), output_types_, output_shapes_);
+  }
+
+ private:
+  class Dataset : public GraphDatasetBase {
+   public:
+    Dataset(OpKernelContext* ctx, std::unique_ptr<CapturedFunction> init_func,
+            std::unique_ptr<CapturedFunction> next_func,
+            std::unique_ptr<CapturedFunction> finalize_func,
+            const DataTypeVector& output_types,
+            const std::vector<PartialTensorShape>& output_shapes)
+        : GraphDatasetBase(ctx),
+          init_func_(std::move(init_func)),
+          next_func_(std::move(next_func)),
+          finalize_func_(std::move(finalize_func)),
+          output_types_(output_types),
+          output_shapes_(output_shapes) {}
+
+    std::unique_ptr<IteratorBase> MakeIterator(
+        const string& prefix) const override {
+      return std::unique_ptr<IteratorBase>(
+          new Iterator({this, strings::StrCat(prefix, "::Generator")}));
+    }
+
+    const DataTypeVector& output_dtypes() const override {
+      return output_types_;
+    }
+    const std::vector<PartialTensorShape>& output_shapes() const override {
+      return output_shapes_;
+    }
+
+    string DebugString() override { return "GeneratorDatasetOp::Dataset"; }
+
+   private:
+    class Iterator : public DatasetIterator<Dataset> {
+     public:
+      explicit Iterator(const Params& params)
+          : DatasetIterator<Dataset>(params) {}
+
+      ~Iterator() override {
+        if (!finalized_) {
+          std::vector<Tensor> ignored;
+          Status s =
+              dataset()->finalize_func_->RunInstantiated(state_, &ignored);
+          if (!s.ok()) {
+            LOG(WARNING)
+                << "Error occurred when finalizing GeneratorDataset iterator: "
+                << s;
+          }
+        }
+      }
+
+      Status GetNextInternal(IteratorContext* ctx,
+                             std::vector<Tensor>* out_tensors,
+                             bool* end_of_sequence) override {
+        mutex_lock l(mu_);
+
+        if (!initialized_) {
+          TF_RETURN_IF_ERROR(
+              dataset()->init_func_->RunWithBorrowedArgs(ctx, {}, &state_));
+          // Explicitly instantiate the finalize function here so that
+          // we can invoke it in the destructor.
+          TF_RETURN_IF_ERROR(dataset()->finalize_func_->Instantiate(ctx));
+          initialized_ = true;
+        }
+
+        if (finalized_) {
+          *end_of_sequence = true;
+          return Status::OK();
+        }
+
+        Status s = dataset()->next_func_->RunWithBorrowedArgs(ctx, state_,
+                                                              out_tensors);
+        if (s.ok()) {
+          *end_of_sequence = false;
+        } else if (errors::IsOutOfRange(s)) {
+          // `next_func` may deliberately raise `errors::OutOfRange`
+          // to indicate that we should terminate the iteration.
+          s = Status::OK();
+          *end_of_sequence = true;
+
+          // NOTE(mrry): We ignore any tensors returned by the
+          // finalize function.
+          std::vector<Tensor> ignored;
+          TF_RETURN_IF_ERROR(
+              dataset()->finalize_func_->RunInstantiated(state_, &ignored));
+          finalized_ = true;
+        }
+        return s;
+      }
+
+     private:
+      mutex mu_;
+      bool initialized_ GUARDED_BY(mu_) = false;
+      bool finalized_ GUARDED_BY(mu_) = false;
+      std::vector<Tensor> state_ GUARDED_BY(mu_);
+    };
+
+    const std::unique_ptr<CapturedFunction> init_func_;
+    const std::unique_ptr<CapturedFunction> next_func_;
+    const std::unique_ptr<CapturedFunction> finalize_func_;
+    const DataTypeVector output_types_;
+    const std::vector<PartialTensorShape> output_shapes_;
+  };
+
+  DataTypeVector output_types_;
+  std::vector<PartialTensorShape> output_shapes_;
+  NameAttrList init_func_;
+  NameAttrList next_func_;
+  NameAttrList finalize_func_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("GeneratorDataset").Device(DEVICE_CPU),
+                        GeneratorDatasetOp);
+
+}  // namespace
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/data/parallel_map_dataset_op.cc b/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
index bc4426a9fdbab971a4e49d57ffcea6896fc037a7..33053b1bd9d7878016ebaf96b75c5c4b30130c4b 100644
--- a/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
+++ b/tensorflow/core/kernels/data/parallel_map_dataset_op.cc
@@ -199,7 +199,14 @@ class ParallelMapDatasetOp : public UnaryDatasetOpKernel {
           }
         }
         ++num_outputs_consumed_;
-        return result->status;
+        if (errors::IsOutOfRange(result->status)) {
+          // `f` may deliberately raise `errors::OutOfRange` to indicate
+          // that we should terminate the iteration early.
+          *end_of_sequence = true;
+          return Status::OK();
+        } else {
+          return result->status;
+        }
       }
 
      protected:
diff --git a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
index 505d33046ecf1ab676899cf2c22415fddb07bf95..94989089ec9cdf9314860b43f67691f39f33c31f 100644
--- a/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/depthwise_conv_op_gpu.cu.cc
@@ -186,6 +186,8 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNHWCSmall(
   const int pad_height = args.pad_rows;
   const int pad_width = args.pad_cols;
 
+  assert(blockDim.x == kBlockDepth);
+  assert(blockDim.y == args.in_cols);
   const int block_height = blockDim.z;
 
   // These values are the same for all threads and could
@@ -465,6 +467,8 @@ __global__ __launch_bounds__(1024, 2) void DepthwiseConv2dGPUKernelNCHWSmall(
   const int pad_width = args.pad_cols;
 
   // Fixed blockDim.z, tailored for maximum grid size for images of size 16x16.
+  assert(blockDim.x == args.in_cols);
+  assert(blockDim.z == kBlockDepth);
   const int block_height = blockDim.y;
 
   // These values are the same for all threads and could
@@ -588,20 +592,30 @@ void LaunchDepthwiseConv2dGPUSmall(const GpuDevice& device,
                                    TensorFormat data_format) {
   const int block_height = (args.in_rows + 1) / 2;
   dim3 block_dim;
+  int block_count;
   void (*kernel)(const DepthwiseArgs, const T*, const T*, T*);
-  if (data_format == FORMAT_NHWC) {
-    block_dim = dim3(kBlockDepth, args.in_cols, block_height);
-    kernel = DepthwiseConv2dGPUKernelNHWCSmall<T, kDirection, kKnownFilterWidth,
-                                               kKnownFilterHeight, kBlockDepth,
-                                               kKnownEvenHeight>;
-  } else if (data_format == FORMAT_NCHW) {
-    block_dim = dim3(args.in_cols, block_height, kBlockDepth);
-    kernel = DepthwiseConv2dGPUKernelNCHWSmall<T, kDirection, kKnownFilterWidth,
-                                               kKnownFilterHeight, kBlockDepth,
-                                               kKnownEvenHeight>;
-  } else {
-    assert(false && "Incorrect data format");
-    return;
+  switch (data_format) {
+    case FORMAT_NHWC:
+      block_dim = dim3(kBlockDepth, args.in_cols, block_height);
+      block_count =
+          args.batch * DivUp(args.out_depth, kBlockDepth) * kBlockDepth;
+      kernel =
+          DepthwiseConv2dGPUKernelNHWCSmall<T, kDirection, kKnownFilterWidth,
+                                            kKnownFilterHeight, kBlockDepth,
+                                            kKnownEvenHeight>;
+      break;
+    case FORMAT_NCHW:
+      block_dim = dim3(args.in_cols, block_height, kBlockDepth);
+      block_count =
+          DivUp(args.batch * args.out_depth, kBlockDepth) * kBlockDepth;
+      kernel =
+          DepthwiseConv2dGPUKernelNCHWSmall<T, kDirection, kKnownFilterWidth,
+                                            kKnownFilterHeight, kBlockDepth,
+                                            kKnownEvenHeight>;
+      break;
+    case FORMAT_NCHW_VECT_C:
+      LOG(ERROR) << "FORMAT_NCHW_VECT_C is not supported";
+      return;
   }
   const int tile_width = args.in_cols + args.filter_cols - 1;
   const int tile_height = block_height * 2 + args.filter_rows - 1;
@@ -609,11 +623,10 @@ void LaunchDepthwiseConv2dGPUSmall(const GpuDevice& device,
   const int filter_pixels = args.filter_rows * args.filter_cols;
   const int shared_memory_size =
       kBlockDepth * (tile_pixels + filter_pixels) * sizeof(T);
-  const int num_outputs =
-      args.batch * args.out_rows * args.out_cols * args.out_depth;
-  CudaLaunchConfig config =
-      GetCudaLaunchConfig(num_outputs, device, kernel, shared_memory_size,
-                          block_dim.x * block_dim.y * block_dim.z);
+  const int num_outputs = args.out_rows * args.out_cols * block_count;
+  CudaLaunchConfig config = GetCudaLaunchConfigFixedBlockSize(
+      num_outputs, device, kernel, shared_memory_size,
+      block_dim.x * block_dim.y * block_dim.z);
   kernel<<<config.block_count, block_dim, shared_memory_size,
            device.stream()>>>(args, input, filter, output);
 }
@@ -666,17 +679,20 @@ void LaunchDepthwiseConv2dGPU(const GpuDevice& device,
                               const T* filter, T* output,
                               TensorFormat data_format) {
   void (*kernel)(const DepthwiseArgs, const T*, const T*, T*, int);
-  if (data_format == FORMAT_NHWC) {
-    kernel =
-        DepthwiseConv2dGPUKernelNHWC<T, kKnownFilterWidth, kKnownFilterHeight,
-                                     kKnownDepthMultiplier>;
-  } else if (data_format == FORMAT_NCHW) {
-    kernel =
-        DepthwiseConv2dGPUKernelNCHW<T, kKnownFilterWidth, kKnownFilterHeight,
-                                     kKnownDepthMultiplier>;
-  } else {
-    assert(false && "Incorrect data format");
-    return;
+  switch (data_format) {
+    case FORMAT_NHWC:
+      kernel =
+          DepthwiseConv2dGPUKernelNHWC<T, kKnownFilterWidth, kKnownFilterHeight,
+                                       kKnownDepthMultiplier>;
+      break;
+    case FORMAT_NCHW:
+      kernel =
+          DepthwiseConv2dGPUKernelNCHW<T, kKnownFilterWidth, kKnownFilterHeight,
+                                       kKnownDepthMultiplier>;
+      break;
+    case FORMAT_NCHW_VECT_C:
+      LOG(ERROR) << "FORMAT_NCHW_VECT_C is not supported";
+      return;
   }
   const int num_outputs =
       args.batch * args.out_rows * args.out_cols * args.out_depth;
@@ -894,15 +910,18 @@ void LaunchDepthwiseConv2dBackpropInputGPU(const GpuDevice& device,
                                            const T* filter, T* in_backprop,
                                            TensorFormat data_format) {
   void (*kernel)(const DepthwiseArgs, const T*, const T*, T*, int);
-  if (data_format == FORMAT_NHWC) {
-    kernel = DepthwiseConv2dBackpropInputGPUKernelNHWC<
-        T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
-  } else if (data_format == FORMAT_NCHW) {
-    kernel = DepthwiseConv2dBackpropInputGPUKernelNCHW<
-        T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
-  } else {
-    assert(false && "Incorrect data format");
-    return;
+  switch (data_format) {
+    case FORMAT_NHWC:
+      kernel = DepthwiseConv2dBackpropInputGPUKernelNHWC<
+          T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
+      break;
+    case FORMAT_NCHW:
+      kernel = DepthwiseConv2dBackpropInputGPUKernelNCHW<
+          T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
+      break;
+    case FORMAT_NCHW_VECT_C:
+      LOG(ERROR) << "FORMAT_NCHW_VECT_C is not supported";
+      return;
   }
   const int num_in_backprop =
       args.batch * args.in_rows * args.in_cols * args.in_depth;
@@ -1113,6 +1132,8 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall(
   const int pad_height = args.pad_rows;
   const int pad_width = args.pad_cols;
 
+  assert(blockDim.x == kBlockDepth);
+  assert(blockDim.y == args.in_cols);
   const int block_height = blockDim.z;
 
   // These values are the same for all threads and could
@@ -1381,6 +1402,8 @@ __launch_bounds__(1024, 2) void DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall(
   const int pad_height = args.pad_rows;
   const int pad_width = args.pad_cols;
 
+  assert(blockDim.x == args.in_cols);
+  assert(blockDim.z == kBlockDepth);
   const int block_height = blockDim.y;
 
   // These values are the same for all threads and could
@@ -1519,24 +1542,31 @@ bool TryLaunchDepthwiseConv2dBackpropFilterGPUSmall(
   }
 
   dim3 block_dim;
+  int block_count;
   void (*kernel)(const DepthwiseArgs, const T*, const T*, T*);
-  if (data_format == FORMAT_NHWC) {
-    block_dim = dim3(kBlockDepth, args.in_cols, block_height);
-    kernel = DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall<
-        T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels>;
-  } else if (data_format == FORMAT_NCHW) {
-    block_dim = dim3(args.in_cols, block_height, kBlockDepth);
-    kernel = DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall<
-        T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels>;
-  } else {
-    assert(false && "Incorrect data format");
-    return false;
+  switch (data_format) {
+    case FORMAT_NHWC:
+      block_dim = dim3(kBlockDepth, args.in_cols, block_height);
+      block_count =
+          args.batch * DivUp(args.out_depth, kBlockDepth) * kBlockDepth;
+      kernel = DepthwiseConv2dBackpropFilterGPUKernelNHWCSmall<
+          T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels>;
+      break;
+    case FORMAT_NCHW:
+      block_dim = dim3(args.in_cols, block_height, kBlockDepth);
+      block_count =
+          DivUp(args.batch * args.out_depth, kBlockDepth) * kBlockDepth;
+      kernel = DepthwiseConv2dBackpropFilterGPUKernelNCHWSmall<
+          T, kKnownFilterWidth, kKnownFilterHeight, kBlockDepth, kAccumPixels>;
+      break;
+    case FORMAT_NCHW_VECT_C:
+      LOG(ERROR) << "FORMAT_NCHW_VECT_C is not supported";
+      return false;
   }
-  const int num_out_backprop =
-      args.batch * args.out_rows * args.out_cols * args.out_depth;
-  CudaLaunchConfig config =
-      GetCudaLaunchConfig(num_out_backprop, device, kernel, shared_memory_size,
-                          block_dim.x * block_dim.y * block_dim.z);
+  const int num_out_backprop = args.out_rows * args.out_cols * block_count;
+  CudaLaunchConfig config = GetCudaLaunchConfigFixedBlockSize(
+      num_out_backprop, device, kernel, shared_memory_size,
+      block_dim.x * block_dim.y * block_dim.z);
   kernel<<<config.block_count, block_dim, shared_memory_size,
            device.stream()>>>(args, out_backprop, input, filter_backprop);
   return true;
@@ -1623,15 +1653,18 @@ void LaunchDepthwiseConv2dBackpropFilterGPU(const GpuDevice& device,
                                             const T* input, T* filter_backprop,
                                             TensorFormat data_format) {
   void (*kernel)(const DepthwiseArgs, const T*, const T*, T*, int);
-  if (data_format == FORMAT_NHWC) {
-    kernel = DepthwiseConv2dBackpropFilterGPUKernelNHWC<
-        T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
-  } else if (data_format == FORMAT_NCHW) {
-    kernel = DepthwiseConv2dBackpropFilterGPUKernelNCHW<
-        T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
-  } else {
-    assert(false && "Incorrect data format");
-    return;
+  switch (data_format) {
+    case FORMAT_NHWC:
+      kernel = DepthwiseConv2dBackpropFilterGPUKernelNHWC<
+          T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
+      break;
+    case FORMAT_NCHW:
+      kernel = DepthwiseConv2dBackpropFilterGPUKernelNCHW<
+          T, kKnownFilterWidth, kKnownFilterHeight, kKnownDepthMultiplier>;
+      break;
+    case FORMAT_NCHW_VECT_C:
+      LOG(ERROR) << "FORMAT_NCHW_VECT_C is not supported";
+      return;
   }
   const int num_out_backprop =
       args.batch * args.out_rows * args.out_cols * args.out_depth;
diff --git a/tensorflow/core/kernels/function_ops.cc b/tensorflow/core/kernels/function_ops.cc
index 9d4bc35ba890c251b0800f266e7845e411e7a835..a094ebe5e2d1d78ec8f5514dca7b7ebeec4e6b57 100644
--- a/tensorflow/core/kernels/function_ops.cc
+++ b/tensorflow/core/kernels/function_ops.cc
@@ -32,7 +32,9 @@ limitations under the License.
 
 namespace tensorflow {
 
-static const char* const kGradientOp = "SymbolicGradient";
+static const char* const kArgOp = FunctionLibraryDefinition::kArgOp;
+static const char* const kRetOp = FunctionLibraryDefinition::kRetOp;
+static const char* const kGradientOp = FunctionLibraryDefinition::kGradientOp;
 
 class ArgOp : public OpKernel {
  public:
@@ -89,26 +91,25 @@ class RetvalOp : public OpKernel {
   TF_DISALLOW_COPY_AND_ASSIGN(RetvalOp);
 };
 
-REGISTER_SYSTEM_KERNEL_BUILDER(Name("_Arg").Device(DEVICE_CPU), ArgOp);
-REGISTER_SYSTEM_KERNEL_BUILDER(Name("_Retval").Device(DEVICE_CPU), RetvalOp);
+REGISTER_SYSTEM_KERNEL_BUILDER(Name(kArgOp).Device(DEVICE_CPU), ArgOp);
+REGISTER_SYSTEM_KERNEL_BUILDER(Name(kRetOp).Device(DEVICE_CPU), RetvalOp);
 
 #if TENSORFLOW_USE_SYCL
 #define REGISTER(type)     \
   REGISTER_KERNEL_BUILDER( \
-      Name("_Arg").Device(DEVICE_SYCL).TypeConstraint<type>("T"), ArgOp);
+      Name(kArgOp).Device(DEVICE_SYCL).TypeConstraint<type>("T"), ArgOp);
 TF_CALL_NUMBER_TYPES_NO_INT32(REGISTER)
-TF_CALL_bool(REGISTER) REGISTER_KERNEL_BUILDER(Name("_Arg")
+TF_CALL_bool(REGISTER) REGISTER_KERNEL_BUILDER(Name(kArgOp)
                                                    .Device(DEVICE_SYCL)
                                                    .HostMemory("output")
                                                    .TypeConstraint<int32>("T"),
                                                ArgOp);
 #undef REGISTER
-#define REGISTER(type)                                               \
-  REGISTER_KERNEL_BUILDER(                                           \
-      Name("_Retval").Device(DEVICE_SYCL).TypeConstraint<type>("T"), \
-      RetvalOp);
+#define REGISTER(type)     \
+  REGISTER_KERNEL_BUILDER( \
+      Name(kRetOp).Device(DEVICE_SYCL).TypeConstraint<type>("T"), RetvalOp);
 TF_CALL_NUMBER_TYPES_NO_INT32(REGISTER)
-TF_CALL_bool(REGISTER) REGISTER_KERNEL_BUILDER(Name("_Retval")
+TF_CALL_bool(REGISTER) REGISTER_KERNEL_BUILDER(Name(kRetOp)
                                                    .Device(DEVICE_SYCL)
                                                    .HostMemory("input")
                                                    .TypeConstraint<int32>("T"),
@@ -118,16 +119,16 @@ TF_CALL_bool(REGISTER) REGISTER_KERNEL_BUILDER(Name("_Retval")
 
 #define REGISTER(type)     \
   REGISTER_KERNEL_BUILDER( \
-      Name("_Arg").Device(DEVICE_GPU).TypeConstraint<type>("T"), ArgOp);
+      Name(kArgOp).Device(DEVICE_GPU).TypeConstraint<type>("T"), ArgOp);
 TF_CALL_NUMBER_TYPES_NO_INT32(REGISTER)
-TF_CALL_bool(REGISTER) REGISTER_KERNEL_BUILDER(Name("_Arg")
+TF_CALL_bool(REGISTER) REGISTER_KERNEL_BUILDER(Name(kArgOp)
                                                    .Device(DEVICE_GPU)
                                                    .HostMemory("output")
                                                    .TypeConstraint<int32>("T"),
                                                ArgOp);
 #undef REGISTER
 
-REGISTER_KERNEL_BUILDER(Name("_Arg")
+REGISTER_KERNEL_BUILDER(Name(kArgOp)
                             .Device(DEVICE_GPU)
                             .HostMemory("output")
                             .TypeConstraint<ResourceHandle>("T"),
@@ -135,9 +136,9 @@ REGISTER_KERNEL_BUILDER(Name("_Arg")
 
 #define REGISTER(type)     \
   REGISTER_KERNEL_BUILDER( \
-      Name("_Retval").Device(DEVICE_GPU).TypeConstraint<type>("T"), RetvalOp);
+      Name(kRetOp).Device(DEVICE_GPU).TypeConstraint<type>("T"), RetvalOp);
 TF_CALL_NUMBER_TYPES_NO_INT32(REGISTER)
-TF_CALL_bool(REGISTER) REGISTER_KERNEL_BUILDER(Name("_Retval")
+TF_CALL_bool(REGISTER) REGISTER_KERNEL_BUILDER(Name(kRetOp)
                                                    .Device(DEVICE_GPU)
                                                    .HostMemory("input")
                                                    .TypeConstraint<int32>("T"),
@@ -287,7 +288,8 @@ REGISTER_KERNEL_BUILDER(Name(kGradientOp).Device(DEVICE_SYCL),
 class RemoteCallOp : public AsyncOpKernel {
  public:
   explicit RemoteCallOp(OpKernelConstruction* ctx) : AsyncOpKernel(ctx) {
-    OP_REQUIRES_OK(ctx, ctx->GetAttr("f", &func_));
+    OP_REQUIRES_OK(ctx,
+                   ctx->GetAttr(FunctionLibraryDefinition::kFuncAttr, &func_));
   }
 
   ~RemoteCallOp() override {}
diff --git a/tensorflow/core/kernels/logging_ops.cc b/tensorflow/core/kernels/logging_ops.cc
index bacf3e77408a12a8a95bf7e7ab8f3a580e675675..6b6a14e9a7383b0a0720782acf69e0896df2444e 100644
--- a/tensorflow/core/kernels/logging_ops.cc
+++ b/tensorflow/core/kernels/logging_ops.cc
@@ -90,4 +90,23 @@ class PrintOp : public OpKernel {
 
 REGISTER_KERNEL_BUILDER(Name("Print").Device(DEVICE_CPU), PrintOp);
 
+class TimestampOp : public OpKernel {
+ public:
+  explicit TimestampOp(OpKernelConstruction* context) : OpKernel(context) {}
+
+  void Compute(OpKernelContext* context) override {
+    TensorShape output_shape;  // Default shape is 0 dim, 1 element
+    Tensor* output_tensor = nullptr;
+    OP_REQUIRES_OK(context,
+                   context->allocate_output(0, output_shape, &output_tensor));
+
+    auto output_scalar = output_tensor->scalar<double>();
+    double now_us = static_cast<double>(Env::Default()->NowMicros());
+    double now_s = now_us / 1000000;
+    output_scalar() = now_s;
+  }
+};
+
+REGISTER_KERNEL_BUILDER(Name("Timestamp").Device(DEVICE_CPU), TimestampOp);
+
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/logging_ops_test.cc b/tensorflow/core/kernels/logging_ops_test.cc
index 9cf669a7efc973a7be4f3139b2180d4e3b07797b..5e6958f364dbbfd6ff6cf112a6cef544202ee955 100644
--- a/tensorflow/core/kernels/logging_ops_test.cc
+++ b/tensorflow/core/kernels/logging_ops_test.cc
@@ -13,6 +13,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include <chrono>
+#include <thread>
+
 #include "tensorflow/core/framework/fake_input.h"
 #include "tensorflow/core/framework/node_def_builder.h"
 #include "tensorflow/core/framework/tensor.h"
@@ -96,5 +99,27 @@ TEST_F(PrintingGraphTest, FirstNSuccess) {
   test::ExpectTensorEqual<int32>(expected, *GetOutput(0));
 }
 
+class TimestampTest : public OpsTestBase {
+ protected:
+  Status Init() {
+    TF_CHECK_OK(NodeDefBuilder("op", "Timestamp").Finalize(node_def()));
+    return InitOp();
+  }
+};
+
+TEST_F(TimestampTest, WaitAtLeast) {
+  TF_ASSERT_OK(Init());
+  TF_ASSERT_OK(RunOpKernel());
+  double ts1 = *((*GetOutput(0)).flat<double>().data());
+
+  // wait 1 second
+  std::this_thread::sleep_for(std::chrono::seconds(1));
+
+  TF_ASSERT_OK(RunOpKernel());
+  double ts2 = *((*GetOutput(0)).flat<double>().data());
+
+  EXPECT_LE(1.0, ts2 - ts1);
+}
+
 }  // end namespace
 }  // end namespace tensorflow
diff --git a/tensorflow/core/kernels/mkl_batch_matmul_op.cc b/tensorflow/core/kernels/mkl_batch_matmul_op.cc
index c48a2038f921986635795575a69606cbab24f12a..723b445a7568775a13b89c9fbf0e7dc70c4b8b8c 100644
--- a/tensorflow/core/kernels/mkl_batch_matmul_op.cc
+++ b/tensorflow/core/kernels/mkl_batch_matmul_op.cc
@@ -198,12 +198,10 @@ class BatchMatMulMkl : public OpKernel {
   void MklCblasGemmBatch(const CBLAS_LAYOUT Layout, const bool TransA,
                          const bool TransB, const MKL_INT *M_Array,
                          const MKL_INT *N_Array, const MKL_INT *K_Array,
-                         const complex128 **A_Array,
-                         const MKL_INT *lda_Array,
-                         const complex128 **B_Array,
-                         const MKL_INT *ldb_Array, complex128 **C_Array,
-                         const MKL_INT *ldc_Array, const MKL_INT group_count,
-                         const MKL_INT *group_size) {
+                         const complex128 **A_Array, const MKL_INT *lda_Array,
+                         const complex128 **B_Array, const MKL_INT *ldb_Array,
+                         complex128 **C_Array, const MKL_INT *ldc_Array,
+                         const MKL_INT group_count, const MKL_INT *group_size) {
     std::vector<CBLAS_TRANSPOSE> TransA_array(
         group_size[0], TransA ? CblasConjTrans : CblasNoTrans);
     std::vector<CBLAS_TRANSPOSE> TransB_array(
diff --git a/tensorflow/core/kernels/mkl_concat_op.cc b/tensorflow/core/kernels/mkl_concat_op.cc
index f1f267e849aa39b43c153b857493160e0d103970..aa3ea890b04358d6176b44558fed014ef29259e3 100644
--- a/tensorflow/core/kernels/mkl_concat_op.cc
+++ b/tensorflow/core/kernels/mkl_concat_op.cc
@@ -519,9 +519,11 @@ class MklConcatOp : public OpKernel {
     mkl_tensor_tf_shape.AddDim(
         SIZE_OF_MKL_SERIAL_DATA(mkl_tensor_mkl_shape.GetDimension()));
     int tf_output_index = 0;
-    context->allocate_output(
+    // TODO(jktomer): replace this with OP_REQUIRES_OK and clean up this file
+    // to propagate the status up the call stack.
+    TF_CHECK_OK(context->allocate_output(
         GetTensorMetaDataIndex(tf_output_index, context->num_outputs()),
-        mkl_tensor_tf_shape, &mkl_tensor);
+        mkl_tensor_tf_shape, &mkl_tensor));
     mkl_tensor_mkl_shape.SerializeMklShape(
         mkl_tensor->flat<uint8>().data(),
         mkl_tensor->flat<uint8>().size() * sizeof(uint8));
@@ -549,9 +551,11 @@ class MklConcatOp : public OpKernel {
     mkl_tensor_tf_shape.AddDim(
         SIZE_OF_MKL_SERIAL_DATA(mkl_tensor_mkl_shape.GetDimension()));
     int tf_output_index = 0;
-    context->allocate_output(
+    // TODO(jktomer): replace this with OP_REQUIRES_OK and clean up this file
+    // to propagate the status up the call stack.
+    TF_CHECK_OK(context->allocate_output(
         GetTensorMetaDataIndex(tf_output_index, context->num_outputs()),
-        mkl_tensor_tf_shape, &mkl_tensor);
+        mkl_tensor_tf_shape, &mkl_tensor));
     mkl_tensor_mkl_shape.SerializeMklShape(
         mkl_tensor->flat<uint8>().data(),
         mkl_tensor->flat<uint8>().size() * sizeof(uint8));
diff --git a/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc b/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
index 25c2573741265d4d33c9c91474792be241dd3b32..d23027a54d169b5e597bd26a63f26d38a23239ae 100644
--- a/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_grad_bias_ops.cc
@@ -79,8 +79,9 @@ class MklConv2DCustomBackpropBiasOp : public OpKernel {
     } else if (data_format_ == FORMAT_NHWC || data_format_ == FORMAT_NCHW) {
       mkl_context.c_size = GetTensorDim(input, data_format_, 'C');
     } else {
-      errors::InvalidArgument("Unknown format ",
-                              " Format must be either NCHW or NHWC. ");
+      context->CtxFailure(errors::InvalidArgument(
+          "Unknown format ", " Format must be either NCHW or NHWC. "));
+      return;
     }
     TensorShape output_shape{mkl_context.c_size};
 
diff --git a/tensorflow/core/kernels/mkl_conv_ops.cc b/tensorflow/core/kernels/mkl_conv_ops.cc
index 2953426d5824064952858124882126c154fe6725..1440da8f8221167c79b9eb3880a9f22fd0f1426f 100644
--- a/tensorflow/core/kernels/mkl_conv_ops.cc
+++ b/tensorflow/core/kernels/mkl_conv_ops.cc
@@ -294,8 +294,10 @@ class MklConv2DOp : public OpKernel {
     mkl_filter_output_mkl_shape.SetMklLayout(mkl_context.prim_fwd,
                                              dnnResourceFilter);
 
-    size_t filter_sizes[4] = {filter.dim_size(0), filter.dim_size(1),
-                              filter.dim_size(2), filter.dim_size(3)};
+    size_t filter_sizes[4] = {static_cast<size_t>(filter.dim_size(0)),
+                              static_cast<size_t>(filter.dim_size(1)),
+                              static_cast<size_t>(filter.dim_size(2)),
+                              static_cast<size_t>(filter.dim_size(3))};
     mkl_filter_output_mkl_shape.SetTfLayout(filter.dims(), filter_sizes,
                                             mkl_context.filter_strides);
 
diff --git a/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc b/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc
index 8313224d7fe3e2d307d3642ced5b277b95c85cdb..9e564b016f54b476f1d5e1d91f291c1ce3e3fda2 100644
--- a/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc
+++ b/tensorflow/core/kernels/mkl_fused_batch_norm_op.cc
@@ -262,7 +262,6 @@ class MklFusedBatchNormOp : public OpKernel {
     }
 
     void MklCreateInputLayout(OpKernelContext* context) {
-      const Tensor& input = MklGetInput(context, 0);
       bool input_in_mkl_format = mkl_shape_input_shape.IsMklTensor();
       if (input_in_mkl_format) {
         mkl_lt_input =
@@ -1110,19 +1109,12 @@ class MklFusedBatchNormGradOp : public OpKernel {
         return;
       }
 
-      if (dnn_shape_src.IsMklTensor())
-        depth_ = dnn_shape_src.DimSize(MklDnnDims::Dim_C);
-      else
-        ExtractParams(context);
-
-      memory::format format_m;
       if (dnn_shape_src.IsMklTensor()) {
-        if (dnn_shape_src.IsTensorInNCHWFormat())
-          format_m = memory::format::nchw;
-        else
-          format_m = memory::format::nhwc;
+        depth_ = dnn_shape_src.DimSize(MklDnnDims::Dim_C);
+      } else if (dnn_shape_diff_dst.IsMklTensor()) {
+        depth_ = dnn_shape_diff_dst.DimSize(MklDnnDims::Dim_C);
       } else {
-        format_m = TFDataFormatToMklDnnDataFormat(tensor_format_);
+        ExtractParams(context);
       }
 
       MklDnnData<T> src(&cpu_engine);
@@ -1146,20 +1138,20 @@ class MklFusedBatchNormGradOp : public OpKernel {
         diff_dst_dims =
             TFShapeToMklDnnDimsInNCHW(diff_dst_tensor.shape(), tensor_format_);
 
-      // set src and diff_dst primitives
+      // set src and diff_dst primitives according to input layout
       memory::desc src_md({}, memory::data_undef, memory::format_undef);
       memory::desc diff_dst_md({}, memory::data_undef, memory::format_undef);
-      if (dnn_shape_src.IsMklTensor() || dnn_shape_diff_dst.IsMklTensor()) {
-        if (dnn_shape_src.IsMklTensor()) {
-          src_md = dnn_shape_src.GetMklLayout();
-          diff_dst_md = src_md;
-        } else {
-          diff_dst_md = dnn_shape_diff_dst.GetMklLayout();
-          src_md = diff_dst_md;
-        }
+      if (dnn_shape_src.IsMklTensor()) {
+        src_md = dnn_shape_src.GetMklLayout();
       } else {
-        src_md = memory::desc(src_dims, MklDnnType<T>(), format_m);
-        diff_dst_md = src_md;
+        src_md =  memory::desc(src_dims, MklDnnType<T>(),
+                TFDataFormatToMklDnnDataFormat(tensor_format_));
+      }
+      if (dnn_shape_diff_dst.IsMklTensor()) {
+        diff_dst_md = dnn_shape_diff_dst.GetMklLayout();
+      } else {
+        diff_dst_md = memory::desc(diff_dst_dims, MklDnnType<T>(),
+                TFDataFormatToMklDnnDataFormat(tensor_format_));
       }
       src.SetUsrMem(src_md, &src_tensor);
       diff_dst.SetUsrMem(diff_dst_md, &diff_dst_tensor);
@@ -1211,28 +1203,64 @@ class MklFusedBatchNormGradOp : public OpKernel {
       // allocate diff_src tensor
       MklDnnShape dnn_shape_diff_src;
       TensorShape tf_shape_diff_src;
-      if (dnn_shape_src.IsMklTensor()) {
+
+      // MKL-DNN's BN primitive not provide API to fetch internal format
+      // set common_md as OpMem
+      // src and diff_dst will reorder to common_md
+      // diff_src will set as common_md
+      memory::desc common_md({}, memory::data_undef, memory::format_undef);
+      if (dnn_shape_src.IsMklTensor() || dnn_shape_diff_dst.IsMklTensor()) {
+        if (dnn_shape_src.IsMklTensor()) {
+          common_md = dnn_shape_src.GetMklLayout();
+        } else {
+          common_md = dnn_shape_diff_dst.GetMklLayout();
+        }
+      } else {
+        common_md = memory::desc(src_dims, MklDnnType<T>(),
+                TFDataFormatToMklDnnDataFormat(tensor_format_));
+      }
+      // if any of src and diff_dst as mkl layout,
+      // then we set diff_src as mkl layout
+      if (dnn_shape_src.IsMklTensor() ||
+              dnn_shape_diff_dst.IsMklTensor()) {
         dnn_shape_diff_src.SetMklTensor(true);
-        auto diff_src_pd = bnrm_fwd_pd.dst_primitive_desc();
+        // set diff_src's mkl layout as common_md
+        auto diff_src_pd = memory::primitive_desc(common_md, cpu_engine);
         dnn_shape_diff_src.SetMklLayout(&diff_src_pd);
         dnn_shape_diff_src.SetElemType(MklDnnType<T>());
-        dnn_shape_diff_src.SetTfLayout(dnn_shape_src.GetDimension(), src_dims,
-                                       format_m);
-        dnn_shape_diff_src.SetTfDimOrder(dnn_shape_src.GetDimension(),
-                                         tensor_format_);
+        if (dnn_shape_src.IsMklTensor()) {
+          dnn_shape_diff_src.SetTfLayout(
+                  dnn_shape_src.GetDimension(),
+                  src_dims,
+                  dnn_shape_src.GetTfDataFormat());
+          dnn_shape_diff_src.SetTfDimOrder(
+                  dnn_shape_src.GetDimension(),
+                  tensor_format_);
+        } else {
+          dnn_shape_diff_src.SetTfLayout(
+                  dnn_shape_diff_dst.GetDimension(),
+                  src_dims,
+                  dnn_shape_diff_dst.GetTfDataFormat());
+          dnn_shape_diff_src.SetTfDimOrder(
+                  dnn_shape_diff_dst.GetDimension(),
+                  tensor_format_);
+        }
         tf_shape_diff_src.AddDim(diff_src_pd.get_size() / sizeof(T));
       } else {
         dnn_shape_diff_src.SetMklTensor(false);
+        // both src and diff_dst are TensorFlow layout,
+        // so it is OK to get TensorFlow shape.
         tf_shape_diff_src = src_tensor.shape();
       }
       AllocateOutputSetMklShape(context, kDiffSrcIndex, &diff_src_tensor,
                                 tf_shape_diff_src, dnn_shape_diff_src);
 
-      diff_src.SetUsrMem(src_md, diff_src_tensor);
+      // set diff_src
+      diff_src.SetUsrMem(common_md, diff_src_tensor);
 
       prop_kind pk = prop_kind::backward;
       auto bnrm_bwd_desc = batch_normalization_backward::desc(
-          pk, diff_src.GetUsrMemDesc(), src.GetUsrMemDesc(), epsilon_,
+          pk, common_md, common_md, epsilon_,
           /* for inference, specify use_global_stats
              1. on fwd prop, use mean and variance
                 provided as inputs
@@ -1245,11 +1273,16 @@ class MklFusedBatchNormGradOp : public OpKernel {
       auto bnrm_bwd_pd = batch_normalization_backward::primitive_desc(
           bnrm_bwd_desc, cpu_engine, bnrm_fwd_pd);
 
+      std::vector<primitive> net;
+      src.CheckReorderToOpMem(memory::primitive_desc(common_md,
+                                   cpu_engine), &net);
+      diff_dst.CheckReorderToOpMem(memory::primitive_desc(common_md,
+                                   cpu_engine), &net);
+
       auto bnrm_bwd_op = batch_normalization_backward(
           bnrm_bwd_pd, src.GetOpMem(), mean.GetOpMem(), variance.GetOpMem(),
           diff_dst.GetOpMem(), weights_m, diff_src.GetOpMem(), diff_weights_m);
 
-      std::vector<primitive> net;
       net.push_back(bnrm_bwd_op);
       stream(stream::kind::eager).submit(net).wait();
 
diff --git a/tensorflow/core/kernels/mkl_lrn_op.cc b/tensorflow/core/kernels/mkl_lrn_op.cc
index 5f0a12a1fb9bff3086e05918e23b8396196eb389..282012c719fe3045e880ef0dc9027a50c0f23fec 100644
--- a/tensorflow/core/kernels/mkl_lrn_op.cc
+++ b/tensorflow/core/kernels/mkl_lrn_op.cc
@@ -88,7 +88,8 @@ class MklLRNOp : public OpKernel {
     OP_REQUIRES_OK(context, context->GetAttr("alpha", &alpha_));
     OP_REQUIRES_OK(context, context->GetAttr("beta", &beta_));
     workspace_enabled_ = false;
-    context->GetAttr("workspace_enabled", &workspace_enabled_);
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("workspace_enabled", &workspace_enabled_));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -357,7 +358,8 @@ class MklLRNGradOp : public OpKernel {
     OP_REQUIRES_OK(context, context->GetAttr("alpha", &alpha_));
     OP_REQUIRES_OK(context, context->GetAttr("beta", &beta_));
     workspace_enabled_ = false;
-    context->GetAttr("workspace_enabled", &workspace_enabled_);
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("workspace_enabled", &workspace_enabled_));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -535,7 +537,6 @@ class MklLRNGradOp : public OpKernel {
                                 Tensor* mkl_tmp_outimage_buf_tensor) {
       const Tensor& in_grads = MklGetInput(context, 0);
       const Tensor& in_image = MklGetInput(context, 1);
-      const Tensor& out_image = MklGetInput(context, 2);
       const Tensor& workspace = MklGetInput(
           context,
           3); /*Worskpsace is enabled, get the buffer to the workspace */
@@ -544,8 +545,6 @@ class MklLRNGradOp : public OpKernel {
           static_cast<const void*>(in_grads.flat<T>().data()));
       void* user_fwd_input = const_cast<void*>(
           static_cast<const void*>(in_image.flat<T>().data()));
-      void* user_fwd_output = const_cast<void*>(
-          static_cast<const void*>(out_image.flat<T>().data()));
       void* workspace_buffer = const_cast<void*>(
           static_cast<const void*>(workspace.flat<T>().data()));
 
diff --git a/tensorflow/core/kernels/mkl_matmul_op.cc b/tensorflow/core/kernels/mkl_matmul_op.cc
index 25ad8c94a78a82cc7e4a6f98903aecf1d5a0d1b4..dfa6cecc9bdc231ebf35e587183b5f84b17489e0 100644
--- a/tensorflow/core/kernels/mkl_matmul_op.cc
+++ b/tensorflow/core/kernels/mkl_matmul_op.cc
@@ -171,13 +171,13 @@ class MklMatMulOp : public OpKernel {
   // For detailed info about parameters, look at FP32 function description.
   void MklBlasGemm(bool transa, bool transb, const int m, const int n,
                    const int k, const complex64* a, const int lda,
-                   const complex64* b, const int ldb,
-                   complex64* c, int const ldc) {
+                   const complex64* b, const int ldb, complex64* c,
+                   int const ldc) {
     const MKL_Complex8 alpha = {1.0f, 0.0f};
     const MKL_Complex8 beta = {0.0f, 0.0f};
     cblas_cgemm(CblasRowMajor, transa ? CblasTrans : CblasNoTrans,
-                transb ? CblasTrans : CblasNoTrans,
-                m, n, k, &alpha, reinterpret_cast<const MKL_Complex8*>(a), lda,
+                transb ? CblasTrans : CblasNoTrans, m, n, k, &alpha,
+                reinterpret_cast<const MKL_Complex8*>(a), lda,
                 reinterpret_cast<const MKL_Complex8*>(b), ldb, &beta,
                 reinterpret_cast<MKL_Complex8*>(c), ldc);
   }
@@ -187,13 +187,13 @@ class MklMatMulOp : public OpKernel {
   // description.
   void MklBlasGemm(bool transa, bool transb, const int m, const int n,
                    const int k, const complex128* a, const int lda,
-                   const complex128* b, const int ldb,
-                   complex128* c, const int ldc) {
+                   const complex128* b, const int ldb, complex128* c,
+                   const int ldc) {
     const MKL_Complex16 alpha = {1.0, 0.0};
     const MKL_Complex16 beta = {0.0, 0.0};
     cblas_zgemm(CblasRowMajor, transa ? CblasTrans : CblasNoTrans,
-                transb ? CblasTrans : CblasNoTrans,
-                m, n, k, &alpha, reinterpret_cast<const MKL_Complex16*>(a), lda,
+                transb ? CblasTrans : CblasNoTrans, m, n, k, &alpha,
+                reinterpret_cast<const MKL_Complex16*>(a), lda,
                 reinterpret_cast<const MKL_Complex16*>(b), ldb, &beta,
                 reinterpret_cast<MKL_Complex16*>(c), ldc);
   }
diff --git a/tensorflow/core/kernels/mkl_maxpooling_op.cc b/tensorflow/core/kernels/mkl_maxpooling_op.cc
index 14607f26e0ccd1028dd62343000d90ac8451d7bb..ea537524b11ef1362ff08b79ae25ca6e7048a9cd 100644
--- a/tensorflow/core/kernels/mkl_maxpooling_op.cc
+++ b/tensorflow/core/kernels/mkl_maxpooling_op.cc
@@ -69,7 +69,8 @@ class MklMaxPoolingOp : public OpKernel {
     // We may not get this attribute for this node if it does not go through
     // graph rewrite pass. So we do not check for error while retrieving this
     // attribute value.
-    context->GetAttr("workspace_enabled", &workspace_enabled_);
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("workspace_enabled", &workspace_enabled_));
   }
 
   void Compute(OpKernelContext* context) override {
@@ -118,7 +119,6 @@ class MklMaxPoolingOp : public OpKernel {
                               mkl_out_shape);
 
     Tensor* workspace_tensor;
-    void* workspace_buf = nullptr;
 
     TensorShape workspace_shape;
     mkl_workspace_shape.SetMklTensor(false);
@@ -226,7 +226,8 @@ class MklMaxPoolingGradOp : public OpKernel {
     // We may not get this attribute for this node if it does not go through
     // graph rewrite pass. So we do not check for error while retrieving this
     // attribute value.
-    context->GetAttr("workspace_enabled", &workspace_enabled_);
+    OP_REQUIRES_OK(context,
+                   context->GetAttr("workspace_enabled", &workspace_enabled_));
   }
 
   void Compute(OpKernelContext* context) override {
diff --git a/tensorflow/core/kernels/mkl_relu_op.cc b/tensorflow/core/kernels/mkl_relu_op.cc
index 51db3991e2a24f087771f571cd91fc9fbb26040b..267f4f8d12c171fe5592296a7459367290604bc5 100644
--- a/tensorflow/core/kernels/mkl_relu_op.cc
+++ b/tensorflow/core/kernels/mkl_relu_op.cc
@@ -25,7 +25,6 @@ limitations under the License.
 
 #include "mkl_dnn.h"
 #include "mkl_dnn_types.h"
-#include "tensorflow/core/platform/default/logging.h"
 #include "tensorflow/core/util/mkl_util.h"
 
 #ifndef INTEL_MKL_ML
@@ -368,8 +367,11 @@ void MklReluGradOp<Device, T>::Compute(OpKernelContext* context) {
   mkl_context.MklCleanup();
 }
 
+
+
 #else  // INTEL_MKL_ML
 
+
 template <typename Device, typename T, algorithm alg_kind>
 class MklReluOpBase : public OpKernel {
  public:
@@ -579,17 +581,26 @@ class MklReluGradOpBase : public OpKernel {
       // allocate diff_src tensor
       MklDnnShape dnn_shape_diff_src;
       TensorShape tf_shape_diff_src;
-      if (dnn_shape_src.IsMklTensor()) {
+      if (dnn_shape_src.IsMklTensor() ||
+              dnn_shape_diff_dst.IsMklTensor()) {
         dnn_shape_diff_src.SetMklTensor(true);
         auto diff_src_pd = relu_bwd_pd.diff_src_primitive_desc();
         dnn_shape_diff_src.SetMklLayout(&diff_src_pd);
         dnn_shape_diff_src.SetElemType(MklDnnType<T>());
-        dnn_shape_diff_src.SetTfLayout(dnn_shape_src.GetDimension(),
-                                       dnn_shape_src.GetSizesAsMklDnnDims(),
-                                       dnn_shape_src.GetTfDataFormat());
+        if (dnn_shape_src.IsMklTensor()) {
+          dnn_shape_diff_src.SetTfLayout(dnn_shape_src.GetDimension(),
+                                         dnn_shape_src.GetSizesAsMklDnnDims(),
+                                         dnn_shape_src.GetTfDataFormat());
+        } else {
+          dnn_shape_diff_src.SetTfLayout(dnn_shape_diff_dst.GetDimension(),
+                                 dnn_shape_diff_dst.GetSizesAsMklDnnDims(),
+                                 dnn_shape_diff_dst.GetTfDataFormat());
+        }
         tf_shape_diff_src.AddDim(diff_src_pd.get_size() / sizeof(T));
       } else {
         dnn_shape_diff_src.SetMklTensor(false);
+        // both src and diff_dst are TensorFlow layout,
+        // so it is ok to get TensorFlow shape.
         tf_shape_diff_src = src_tensor.shape();
       }
       AllocateOutputSetMklShape(context, diff_src_index, &diff_src_tensor,
diff --git a/tensorflow/core/kernels/mkl_transpose_op.cc b/tensorflow/core/kernels/mkl_transpose_op.cc
index b44b4d6f542ed0128a83d20eedf6629f67427867..3f07b317c4d915fd7d304dbbab966837da64757a 100644
--- a/tensorflow/core/kernels/mkl_transpose_op.cc
+++ b/tensorflow/core/kernels/mkl_transpose_op.cc
@@ -63,25 +63,31 @@ INSTANTIATE(double, d)
 #undef INSTANTIATE
 
 template <>
-Status MKLTranspose2D<complex64>(const char trans, const Tensor& in, Tensor* out) {
-    const MKL_Complex8 alpha = { 1.0f, 0.0f };
-    mkl_comatcopy('R', trans, in.dim_size(0), in.dim_size(1), alpha,
-                  reinterpret_cast<const MKL_Complex8*>(in.flat<complex64>().data()),
-                  in.dim_size(1),
-                  reinterpret_cast<MKL_Complex8*>(const_cast<complex64*>(out->flat<complex64>().data())),
-                  in.dim_size(0));
-    return Status::OK();
+Status MKLTranspose2D<complex64>(const char trans, const Tensor& in,
+                                 Tensor* out) {
+  const MKL_Complex8 alpha = {1.0f, 0.0f};
+  mkl_comatcopy(
+      'R', trans, in.dim_size(0), in.dim_size(1), alpha,
+      reinterpret_cast<const MKL_Complex8*>(in.flat<complex64>().data()),
+      in.dim_size(1),
+      reinterpret_cast<MKL_Complex8*>(
+          const_cast<complex64*>(out->flat<complex64>().data())),
+      in.dim_size(0));
+  return Status::OK();
 }
 
 template <>
-Status MKLTranspose2D<complex128>(const char trans, const Tensor& in, Tensor* out) {
-    const MKL_Complex16 alpha = { 1.0, 0.0 };
-    mkl_zomatcopy('R', trans, in.dim_size(0), in.dim_size(1), alpha,
-                  reinterpret_cast<const MKL_Complex16*>(in.flat<complex128>().data()),
-                  in.dim_size(1),
-                  reinterpret_cast<MKL_Complex16*>(const_cast<complex128*>(out->flat<complex128>().data())),
-                  in.dim_size(0));
-	return Status::OK();
+Status MKLTranspose2D<complex128>(const char trans, const Tensor& in,
+                                  Tensor* out) {
+  const MKL_Complex16 alpha = {1.0, 0.0};
+  mkl_zomatcopy(
+      'R', trans, in.dim_size(0), in.dim_size(1), alpha,
+      reinterpret_cast<const MKL_Complex16*>(in.flat<complex128>().data()),
+      in.dim_size(1),
+      reinterpret_cast<MKL_Complex16*>(
+          const_cast<complex128*>(out->flat<complex128>().data())),
+      in.dim_size(0));
+  return Status::OK();
 }
 
 static const char kMKLTranspose = 'T';
diff --git a/tensorflow/core/kernels/mutex_ops.cc b/tensorflow/core/kernels/mutex_ops.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b02a584d73d767fa674e3d0ebe0b6c501249c16e
--- /dev/null
+++ b/tensorflow/core/kernels/mutex_ops.cc
@@ -0,0 +1,249 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#define EIGEN_USE_THREADS
+
+#include <deque>
+#include <utility>
+
+#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
+#include "tensorflow/core/framework/resource_mgr.h"
+#include "tensorflow/core/framework/variant.h"
+#include "tensorflow/core/framework/variant_encode_decode.h"
+#include "tensorflow/core/kernels/ops_util.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/threadpool.h"
+#include "tensorflow/core/platform/macros.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+namespace {
+
+class Mutex : public ResourceBase {
+ public:
+  explicit Mutex(OpKernelContext* c, const string& name)
+      : locked_(false),
+        thread_pool_(new thread::ThreadPool(
+            c->env(), ThreadOptions(),
+            strings::StrCat("mutex_lock_thread_", SanitizeThreadSuffix(name)),
+            1 /* num_threads */, false /* low_latency_hint */)),
+        name_(name) {
+    VLOG(2) << "Creating mutex with name " << name << ": " << this;
+  }
+
+  string DebugString() override { return strings::StrCat("Mutex ", name_); }
+
+  class LockReleaser {
+   public:
+    explicit LockReleaser(Mutex* mutex) : mutex_(mutex) {}
+
+    LockReleaser(const LockReleaser&) = delete;
+    LockReleaser& operator=(const LockReleaser&) = delete;
+
+    virtual ~LockReleaser() {
+      VLOG(3) << "Destroying LockReleaser " << this << " for mutex: " << mutex_;
+      if (mutex_) {
+        mutex_lock lock(mutex_->mu_);
+        mutex_->locked_ = false;
+        mutex_->cv_.notify_all();
+        VLOG(3) << "Destroying LockReleaser " << this
+                << ": sent notifications.";
+      }
+    }
+
+   private:
+    Mutex* mutex_;
+  };
+
+  struct SharedLockReleaser {
+    std::shared_ptr<LockReleaser> shared_lock;
+
+    explicit SharedLockReleaser(std::shared_ptr<LockReleaser>&& lock)
+        : shared_lock(std::forward<decltype(lock)>(lock)) {
+      VLOG(3) << "Creating shared_ptr of " << shared_lock.get()
+              << " count is: " << shared_lock.use_count();
+    }
+
+    SharedLockReleaser(SharedLockReleaser&& rhs)
+        : shared_lock(std::move(rhs.shared_lock)) {
+      VLOG(3) << "Moving SharedLockReleaser of " << shared_lock.get()
+              << " count is: " << shared_lock.use_count();
+    }
+
+    SharedLockReleaser(const SharedLockReleaser& rhs)
+        : shared_lock(rhs.shared_lock) {
+      VLOG(3) << "Copying SharedLockReleaser of " << shared_lock.get()
+              << " count is: " << shared_lock.use_count();
+    }
+
+    ~SharedLockReleaser() {
+      VLOG(3) << "Destroying SharedLockReleaser of " << shared_lock.get()
+              << " count is: " << shared_lock.use_count();
+    }
+
+    void Encode(VariantTensorData*) const {
+      // Not supported.
+    }
+
+    bool Decode(const VariantTensorData&) {
+      return false;  // Not supported.
+    }
+  };
+
+  void AcquireAsync(
+      OpKernelContext* c,
+      std::function<void(const Status& s, SharedLockReleaser lock)> fn) {
+    CancellationManager* cm = c->cancellation_manager();
+    CancellationToken token{};
+    bool* cancelled = nullptr;
+    if (cm) {
+      cancelled = new bool(false);  // GUARDED_BY(mu_);
+      token = cm->get_cancellation_token();
+      const bool already_cancelled =
+          !cm->RegisterCallback(token, [this, cancelled]() {
+            mutex_lock lock(mu_);
+            *cancelled = true;
+            cv_.notify_all();
+          });
+      if (already_cancelled) {
+        delete cancelled;
+        fn(errors::Cancelled("Lock acquisition cancelled."),
+           SharedLockReleaser{nullptr});
+        return;
+      }
+    }
+    thread_pool_->Schedule(std::bind(
+        [this, c, cm, cancelled,
+         token](std::function<void(const Status& s, SharedLockReleaser&& lock)>
+                    fn_) {
+          bool local_locked;
+          {
+            mutex_lock lock(mu_);
+            while (locked_ && !(cancelled && *cancelled)) {
+              cv_.wait(lock);
+            }
+            local_locked = locked_ = !(cancelled && *cancelled);
+          }
+          if (cm) {
+            cm->DeregisterCallback(token);
+            delete cancelled;
+          }
+          if (local_locked) {  // Not cancelled.
+            fn_(Status::OK(),
+                SharedLockReleaser{std::make_shared<LockReleaser>(this)});
+          } else {
+            fn_(errors::Cancelled("Lock acqusition cancelled."),
+                SharedLockReleaser{nullptr});
+          }
+        },
+        std::move(fn)));
+  }
+
+ private:
+  mutex mu_;
+  condition_variable cv_ GUARDED_BY(mu_);
+  bool locked_ GUARDED_BY(mu_);
+  std::unique_ptr<thread::ThreadPool> thread_pool_;
+  string name_;
+};
+
+}  // namespace
+
+class MutexLockOp : public AsyncOpKernel {
+ public:
+  explicit MutexLockOp(OpKernelConstruction* c) : AsyncOpKernel(c) {}
+
+ public:
+  void ComputeAsync(OpKernelContext* c, DoneCallback done) override {
+    Mutex* mutex = nullptr;
+    OP_REQUIRES_OK_ASYNC(
+        c,
+        LookupOrCreateResource<Mutex>(c, HandleFromInput(c, 0), &mutex,
+                                      [this, c](Mutex** ptr) {
+                                        *ptr = new Mutex(
+                                            c, HandleFromInput(c, 0).name());
+                                        return Status::OK();
+                                      }),
+        done);
+
+    Tensor* variant;
+    OP_REQUIRES_OK_ASYNC(c, c->allocate_output(0, TensorShape({}), &variant),
+                         done);
+
+    mutex->AcquireAsync(
+        c, std::bind(
+               [this, c, variant, mutex](DoneCallback done_,
+                                         // End of bound arguments.
+                                         const Status& s,
+                                         Mutex::SharedLockReleaser&& lock) {
+                 VLOG(2) << "Finished locking mutex " << mutex
+                         << " with lock: " << lock.shared_lock.get()
+                         << " status: " << s.ToString();
+                 if (s.ok()) {
+                   variant->scalar<Variant>()() = std::move(lock);
+                 } else {
+                   c->SetStatus(s);
+                 }
+                 mutex->Unref();
+                 done_();
+               },
+               std::move(done), std::placeholders::_1, std::placeholders::_2));
+  }
+};
+
+class ConsumeMutexLockOp : public OpKernel {
+ public:
+  explicit ConsumeMutexLockOp(OpKernelConstruction* context)
+      : OpKernel(context) {}
+
+  void Compute(OpKernelContext* c) override {
+    VLOG(2) << "Executing ConsumeMutexLockOp";
+    const Tensor& lock_t = c->input(0);
+    OP_REQUIRES(
+        c, lock_t.dims() == 0,
+        errors::InvalidArgument("Expected input to be a scalar, saw shape: ",
+                                lock_t.shape().DebugString()));
+    OP_REQUIRES(
+        c, lock_t.dtype() == DT_VARIANT,
+        errors::InvalidArgument("Expected input to be a variant, saw type: ",
+                                DataTypeString(lock_t.dtype())));
+    const auto* lock =
+        lock_t.scalar<Variant>()().get<Mutex::SharedLockReleaser>();
+    OP_REQUIRES(c, lock,
+                errors::InvalidArgument(
+                    "Expected input to contain a SharedLockReleaser "
+                    "object, but saw variant: '",
+                    lock_t.scalar<Variant>()().DebugString(), "'"));
+    const int use_count = lock->shared_lock.use_count();
+    OP_REQUIRES(
+        c, use_count == 1,
+        errors::InvalidArgument("Expected use count of lock to be 1, but saw: ",
+                                use_count));
+  }
+
+  bool IsExpensive() override { return false; }
+};
+
+REGISTER_KERNEL_BUILDER(Name("MutexLock").Device(DEVICE_CPU), MutexLockOp);
+
+REGISTER_KERNEL_BUILDER(Name("MutexV2").Device(DEVICE_CPU),
+                        ResourceHandleOp<Mutex>);
+
+REGISTER_KERNEL_BUILDER(Name("ConsumeMutexLock").Device(DEVICE_CPU),
+                        ConsumeMutexLockOp);
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/random_op.cc b/tensorflow/core/kernels/random_op.cc
index 78ff7948fbf1b6406b2faca1d94acd7ea3325437..e37232539f2f32cba74cde354dade0efe8bf719a 100644
--- a/tensorflow/core/kernels/random_op.cc
+++ b/tensorflow/core/kernels/random_op.cc
@@ -495,6 +495,7 @@ class RandomGammaOp : public OpKernel {
                           RandomUniformIntOp<CPUDevice, IntType>);
 
 TF_CALL_half(REGISTER);
+TF_CALL_bfloat16(REGISTER);
 TF_CALL_float(REGISTER);
 TF_CALL_double(REGISTER);
 TF_CALL_int32(REGISTER_INT);
diff --git a/tensorflow/core/kernels/reduction_gpu_kernels.cu.h b/tensorflow/core/kernels/reduction_gpu_kernels.cu.h
index 15ae4c1fc53b2b9bfe1d6085d2ecbc3659705b47..9237fa51d885c633675146191dc384dd87d8ab22 100644
--- a/tensorflow/core/kernels/reduction_gpu_kernels.cu.h
+++ b/tensorflow/core/kernels/reduction_gpu_kernels.cu.h
@@ -280,8 +280,8 @@ __global__ void ColumnReduceMax16ColumnsKernel(
   const int rows_in_this_warp = min(rows_per_warp, num_rows - start_row_warp);
   // not the most efficient way to do this sum
   for (int i = 1; i < rows_in_this_warp; ++i) {
-    value_type tmp =
-        cub::ShuffleIndex(sum, threadIdx.x + i * num_cols, 32, 0xffffffff);
+    value_type tmp = cub::ShuffleIndex<32, value_type>(
+        sum, static_cast<int>(threadIdx.x + i * num_cols), 0xffffffff);
     if (lane < num_cols) sum = op(sum, tmp);
   }
 
diff --git a/tensorflow/core/kernels/regex_replace_op.cc b/tensorflow/core/kernels/regex_replace_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..59ec854a79c90424966e4c7f19f8e5c10dfe17d4
--- /dev/null
+++ b/tensorflow/core/kernels/regex_replace_op.cc
@@ -0,0 +1,76 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include <string>
+
+#include "re2/re2.h"
+#include "tensorflow/core/framework/op_kernel.h"
+#include "tensorflow/core/framework/tensor.h"
+#include "tensorflow/core/lib/core/errors.h"
+#include "tensorflow/core/lib/core/status.h"
+
+namespace tensorflow {
+
+class RegexReplaceOp : public OpKernel {
+ public:
+  explicit RegexReplaceOp(OpKernelConstruction* ctx) : OpKernel(ctx) {
+    OP_REQUIRES_OK(ctx, ctx->GetAttr("replace_global", &replace_global_));
+  }
+
+  void Compute(OpKernelContext* ctx) override {
+    const Tensor* input_tensor;
+    OP_REQUIRES_OK(ctx, ctx->input("input", &input_tensor));
+    const auto& input_flat = input_tensor->flat<string>();
+
+    const Tensor* pattern_tensor;
+    OP_REQUIRES_OK(ctx, ctx->input("pattern", &pattern_tensor));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(pattern_tensor->shape()),
+                errors::InvalidArgument("Pattern must be scalar, but received ",
+                                        pattern_tensor->shape().DebugString()));
+    const string pattern = pattern_tensor->flat<string>()(0);
+    const RE2 match(pattern);
+    OP_REQUIRES(ctx, match.ok(),
+                errors::InvalidArgument("Invalid pattern: ", pattern,
+                                        ", error: ", match.error()));
+
+    const Tensor* rewrite_tensor;
+    OP_REQUIRES_OK(ctx, ctx->input("rewrite", &rewrite_tensor));
+    OP_REQUIRES(ctx, TensorShapeUtils::IsScalar(rewrite_tensor->shape()),
+                errors::InvalidArgument("Rewrite must be scalar, but received ",
+                                        rewrite_tensor->shape().DebugString()));
+    const string rewrite = rewrite_tensor->flat<string>()(0);
+
+    Tensor* output_tensor = nullptr;
+    OP_REQUIRES_OK(ctx, ctx->allocate_output("output", input_tensor->shape(),
+                                             &output_tensor));
+    auto output_flat = output_tensor->flat<string>();
+    for (size_t i = 0; i < input_flat.size(); ++i) {
+      output_flat(i) = input_flat(i);
+      if (replace_global_) {
+        RE2::GlobalReplace(&output_flat(i), match, rewrite);
+      } else {
+        RE2::Replace(&output_flat(i), match, rewrite);
+      }
+    }
+  }
+
+ private:
+  bool replace_global_;
+};
+
+REGISTER_KERNEL_BUILDER(Name("RegexReplace").Device(DEVICE_CPU),
+                        RegexReplaceOp);
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/kernels/relu_op_gpu.cu.cc b/tensorflow/core/kernels/relu_op_gpu.cu.cc
index ec09d8dfea519a70474dca7d3167ba20d3d16d69..6e46c979f33496f8da2c561683723728e28a610e 100644
--- a/tensorflow/core/kernels/relu_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/relu_op_gpu.cu.cc
@@ -19,15 +19,104 @@ limitations under the License.
 
 #include <stdio.h>
 
-#include "tensorflow/core/kernels/relu_op_functor.h"
-
+#include "third_party/eigen3/Eigen/Core"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/framework/tensor_types.h"
+#include "tensorflow/core/kernels/relu_op_functor.h"
+#include "tensorflow/core/util/cuda_kernel_helper.h"
+#include "tensorflow/core/util/cuda_launch_config.h"
 
 namespace tensorflow {
 
 typedef Eigen::GpuDevice GPUDevice;
 
+namespace functor {
+#ifdef TF_HAS_CUDA_FP16
+
+// This kernel computes ReluGrad by processing one half2, two fp16, at a time.
+// It effectively does: backdrops = (feature > 0) ? gradient : 0
+// It also tries to use native half2 primitives as much as possible.
+__global__ void ReluGradHalfKernel(const Eigen::half* gradient,
+                                   const Eigen::half* feature,
+                                   Eigen::half* backprop, int32 count) {
+  int32 half2_count = count >> 1;
+  int32 index = blockIdx.x * blockDim.x + threadIdx.x;
+  const int32 total_device_threads = gridDim.x * blockDim.x;
+
+  while (index < half2_count) {
+    // The fast branch.
+    // One half2, two fp16, is fetched and processed at a time.
+    half2 gradient_h2 = reinterpret_cast<const half2*>(gradient)[index];
+    half2 feature_h2 = reinterpret_cast<const half2*>(feature)[index];
+    half2* p_backprop_h2 = reinterpret_cast<half2*>(backprop) + index;
+
+#if __CUDA_ARCH__ >= 530
+    // Fast path, when half2 primitives are available.
+    const half2 kZeroH2 = __float2half2_rn(0.f);
+    // mask = (feature > 0)
+    half2 mask_h2 = __hgt2(feature_h2, kZeroH2);
+    // backprop = mask * gradient
+    half2 backprop_h2 = __hmul2(mask_h2, gradient_h2);
+#else
+    // Fall back: convert half2 to float2 for processing.
+    float2 feature_f2 = __half22float2(feature_h2);
+    float2 gradient_f2 = __half22float2(gradient_h2);
+    float2 backprop_f2 = make_float2((feature_f2.x > 0) ? gradient_f2.x : 0,
+                                     (feature_f2.y > 0) ? gradient_f2.y : 0);
+    // Convert back to half2.
+    half2 backprop_h2 = __float22half2_rn(backprop_f2);
+#endif
+
+    // Write back the result.
+    *p_backprop_h2 = backprop_h2;
+
+    index += total_device_threads;
+  }
+
+  if ((count & 0x1) == 1 && index == half2_count) {
+    // If the total number of the elements is odd, process the last element.
+    Eigen::half grad_h = gradient[count - 1];
+    Eigen::half feature_h = feature[count - 1];
+
+    float grad_f = static_cast<float>(grad_h);
+    float feature_f = static_cast<float>(feature_h);
+    float backprop_f = (feature_f > 0) ? grad_f : 0;
+
+    Eigen::half backprop_h(backprop_f);
+    backprop[count - 1] = backprop_h;
+  }
+}
+
+template <typename Device>
+struct ReluGrad<Device, Eigen::half> {
+  // Computes ReluGrad backprop.
+  //
+  // gradient: gradient backpropagated to the Relu op.
+  // feature: either the inputs that were passed to the Relu, or its outputs
+  //           (using either one yields the same result here).
+  // backprop: gradient to backpropagate to the Relu inputs.
+  void operator()(const Device& d,
+                  typename TTypes<Eigen::half>::ConstTensor gradient,
+                  typename TTypes<Eigen::half>::ConstTensor feature,
+                  typename TTypes<Eigen::half>::Tensor backprop) {
+    // NOTE: When the activation is exactly zero, we do not propagate the
+    // associated gradient value. This allows the output of the Relu to be used,
+    // as well as its input.
+    int32 count = gradient.size();
+    if (count == 0) return;
+    int32 half2_count = Eigen::divup(count, 2);
+    const int32 kThreadInBlock = 512;
+    CudaLaunchConfig config = GetCudaLaunchConfigFixedBlockSize(
+        half2_count, d, ReluGradHalfKernel, 0, kThreadInBlock);
+    ReluGradHalfKernel<<<config.block_count, config.thread_per_block, 0,
+                         d.stream()>>>(gradient.data(), feature.data(),
+                                       backprop.data(), count);
+  }
+};
+
+#endif  // TF_HAS_CUDA_FP16
+}  // namespace functor
+
 // Definition of the GPU implementations declared in relu_op.cc.
 #define DEFINE_GPU_KERNELS(T)                       \
   template struct functor::Relu<GPUDevice, T>;      \
diff --git a/tensorflow/core/kernels/resource_variable_ops.cc b/tensorflow/core/kernels/resource_variable_ops.cc
index 702fb89aac9afe577cf7e4cd72616f7136a63b0b..2041fb90946860c5164da3cb448ff81d9f654e54 100644
--- a/tensorflow/core/kernels/resource_variable_ops.cc
+++ b/tensorflow/core/kernels/resource_variable_ops.cc
@@ -253,6 +253,7 @@ class AssignVariableOp : public OpKernel {
     std::unique_ptr<Tensor> input_alias =
         context->forward_input(1, dtype_, value.shape(), DEVICE_MEMORY, attr);
     mutex_lock ml(*variable->mu());
+    variable->is_initialized = true;
     if (input_alias) {
       *variable->tensor() = *input_alias;
       return;
@@ -363,7 +364,7 @@ class AssignVariableOp<Device, Variant> : public OpKernel {
                     DataTypeString(DT_VARIANT)));
 
     mutex_lock ml(*variable->mu());
-
+    variable->is_initialized = true;
     *variable->tensor() = Tensor(DT_VARIANT, value.shape());
     const auto elements_in = value.flat<Variant>();
     auto elements_out = variable->tensor()->flat<Variant>();
@@ -462,8 +463,29 @@ TF_CALL_int64(REGISTER_GPU_KERNELS);
 #undef REGISTER_GPU_KERNELS
 #endif  // GOOGLE_CUDA
 
+class VarIsInitializedOp : public OpKernel {
+ public:
+  explicit VarIsInitializedOp(OpKernelConstruction* c) : OpKernel(c) {}
+
+  void Compute(OpKernelContext* context) override {
+    Tensor* output = nullptr;
+    OP_REQUIRES_OK(context,
+                   context->allocate_output(0, TensorShape({}), &output));
+    auto output_tensor = output->tensor<bool, 0>();
+    Var* variable = nullptr;
+    Status s = LookupResource(context, HandleFromInput(context, 0), &variable);
+    if (!s.ok()) {
+      output_tensor() = false;
+      return;
+    }
+    core::ScopedUnref su(variable);
+    mutex_lock ml(*variable->mu());
+    output_tensor() = variable->is_initialized;
+  }
+};
+
 REGISTER_KERNEL_BUILDER(Name("VarIsInitializedOp").Device(DEVICE_CPU),
-                        IsResourceInitialized<Var>);
+                        VarIsInitializedOp);
 
 #if GOOGLE_CUDA
 REGISTER_KERNEL_BUILDER(Name("VarIsInitializedOp")
diff --git a/tensorflow/core/kernels/segment_reduction_ops.h b/tensorflow/core/kernels/segment_reduction_ops.h
index 51814273b305bfa35bca0ddce0376658064ea56a..fe0a2782f952386e673127776c8f20da3ab1e2d5 100644
--- a/tensorflow/core/kernels/segment_reduction_ops.h
+++ b/tensorflow/core/kernels/segment_reduction_ops.h
@@ -16,6 +16,14 @@ limitations under the License.
 #ifndef THIRD_PARTY_TENSORFLOW_CORE_KERNELS_SEGMENT_REDUCTION_OPS_H_
 #define THIRD_PARTY_TENSORFLOW_CORE_KERNELS_SEGMENT_REDUCTION_OPS_H_
 
+
+// This file requires the following include because it uses CudaAtomicMax:
+// #include "tensorflow/core/util/cuda_kernel_helper.h"
+
+// Unfortunately we can't add the #include, since it breaks compilation for
+// non-GPU targets. This only breaks in clang, because it's more strict for
+// template code and CudaAtomicMax is used in template context.
+
 #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
 #include "tensorflow/core/framework/tensor.h"
 #include "tensorflow/core/framework/tensor_shape.h"
diff --git a/tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc b/tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc
index ba979e6bb216b649ff4fc3cefa7099ac9cbc1b91..3511c85f7174f8dab47ca3ba05f01d7c4f5110b8 100644
--- a/tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc
+++ b/tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc
@@ -17,10 +17,13 @@ limitations under the License.
 
 #define EIGEN_USE_GPU
 
+// We need to include cuda_kernel_helper.h before segment_reduction_ops.h
+// See comment in segment_reduction_ops.h for more details.
+#include "tensorflow/core/util/cuda_kernel_helper.h"
+
 #include "tensorflow/core/kernels/segment_reduction_ops.h"
 #include "tensorflow/core/framework/register_types.h"
 #include "tensorflow/core/util/cuda_device_functions.h"
-#include "tensorflow/core/util/cuda_kernel_helper.h"
 
 
 namespace tensorflow {
diff --git a/tensorflow/core/kernels/serialize_sparse_op.cc b/tensorflow/core/kernels/serialize_sparse_op.cc
index 799c574d1542c345c606c276b0cc24fe61a47bba..64e0a68c2c119394561e947c4cf37838defd2d39 100644
--- a/tensorflow/core/kernels/serialize_sparse_op.cc
+++ b/tensorflow/core/kernels/serialize_sparse_op.cc
@@ -44,6 +44,8 @@ class SerializeSparseOp : public OpKernel {
   explicit SerializeSparseOp(OpKernelConstruction* context)
       : OpKernel(context) {}
 
+  bool IsExpensive() override;
+
   Status Initialize(Tensor* result);
   Status Serialize(const Tensor& input, T* result);
 
@@ -82,6 +84,21 @@ class SerializeSparseOp : public OpKernel {
   }
 };
 
+// NOTE(mrry): We specialize the IsExpensive() method differently for
+// the string and variant cases, because (i) the string version
+// actually performs memory copies as part of its serialization (and
+// is hence potentially expensive), and (ii) the variant version
+// performs O(1) shallow copies (and hence is much cheaper than
+// dispatching to another thread would be).
+template <>
+bool SerializeSparseOp<string>::IsExpensive() {
+  return true;
+}
+template <>
+bool SerializeSparseOp<Variant>::IsExpensive() {
+  return false;
+}
+
 template <>
 Status SerializeSparseOp<string>::Initialize(Tensor* result) {
   *result = Tensor(DT_STRING, TensorShape({3}));
diff --git a/tensorflow/core/kernels/split_lib.h b/tensorflow/core/kernels/split_lib.h
index a08949e626cc8e5d4c3707b75a902d82b46c3376..bc1fa28f8f8f23085d89e5b98d57914de778ea0b 100644
--- a/tensorflow/core/kernels/split_lib.h
+++ b/tensorflow/core/kernels/split_lib.h
@@ -31,31 +31,31 @@ struct SplitCustom {
                   const Eigen::DSizes<Eigen::DenseIndex, 2>& slice_sizes);
 };
 
-template <typename Device, typename T>
+template <typename Device, typename T, int NDims>
 struct Split {
-  void operator()(const Device& d, typename TTypes<T, 3>::Tensor output,
-                  typename TTypes<T, 3>::ConstTensor input,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes);
+  void operator()(const Device& d, typename TTypes<T, NDims>::Tensor output,
+                  typename TTypes<T, NDims>::ConstTensor input,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes);
 };
 
-template <typename T>
-struct Split<Eigen::ThreadPoolDevice, T> {
+template <typename T, int NDims>
+struct Split<Eigen::ThreadPoolDevice, T, NDims> {
   void operator()(const Eigen::ThreadPoolDevice& d,
-                  typename TTypes<T, 3>::Tensor output,
-                  typename TTypes<T, 3>::ConstTensor input,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes);
+                  typename TTypes<T, NDims>::Tensor output,
+                  typename TTypes<T, NDims>::ConstTensor input,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes);
 };
 
 #ifdef TENSORFLOW_USE_SYCL
-template <typename T>
+template <typename T, int NDims>
 struct Split<Eigen::SyclDevice, T> {
   void operator()(const Eigen::SyclDevice& d,
-                  typename TTypes<T, 3>::Tensor output,
-                  typename TTypes<T, 3>::ConstTensor input,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-                  const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes);
+                  typename TTypes<T, NDims>::Tensor output,
+                  typename TTypes<T, NDims>::ConstTensor input,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+                  const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes);
 };
 #endif  // TENSORFLOW_USE_SYCL
 
diff --git a/tensorflow/core/kernels/split_lib_cpu.cc b/tensorflow/core/kernels/split_lib_cpu.cc
index 771c633b156edf7c7d9944fe95703a0e0cd9e981..a3060e4e90d8db6866bd0c56570beeef65ab58ce 100644
--- a/tensorflow/core/kernels/split_lib_cpu.cc
+++ b/tensorflow/core/kernels/split_lib_cpu.cc
@@ -24,12 +24,12 @@ limitations under the License.
 namespace tensorflow {
 namespace functor {
 
-template <typename T>
-void Split<Eigen::ThreadPoolDevice, T>::operator()(
-    const Eigen::ThreadPoolDevice& d, typename TTypes<T, 3>::Tensor output,
-    typename TTypes<T, 3>::ConstTensor input,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes) {
+template <typename T, int NDims>
+void Split<Eigen::ThreadPoolDevice, T, NDims>::operator()(
+    const Eigen::ThreadPoolDevice& d, typename TTypes<T, NDims>::Tensor output,
+    typename TTypes<T, NDims>::ConstTensor input,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes) {
   if (output.size() < 131072) {
     output = input.slice(slice_indices, slice_sizes);
   } else {
@@ -37,22 +37,26 @@ void Split<Eigen::ThreadPoolDevice, T>::operator()(
   }
 }
 
-#define DEFINE_CPU_KERNELS(T) template struct Split<Eigen::ThreadPoolDevice, T>;
+#define DEFINE_CPU_KERNELS(T)                           \
+  template struct Split<Eigen::ThreadPoolDevice, T, 2>; \
+  template struct Split<Eigen::ThreadPoolDevice, T, 3>;
 
 TF_CALL_ALL_TYPES(DEFINE_CPU_KERNELS)
 DEFINE_CPU_KERNELS(quint8)
 
 #ifdef TENSORFLOW_USE_SYCL
-template <typename T>
-void Split<Eigen::SyclDevice, T>::operator()(
-    const Eigen::SyclDevice& d, typename TTypes<T, 3>::Tensor output,
-    typename TTypes<T, 3>::ConstTensor input,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes) {
+template <typename T, int NDims>
+void Split<Eigen::SyclDevice, T, NDims>::operator()(
+    const Eigen::SyclDevice& d, typename TTypes<T, NDims>::Tensor output,
+    typename TTypes<T, NDims>::ConstTensor input,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes) {
   output.device(d) = input.slice(slice_indices, slice_sizes);
 }
 
-#define DEFINE_SYCL_KERNELS(T) template struct Split<Eigen::SyclDevice, T>;
+#define DEFINE_SYCL_KERNELS(T)                    \
+  template struct Split<Eigen::SyclDevice, T, 2>; \
+  template struct Split<Eigen::SyclDevice, T, 3>;
 
 TF_CALL_GPU_NUMBER_TYPES_NO_HALF(DEFINE_SYCL_KERNELS);
 #endif  // TENSORFLOW_USE_SYCL
diff --git a/tensorflow/core/kernels/split_lib_gpu.cu.cc b/tensorflow/core/kernels/split_lib_gpu.cu.cc
index 9f234fc0935be0662b0d8df1a6bd1c109ab24fd9..393818730bb4fe7fc6bba7f66b2cc96b12cab390 100644
--- a/tensorflow/core/kernels/split_lib_gpu.cu.cc
+++ b/tensorflow/core/kernels/split_lib_gpu.cu.cc
@@ -29,12 +29,12 @@ limitations under the License.
 namespace tensorflow {
 namespace functor {
 
-template <typename Device, typename T>
-void Split<Device, T>::operator()(
-    const Device& d, typename TTypes<T, 3>::Tensor output,
-    typename TTypes<T, 3>::ConstTensor input,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_indices,
-    const Eigen::DSizes<Eigen::DenseIndex, 3>& slice_sizes) {
+template <typename Device, typename T, int NDims>
+void Split<Device, T, NDims>::operator()(
+    const Device& d, typename TTypes<T, NDims>::Tensor output,
+    typename TTypes<T, NDims>::ConstTensor input,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_indices,
+    const Eigen::DSizes<Eigen::DenseIndex, NDims>& slice_sizes) {
   To32Bit(output).device(d) = To32Bit(input).slice(slice_indices, slice_sizes);
 }
 
@@ -47,7 +47,9 @@ void SplitCustom<Device, T>::operator()(
   To32Bit(output).device(d) = To32Bit(input).slice(slice_indices, slice_sizes);
 }
 
-#define DEFINE_GPU_KERNELS(T) template struct Split<Eigen::GpuDevice, T>;
+#define DEFINE_GPU_KERNELS(T)                    \
+  template struct Split<Eigen::GpuDevice, T, 2>; \
+  template struct Split<Eigen::GpuDevice, T, 3>;
 
 TF_CALL_GPU_NUMBER_TYPES(DEFINE_GPU_KERNELS);
 TF_CALL_complex64(DEFINE_GPU_KERNELS);
diff --git a/tensorflow/core/kernels/split_op.cc b/tensorflow/core/kernels/split_op.cc
index 85f529326dbf5d9d5ae72156da05f08f805d1271..7cc3c532c95584a66cfab3f184ffef4028ee4bdb 100644
--- a/tensorflow/core/kernels/split_op.cc
+++ b/tensorflow/core/kernels/split_op.cc
@@ -121,6 +121,77 @@ class SplitOpBase : public OpKernel {
   }
 };
 
+template <typename T, typename InputReshapedType, int NDims>
+class SplitOpCPUImpl {
+ public:
+  template <typename MakeSizesType, typename ReshapeResultType>
+  void operator()(OpKernelContext* context,
+                  const InputReshapedType& input_reshaped,
+                  const TensorShape& input_shape, int32 split_dim,
+                  Eigen::DenseIndex prefix_dim_size,
+                  Eigen::DenseIndex split_dim_size,
+                  Eigen::DenseIndex suffix_dim_size,
+                  const MakeSizesType& make_sizes,
+                  const ReshapeResultType& reshape_result, int32 num_split,
+                  int64 split_dim_output_size) const {
+    const auto num_threads =
+        context->device()->tensorflow_cpu_worker_threads()->num_threads;
+    // TODO(jewillco): Tune heuristic further.
+    const auto input_element_count = input_shape.num_elements();
+    const bool use_parallelism_between_outputs =
+        (num_split >= 4 &&
+         input_element_count >= std::max(num_threads, num_split) * 4096 &&
+         input_element_count < num_split * 180 * 1024);
+    Eigen::DSizes<Eigen::DenseIndex, NDims> indices;
+    for (int i = 0; i < NDims; ++i) {
+      indices[i] = 0;
+    }
+    auto sizes = make_sizes(split_dim_output_size);
+    TensorShape output_shape(input_shape);
+    output_shape.set_dim(split_dim, split_dim_output_size);
+
+    auto range_output_func = [&indices, context, &output_shape, prefix_dim_size,
+                              split_dim_output_size, suffix_dim_size, &sizes,
+                              use_parallelism_between_outputs, &input_reshaped,
+                              &reshape_result](int64 start, int64 limit) {
+      for (int64 i = start; i < limit; ++i) {
+        Tensor* result = nullptr;
+        OP_REQUIRES_OK(context,
+                       context->allocate_output(i, output_shape, &result));
+        if (prefix_dim_size * split_dim_output_size * suffix_dim_size > 0) {
+          Eigen::DSizes<Eigen::DenseIndex, NDims> slice_indices;
+          Eigen::DSizes<Eigen::DenseIndex, NDims> slice_sizes;
+          for (int j = 0; j < NDims; ++j) {
+            slice_indices[j] =
+                (j == NDims - 2 ? i * split_dim_output_size : indices[j]);
+            slice_sizes[j] = sizes[j];
+          }
+
+          auto result_shaped = reshape_result(result, split_dim_output_size);
+
+          if (use_parallelism_between_outputs) {
+            // Use sequential implementation for single output.
+            result_shaped = input_reshaped.slice(slice_indices, slice_sizes);
+          } else {
+            // This implementation may be parallel internally.
+            functor::Split<CPUDevice, T, NDims>()(
+                context->eigen_device<CPUDevice>(), result_shaped,
+                input_reshaped, slice_indices, slice_sizes);
+          }
+        }
+      }
+    };
+    if (use_parallelism_between_outputs) {
+      // Run in parallel, disabling parallelism in functor.
+      context->device()->tensorflow_cpu_worker_threads()->workers->ParallelFor(
+          num_split, input_element_count / num_split, range_output_func);
+    } else {
+      // Run sequentially, but allow internal parallelism in functor.
+      range_output_func(0, num_split);
+    }
+  }
+};
+
 template <typename T>
 class SplitOpCPU : public SplitOpBase<CPUDevice, T> {
  public:
@@ -154,66 +225,37 @@ class SplitOpCPU : public SplitOpBase<CPUDevice, T> {
 
     std::tie(prefix_dim_size, split_dim_size, suffix_dim_size) =
         Base::template SetDims<Eigen::DenseIndex>(input_shape, split_dim);
-    auto input_reshaped =
-        input.shaped<T, 3>({prefix_dim_size, split_dim_size, suffix_dim_size});
 
     const int64 split_dim_output_size = split_dim_size / num_split;
-    TensorShape output_shape(input_shape);
-    output_shape.set_dim(split_dim, split_dim_output_size);
-
-    Eigen::DSizes<Eigen::DenseIndex, 3> indices{0, 0, 0};
-    const Eigen::DSizes<Eigen::DenseIndex, 3> sizes{
-        prefix_dim_size, split_dim_output_size, suffix_dim_size};
-
-    const auto num_threads =
-        context->device()->tensorflow_cpu_worker_threads()->num_threads;
-    // TODO(jewillco): Tune heuristic further.
-    const auto input_element_count = input_shape.num_elements();
-    const bool use_parallelism_between_outputs =
-        (num_split >= 4 &&
-         input_element_count >= std::max(num_threads, num_split) * 4096 &&
-         input_element_count < num_split * 180 * 1024);
-
-    auto range_output_func = [&indices, context, &output_shape, prefix_dim_size,
-                              split_dim_output_size, suffix_dim_size, &sizes,
-                              use_parallelism_between_outputs,
-                              &input_reshaped](int64 start, int64 limit) {
-      for (int64 i = start; i < limit; ++i) {
-        Tensor* result = nullptr;
-        OP_REQUIRES_OK(context,
-                       context->allocate_output(i, output_shape, &result));
-        if (prefix_dim_size * split_dim_output_size * suffix_dim_size > 0) {
-          Eigen::DSizes<Eigen::DenseIndex, 3> slice_indices;
-          Eigen::DSizes<Eigen::DenseIndex, 3> slice_sizes;
-          for (int j = 0; j < 3; ++j) {
-            slice_indices[j] =
-                (j == 1 ? i * split_dim_output_size : indices[j]);
-            slice_sizes[j] = sizes[j];
-          }
-
-          auto result_shaped = result->shaped<T, 3>(
-              {prefix_dim_size, split_dim_output_size, suffix_dim_size});
 
-          if (use_parallelism_between_outputs) {
-            // Use sequential implementation for single output.
-            result_shaped = input_reshaped.slice(slice_indices, slice_sizes);
-          } else {
-            // This implementation may be parallel internally.
-            functor::Split<CPUDevice, T>()(context->eigen_device<CPUDevice>(),
-                                           result_shaped, input_reshaped,
-                                           slice_indices, slice_sizes);
-          }
-        }
-      }
-    };
-    if (use_parallelism_between_outputs) {
-      // Run in parallel, disabling parallelism in functor.
-      Shard(num_split,
-            context->device()->tensorflow_cpu_worker_threads()->workers,
-            num_split, input_element_count / num_split, range_output_func);
+    if (prefix_dim_size == 1) {
+      auto input_reshaped =
+          input.shaped<T, 2>({split_dim_size, suffix_dim_size});
+      auto make_sizes = [&](Eigen::DenseIndex split_size) {
+        return Eigen::DSizes<Eigen::DenseIndex, 2>{split_size, suffix_dim_size};
+      };
+      auto reshape_result = [&](Tensor* result, Eigen::DenseIndex split_size) {
+        return result->shaped<T, 2>({split_size, suffix_dim_size});
+      };
+      SplitOpCPUImpl<T, decltype(input_reshaped), 2>{}(
+          context, input_reshaped, input_shape, split_dim, prefix_dim_size,
+          split_dim_size, suffix_dim_size, make_sizes, reshape_result,
+          num_split, split_dim_output_size);
     } else {
-      // Run sequentially, but allow internal parallelism in functor.
-      range_output_func(0, num_split);
+      auto input_reshaped = input.shaped<T, 3>(
+          {prefix_dim_size, split_dim_size, suffix_dim_size});
+      auto make_sizes = [&](Eigen::DenseIndex split_size) {
+        return Eigen::DSizes<Eigen::DenseIndex, 3>{prefix_dim_size, split_size,
+                                                   suffix_dim_size};
+      };
+      auto reshape_result = [&](Tensor* result, Eigen::DenseIndex split_size) {
+        return result->shaped<T, 3>(
+            {prefix_dim_size, split_size, suffix_dim_size});
+      };
+      SplitOpCPUImpl<T, decltype(input_reshaped), 3>{}(
+          context, input_reshaped, input_shape, split_dim, prefix_dim_size,
+          split_dim_size, suffix_dim_size, make_sizes, reshape_result,
+          num_split, split_dim_output_size);
     }
   }
 };
diff --git a/tensorflow/core/kernels/split_v_op.cc b/tensorflow/core/kernels/split_v_op.cc
index 7ff5df47d70fa8e47aabfb24e82874c146708ef1..0ce0b552e630220bc9d8c1cb77ef3695cf72dc93 100644
--- a/tensorflow/core/kernels/split_v_op.cc
+++ b/tensorflow/core/kernels/split_v_op.cc
@@ -55,8 +55,13 @@ class SplitVOpBase : public OpKernel {
     const Tensor& input = context->input(0);
     const TensorShape& input_shape = input.shape();
     const Tensor& split_tensor = context->input(1);
+    const Tensor& split_dim_tensor = context->input(2);
 
-    const int32 split_dim_orig = context->input(2).flat<int32>()(0);
+    OP_REQUIRES(context, split_dim_tensor.NumElements() == 1,
+                errors::InvalidArgument("split_dim_tensor must have "
+                                        "exactly one element."));
+
+    const int32 split_dim_orig = split_dim_tensor.flat<int32>()(0);
     const int32 split_dim =
         split_dim_orig < 0 ? split_dim_orig + input.dims() : split_dim_orig;
 
@@ -175,6 +180,77 @@ class SplitVOpBase : public OpKernel {
   }
 };
 
+template <typename T, typename Tlen, typename InputReshapedType, int NDims>
+class SplitVOpCPUImpl {
+ public:
+  template <typename MakeSizesType, typename ReshapeResultType>
+  void operator()(OpKernelContext* context,
+                  const InputReshapedType& input_reshaped,
+                  const std::vector<int64>& split_start_points,
+                  const TensorShape& input_shape, int32 split_dim,
+                  Eigen::DenseIndex prefix_dim_size,
+                  Eigen::DenseIndex split_dim_size,
+                  Eigen::DenseIndex suffix_dim_size,
+                  std::vector<Tlen>& split_sizes_vec,
+                  const MakeSizesType& make_sizes,
+                  const ReshapeResultType& reshape_result) const {
+    Eigen::DSizes<Eigen::DenseIndex, NDims> indices;
+    for (int i = 0; i < NDims; ++i) {
+      indices[i] = 0;
+    }
+    const auto num_threads =
+        context->device()->tensorflow_cpu_worker_threads()->num_threads;
+    // TODO(jewillco): Tune heuristic further.
+    const auto input_element_count = input_shape.num_elements();
+    const int num_split = split_start_points.size();
+    const bool use_parallelism_between_outputs =
+        (num_split >= 4 &&
+         input_element_count >= std::max(num_threads, num_split) * 4096 &&
+         input_element_count < num_split * 180 * 1024);
+
+    auto range_output_func = [&indices, context, &input_shape, prefix_dim_size,
+                              split_dim, &split_sizes_vec, &split_start_points,
+                              suffix_dim_size, use_parallelism_between_outputs,
+                              &input_reshaped, &make_sizes,
+                              &reshape_result](int64 start, int64 limit) {
+      for (int64 i = start; i < limit; ++i) {
+        TensorShape output_shape(input_shape);
+        output_shape.set_dim(split_dim, split_sizes_vec[i]);
+        Tensor* result = nullptr;
+        OP_REQUIRES_OK(context,
+                       context->allocate_output(i, output_shape, &result));
+
+        const auto sizes = make_sizes(split_sizes_vec[i]);
+
+        if (sizes.TotalSize() > 0) {
+          auto result_shaped = reshape_result(result, split_sizes_vec[i]);
+
+          auto current_indices = indices;
+          current_indices[NDims - 2] = split_start_points[i];
+          if (use_parallelism_between_outputs) {
+            // Use sequential implementation for single output.
+            result_shaped = input_reshaped.slice(current_indices, sizes);
+          } else {
+            // This implementation may be parallel internally.
+            functor::Split<CPUDevice, T, NDims>()(
+                context->eigen_device<CPUDevice>(), result_shaped,
+                input_reshaped, current_indices, sizes);
+          }
+        }
+      }
+    };
+    if (use_parallelism_between_outputs) {
+      // Run in parallel, disabling parallelism in functor.
+      Shard(num_split,
+            context->device()->tensorflow_cpu_worker_threads()->workers,
+            num_split, input_element_count / num_split, range_output_func);
+    } else {
+      // Run sequentially, but allow internal parallelism in functor.
+      range_output_func(0, num_split);
+    }
+  }
+};
+
 template <typename T, typename Tlen>
 class SplitVOpCPU : public SplitVOpBase<CPUDevice, T, Tlen> {
  public:
@@ -209,10 +285,6 @@ class SplitVOpCPU : public SplitVOpBase<CPUDevice, T, Tlen> {
 
     std::tie(prefix_dim_size, split_dim_size, suffix_dim_size) =
         Base::template SetDims<Eigen::DenseIndex>(input_shape, split_dim);
-    auto input_reshaped =
-        input.shaped<T, 3>({prefix_dim_size, split_dim_size, suffix_dim_size});
-
-    Eigen::DSizes<Eigen::DenseIndex, 3> indices{0, 0, 0};
     std::vector<int64> split_start_points(num_split);
     for (int i = 0; i < num_split; ++i) {
       if (i == 0) {
@@ -223,55 +295,34 @@ class SplitVOpCPU : public SplitVOpBase<CPUDevice, T, Tlen> {
       }
     }
 
-    const auto num_threads =
-        context->device()->tensorflow_cpu_worker_threads()->num_threads;
-    // TODO(jewillco): Tune heuristic further.
-    const auto input_element_count = input_shape.num_elements();
-    const bool use_parallelism_between_outputs =
-        (num_split >= 4 &&
-         input_element_count >= std::max(num_threads, num_split) * 4096 &&
-         input_element_count < num_split * 180 * 1024);
-
-    auto range_output_func = [&indices, context, &input_shape, prefix_dim_size,
-                              split_dim, &split_sizes_vec, &split_start_points,
-                              suffix_dim_size, use_parallelism_between_outputs,
-                              &input_reshaped](int64 start, int64 limit) {
-      for (int64 i = start; i < limit; ++i) {
-        TensorShape output_shape(input_shape);
-        output_shape.set_dim(split_dim, split_sizes_vec[i]);
-        Tensor* result = nullptr;
-        OP_REQUIRES_OK(context,
-                       context->allocate_output(i, output_shape, &result));
-
-        Eigen::DSizes<Eigen::DenseIndex, 3> sizes{
-            prefix_dim_size, split_sizes_vec[i], suffix_dim_size};
-
-        if (sizes.TotalSize() > 0) {
-          auto result_shaped = result->shaped<T, 3>(
-              {prefix_dim_size, split_sizes_vec[i], suffix_dim_size});
-
-          auto current_indices = indices;
-          current_indices[1] = split_start_points[i];
-          if (use_parallelism_between_outputs) {
-            // Use sequential implementation for single output.
-            result_shaped = input_reshaped.slice(current_indices, sizes);
-          } else {
-            // This implementation may be parallel internally.
-            functor::Split<CPUDevice, T>()(context->eigen_device<CPUDevice>(),
-                                           result_shaped, input_reshaped,
-                                           current_indices, sizes);
-          }
-        }
-      }
-    };
-    if (use_parallelism_between_outputs) {
-      // Run in parallel, disabling parallelism in functor.
-      Shard(num_split,
-            context->device()->tensorflow_cpu_worker_threads()->workers,
-            num_split, input_element_count / num_split, range_output_func);
+    if (prefix_dim_size == 1) {
+      auto input_reshaped =
+          input.shaped<T, 2>({split_dim_size, suffix_dim_size});
+      auto make_sizes = [&](Eigen::DenseIndex split_size) {
+        return Eigen::DSizes<Eigen::DenseIndex, 2>{split_size, suffix_dim_size};
+      };
+      auto reshape_result = [&](Tensor* result, Tlen split_size) {
+        return result->shaped<T, 2>({split_size, suffix_dim_size});
+      };
+      SplitVOpCPUImpl<T, Tlen, decltype(input_reshaped), 2>{}(
+          context, input_reshaped, split_start_points, input_shape, split_dim,
+          prefix_dim_size, split_dim_size, suffix_dim_size, split_sizes_vec,
+          make_sizes, reshape_result);
     } else {
-      // Run sequentially, but allow internal parallelism in functor.
-      range_output_func(0, num_split);
+      auto input_reshaped = input.shaped<T, 3>(
+          {prefix_dim_size, split_dim_size, suffix_dim_size});
+      auto make_sizes = [&](Eigen::DenseIndex split_size) {
+        return Eigen::DSizes<Eigen::DenseIndex, 3>{prefix_dim_size, split_size,
+                                                   suffix_dim_size};
+      };
+      auto reshape_result = [&](Tensor* result, Tlen split_size) {
+        return result->shaped<T, 3>(
+            {prefix_dim_size, split_size, suffix_dim_size});
+      };
+      SplitVOpCPUImpl<T, Tlen, decltype(input_reshaped), 3>{}(
+          context, input_reshaped, split_start_points, input_shape, split_dim,
+          prefix_dim_size, split_dim_size, suffix_dim_size, split_sizes_vec,
+          make_sizes, reshape_result);
     }
   }
 };
diff --git a/tensorflow/core/kernels/strided_slice_op.cc b/tensorflow/core/kernels/strided_slice_op.cc
index 7745effe2abe94ba73a2f0d761210e07c62e499c..1e3e92a68a05123bafad77348e6811a14c303301 100644
--- a/tensorflow/core/kernels/strided_slice_op.cc
+++ b/tensorflow/core/kernels/strided_slice_op.cc
@@ -109,17 +109,27 @@ class StridedSliceOp : public OpKernel {
     if (is_identity) {
       VLOG(1) << "Strided slice identity ";
       Tensor tmp;
-      CHECK(tmp.CopyFrom(input, final_shape));
+      OP_REQUIRES(context, tmp.CopyFrom(input, final_shape),
+                  errors::Internal("Copy failed"));
       context->set_output(0, tmp);
       return;
     }
 
     // Optimization #2, slice is memory contiguous (only occurs in dim 0)
     if (slice_dim0 && IsDim0SliceAligned<T>(input.shape(), begin[0], end[0])) {
-      CHECK_GE(input.dims(), 1);  // Otherwise, is_identity should be true.
+      OP_REQUIRES(context, input.dims() >= 1,
+                  errors::InvalidArgument(
+                      "Input must have rank at least 1, got: ", input.dims()));
+      // Otherwise, is_identity should be true.
       VLOG(1) << "Strided slice dim 0: " << input.shape().DebugString();
+      OP_REQUIRES(
+          context, begin[0] <= end[0],
+          errors::InvalidArgument("begin[0] (", begin[0],
+                                  ") must less or equal to end[0] (", end[0]));
+      Tensor slice = input.Slice(begin[0], end[0]);
       Tensor tmp;
-      CHECK(tmp.CopyFrom(input.Slice(begin[0], end[0]), final_shape));
+      OP_REQUIRES(context, tmp.CopyFrom(slice, final_shape),
+                  errors::Internal("Copy failed"));
       context->set_output(0, tmp);
       return;
     }
@@ -238,7 +248,8 @@ class StridedSliceGradOp : public OpKernel {
 
     if (processing_shape.dims() == 0) {
       auto in = context->input(4);
-      CHECK(result->CopyFrom(in, processing_shape));
+      OP_REQUIRES(context, result->CopyFrom(in, processing_shape),
+                  errors::Internal("Copy failed"));
       return;
     }
 
diff --git a/tensorflow/core/kernels/tensor_array_ops.cc b/tensorflow/core/kernels/tensor_array_ops.cc
index af93d814ec06ff86c6c7eb3312d97224dee485f2..7ec26d95e6886d639d2dde5a61456898529be524 100644
--- a/tensorflow/core/kernels/tensor_array_ops.cc
+++ b/tensorflow/core/kernels/tensor_array_ops.cc
@@ -1104,9 +1104,9 @@ class TensorArrayUnpackOrScatterOp : public OpKernel {
       indices[1] = i;
 
       if (element_shape.num_elements() > 0) {
-        functor::Split<Device, T>()(ctx->eigen_device<Device>(),
-                                    tensor_value_i_t, tensor_value_t, indices,
-                                    sizes);
+        functor::Split<Device, T, 3>()(ctx->eigen_device<Device>(),
+                                       tensor_value_i_t, tensor_value_t,
+                                       indices, sizes);
       }
 
       write_values.push_back(persistent_tensor);
@@ -1295,9 +1295,9 @@ class TensorArraySplitOp : public OpKernel {
         auto tensor_value_i_t = tensor_value_i->shaped<T, 3>(
             {1, tensor_lengths_t(i), elements_per_row});
 
-        functor::Split<Device, T>()(ctx->eigen_device<Device>(),
-                                    tensor_value_i_t, tensor_value_t, indices,
-                                    sizes);
+        functor::Split<Device, T, 3>()(ctx->eigen_device<Device>(),
+                                       tensor_value_i_t, tensor_value_t,
+                                       indices, sizes);
       }
 
       write_values.push_back(persistent_tensor);
diff --git a/tensorflow/core/kernels/training_ops.cc b/tensorflow/core/kernels/training_ops.cc
index 07befa27bc54631d30e413a15972c560655418e0..f53c567c4da19e18dc2832fbc47ee27ee1c928d1 100644
--- a/tensorflow/core/kernels/training_ops.cc
+++ b/tensorflow/core/kernels/training_ops.cc
@@ -15,6 +15,8 @@ limitations under the License.
 
 #define EIGEN_USE_THREADS
 
+#include "tensorflow/core/lib/bfloat16/bfloat16.h"
+
 #include <algorithm>
 
 #include "tensorflow/core/framework/op_kernel.h"
@@ -494,6 +496,7 @@ class ApplyGradientDescentOp<SYCLDevice, T> : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -647,6 +650,7 @@ class ApplyAdadeltaOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -822,6 +826,7 @@ class SparseApplyAdadeltaOp : public OpKernel {
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -1107,6 +1112,7 @@ class ApplyAdagradOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -1228,11 +1234,8 @@ inline T FtrlCompute(const T& accum, const T& linear, const T& lr, const T& l1,
     quadratic =
         Eigen::numext::pow(accum, -lr_power) / lr + static_cast<T>(2) * l2;
   }
-  if (Eigen::numext::abs(linear) > l1) {
-    return (l1 * sgn(linear) - linear) / quadratic;
-  } else {
-    return static_cast<T>(0.0);
-  }
+  auto l1_reg_adjust = std::max(std::min(linear, l1), -l1);
+  return (l1_reg_adjust - linear) / quadratic;
 }
 }  // namespace
 
@@ -1363,6 +1366,7 @@ class SparseApplyAdagradOp : public OpKernel {
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -1964,6 +1968,7 @@ class ApplyFtrlOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -1985,6 +1990,7 @@ TF_CALL_double(REGISTER_CPU_KERNELS);
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2233,6 +2239,7 @@ class SparseApplyFtrlOp : public OpKernel {
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2257,6 +2264,7 @@ TF_CALL_double(REGISTER_CPU_KERNELS);
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2335,6 +2343,7 @@ class ApplyMomentumOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2474,6 +2483,7 @@ class SparseApplyMomentumOp : public OpKernel {
   REGISTER_KERNELS(T, int64);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2701,6 +2711,7 @@ class ApplyAdamOp<SYCLDevice, T> : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -2940,6 +2951,7 @@ class ApplyCenteredRMSPropOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -3355,6 +3367,7 @@ class ApplyAddSignOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
@@ -3460,6 +3473,7 @@ class ApplyPowerSignOp : public OpKernel {
 #define REGISTER_CPU_KERNELS(T) REGISTER_KERNELS(CPU, T);
 
 TF_CALL_half(REGISTER_CPU_KERNELS);
+TF_CALL_bfloat16(REGISTER_CPU_KERNELS);
 TF_CALL_float(REGISTER_CPU_KERNELS);
 TF_CALL_double(REGISTER_CPU_KERNELS);
 
diff --git a/tensorflow/core/kernels/unique_op.cc b/tensorflow/core/kernels/unique_op.cc
index 0ef8724b10e492373c7663a58420bfe236be7df7..31388e42904608f20edd48152330f9ad2fb7d0ca 100644
--- a/tensorflow/core/kernels/unique_op.cc
+++ b/tensorflow/core/kernels/unique_op.cc
@@ -223,6 +223,16 @@ class UniqueOp : public OpKernel {
                               .Device(DEVICE_CPU)                \
                               .TypeConstraint<type>("T")         \
                               .TypeConstraint<int64>("out_idx"), \
+                          UniqueOp<type, int64>);                \
+  REGISTER_KERNEL_BUILDER(Name("UniqueWithCountsV2")             \
+                              .Device(DEVICE_CPU)                \
+                              .TypeConstraint<type>("T")         \
+                              .TypeConstraint<int32>("out_idx"), \
+                          UniqueOp<type, int32>)                 \
+  REGISTER_KERNEL_BUILDER(Name("UniqueWithCountsV2")             \
+                              .Device(DEVICE_CPU)                \
+                              .TypeConstraint<type>("T")         \
+                              .TypeConstraint<int64>("out_idx"), \
                           UniqueOp<type, int64>)
 TF_CALL_REAL_NUMBER_TYPES(REGISTER_UNIQUE);
 REGISTER_UNIQUE(string)
diff --git a/tensorflow/core/kernels/unpack_op.cc b/tensorflow/core/kernels/unpack_op.cc
index 764b6a252adf09c13511a01f95332857f46eee96..1e1647db5c1c41d6242cab87b0d8a8cf66d32a28 100644
--- a/tensorflow/core/kernels/unpack_op.cc
+++ b/tensorflow/core/kernels/unpack_op.cc
@@ -90,21 +90,21 @@ class UnpackOp : public OpKernel {
     }
 #endif  // TENSORFLOW_USE_SYCL
 
-    int64 before_dim = 1;
+    Eigen::DenseIndex before_dim = 1;
     for (int i = 0; i < axis; ++i) {
       before_dim *= input_shape.dim_size(i);
     }
 
-    int64 after_dim = 1;
+    Eigen::DenseIndex after_dim = 1;
     for (int i = axis + 1; i < input_shape.dims(); ++i) {
       after_dim *= input_shape.dim_size(i);
     }
-    const int64 axis_dim = input_shape.dim_size(axis);
+    const Eigen::DenseIndex axis_dim = input_shape.dim_size(axis);
 
     // Except for shape, unpack is a special case of split, so we reuse the
     // same computational kernels.
     auto input_reshaped =
-        input.shaped<T, 3>({1, before_dim, axis_dim * after_dim});
+        input.shaped<T, 2>({before_dim, axis_dim * after_dim});
 
     for (int i = 0; i < num; ++i) {
       Tensor* output;
@@ -112,12 +112,12 @@ class UnpackOp : public OpKernel {
                      context->allocate_output(i, output_shape, &output));
 
       if (output_shape.num_elements() > 0) {
-        auto output_shaped = output->shaped<T, 3>({1, before_dim, after_dim});
-        Eigen::DSizes<Eigen::DenseIndex, 3> indices{0, 0, i * after_dim};
-        Eigen::DSizes<Eigen::DenseIndex, 3> sizes{1, before_dim, after_dim};
-        functor::Split<Device, T>()(context->eigen_device<Device>(),
-                                    output_shaped, input_reshaped, indices,
-                                    sizes);
+        auto output_shaped = output->shaped<T, 2>({before_dim, after_dim});
+        Eigen::DSizes<Eigen::DenseIndex, 2> indices{0, i * after_dim};
+        Eigen::DSizes<Eigen::DenseIndex, 2> sizes{before_dim, after_dim};
+        functor::Split<Device, T, 2>()(context->eigen_device<Device>(),
+                                       output_shaped, input_reshaped, indices,
+                                       sizes);
       }
     }
   }
diff --git a/tensorflow/core/kernels/variable_ops.h b/tensorflow/core/kernels/variable_ops.h
index 83134bad378bfef18c3e93be5cc3c6b70ab4f523..8b406e5311cc33db943c1875a940fb886174cf28 100644
--- a/tensorflow/core/kernels/variable_ops.h
+++ b/tensorflow/core/kernels/variable_ops.h
@@ -45,6 +45,14 @@ class Var : public ResourceBase {
                            tensor_.shape().DebugString());
   }
 
+  // Only used in the resource variable path. In resource variables,
+  // tensor.IsInitialized() can be true (i.e. have memory allocated to it) while
+  // there is not a good value there due to a race condition, and it's possible
+  // to stumble upon this during variable.initialized_value(). So it's best to
+  // just store directly whether the variable is initialized.
+  bool is_initialized = false;  // GUARDED_BY(mu_) but annotalysis doesn't like
+                                // it.
+
  private:
   mutex mu_;
   Tensor tensor_;
diff --git a/tensorflow/core/kernels/xsmm_conv2d.cc b/tensorflow/core/kernels/xsmm_conv2d.cc
index ba03357cc6ac22d42f9f1cceab6875ef7e49b4c2..f8c06988cbac021d1f0924ca274c8bee5e9272a5 100644
--- a/tensorflow/core/kernels/xsmm_conv2d.cc
+++ b/tensorflow/core/kernels/xsmm_conv2d.cc
@@ -16,7 +16,7 @@ limitations under the License.
 // Make this file empty (or nearly empty) so that it can be compiled even when
 // libxsmm is not available.
 
-#ifndef TENSORFLOW_USE_LIBXSMM
+#ifndef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
 void dummy_xsmm_conv2d_ensure_file_is_not_empty();
 #else
 
@@ -32,9 +32,9 @@ void dummy_xsmm_conv2d_ensure_file_is_not_empty();
 #include "tensorflow/core/lib/core/blocking_counter.h"
 #include "tensorflow/core/lib/core/threadpool.h"
 
-#include "libxsmm_main.h"  // TODO(bsteiner): API to avoid incl. header from src/
 #include "include/libxsmm_cpuid.h"
 #include "include/libxsmm_malloc.h"
+#include "third_party/libxsmm/src/libxsmm_main.h"  // TODO(bsteiner): API to avoid incl. header from src/
 
 namespace tensorflow {
 
@@ -173,8 +173,16 @@ static bool CallLibxsmmConvGeneric(OpKernelContext* ctx,
                                    InputPtr input, FilterPtr filter,
                                    OutputPtr output) {
 #if defined(LIBXSMM_DETAILED_TIMING)
-  unsigned long long l_tick1, l_tick2, l_tick3, l_tick4, l_tick5, l_tick6,
-      l_tick7, l_tick8, l_tick9, l_tick10;
+  uint64 l_tick1;
+  uint64 l_tick2;
+  uint64 l_tick3;
+  uint64 l_tick4;
+  uint64 l_tick5;
+  uint64 l_tick6;
+  uint64 l_tick7;
+  uint64 l_tick8;
+  uint64 l_tick9;
+  uint64 l_tick10;
   l_tick1 = libxsmm_timer_tick();
 #endif
   // setup scoped allocator, which adopts the allocator from the context
@@ -453,6 +461,7 @@ static bool CallLibxsmmConvGeneric(OpKernelContext* ctx,
   return true;  // Succeeded
 }
 
+#ifdef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
 template <typename T>
 struct XsmmFwdConv2D<CPUDevice, T> {
   bool operator()(OpKernelContext* ctx, const libxsmm_dnn_conv_desc& desc,
@@ -461,7 +470,9 @@ struct XsmmFwdConv2D<CPUDevice, T> {
                                   input, filter, output);
   }
 };
+#endif
 
+#ifdef TENSORFLOW_USE_LIBXSMM_BACKWARD_CONVOLUTIONS
 template <typename T>
 struct XsmmBkwInputConv2D<CPUDevice, T> {
   bool operator()(OpKernelContext* ctx, const libxsmm_dnn_conv_desc& desc,
@@ -479,6 +490,7 @@ struct XsmmBkwFilterConv2D<CPUDevice, T> {
                                   input, filter, output);
   }
 };
+#endif
 
 }  // namespace functor
 
@@ -488,4 +500,4 @@ template struct functor::XsmmBkwFilterConv2D<CPUDevice, float>;
 
 }  // namespace tensorflow
 
-#endif  // TENSORFLOW_USE_LIBXSMM
+#endif  // TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
diff --git a/tensorflow/core/lib/bfloat16/bfloat16.h b/tensorflow/core/lib/bfloat16/bfloat16.h
index f9cca0ef2ab90c677e47d979a4636b3fc25ec919..de8f92d1eb929593a5b35262042dcb8f3992384c 100644
--- a/tensorflow/core/lib/bfloat16/bfloat16.h
+++ b/tensorflow/core/lib/bfloat16/bfloat16.h
@@ -16,6 +16,7 @@ limitations under the License.
 #ifndef TENSORFLOW_CORE_LIB_BFLOAT16_BFLOAT16_H_
 #define TENSORFLOW_CORE_LIB_BFLOAT16_BFLOAT16_H_
 
+#include <cmath>
 #include <complex>
 
 #ifdef __CUDACC__
@@ -271,6 +272,35 @@ struct hash<tensorflow::bfloat16> {
     return hash<float>()(static_cast<float>(v));
   }
 };
+
+using tensorflow::bfloat16;
+inline bool isinf(const bfloat16& a) { return std::isinf(float(a)); }
+inline bool isnan(const bfloat16& a) { return std::isnan(float(a)); }
+inline bool isfinite(const bfloat16& a) { return std::isfinite(float(a)); }
+inline bfloat16 abs(const bfloat16& a) { return bfloat16(std::abs(float(a))); }
+inline bfloat16 exp(const bfloat16& a) { return bfloat16(std::exp(float(a))); }
+inline bfloat16 log(const bfloat16& a) { return bfloat16(std::log(float(a))); }
+inline bfloat16 log10(const bfloat16& a) {
+  return bfloat16(std::log10(float(a)));
+}
+inline bfloat16 sqrt(const bfloat16& a) {
+  return bfloat16(std::sqrt(float(a)));
+}
+inline bfloat16 pow(const bfloat16& a, const bfloat16& b) {
+  return bfloat16(std::pow(float(a), float(b)));
+}
+inline bfloat16 sin(const bfloat16& a) { return bfloat16(std::sin(float(a))); }
+inline bfloat16 cos(const bfloat16& a) { return bfloat16(std::cos(float(a))); }
+inline bfloat16 tan(const bfloat16& a) { return bfloat16(std::tan(float(a))); }
+inline bfloat16 tanh(const bfloat16& a) {
+  return bfloat16(std::tanh(float(a)));
+}
+inline bfloat16 floor(const bfloat16& a) {
+  return bfloat16(std::floor(float(a)));
+}
+inline bfloat16 ceil(const bfloat16& a) {
+  return bfloat16(std::ceil(float(a)));
+}
 }  // namespace std
 
 #endif  // TENSORFLOW_CORE_LIB_BFLOAT16_BFLOAT16_H_
diff --git a/tensorflow/core/lib/core/threadpool_test.cc b/tensorflow/core/lib/core/threadpool_test.cc
index 627ef5a892a35ec43d0c31220dcf046b4b8eda55..320f3ebb8328b23c5e0b10ae2effe1de2528246b 100644
--- a/tensorflow/core/lib/core/threadpool_test.cc
+++ b/tensorflow/core/lib/core/threadpool_test.cc
@@ -17,6 +17,7 @@ limitations under the License.
 
 #include <atomic>
 
+#include "tensorflow/core/platform/context.h"
 #include "tensorflow/core/platform/env.h"
 #include "tensorflow/core/platform/mutex.h"
 #include "tensorflow/core/platform/test.h"
@@ -35,6 +36,7 @@ TEST(ThreadPool, Empty) {
 }
 
 TEST(ThreadPool, DoWork) {
+  Context outer_context(ContextKind::kThread);
   for (int num_threads = 1; num_threads < kNumThreads; num_threads++) {
     fprintf(stderr, "Testing with %d threads\n", num_threads);
     const int kWorkItems = 15;
@@ -45,7 +47,9 @@ TEST(ThreadPool, DoWork) {
     {
       ThreadPool pool(Env::Default(), "test", num_threads);
       for (int i = 0; i < kWorkItems; i++) {
-        pool.Schedule([&work, i]() {
+        pool.Schedule([&outer_context, &work, i]() {
+          Context inner_context(ContextKind::kThread);
+          ASSERT_EQ(outer_context, inner_context);
           ASSERT_FALSE(work[i]);
           work[i] = true;
         });
@@ -58,6 +62,7 @@ TEST(ThreadPool, DoWork) {
 }
 
 TEST(ThreadPool, ParallelFor) {
+  Context outer_context(ContextKind::kThread);
   // Make ParallelFor use as many threads as possible.
   int64 kHugeCost = 1 << 30;
   for (int num_threads = 1; num_threads < kNumThreads; num_threads++) {
@@ -68,12 +73,15 @@ TEST(ThreadPool, ParallelFor) {
     for (int i = 0; i < kWorkItems; i++) {
       work[i] = false;
     }
-    pool.ParallelFor(kWorkItems, kHugeCost, [&work](int64 begin, int64 end) {
-      for (int64 i = begin; i < end; ++i) {
-        ASSERT_FALSE(work[i]);
-        work[i] = true;
-      }
-    });
+    pool.ParallelFor(kWorkItems, kHugeCost,
+                     [&outer_context, &work](int64 begin, int64 end) {
+                       Context inner_context(ContextKind::kThread);
+                       ASSERT_EQ(outer_context, inner_context);
+                       for (int64 i = begin; i < end; ++i) {
+                         ASSERT_FALSE(work[i]);
+                         work[i] = true;
+                       }
+                     });
     for (int i = 0; i < kWorkItems; i++) {
       ASSERT_TRUE(work[i]);
     }
@@ -167,5 +175,40 @@ static void BM_Parallel(int iters) {
 }
 BENCHMARK(BM_Parallel);
 
+static void BM_ParallelFor(int iters, int total, int cost_per_unit) {
+  ThreadPool pool(Env::Default(), "test", kNumThreads);
+  // Decrement count concurrently until 0.
+  std::atomic_int_fast32_t count(iters);
+  mutex done_lock;
+  condition_variable done;
+  bool done_flag = false;
+  for (int i = 0; i < iters; ++i) {
+    pool.ParallelFor(
+        total, cost_per_unit,
+        [&count, &done_lock, &done, &done_flag](int64 begin, int64 end) {
+          for (int64 i = begin; i < end; ++i) {
+            if (count.fetch_sub(1) == 1) {
+              mutex_lock l(done_lock);
+              done_flag = true;
+              done.notify_all();
+            }
+          }
+        });
+  }
+  mutex_lock l(done_lock);
+  if (!done_flag) {
+    done.wait(l);
+  }
+}
+BENCHMARK(BM_ParallelFor)
+    ->ArgPair(1 << 10, 1)
+    ->ArgPair(1 << 20, 1)
+    ->ArgPair(1 << 10, 1 << 10)
+    ->ArgPair(1 << 20, 1 << 10)
+    ->ArgPair(1 << 10, 1 << 20)
+    ->ArgPair(1 << 20, 1 << 20)
+    ->ArgPair(1 << 10, 1 << 30)
+    ->ArgPair(1 << 20, 1 << 30);
+
 }  // namespace thread
 }  // namespace tensorflow
diff --git a/tensorflow/core/lib/io/record_writer.cc b/tensorflow/core/lib/io/record_writer.cc
index 3657243c5d38a2076c1ca2c2e5f31b488b5a281b..ebc56482699948974ad434b6ea76fe26e1a4a5c5 100644
--- a/tensorflow/core/lib/io/record_writer.cc
+++ b/tensorflow/core/lib/io/record_writer.cc
@@ -49,7 +49,7 @@ RecordWriterOptions RecordWriterOptions::CreateRecordWriterOptions(
 #endif  // IS_SLIM_BUILD
   } else if (compression_type != compression::kNone) {
     LOG(ERROR) << "Unsupported compression_type:" << compression_type
-               << ". No comprression will be used.";
+               << ". No compression will be used.";
   }
   return options;
 }
diff --git a/tensorflow/core/lib/random/random_distributions.h b/tensorflow/core/lib/random/random_distributions.h
index 3fe1f9bc6cf06158df4811eaa177988b60890006..2ebe608fc915e78974f9a9c0aedacb8eb5b37859 100644
--- a/tensorflow/core/lib/random/random_distributions.h
+++ b/tensorflow/core/lib/random/random_distributions.h
@@ -32,6 +32,8 @@ namespace random {
 
 // Helper function to convert a 16-bit integer to a half between [0..1).
 PHILOX_DEVICE_INLINE Eigen::half Uint16ToHalf(uint16 x);
+// Helper function to convert a 16-bit integer to a bfloat16 between [0..1).
+PHILOX_DEVICE_INLINE bfloat16 Uint16ToGfloat16(uint16 x);
 // Helper function to convert a 32-bit integer to a float between [0..1).
 PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x);
 // Helper function to convert two 32-bit integers to a double between [0..1).
@@ -75,6 +77,30 @@ class UniformDistribution<Generator, Eigen::half> {
   }
 };
 
+template <class Generator>
+class UniformDistribution<Generator, bfloat16> {
+ public:
+  // The number of elements that will be returned.
+  static const int kResultElementCount = Generator::kResultElementCount;
+  // Cost of generation of a single element (in cycles).
+  static const int kElementCost = 3;
+  // Indicate that this distribution may take variable number of samples
+  // during the runtime.
+  static const bool kVariableSamplesPerOutput = false;
+  typedef Array<bfloat16, kResultElementCount> ResultType;
+  typedef bfloat16 ResultElementType;
+
+  PHILOX_DEVICE_INLINE
+  ResultType operator()(Generator* gen) {
+    typename Generator::ResultType sample = (*gen)();
+    ResultType result;
+    for (int i = 0; i < kResultElementCount; ++i) {
+      result[i] = Uint16ToGfloat16(sample[i]);
+    }
+    return result;
+  }
+};
+
 template <class Generator>
 class UniformDistribution<Generator, float> {
  public:
@@ -305,6 +331,36 @@ class NormalDistribution<Generator, Eigen::half> {
   }
 };
 
+template <class Generator>
+class NormalDistribution<Generator, bfloat16> {
+ public:
+  // The number of elements that will be returned.
+  static const int kResultElementCount = Generator::kResultElementCount;
+  // Cost of generation of a single element (in cycles).
+  static const int kElementCost = 70;
+  // Indicate that this distribution may take variable number of samples
+  // during the runtime.
+  static const bool kVariableSamplesPerOutput = false;
+  typedef Array<bfloat16, kResultElementCount> ResultType;
+  typedef bfloat16 ResultElementType;
+
+  PHILOX_DEVICE_INLINE
+  ResultType operator()(Generator* gen) {
+    typename Generator::ResultType sample = (*gen)();
+    ResultType result;
+    static_assert(kResultElementCount % 2 == 0,
+                  "kResultElementCount should be an even number");
+    for (int i = 0; i < kResultElementCount; i += 2) {
+      float f[2];
+      // Box-Muller transform requires processing 2 elements at a time.
+      BoxMullerFloat(sample[i], sample[i + 1], &f[0], &f[1]);
+      result[i] = bfloat16(f[0]);
+      result[i + 1] = bfloat16(f[1]);
+    }
+    return result;
+  }
+};
+
 template <class Generator>
 class NormalDistribution<Generator, float> {
  public:
@@ -414,6 +470,48 @@ class TruncatedNormalDistribution<SingleSampleGenerator, Eigen::half> {
   }
 };
 
+template <class SingleSampleGenerator>
+class TruncatedNormalDistribution<SingleSampleGenerator, bfloat16> {
+ public:
+  // The number of elements that will be returned.
+  static const int kResultElementCount =
+      SingleSampleGenerator::kNativeElementCount;
+  // Cost of generation of a single element (in cycles).
+  static const int kElementCost = 90;
+  // Indicate that this distribution may take variable number of samples
+  // during the runtime.
+  static const bool kVariableSamplesPerOutput = true;
+  // The threshold where the normal distribution is truncated.
+  const float kTruncateValue = 2.0f;
+
+  typedef Array<bfloat16, kResultElementCount> ResultType;
+  typedef bfloat16 ResultElementType;
+
+  PHILOX_DEVICE_INLINE
+  ResultType operator()(SingleSampleGenerator* gen) {
+    ResultType results;
+    int index = 0;
+    while (true) {
+      // Repeatedly take samples from the normal distribution, until we have
+      // the desired number of elements that fall within the pre-defined cutoff
+      // threshold.
+      const uint32 x0 = (*gen)();
+      const uint32 x1 = (*gen)();
+      float f[2];
+      BoxMullerFloat(x0, x1, &f[0], &f[1]);
+
+      for (int i = 0; i < 2; ++i) {
+        if (Eigen::numext::abs(f[i]) < kTruncateValue) {
+          results[index++] = bfloat16(f[i]);
+          if (index >= kResultElementCount) {
+            return results;
+          }
+        }
+      }
+    }
+  }
+};
+
 // Partial specialization for float.
 template <class SingleSampleGenerator>
 class TruncatedNormalDistribution<SingleSampleGenerator, float> {
@@ -567,6 +665,27 @@ PHILOX_DEVICE_INLINE Eigen::half Uint16ToHalf(uint16 x) {
   return result - Eigen::half(1.0);
 }
 
+// Helper function to convert an 16-bit integer to a bfloat16 between [0..1).
+// This can create a uniform distribution of values between [0..1).
+PHILOX_DEVICE_INLINE bfloat16 Uint16ToGfloat16(uint16 x) {
+  // bfloat are formatted as follows (MSB first):
+  //    sign(1) exponent(8) mantissa(7)
+  // Conceptually construct the following:
+  //    sign == 0
+  //    exponent == 127  -- an excess 127 representation of a zero exponent
+  //    mantissa == 7 random bits
+  const uint16 man = x & 0x7fu;  // 7 bit mantissa
+  const uint16 exp = static_cast<uint16>(127);
+  const uint16 val = (exp << 7) | man;
+
+  bfloat16 result;
+  memcpy(&result, &val, sizeof(val));
+  // The mantissa has an implicit leading 1, so the above code creates a value
+  // in [1, 2). The minus will not cause a rounding that makes the result 1.
+  // Instead it will just be close to 1.
+  return result - bfloat16(1.0);
+}
+
 // Helper function to convert an 32-bit integer to a float between [0..1).
 PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x) {
   // IEEE754 floats are formatted as follows (MSB first):
diff --git a/tensorflow/core/lib/random/random_distributions_test.cc b/tensorflow/core/lib/random/random_distributions_test.cc
index 85d68f456e1e27b7a62315f2b0a962843da87d52..8868672a10ae027415d81f76ef146d1a5f28bddd 100644
--- a/tensorflow/core/lib/random/random_distributions_test.cc
+++ b/tensorflow/core/lib/random/random_distributions_test.cc
@@ -37,6 +37,10 @@ namespace {
 // unit normal distribution, it should almost definitely never exceed 6.
 static constexpr float kZLimit = 6.0;
 
+// As bfloat16 has much less precision, the largest z-value will should be
+// larger than float32.
+static constexpr float kZLimitBfloat16 = 20.0;
+
 // A utility function to fill the given array with samples from the given
 // distribution, using the single adapter of the underlying generator
 template <class Distribution>
@@ -93,7 +97,7 @@ bool CheckSamplesMoments(const std::vector<T>& samples,
       // mode, given the large number of samples.
       moments_data[i] += moment;
       ++moments_sample_count_data[i];
-      moment *= samples_data[index];
+      moment *= static_cast<double>(samples_data[index]);
     }
   }
 
@@ -125,7 +129,7 @@ bool CheckSamplesMoments(const std::vector<T>& samples,
     const double z_test =
         fabs((moments[i] - moments_i_mean) / sqrt(total_variance));
 
-    if (z_test > z_limit) {
+    if (z_test > static_cast<double>(z_limit)) {
       LOG(ERROR) << "failing z_test:"
                  << " moment: " << i << " stride: " << stride
                  << " z_test: " << z_test << " z_limit: " << z_limit
@@ -252,6 +256,22 @@ void RandomParametersMomentsTest(int count, int max_moments,
   }
 }
 
+TEST(PhiloxRandomTest, UniformBfloat16MomentsTest) {
+  const std::vector<int> strides = {0, 1, 4, 17};
+  UniformMomentsTest<bfloat16>(1 << 20, 40, strides, bfloat16(kZLimitBfloat16));
+}
+
+TEST(PhiloxRandomTest, NormalBfloat16MomentsTest) {
+  const std::vector<int> strides = {0, 1, 4, 17};
+  NormalMomentsTest<bfloat16>(8 << 20, 25, strides, bfloat16(kZLimitBfloat16));
+}
+
+TEST(PhiloxRandomTest, RandomParametersBfloat16MomentsTest) {
+  const std::vector<int> strides = {0, 1, 4, 17};
+  RandomParametersMomentsTest<bfloat16>(1 << 20, 40, strides,
+                                        bfloat16(kZLimitBfloat16));
+}
+
 TEST(PhiloxRandomTest, UniformFloatMomentsTest) {
   const std::vector<int> strides = {0, 1, 4, 17};
   UniformMomentsTest<float>(1 << 20, 40, strides, kZLimit);
diff --git a/tensorflow/core/ops/array_ops.cc b/tensorflow/core/ops/array_ops.cc
index 267ce88440080399aae783903503f0bbd025d8b4..2fab62ea5cae6280554d2106f8f77d46017180e7 100644
--- a/tensorflow/core/ops/array_ops.cc
+++ b/tensorflow/core/ops/array_ops.cc
@@ -1201,6 +1201,23 @@ REGISTER_OP("UniqueWithCounts")
       return Status::OK();
     });
 
+REGISTER_OP("UniqueWithCountsV2")
+    .Input("x: T")
+    .Input("axis: Taxis")
+    .Output("y: T")
+    .Output("idx: out_idx")
+    .Output("count: out_idx")
+    .Attr("T: type")
+    .Attr("Taxis: {int32,int64} = DT_INT64")
+    .Attr("out_idx: {int32, int64} = DT_INT32")
+    .SetShapeFn([](InferenceContext* c) {
+      auto uniq = c->Vector(InferenceContext::kUnknownDim);
+      c->set_output(0, uniq);
+      c->set_output(1, c->input(0));
+      c->set_output(2, uniq);
+      return Status::OK();
+    });
+
 namespace {
 
 Status ShapeShapeFn(InferenceContext* c) {
diff --git a/tensorflow/core/ops/compat/ops_history.v1.pbtxt b/tensorflow/core/ops/compat/ops_history.v1.pbtxt
index fc9e5b02a2253621203a47c5f7d1b7d311c82a97..35c49658b3cc554a48e9c75b4f6c926ee42a8135 100644
--- a/tensorflow/core/ops/compat/ops_history.v1.pbtxt
+++ b/tensorflow/core/ops/compat/ops_history.v1.pbtxt
@@ -11460,6 +11460,14 @@ op {
     type: "type"
   }
 }
+op {
+  name: "ConsumeMutexLock"
+  input_arg {
+    name: "mutex_lock"
+    type: DT_VARIANT
+  }
+  is_stateful: true
+}
 op {
   name: "ControlTrigger"
 }
@@ -12814,28 +12822,6 @@ op {
     }
   }
 }
-op {
-  name: "CriticalSectionOp"
-  output_arg {
-    name: "resource"
-    type: DT_RESOURCE
-  }
-  attr {
-    name: "container"
-    type: "string"
-    default_value {
-      s: ""
-    }
-  }
-  attr {
-    name: "shared_name"
-    type: "string"
-    default_value {
-      s: ""
-    }
-  }
-  is_stateful: true
-}
 op {
   name: "CropAndResize"
   input_arg {
@@ -17433,78 +17419,6 @@ op {
     }
   }
 }
-op {
-  name: "ExecuteInCriticalSection"
-  input_arg {
-    name: "critical_section"
-    type: DT_RESOURCE
-  }
-  input_arg {
-    name: "arguments"
-    type_list_attr: "Targuments"
-  }
-  output_arg {
-    name: "outputs"
-    type_list_attr: "output_types"
-  }
-  attr {
-    name: "f"
-    type: "func"
-  }
-  attr {
-    name: "Targuments"
-    type: "list(type)"
-    has_minimum: true
-  }
-  attr {
-    name: "output_types"
-    type: "list(type)"
-    has_minimum: true
-    minimum: 1
-  }
-  attr {
-    name: "output_shapes"
-    type: "list(shape)"
-    has_minimum: true
-    minimum: 1
-  }
-  is_stateful: true
-}
-op {
-  name: "ExecuteInCriticalSection"
-  input_arg {
-    name: "critical_section"
-    type: DT_RESOURCE
-  }
-  input_arg {
-    name: "arguments"
-    type_list_attr: "Targuments"
-  }
-  output_arg {
-    name: "outputs"
-    type_list_attr: "output_types"
-  }
-  attr {
-    name: "f"
-    type: "func"
-  }
-  attr {
-    name: "Targuments"
-    type: "list(type)"
-    has_minimum: true
-  }
-  attr {
-    name: "output_types"
-    type: "list(type)"
-    has_minimum: true
-  }
-  attr {
-    name: "output_shapes"
-    type: "list(shape)"
-    has_minimum: true
-  }
-  is_stateful: true
-}
 op {
   name: "Exit"
   input_arg {
@@ -20556,6 +20470,65 @@ op {
     minimum: -1
   }
 }
+op {
+  name: "GeneratorDataset"
+  input_arg {
+    name: "init_func_other_args"
+    type_list_attr: "Tinit_func_args"
+  }
+  input_arg {
+    name: "next_func_other_args"
+    type_list_attr: "Tnext_func_args"
+  }
+  input_arg {
+    name: "finalize_func_other_args"
+    type_list_attr: "Tfinalize_func_args"
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "init_func"
+    type: "func"
+  }
+  attr {
+    name: "next_func"
+    type: "func"
+  }
+  attr {
+    name: "finalize_func"
+    type: "func"
+  }
+  attr {
+    name: "Tinit_func_args"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tnext_func_args"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tfinalize_func_args"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
 op {
   name: "GetSessionHandle"
   input_arg {
@@ -30112,6 +30085,40 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "MutexLock"
+  input_arg {
+    name: "mutex"
+    type: DT_RESOURCE
+  }
+  output_arg {
+    name: "mutex_lock"
+    type: DT_VARIANT
+  }
+  is_stateful: true
+}
+op {
+  name: "MutexV2"
+  output_arg {
+    name: "resource"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "container"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "shared_name"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  is_stateful: true
+}
 op {
   name: "Neg"
   input_arg {
@@ -37659,6 +37666,32 @@ op {
   }
   allows_uninitialized_input: true
 }
+op {
+  name: "RegexReplace"
+  input_arg {
+    name: "input"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "pattern"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "rewrite"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "output"
+    type: DT_STRING
+  }
+  attr {
+    name: "replace_global"
+    type: "bool"
+    default_value {
+      b: true
+    }
+  }
+}
 op {
   name: "Relu"
   input_arg {
@@ -64366,6 +64399,14 @@ op {
     version: 3
   }
 }
+op {
+  name: "Timestamp"
+  output_arg {
+    name: "ts"
+    type: DT_DOUBLE
+  }
+  is_stateful: true
+}
 op {
   name: "TopK"
   input_arg {
@@ -65218,29 +65259,6 @@ op {
     }
   }
 }
-op {
-  name: "UniqueDataset"
-  input_arg {
-    name: "input_dataset"
-    type: DT_VARIANT
-  }
-  output_arg {
-    name: "handle"
-    type: DT_VARIANT
-  }
-  attr {
-    name: "output_types"
-    type: "list(type)"
-    has_minimum: true
-    minimum: 1
-  }
-  attr {
-    name: "output_shapes"
-    type: "list(shape)"
-    has_minimum: true
-    minimum: 1
-  }
-}
 op {
   name: "UniqueV2"
   input_arg {
diff --git a/tensorflow/core/ops/dataset_ops.cc b/tensorflow/core/ops/dataset_ops.cc
index 9e98f56c745a2b0b16531e2785e43ba8464d42b8..bdbbf6d7c32014678d8ad171df03c29a4a44f422 100644
--- a/tensorflow/core/ops/dataset_ops.cc
+++ b/tensorflow/core/ops/dataset_ops.cc
@@ -66,6 +66,23 @@ REGISTER_OP("SparseTensorSliceDataset")
                       // stateful to inhibit constant folding.
     .SetShapeFn(shape_inference::ScalarShape);
 
+REGISTER_OP("GeneratorDataset")
+    .Input("init_func_other_args: Tinit_func_args")
+    .Input("next_func_other_args: Tnext_func_args")
+    .Input("finalize_func_other_args: Tfinalize_func_args")
+    .Output("handle: variant")
+    .Attr("init_func: func")
+    .Attr("next_func: func")
+    .Attr("finalize_func: func")
+    .Attr("Tinit_func_args: list(type) >= 0")
+    .Attr("Tnext_func_args: list(type) >= 0")
+    .Attr("Tfinalize_func_args: list(type) >= 0")
+    .Attr("output_types: list(type) >= 1")
+    .Attr("output_shapes: list(shape) >= 1")
+    .SetIsStateful()  // TODO(b/65524810): Source dataset ops must be marked
+                      // stateful to inhibit constant folding.
+    .SetShapeFn(shape_inference::ScalarShape);
+
 REGISTER_OP("ZipDataset")
     .Input("input_datasets: N * variant")
     .Output("handle: variant")
@@ -329,13 +346,6 @@ REGISTER_OP("CacheDataset")
     .Attr("output_shapes: list(shape) >= 1")
     .SetShapeFn(shape_inference::ScalarShape);
 
-REGISTER_OP("UniqueDataset")
-    .Input("input_dataset: variant")
-    .Output("handle: variant")
-    .Attr("output_types: list(type) >= 1")
-    .Attr("output_shapes: list(shape) >= 1")
-    .SetShapeFn(shape_inference::ScalarShape);
-
 REGISTER_OP("TextLineDataset")
     .Input("filenames: string")
     .Input("compression_type: string")
diff --git a/tensorflow/core/ops/function_ops.cc b/tensorflow/core/ops/function_ops.cc
index ada96fa1d2ddf79b2669fa3fc437ce7b872a2eb1..a6914d9383d2f5c623b17fb0b918c4907ed84175 100644
--- a/tensorflow/core/ops/function_ops.cc
+++ b/tensorflow/core/ops/function_ops.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include "tensorflow/core/framework/common_shape_fns.h"
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/shape_inference.h"
 
@@ -55,6 +56,7 @@ REGISTER_OP("_ListToArray")
     .Attr("Tin: list(type)")
     .Attr("T: type")
     .Attr("N: int >= 1")
+    .SetShapeFn(shape_inference::UnknownShape)
     .Doc(R"doc(
 Converts a list of tensors to an array of tensors.
 )doc");
@@ -65,6 +67,7 @@ REGISTER_OP("_ArrayToList")
     .Attr("T: type")
     .Attr("N: int >= 1")
     .Attr("out_types: list(type)")
+    .SetShapeFn(shape_inference::UnknownShape)
     .Doc(R"doc(
 Converts an array of tensors to a list of tensors.
 )doc");
diff --git a/tensorflow/core/ops/logging_ops.cc b/tensorflow/core/ops/logging_ops.cc
index d263dc25b29d5c867a10ef20ea1b39fa9b9662f1..fbde692e959769fca53c91fef649b18c248526a6 100644
--- a/tensorflow/core/ops/logging_ops.cc
+++ b/tensorflow/core/ops/logging_ops.cc
@@ -111,4 +111,9 @@ REGISTER_OP("MergeSummary")
     .Attr("N : int >= 1")
     .SetShapeFn(shape_inference::ScalarShape);
 
+REGISTER_OP("Timestamp")
+    .Output("ts: float64")
+    .SetIsStateful()
+    .SetShapeFn(shape_inference::ScalarShape);
+
 }  // end namespace tensorflow
diff --git a/tensorflow/core/ops/nn_ops.cc b/tensorflow/core/ops/nn_ops.cc
index 67481fd202b3c3b35033b72e4c1c5fd294d98696..910fbaca9e72d4352bf671fe5a07be3e761517b2 100644
--- a/tensorflow/core/ops/nn_ops.cc
+++ b/tensorflow/core/ops/nn_ops.cc
@@ -2007,10 +2007,10 @@ REGISTER_OP("_MklFusedBatchNorm")
       TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 4, &x));
 
       bool is_training;
-      c->GetAttr("is_training", &is_training);
+      TF_RETURN_IF_ERROR(c->GetAttr("is_training", &is_training));
       int number_inputs = (is_training) ? 3 : 5;
       string data_format;
-      c->GetAttr("data_format", &data_format);
+      TF_RETURN_IF_ERROR(c->GetAttr("data_format", &data_format));
       DimensionHandle channel_dim =
           (data_format == "NHWC") ? c->Dim(x, 3) : c->Dim(x, 1);
 
@@ -2076,8 +2076,8 @@ REGISTER_OP("_MklFusedBatchNormGrad")
 
       bool is_training;
       string data_format;
-      c->GetAttr("is_training", &is_training);
-      c->GetAttr("data_format", &data_format);
+      TF_RETURN_IF_ERROR(c->GetAttr("is_training", &is_training));
+      TF_RETURN_IF_ERROR(c->GetAttr("data_format", &data_format));
       DimensionHandle channel_dim = (data_format == "NHWC")
                                         ? c->Dim(y_backprop, 3)
                                         : c->Dim(y_backprop, 1);
diff --git a/tensorflow/core/ops/ops.pbtxt b/tensorflow/core/ops/ops.pbtxt
index 45ff08f38b134f963460d15f949411a7f1619d0c..bf7682712cd6b9555b8d7727fc689dba9dcc262d 100644
--- a/tensorflow/core/ops/ops.pbtxt
+++ b/tensorflow/core/ops/ops.pbtxt
@@ -4773,6 +4773,14 @@ op {
     type: "type"
   }
 }
+op {
+  name: "ConsumeMutexLock"
+  input_arg {
+    name: "mutex_lock"
+    type: DT_VARIANT
+  }
+  is_stateful: true
+}
 op {
   name: "ControlTrigger"
 }
@@ -5465,28 +5473,6 @@ op {
     }
   }
 }
-op {
-  name: "CriticalSectionOp"
-  output_arg {
-    name: "resource"
-    type: DT_RESOURCE
-  }
-  attr {
-    name: "container"
-    type: "string"
-    default_value {
-      s: ""
-    }
-  }
-  attr {
-    name: "shared_name"
-    type: "string"
-    default_value {
-      s: ""
-    }
-  }
-  is_stateful: true
-}
 op {
   name: "CropAndResize"
   input_arg {
@@ -7788,41 +7774,6 @@ op {
     }
   }
 }
-op {
-  name: "ExecuteInCriticalSection"
-  input_arg {
-    name: "critical_section"
-    type: DT_RESOURCE
-  }
-  input_arg {
-    name: "arguments"
-    type_list_attr: "Targuments"
-  }
-  output_arg {
-    name: "outputs"
-    type_list_attr: "output_types"
-  }
-  attr {
-    name: "f"
-    type: "func"
-  }
-  attr {
-    name: "Targuments"
-    type: "list(type)"
-    has_minimum: true
-  }
-  attr {
-    name: "output_types"
-    type: "list(type)"
-    has_minimum: true
-  }
-  attr {
-    name: "output_shapes"
-    type: "list(shape)"
-    has_minimum: true
-  }
-  is_stateful: true
-}
 op {
   name: "Exit"
   input_arg {
@@ -9656,6 +9607,65 @@ op {
     minimum: -1
   }
 }
+op {
+  name: "GeneratorDataset"
+  input_arg {
+    name: "init_func_other_args"
+    type_list_attr: "Tinit_func_args"
+  }
+  input_arg {
+    name: "next_func_other_args"
+    type_list_attr: "Tnext_func_args"
+  }
+  input_arg {
+    name: "finalize_func_other_args"
+    type_list_attr: "Tfinalize_func_args"
+  }
+  output_arg {
+    name: "handle"
+    type: DT_VARIANT
+  }
+  attr {
+    name: "init_func"
+    type: "func"
+  }
+  attr {
+    name: "next_func"
+    type: "func"
+  }
+  attr {
+    name: "finalize_func"
+    type: "func"
+  }
+  attr {
+    name: "Tinit_func_args"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tnext_func_args"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "Tfinalize_func_args"
+    type: "list(type)"
+    has_minimum: true
+  }
+  attr {
+    name: "output_types"
+    type: "list(type)"
+    has_minimum: true
+    minimum: 1
+  }
+  attr {
+    name: "output_shapes"
+    type: "list(shape)"
+    has_minimum: true
+    minimum: 1
+  }
+  is_stateful: true
+}
 op {
   name: "GetSessionHandle"
   input_arg {
@@ -14308,6 +14318,40 @@ op {
   }
   is_stateful: true
 }
+op {
+  name: "MutexLock"
+  input_arg {
+    name: "mutex"
+    type: DT_RESOURCE
+  }
+  output_arg {
+    name: "mutex_lock"
+    type: DT_VARIANT
+  }
+  is_stateful: true
+}
+op {
+  name: "MutexV2"
+  output_arg {
+    name: "resource"
+    type: DT_RESOURCE
+  }
+  attr {
+    name: "container"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  attr {
+    name: "shared_name"
+    type: "string"
+    default_value {
+      s: ""
+    }
+  }
+  is_stateful: true
+}
 op {
   name: "Neg"
   input_arg {
@@ -19309,6 +19353,32 @@ op {
   }
   allows_uninitialized_input: true
 }
+op {
+  name: "RegexReplace"
+  input_arg {
+    name: "input"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "pattern"
+    type: DT_STRING
+  }
+  input_arg {
+    name: "rewrite"
+    type: DT_STRING
+  }
+  output_arg {
+    name: "output"
+    type: DT_STRING
+  }
+  attr {
+    name: "replace_global"
+    type: "bool"
+    default_value {
+      b: true
+    }
+  }
+}
 op {
   name: "Relu"
   input_arg {
@@ -30368,6 +30438,14 @@ op {
     explanation: "TileGrad has been replaced with reduce_sum"
   }
 }
+op {
+  name: "Timestamp"
+  output_arg {
+    name: "ts"
+    type: DT_DOUBLE
+  }
+  is_stateful: true
+}
 op {
   name: "TopK"
   input_arg {
@@ -30778,29 +30856,6 @@ op {
     }
   }
 }
-op {
-  name: "UniqueDataset"
-  input_arg {
-    name: "input_dataset"
-    type: DT_VARIANT
-  }
-  output_arg {
-    name: "handle"
-    type: DT_VARIANT
-  }
-  attr {
-    name: "output_types"
-    type: "list(type)"
-    has_minimum: true
-    minimum: 1
-  }
-  attr {
-    name: "output_shapes"
-    type: "list(shape)"
-    has_minimum: true
-    minimum: 1
-  }
-}
 op {
   name: "UniqueV2"
   input_arg {
diff --git a/tensorflow/core/ops/resource_variable_ops.cc b/tensorflow/core/ops/resource_variable_ops.cc
index 8dae7e1ff5f872c33dd56509c0349180cec78593..0d8cf78cc2a196cde4a77f53ce912c437648786a 100644
--- a/tensorflow/core/ops/resource_variable_ops.cc
+++ b/tensorflow/core/ops/resource_variable_ops.cc
@@ -211,7 +211,7 @@ REGISTER_OP("ResourceScatterUpdate")
       return Status::OK();
     });
 
-REGISTER_OP("CriticalSectionOp")
+REGISTER_OP("MutexV2")
     .Attr("container: string = ''")
     .Attr("shared_name: string = ''")
     .Output("resource: resource")
@@ -221,24 +221,18 @@ REGISTER_OP("CriticalSectionOp")
       return Status::OK();
     });
 
-REGISTER_OP("ExecuteInCriticalSection")
-    .Input("critical_section: resource")
-    .Input("arguments: Targuments")
-    .Output("outputs: output_types")
-    .Attr("f: func")
-    .Attr("Targuments: list(type) >= 0")
-    .Attr("output_types: list(type) >= 0")
-    .Attr("output_shapes: list(shape) >= 0")
+REGISTER_OP("MutexLock")
+    .Input("mutex: resource")
+    .Output("mutex_lock: variant")
+    .SetIsStateful()
     .SetShapeFn([](InferenceContext* c) {
-      std::vector<PartialTensorShape> output_shapes;
-      TF_RETURN_IF_ERROR(c->GetAttr("output_shapes", &output_shapes));
-      for (int i = 0; i < output_shapes.size(); ++i) {
-        ShapeHandle s;
-        TF_RETURN_IF_ERROR(
-            c->MakeShapeFromPartialTensorShape(output_shapes[i], &s));
-        c->set_output(i, s);
-      }
+      c->set_output(0, c->Scalar());
       return Status::OK();
     });
 
+REGISTER_OP("ConsumeMutexLock")
+    .Input("mutex_lock: variant")
+    .SetIsStateful()
+    .SetShapeFn([](InferenceContext* c) { return Status::OK(); });
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/ops/shape_function_test.cc b/tensorflow/core/ops/shape_function_test.cc
new file mode 100644
index 0000000000000000000000000000000000000000..120995f3aac7da4111d0404a64f322a50d30a491
--- /dev/null
+++ b/tensorflow/core/ops/shape_function_test.cc
@@ -0,0 +1,34 @@
+/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#include "tensorflow/core/framework/op.h"
+#include "tensorflow/core/platform/test.h"
+
+// Test to ensure that all core ops have shape functions defined. This is done
+// by looking at all ops registered in the test binary.
+
+namespace tensorflow {
+
+TEST(ShapeFunctionTest, RegisteredOpsHaveShapeFns) {
+  OpRegistry* op_registry = OpRegistry::Global();
+  std::vector<OpRegistrationData> op_data;
+  op_registry->GetOpRegistrationData(&op_data);
+  for (const OpRegistrationData& op_reg_data : op_data) {
+    EXPECT_TRUE(op_reg_data.shape_inference_fn != nullptr)
+        << op_reg_data.op_def.name();
+  }
+}
+
+}  // namespace tensorflow
diff --git a/tensorflow/core/ops/spectral_ops.cc b/tensorflow/core/ops/spectral_ops.cc
index 508cea3495a9e811d4d12bf022b0ddfdcb33d718..2790aee37e93d3915ff9cba80af2e7ddccf4774e 100644
--- a/tensorflow/core/ops/spectral_ops.cc
+++ b/tensorflow/core/ops/spectral_ops.cc
@@ -142,26 +142,32 @@ REGISTER_OP("IRFFT3D")
 REGISTER_OP("BatchFFT")
     .Input("input: complex64")
     .Output("output: complex64")
+    .SetShapeFn(shape_inference::UnknownShape)
     .Deprecated(15, "Use FFT");
 REGISTER_OP("BatchIFFT")
     .Input("input: complex64")
     .Output("output: complex64")
+    .SetShapeFn(shape_inference::UnknownShape)
     .Deprecated(15, "Use IFFT");
 REGISTER_OP("BatchFFT2D")
     .Input("input: complex64")
     .Output("output: complex64")
+    .SetShapeFn(shape_inference::UnknownShape)
     .Deprecated(15, "Use FFT2D");
 REGISTER_OP("BatchIFFT2D")
     .Input("input: complex64")
     .Output("output: complex64")
+    .SetShapeFn(shape_inference::UnknownShape)
     .Deprecated(15, "Use IFFT2D");
 REGISTER_OP("BatchFFT3D")
     .Input("input: complex64")
     .Output("output: complex64")
+    .SetShapeFn(shape_inference::UnknownShape)
     .Deprecated(15, "Use FFT3D");
 REGISTER_OP("BatchIFFT3D")
     .Input("input: complex64")
     .Output("output: complex64")
+    .SetShapeFn(shape_inference::UnknownShape)
     .Deprecated(15, "Use IFFT3D");
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/ops/string_ops.cc b/tensorflow/core/ops/string_ops.cc
index e4c5bcfb540660a609aca013b795d566e69f54a8..05f216a83e21030443379876ddd160f2ceba6d39 100644
--- a/tensorflow/core/ops/string_ops.cc
+++ b/tensorflow/core/ops/string_ops.cc
@@ -23,6 +23,20 @@ using shape_inference::DimensionHandle;
 using shape_inference::InferenceContext;
 using shape_inference::ShapeHandle;
 
+REGISTER_OP("RegexReplace")
+    .Input("input: string")
+    .Input("pattern: string")
+    .Input("rewrite: string")
+    .Output("output: string")
+    .Attr("replace_global: bool = true")
+    .SetShapeFn([](InferenceContext* c) {
+      ShapeHandle unused;
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(1), 0, &unused));
+      TF_RETURN_IF_ERROR(c->WithRank(c->input(2), 0, &unused));
+      c->set_output(0, c->input(0));
+      return Status::OK();
+    });
+
 REGISTER_OP("StringToHashBucketFast")
     .Input("input: string")
     .Output("output: int64")
diff --git a/tensorflow/core/ops/word2vec_ops.cc b/tensorflow/core/ops/word2vec_ops.cc
index ed685dcf0ae9a3c61a1db491751f7de4e981300d..e469771103925e107d2f8aeced6df9dfb56cbe24 100644
--- a/tensorflow/core/ops/word2vec_ops.cc
+++ b/tensorflow/core/ops/word2vec_ops.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
+#include "tensorflow/core/framework/common_shape_fns.h"
 #include "tensorflow/core/framework/op.h"
 
 namespace tensorflow {
@@ -33,7 +34,8 @@ REGISTER_OP("Skipgram")
     .Attr("batch_size: int")
     .Attr("window_size: int = 5")
     .Attr("min_count: int = 5")
-    .Attr("subsample: float = 1e-3");
+    .Attr("subsample: float = 1e-3")
+    .SetShapeFn(shape_inference::UnknownShape);
 
 REGISTER_OP("NegTrain")
     .Deprecated(19,
@@ -46,6 +48,7 @@ REGISTER_OP("NegTrain")
     .Input("lr: float")
     .SetIsStateful()
     .Attr("vocab_count: list(int)")
-    .Attr("num_negative_samples: int");
+    .Attr("num_negative_samples: int")
+    .SetShapeFn(shape_inference::UnknownShape);
 
 }  // end namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/BUILD b/tensorflow/core/platform/cloud/BUILD
index 9ba25dea4fb278cbfaf4080e21beef8a3e9de769..0a17a419d3ee386879d2e89ff07c45d804d9ffc5 100644
--- a/tensorflow/core/platform/cloud/BUILD
+++ b/tensorflow/core/platform/cloud/BUILD
@@ -38,13 +38,24 @@ cc_library(
 
 cc_library(
     name = "file_block_cache",
-    srcs = ["file_block_cache.cc"],
     hdrs = ["file_block_cache.h"],
     copts = tf_copts(),
     visibility = ["//tensorflow:__subpackages__"],
     deps = ["//tensorflow/core:lib"],
 )
 
+cc_library(
+    name = "ram_file_block_cache",
+    srcs = ["ram_file_block_cache.cc"],
+    hdrs = ["ram_file_block_cache.h"],
+    copts = tf_copts(),
+    visibility = ["//tensorflow:__subpackages__"],
+    deps = [
+        ":file_block_cache",
+        "//tensorflow/core:lib",
+    ],
+)
+
 cc_library(
     name = "gcs_dns_cache",
     srcs = ["gcs_dns_cache.cc"],
@@ -83,6 +94,7 @@ cc_library(
         ":gcs_throttle",
         ":google_auth_provider",
         ":http_request",
+        ":ram_file_block_cache",
         ":retrying_file_system",
         ":retrying_utils",
         ":time_util",
@@ -245,12 +257,12 @@ tf_cc_test(
 )
 
 tf_cc_test(
-    name = "file_block_cache_test",
+    name = "ram_file_block_cache_test",
     size = "small",
-    srcs = ["file_block_cache_test.cc"],
+    srcs = ["ram_file_block_cache_test.cc"],
     deps = [
-        ":file_block_cache",
         ":now_seconds_env",
+        ":ram_file_block_cache",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:test",
diff --git a/tensorflow/core/platform/cloud/curl_http_request.cc b/tensorflow/core/platform/cloud/curl_http_request.cc
index 88a5d1e96dc2fcb7d12e2c0891d2f04d64bac594..9bc06d56ae84dd264f1f57517d1accfa45de65af 100644
--- a/tensorflow/core/platform/cloud/curl_http_request.cc
+++ b/tensorflow/core/platform/cloud/curl_http_request.cc
@@ -21,6 +21,7 @@ limitations under the License.
 #include "tensorflow/core/lib/gtl/map_util.h"
 #include "tensorflow/core/lib/strings/scanner.h"
 #include "tensorflow/core/lib/strings/str_util.h"
+#include "tensorflow/core/platform/macros.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/core/public/version.h"
 
@@ -129,20 +130,34 @@ CurlHttpRequest::CurlHttpRequest(LibCurl* libcurl, Env* env)
   //       default in //third_party:curl.BUILD and can be customized via an
   //       environment variable.
 
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_VERBOSE, kVerboseOutput);
-  libcurl_->curl_easy_setopt(
-      curl_, CURLOPT_USERAGENT,
-      strings::StrCat("TensorFlow/", TF_VERSION_STRING).c_str());
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_VERBOSE, kVerboseOutput),
+      "Setting verbose output");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(
+          curl_, CURLOPT_USERAGENT,
+          strings::StrCat("TensorFlow/", TF_VERSION_STRING).c_str()),
+      "Setting user agent");
   // Do not use signals for timeouts - does not work in multi-threaded programs.
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_NOSIGNAL, 1L);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_HTTP_VERSION,
-                             CURL_HTTP_VERSION_2_0);
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_NOSIGNAL, 1L),
+      "Disabling signals");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_HTTP_VERSION,
+                                 CURL_HTTP_VERSION_2_0),
+      "Setting HTTP version");
 
   // Set up the progress meter.
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_NOPROGRESS, 0ULL);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_XFERINFODATA, this);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_XFERINFOFUNCTION,
-                             &CurlHttpRequest::ProgressCallback);
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_NOPROGRESS, 0ULL),
+      "Disabling progress meter");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_XFERINFODATA, this),
+      "Setting custom pointer to the progress callback");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_XFERINFOFUNCTION,
+                                 &CurlHttpRequest::ProgressCallback),
+      "Setting the progress callback");
 
   // If response buffer is not set, libcurl will print results to stdout,
   // so we always set it.
@@ -175,13 +190,17 @@ void CurlHttpRequest::SetUri(const string& uri) {
   CheckNotSent();
   is_uri_set_ = true;
   uri_ = uri;
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_URL, uri.c_str());
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_URL, uri.c_str()),
+      "Setting URL");
 }
 
 void CurlHttpRequest::SetRange(uint64 start, uint64 end) {
   CheckNotSent();
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_RANGE,
-                             strings::StrCat(start, "-", end).c_str());
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_RANGE,
+                                 strings::StrCat(start, "-", end).c_str()),
+      "Setting range");
 }
 
 void CurlHttpRequest::AddHeader(const string& name, const string& value) {
@@ -210,7 +229,9 @@ void CurlHttpRequest::SetDeleteRequest() {
   CheckNotSent();
   CheckMethodNotSet();
   is_method_set_ = true;
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_CUSTOMREQUEST, "DELETE");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_CUSTOMREQUEST, "DELETE"),
+      "Setting delete request");
 }
 
 Status CurlHttpRequest::SetPutFromFile(const string& body_filepath,
@@ -232,9 +253,12 @@ Status CurlHttpRequest::SetPutFromFile(const string& body_filepath,
 
   curl_headers_ = libcurl_->curl_slist_append(
       curl_headers_, strings::StrCat("Content-Length: ", size).c_str());
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_PUT, 1);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
-                             reinterpret_cast<void*>(put_body_));
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_PUT, 1), "Setting PUT request");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
+                                 reinterpret_cast<void*>(put_body_)),
+      "Setting read data");
   // Using the default CURLOPT_READFUNCTION, which is doing an fread() on the
   // FILE * userdata set with CURLOPT_READDATA.
   return Status::OK();
@@ -244,13 +268,18 @@ void CurlHttpRequest::SetPutEmptyBody() {
   CheckNotSent();
   CheckMethodNotSet();
   is_method_set_ = true;
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_PUT, 1);
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_PUT, 1), "Setting put request");
   curl_headers_ =
       libcurl_->curl_slist_append(curl_headers_, "Content-Length: 0");
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
-                             &CurlHttpRequest::ReadCallback);
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
+                                 reinterpret_cast<void*>(this)),
+      "Setting read data");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
+                                 &CurlHttpRequest::ReadCallback),
+      "Setting read callback");
 }
 
 void CurlHttpRequest::SetPostFromBuffer(const char* buffer, size_t size) {
@@ -259,11 +288,17 @@ void CurlHttpRequest::SetPostFromBuffer(const char* buffer, size_t size) {
   is_method_set_ = true;
   curl_headers_ = libcurl_->curl_slist_append(
       curl_headers_, strings::StrCat("Content-Length: ", size).c_str());
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_POST, 1);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
-                             &CurlHttpRequest::ReadCallback);
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_POST, 1),
+      "Setting POST request");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
+                                 reinterpret_cast<void*>(this)),
+      "Setting read data");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
+                                 &CurlHttpRequest::ReadCallback),
+      "Setting read callback");
   post_body_buffer_ = StringPiece(buffer, size);
 }
 
@@ -271,13 +306,19 @@ void CurlHttpRequest::SetPostEmptyBody() {
   CheckNotSent();
   CheckMethodNotSet();
   is_method_set_ = true;
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_POST, 1);
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_POST, 1),
+      "Setting POST request");
   curl_headers_ =
       libcurl_->curl_slist_append(curl_headers_, "Content-Length: 0");
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
-                             &CurlHttpRequest::ReadCallback);
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_READDATA,
+                                 reinterpret_cast<void*>(this)),
+      "Setting read data");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_READFUNCTION,
+                                 &CurlHttpRequest::ReadCallback),
+      "Setting read callback");
 }
 
 void CurlHttpRequest::SetResultBuffer(std::vector<char>* out_buffer) {
@@ -287,10 +328,14 @@ void CurlHttpRequest::SetResultBuffer(std::vector<char>* out_buffer) {
   out_buffer->clear();
   response_buffer_ = out_buffer;
 
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEFUNCTION,
-                             &CurlHttpRequest::WriteCallback);
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEDATA,
+                                 reinterpret_cast<void*>(this)),
+      "Setting write data");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEFUNCTION,
+                                 &CurlHttpRequest::WriteCallback),
+      "Setting write callback");
 }
 
 void CurlHttpRequest::SetResultBufferDirect(char* buffer, size_t size) {
@@ -299,10 +344,14 @@ void CurlHttpRequest::SetResultBufferDirect(char* buffer, size_t size) {
 
   direct_response_ = DirectResponseState{buffer, size, 0};
 
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEFUNCTION,
-                             &CurlHttpRequest::WriteCallbackDirect);
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEDATA,
+                                 reinterpret_cast<void*>(this)),
+      "Setting write data");
+  TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_WRITEFUNCTION,
+                                 &CurlHttpRequest::WriteCallbackDirect),
+      "Setting write callback");
 }
 
 bool CurlHttpRequest::IsDirectResponse() const {
@@ -399,6 +448,24 @@ size_t CurlHttpRequest::HeaderCallback(const void* ptr, size_t size,
   return size * nmemb;
 }
 
+// This is pulled out as a separate function so that it's only computed when
+// an error occurs.
+string response_to_error_message(uint64 response_code, StringPiece response,
+                                 size_t response_to_error_limit,
+                                 CURLcode curl_result,
+                                 StringPiece error_buffer) {
+  string error_message = strings::StrCat(
+      "Error executing an HTTP request (HTTP response code ", response_code,
+      ", error code ", curl_result, ", error message '", error_buffer, "')");
+  if (!response.empty()) {
+    return strings::StrCat(
+        error_message, ", response '",
+        response.substr(0, std::min(response.size(), response_to_error_limit)),
+        "'");
+  }
+  return error_message;
+}
+
 Status CurlHttpRequest::Send() {
   CheckNotSent();
   CHECK(is_uri_set_) << "URI has not been set.";
@@ -406,37 +473,52 @@ Status CurlHttpRequest::Send() {
   is_sent_ = true;
 
   if (curl_headers_) {
-    libcurl_->curl_easy_setopt(curl_, CURLOPT_HTTPHEADER, curl_headers_);
+    TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+        libcurl_->curl_easy_setopt(curl_, CURLOPT_HTTPHEADER, curl_headers_),
+        "Setting HTTP header");
   }
   if (resolve_list_) {
-    libcurl_->curl_easy_setopt(curl_, CURLOPT_RESOLVE, resolve_list_);
+    TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+        libcurl_->curl_easy_setopt(curl_, CURLOPT_RESOLVE, resolve_list_),
+        "Setting custom resolves");
   }
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_HEADERDATA,
-                             reinterpret_cast<void*>(this));
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_HEADERFUNCTION,
-                             &CurlHttpRequest::HeaderCallback);
-
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_TIMEOUT, request_timeout_secs_);
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_CONNECTTIMEOUT,
-                             connect_timeout_secs_);
+  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_HEADERDATA,
+                                 reinterpret_cast<void*>(this)),
+      "Setting header data");
+  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_HEADERFUNCTION,
+                                 &CurlHttpRequest::HeaderCallback),
+      "Setting header function");
+
+  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_TIMEOUT, request_timeout_secs_),
+      "Setting request timeout");
+  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_CONNECTTIMEOUT,
+                                 connect_timeout_secs_),
+      "Setting connection timeout");
 
   char error_buffer[CURL_ERROR_SIZE] = {0};
-  libcurl_->curl_easy_setopt(curl_, CURLOPT_ERRORBUFFER, error_buffer);
+  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_setopt(curl_, CURLOPT_ERRORBUFFER, error_buffer),
+      "Setting error buffer");
 
-  const auto curl_result = libcurl_->curl_easy_perform(curl_);
+  const CURLcode curl_result = libcurl_->curl_easy_perform(curl_);
+  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+      curl_result, "Performing request. Detailed error: ", error_buffer);
 
   double written_size = 0;
-  libcurl_->curl_easy_getinfo(curl_, CURLINFO_SIZE_DOWNLOAD, &written_size);
-
-  libcurl_->curl_easy_getinfo(curl_, CURLINFO_RESPONSE_CODE, &response_code_);
+  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_getinfo(curl_, CURLINFO_SIZE_DOWNLOAD, &written_size),
+      "Fetching written size");
 
-  const auto& error_message = strings::StrCat(
-      "Error executing an HTTP request (HTTP response code ", response_code_,
-      ", error code ", curl_result, ", error message '", error_buffer, "')");
+  TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(
+      libcurl_->curl_easy_getinfo(curl_, CURLINFO_RESPONSE_CODE,
+                                  &response_code_),
+      "Fetching response code");
 
   Status result;
-  StringPiece response = GetResponse();
-  string extended_error_message;
   switch (response_code_) {
     // The group of response codes indicating that the request achieved
     // the expected goal.
@@ -447,7 +529,9 @@ Status CurlHttpRequest::Send() {
       if (curl_result != CURLE_OK) {
         // This means the server executed the request successfully, but then
         // something went wrong during the transmission of the response.
-        result = errors::Unavailable(error_message);
+        result = errors::Unavailable(response_to_error_message(
+            response_code_, GetResponse(), response_to_error_limit_,
+            curl_result, error_buffer));
       } else {
         result = Status::OK();
       }
@@ -463,27 +547,25 @@ Status CurlHttpRequest::Send() {
     // INVALID_ARGUMENT indicates a problem with how the request is constructed.
     case 400:  // Bad Request
     case 411:  // Length Required
-      result = errors::InvalidArgument(error_message);
+      result = errors::InvalidArgument(response_to_error_message(
+          response_code_, GetResponse(), response_to_error_limit_, curl_result,
+          error_buffer));
       break;
 
     // PERMISSION_DENIED indicates an authentication or an authorization issue.
     case 401:  // Unauthorized
     case 403:  // Forbidden
-      if (!response.empty()) {
-        extended_error_message = strings::StrCat(
-            error_message, ", response ",
-            response.substr(
-                0, std::min(response.size(), response_to_error_limit_)));
-        result = errors::PermissionDenied(extended_error_message);
-      } else {
-        result = errors::PermissionDenied(error_message);
-      }
+      result = errors::PermissionDenied(response_to_error_message(
+          response_code_, GetResponse(), response_to_error_limit_, curl_result,
+          error_buffer));
       break;
 
     // NOT_FOUND indicates that the requested resource does not exist.
     case 404:  // Not found
     case 410:  // Gone
-      result = errors::NotFound(error_message);
+      result = errors::NotFound(response_to_error_message(
+          response_code_, GetResponse(), response_to_error_limit_, curl_result,
+          error_buffer));
       break;
 
     // FAILED_PRECONDITION indicates that the request failed because some
@@ -493,21 +575,29 @@ Status CurlHttpRequest::Send() {
     case 303:  // See Other
     case 304:  // Not Modified
     case 307:  // Temporary Redirect
-    case 308:  // Resume Incomplete
     case 412:  // Precondition Failed
     case 413:  // Payload Too Large
-      result = errors::FailedPrecondition(error_message);
+      result = errors::FailedPrecondition(response_to_error_message(
+          response_code_, GetResponse(), response_to_error_limit_, curl_result,
+          error_buffer));
       break;
 
     // UNAVAILABLE indicates a problem that can go away if the request
-    // is just retried without any modification.
+    // is just retried without any modification. 308 return codes are intended
+    // for write requests that can be retried. See the documentation and the
+    // official library:
+    // https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload
+    // https://github.com/google/apitools/blob/master/apitools/base/py/transfer.py
+    case 308:  // Resume Incomplete
     case 409:  // Conflict
     case 429:  // Too Many Requests
     case 500:  // Internal Server Error
     case 502:  // Bad Gateway
     case 503:  // Service Unavailable
     default:   // All other HTTP response codes also should be retried.
-      result = errors::Unavailable(error_message);
+      result = errors::Unavailable(response_to_error_message(
+          response_code_, GetResponse(), response_to_error_limit_, curl_result,
+          error_buffer));
       break;
   }
   if (!result.ok()) {
@@ -596,4 +686,12 @@ int CurlHttpRequest::ProgressCallback(void* this_object, curl_off_t dltotal,
   return 0;
 }
 
+Status CURLcodeToStatus(CURLcode code) {
+  // Return Unavailable to retry by default. We probably should distinguish
+  // between permanent or temporary failures.
+  return errors::Unavailable("Error executing an HTTP request (error code ",
+                             code, ", error message '",
+                             curl_easy_strerror(code), "')");
+}
+
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/curl_http_request.h b/tensorflow/core/platform/cloud/curl_http_request.h
index cfa26f2b795a6cc33aba308597c77088362f1e1b..c9f60cb5fc2497051c66d22d069c0ce50202f864 100644
--- a/tensorflow/core/platform/cloud/curl_http_request.h
+++ b/tensorflow/core/platform/cloud/curl_http_request.h
@@ -229,26 +229,28 @@ class LibCurl {
 
   virtual CURL* curl_easy_init() = 0;
   virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
-                                    uint64 param) = 0;
+                                    uint64 param) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
-                                    const char* param) = 0;
+                                    const char* param) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
-                                    void* param) = 0;
-  virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
-                                    size_t (*param)(void*, size_t, size_t,
-                                                    FILE*)) = 0;
+                                    void* param) TF_MUST_USE_RESULT = 0;
+  virtual CURLcode curl_easy_setopt(
+      CURL* curl, CURLoption option,
+      size_t (*param)(void*, size_t, size_t, FILE*)) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_setopt(CURL* curl, CURLoption option,
                                     size_t (*param)(const void*, size_t, size_t,
-                                                    void*)) = 0;
+                                                    void*))
+      TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_setopt(
       CURL* curl, CURLoption option,
       int (*param)(void* clientp, curl_off_t dltotal, curl_off_t dlnow,
-                   curl_off_t ultotal, curl_off_t ulnow)) = 0;
-  virtual CURLcode curl_easy_perform(CURL* curl) = 0;
+                   curl_off_t ultotal,
+                   curl_off_t ulnow)) TF_MUST_USE_RESULT = 0;
+  virtual CURLcode curl_easy_perform(CURL* curl) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_getinfo(CURL* curl, CURLINFO info,
-                                     uint64* value) = 0;
+                                     uint64* value) TF_MUST_USE_RESULT = 0;
   virtual CURLcode curl_easy_getinfo(CURL* curl, CURLINFO info,
-                                     double* value) = 0;
+                                     double* value) TF_MUST_USE_RESULT = 0;
   virtual void curl_easy_cleanup(CURL* curl) = 0;
   virtual curl_slist* curl_slist_append(curl_slist* list, const char* str) = 0;
   virtual void curl_slist_free_all(curl_slist* list) = 0;
@@ -258,6 +260,26 @@ class LibCurl {
   virtual const char* curl_easy_strerror(CURLcode errornum) = 0;
 };
 
+Status CURLcodeToStatus(CURLcode code);
+
+#define TF_CURL_RETURN_WITH_CONTEXT_IF_ERROR(_code, ...)                    \
+  do {                                                                      \
+    if (_code != CURLE_OK) {                                                \
+      ::tensorflow::Status _status = ::tensorflow::CURLcodeToStatus(_code); \
+      ::tensorflow::errors::AppendToMessage(&_status, __VA_ARGS__);         \
+      return _status;                                                       \
+    }                                                                       \
+  } while (0)
+
+#define TF_CURL_LOG_WITH_CONTEXT_IF_ERROR(_code, ...)                       \
+  do {                                                                      \
+    if (_code != CURLE_OK) {                                                \
+      ::tensorflow::Status _status = ::tensorflow::CURLcodeToStatus(_code); \
+      ::tensorflow::errors::AppendToMessage(&_status, __VA_ARGS__);         \
+      LOG(ERROR) << "curl error: " << _status.error_message();              \
+    }                                                                       \
+  } while (0)
+
 }  // namespace tensorflow
 
 #endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_CURL_HTTP_REQUEST_H_
diff --git a/tensorflow/core/platform/cloud/curl_http_request_test.cc b/tensorflow/core/platform/cloud/curl_http_request_test.cc
index 86d26a028733c303b85390b0be8fb8808c6e082a..4cded9b81b0065b6ceb60a5819caceb92b635ec9 100644
--- a/tensorflow/core/platform/cloud/curl_http_request_test.cc
+++ b/tensorflow/core/platform/cloud/curl_http_request_test.cc
@@ -346,7 +346,6 @@ TEST(CurlHttpRequestTest, GetRequest_Empty) {
 
 TEST(CurlHttpRequestTest, GetRequest_RangeOutOfBound) {
   FakeLibCurl libcurl("get response", 416);
-  libcurl.curl_easy_perform_result_ = CURLE_WRITE_ERROR;
   CurlHttpRequest http_request(&libcurl);
 
   std::vector<char> scratch;
@@ -377,10 +376,10 @@ TEST(CurlHttpRequestTest, GetRequest_503) {
   const auto& status = http_request.Send();
   EXPECT_EQ(error::UNAVAILABLE, status.code());
   EXPECT_EQ(
-      "Error executing an HTTP request (HTTP response code 503, "
-      "error code 23, error message '')",
+      "Error executing an HTTP request (error code 23, error message 'Failed "
+      "writing received data to disk/application')\n\tPerforming request. "
+      "Detailed error: ",
       status.error_message());
-  EXPECT_EQ(503, http_request.GetResponseCode());
 }
 
 TEST(CurlHttpRequestTest, GetRequest_HttpCode0) {
@@ -396,8 +395,9 @@ TEST(CurlHttpRequestTest, GetRequest_HttpCode0) {
   const auto& status = http_request.Send();
   EXPECT_EQ(error::UNAVAILABLE, status.code());
   EXPECT_EQ(
-      "Error executing an HTTP request (HTTP response code 0, "
-      "error code 28, error message 'Operation timed out')",
+      "Error executing an HTTP request (error code 28, error message 'Timeout "
+      "was reached')\n\tPerforming request. Detailed error: Operation timed "
+      "out",
       status.error_message());
   EXPECT_EQ(0, http_request.GetResponseCode());
 }
@@ -628,8 +628,9 @@ TEST(CurlHttpRequestTest, ProgressIsStuck) {
   auto status = http_request.Send();
   EXPECT_EQ(error::UNAVAILABLE, status.code());
   EXPECT_EQ(
-      "Error executing an HTTP request (HTTP response code 200, "
-      "error code 42, error message '')",
+      "Error executing an HTTP request (error code 42, error message "
+      "'Operation was aborted by an application callback')\n\tPerforming "
+      "request. Detailed error: ",
       status.error_message());
 }
 
diff --git a/tensorflow/core/platform/cloud/file_block_cache.h b/tensorflow/core/platform/cloud/file_block_cache.h
index 5c180e2332042af3ae938c2685ac416952b00187..da167882470bfa3d833faeb7f031fdd7064aba35 100644
--- a/tensorflow/core/platform/cloud/file_block_cache.h
+++ b/tensorflow/core/platform/cloud/file_block_cache.h
@@ -32,7 +32,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-/// \brief An LRU block cache of file contents, keyed by {filename, offset}.
+/// \brief A block cache of file contents, keyed by {filename, offset}.
 ///
 /// This class should be shared by read-only random access files on a remote
 /// filesystem (e.g. GCS).
@@ -48,27 +48,7 @@ class FileBlockCache {
                                size_t* bytes_transferred)>
       BlockFetcher;
 
-  FileBlockCache(size_t block_size, size_t max_bytes, uint64 max_staleness,
-                 BlockFetcher block_fetcher, Env* env = Env::Default())
-      : block_size_(block_size),
-        max_bytes_(max_bytes),
-        max_staleness_(max_staleness),
-        block_fetcher_(block_fetcher),
-        env_(env) {
-    if (max_staleness_ > 0) {
-      pruning_thread_.reset(env_->StartThread(ThreadOptions(), "TF_prune_FBC",
-                                              [this] { Prune(); }));
-    }
-  }
-
-  ~FileBlockCache() {
-    if (pruning_thread_) {
-      stop_pruning_thread_.Notify();
-      // Destroying pruning_thread_ will block until Prune() receives the above
-      // notification and returns.
-      pruning_thread_.reset();
-    }
-  }
+  virtual ~FileBlockCache() {}
 
   /// Read `n` bytes from `filename` starting at `offset` into `out`. This
   /// method will return:
@@ -84,143 +64,22 @@ class FileBlockCache {
   ///    placed in `out`.
   /// 4) OK otherwise (i.e. the read succeeded, and at least one byte was placed
   ///    in `out`).
-  Status Read(const string& filename, size_t offset, size_t n, char* buffer,
-              size_t* bytes_transferred);
+  virtual Status Read(const string& filename, size_t offset, size_t n,
+                      char* buffer, size_t* bytes_transferred) = 0;
 
   /// Remove all cached blocks for `filename`.
-  void RemoveFile(const string& filename) LOCKS_EXCLUDED(mu_);
+  virtual void RemoveFile(const string& filename) = 0;
 
   /// Remove all cached data.
-  void Flush() LOCKS_EXCLUDED(mu_);
+  virtual void Flush() = 0;
 
   /// Accessors for cache parameters.
-  size_t block_size() const { return block_size_; }
-  size_t max_bytes() const { return max_bytes_; }
-  uint64 max_staleness() const { return max_staleness_; }
+  virtual size_t block_size() const = 0;
+  virtual size_t max_bytes() const = 0;
+  virtual uint64 max_staleness() const = 0;
 
   /// The current size (in bytes) of the cache.
-  size_t CacheSize() const LOCKS_EXCLUDED(mu_);
-
- private:
-  /// The size of the blocks stored in the LRU cache, as well as the size of the
-  /// reads from the underlying filesystem.
-  const size_t block_size_;
-  /// The maximum number of bytes (sum of block sizes) allowed in the LRU cache.
-  const size_t max_bytes_;
-  /// The maximum staleness of any block in the LRU cache, in seconds.
-  const uint64 max_staleness_;
-  /// The callback to read a block from the underlying filesystem.
-  const BlockFetcher block_fetcher_;
-  /// The Env from which we read timestamps.
-  Env* const env_;  // not owned
-
-  /// \brief The key type for the file block cache.
-  ///
-  /// The file block cache key is a {filename, offset} pair.
-  typedef std::pair<string, size_t> Key;
-
-  /// \brief The state of a block.
-  ///
-  /// A block begins in the CREATED stage. The first thread will attempt to read
-  /// the block from the filesystem, transitioning the state of the block to
-  /// FETCHING. After completing, if the read was successful the state should
-  /// be FINISHED. Otherwise the state should be ERROR. A subsequent read can
-  /// re-fetch the block if the state is ERROR.
-  enum class FetchState {
-    CREATED,
-    FETCHING,
-    FINISHED,
-    ERROR,
-  };
-
-  /// \brief A block of a file.
-  ///
-  /// A file block consists of the block data, the block's current position in
-  /// the LRU cache, the timestamp (seconds since epoch) at which the block
-  /// was cached, a coordination lock, and state & condition variables.
-  ///
-  /// Thread safety:
-  /// The iterator and timestamp fields should only be accessed while holding
-  /// the block-cache-wide mu_ instance variable. The state variable should only
-  /// be accessed while holding the Block's mu lock. The data vector should only
-  /// be accessed after state == FINISHED, and it should never be modified.
-  ///
-  /// In order to prevent deadlocks, never grab the block-cache-wide mu_ lock
-  /// AFTER grabbing any block's mu lock. It is safe to grab mu without locking
-  /// mu_.
-  struct Block {
-    /// The block data.
-    std::vector<char> data;
-    /// A list iterator pointing to the block's position in the LRU list.
-    std::list<Key>::iterator lru_iterator;
-    /// A list iterator pointing to the block's position in the LRA list.
-    std::list<Key>::iterator lra_iterator;
-    /// The timestamp (seconds since epoch) at which the block was cached.
-    uint64 timestamp;
-    /// Mutex to guard state variable
-    mutex mu;
-    /// The state of the block.
-    FetchState state GUARDED_BY(mu) = FetchState::CREATED;
-    /// Wait on cond_var if state is FETCHING.
-    condition_variable cond_var;
-  };
-
-  /// \brief The block map type for the file block cache.
-  ///
-  /// The block map is an ordered map from Key to Block.
-  typedef std::map<Key, std::shared_ptr<Block>> BlockMap;
-
-  /// Prune the cache by removing files with expired blocks.
-  void Prune() LOCKS_EXCLUDED(mu_);
-
-  bool BlockNotStale(const std::shared_ptr<Block>& block)
-      EXCLUSIVE_LOCKS_REQUIRED(mu_);
-
-  /// Look up a Key in the block cache.
-  std::shared_ptr<Block> Lookup(const Key& key) LOCKS_EXCLUDED(mu_);
-
-  Status MaybeFetch(const Key& key, const std::shared_ptr<Block>& block)
-      LOCKS_EXCLUDED(mu_);
-
-  /// Trim the block cache to make room for another entry.
-  void Trim() EXCLUSIVE_LOCKS_REQUIRED(mu_);
-
-  /// Update the LRU iterator for the block at `key`.
-  Status UpdateLRU(const Key& key, const std::shared_ptr<Block>& block)
-      LOCKS_EXCLUDED(mu_);
-
-  /// Remove all blocks of a file, with mu_ already held.
-  void RemoveFile_Locked(const string& filename) EXCLUSIVE_LOCKS_REQUIRED(mu_);
-
-  /// Remove the block `entry` from the block map and LRU list, and update the
-  /// cache size accordingly.
-  void RemoveBlock(BlockMap::iterator entry) EXCLUSIVE_LOCKS_REQUIRED(mu_);
-
-  /// The cache pruning thread that removes files with expired blocks.
-  std::unique_ptr<Thread> pruning_thread_;
-
-  /// Notification for stopping the cache pruning thread.
-  Notification stop_pruning_thread_;
-
-  /// Guards access to the block map, LRU list, and cached byte count.
-  mutable mutex mu_;
-
-  /// The block map (map from Key to Block).
-  BlockMap block_map_ GUARDED_BY(mu_);
-
-  /// The LRU list of block keys. The front of the list identifies the most
-  /// recently accessed block.
-  std::list<Key> lru_list_ GUARDED_BY(mu_);
-
-  /// The LRA (least recently added) list of block keys. The front of the list
-  /// identifies the most recently added block.
-  ///
-  /// Note: blocks are added to lra_list_ only after they have successfully been
-  /// fetched from the underlying block store.
-  std::list<Key> lra_list_ GUARDED_BY(mu_);
-
-  /// The combined number of bytes in all of the cached blocks.
-  size_t cache_size_ GUARDED_BY(mu_) = 0;
+  virtual size_t CacheSize() const = 0;
 };
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/gcs_file_system.cc b/tensorflow/core/platform/cloud/gcs_file_system.cc
index 01ca0d76bab2720513775ef33ff8670bd148c241..84b65cec4fa97ddc3400a8567c1dc097a0f2f56c 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system.cc
+++ b/tensorflow/core/platform/cloud/gcs_file_system.cc
@@ -36,6 +36,7 @@ limitations under the License.
 #include "tensorflow/core/platform/cloud/curl_http_request.h"
 #include "tensorflow/core/platform/cloud/file_block_cache.h"
 #include "tensorflow/core/platform/cloud/google_auth_provider.h"
+#include "tensorflow/core/platform/cloud/ram_file_block_cache.h"
 #include "tensorflow/core/platform/cloud/retrying_utils.h"
 #include "tensorflow/core/platform/cloud/time_util.h"
 #include "tensorflow/core/platform/env.h"
@@ -783,13 +784,13 @@ Status GcsFileSystem::NewRandomAccessFile(
 // A helper function to build a FileBlockCache for GcsFileSystem.
 std::unique_ptr<FileBlockCache> GcsFileSystem::MakeFileBlockCache(
     size_t block_size, size_t max_bytes, uint64 max_staleness) {
-  std::unique_ptr<FileBlockCache> file_block_cache(
-      new FileBlockCache(block_size, max_bytes, max_staleness,
-                         [this](const string& filename, size_t offset, size_t n,
-                                char* buffer, size_t* bytes_transferred) {
-                           return LoadBufferFromGCS(filename, offset, n, buffer,
-                                                    bytes_transferred);
-                         }));
+  std::unique_ptr<FileBlockCache> file_block_cache(new RamFileBlockCache(
+      block_size, max_bytes, max_staleness,
+      [this](const string& filename, size_t offset, size_t n, char* buffer,
+             size_t* bytes_transferred) {
+        return LoadBufferFromGCS(filename, offset, n, buffer,
+                                 bytes_transferred);
+      }));
   return file_block_cache;
 }
 
diff --git a/tensorflow/core/platform/cloud/gcs_file_system_test.cc b/tensorflow/core/platform/cloud/gcs_file_system_test.cc
index d452074ce312f98abe6b058ea56d2e0ce4cf047a..cd9fd3adea090b2c33db70b09e83cf5cc220a5e7 100644
--- a/tensorflow/core/platform/cloud/gcs_file_system_test.cc
+++ b/tensorflow/core/platform/cloud/gcs_file_system_test.cc
@@ -393,7 +393,7 @@ TEST(GcsFileSystemTest, NewWritableFile_ResumeUploadSucceeds) {
                            "Timeouts: 5 1 10\n"
                            "Header Content-Range: bytes */17\n"
                            "Put: yes\n",
-                           "", errors::FailedPrecondition("308"), nullptr,
+                           "", errors::Unavailable("308"), nullptr,
                            {{"Range", "0-10"}}, 308),
        new FakeHttpRequest("Uri: https://custom/upload/location\n"
                            "Auth Token: fake_token\n"
@@ -406,13 +406,26 @@ TEST(GcsFileSystemTest, NewWritableFile_ResumeUploadSucceeds) {
                            "Timeouts: 5 1 10\n"
                            "Header Content-Range: bytes */17\n"
                            "Put: yes\n",
-                           "", errors::FailedPrecondition("308"), nullptr,
+                           "", errors::Unavailable("308"), nullptr,
                            {{"Range", "bytes=0-12"}}, 308),
        new FakeHttpRequest("Uri: https://custom/upload/location\n"
                            "Auth Token: fake_token\n"
                            "Header Content-Range: bytes 13-16/17\n"
                            "Timeouts: 5 1 30\n"
                            "Put body: ent2\n",
+                           "", errors::Unavailable("308"), 308),
+       new FakeHttpRequest("Uri: https://custom/upload/location\n"
+                           "Auth Token: fake_token\n"
+                           "Timeouts: 5 1 10\n"
+                           "Header Content-Range: bytes */17\n"
+                           "Put: yes\n",
+                           "", errors::Unavailable("308"), nullptr,
+                           {{"Range", "bytes=0-14"}}, 308),
+       new FakeHttpRequest("Uri: https://custom/upload/location\n"
+                           "Auth Token: fake_token\n"
+                           "Header Content-Range: bytes 15-16/17\n"
+                           "Timeouts: 5 1 30\n"
+                           "Put body: t2\n",
                            "")});
   GcsFileSystem fs(std::unique_ptr<AuthProvider>(new FakeAuthProvider),
                    std::unique_ptr<HttpRequest::Factory>(
@@ -521,14 +534,14 @@ TEST(GcsFileSystemTest, NewWritableFile_ResumeUploadAllAttemptsFail) {
                            "Put body: content1,content2\n",
                            "", errors::Unavailable("503"), 503)});
   for (int i = 0; i < 10; i++) {
-    requests.emplace_back(new FakeHttpRequest(
-        "Uri: https://custom/upload/location\n"
-        "Auth Token: fake_token\n"
-        "Timeouts: 5 1 10\n"
-        "Header Content-Range: bytes */17\n"
-        "Put: yes\n",
-        "", errors::FailedPrecondition("important HTTP error 308"), nullptr,
-        {{"Range", "0-10"}}, 308));
+    requests.emplace_back(
+        new FakeHttpRequest("Uri: https://custom/upload/location\n"
+                            "Auth Token: fake_token\n"
+                            "Timeouts: 5 1 10\n"
+                            "Header Content-Range: bytes */17\n"
+                            "Put: yes\n",
+                            "", errors::Unavailable("important HTTP error 308"),
+                            nullptr, {{"Range", "0-10"}}, 308));
     requests.emplace_back(new FakeHttpRequest(
         "Uri: https://custom/upload/location\n"
         "Auth Token: fake_token\n"
diff --git a/tensorflow/core/platform/cloud/gcs_throttle.cc b/tensorflow/core/platform/cloud/gcs_throttle.cc
index eb5f8958a37f45aeac1a836ca037f91931bb34a6..27dd06a6250ad457d0ec142c07d29a2358dddaee 100644
--- a/tensorflow/core/platform/cloud/gcs_throttle.cc
+++ b/tensorflow/core/platform/cloud/gcs_throttle.cc
@@ -26,10 +26,9 @@ GcsThrottle::GcsThrottle(EnvTime* env_time)
 
 bool GcsThrottle::AdmitRequest() {
   mutex_lock l(mu_);
-  if (!config_.enabled) return true;
   UpdateState();
   if (available_tokens_ < config_.tokens_per_request) {
-    return false;
+    return false || !config_.enabled;
   }
   available_tokens_ -= config_.tokens_per_request;
   return true;
@@ -37,7 +36,6 @@ bool GcsThrottle::AdmitRequest() {
 
 void GcsThrottle::RecordResponse(size_t num_bytes) {
   mutex_lock l(mu_);
-  if (!config_.enabled) return;
   UpdateState();
   available_tokens_ -= request_bytes_to_tokens(num_bytes);
 }
diff --git a/tensorflow/core/platform/cloud/gcs_throttle.h b/tensorflow/core/platform/cloud/gcs_throttle.h
index 1a89daef084e921f1ad8bd856cefcc62d0d7aa1c..6d5eed7338f14d4c9258dbe950af9b789cb157cf 100644
--- a/tensorflow/core/platform/cloud/gcs_throttle.h
+++ b/tensorflow/core/platform/cloud/gcs_throttle.h
@@ -109,13 +109,22 @@ class GcsThrottle {
    * purpose of this function is to make available to monitoring or other
    * instrumentation the number of available tokens in the pool.
    */
-  inline int64 available_tokens() {
+  inline int64 available_tokens() LOCKS_EXCLUDED(mu_) {
     mutex_lock l(mu_);
-    if (!config_.enabled) return 0;
     UpdateState();
     return available_tokens_;
   }
 
+  /**
+   * is_enabled determines if the throttle is enabled.
+   *
+   * If !is_enabled(), AdmitRequest() will always return true.
+   */
+  bool is_enabled() LOCKS_EXCLUDED(mu_) {
+    mutex_lock l(mu_);
+    return config_.enabled;
+  }
+
  private:
   /**
    * UpdateState updates the available_tokens_ and last_updated_secs_ variables.
diff --git a/tensorflow/core/platform/cloud/gcs_throttle_test.cc b/tensorflow/core/platform/cloud/gcs_throttle_test.cc
index 694756022e37263a07f8215bf7496c9ca130fd58..57193ac4057550463b6bea29089bdd545f2f0a33 100644
--- a/tensorflow/core/platform/cloud/gcs_throttle_test.cc
+++ b/tensorflow/core/platform/cloud/gcs_throttle_test.cc
@@ -96,6 +96,24 @@ TEST_F(GcsThrottleTest, ReverseTime) {
   EXPECT_EQ(200000, throttle_.available_tokens());
 }
 
+TEST(GcsThrottleDisabledTest, Disabled) {
+  TestTime time;
+  GcsThrottle throttle(&time);
+  ASSERT_FALSE(throttle.is_enabled());  // Verify throttle is disabled.
+
+  EXPECT_EQ(0, throttle.available_tokens());
+  time.AdvanceSeconds(1);
+  EXPECT_EQ(100000, throttle.available_tokens());
+  EXPECT_TRUE(throttle.AdmitRequest());
+  EXPECT_EQ(99900, throttle.available_tokens());
+  time.AdvanceSeconds(1);
+  EXPECT_EQ(199900, throttle.available_tokens());
+  throttle.RecordResponse(128000000);  // 128 MB response.
+  EXPECT_LT(0, throttle.available_tokens());
+  // Admit request even without available tokens
+  EXPECT_TRUE(throttle.AdmitRequest());
+}
+
 }  // namespace
 
 }  // namespace tensorflow
diff --git a/tensorflow/core/platform/cloud/file_block_cache.cc b/tensorflow/core/platform/cloud/ram_file_block_cache.cc
similarity index 89%
rename from tensorflow/core/platform/cloud/file_block_cache.cc
rename to tensorflow/core/platform/cloud/ram_file_block_cache.cc
index 6add1142a15fb69044828bd82a6d6e838959de08..55a5657a503a334866cad737bb11fe505e59699a 100644
--- a/tensorflow/core/platform/cloud/file_block_cache.cc
+++ b/tensorflow/core/platform/cloud/ram_file_block_cache.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/core/platform/cloud/file_block_cache.h"
+#include "tensorflow/core/platform/cloud/ram_file_block_cache.h"
 #include <cstring>
 #include <memory>
 #include "tensorflow/core/lib/gtl/cleanup.h"
@@ -21,7 +21,7 @@ limitations under the License.
 
 namespace tensorflow {
 
-bool FileBlockCache::BlockNotStale(const std::shared_ptr<Block>& block) {
+bool RamFileBlockCache::BlockNotStale(const std::shared_ptr<Block>& block) {
   mutex_lock l(block->mu);
   if (block->state != FetchState::FINISHED) {
     return true;  // No need to check for staleness.
@@ -30,7 +30,8 @@ bool FileBlockCache::BlockNotStale(const std::shared_ptr<Block>& block) {
   return env_->NowSeconds() - block->timestamp <= max_staleness_;
 }
 
-std::shared_ptr<FileBlockCache::Block> FileBlockCache::Lookup(const Key& key) {
+std::shared_ptr<RamFileBlockCache::Block> RamFileBlockCache::Lookup(
+    const Key& key) {
   mutex_lock lock(mu_);
   auto entry = block_map_.find(key);
   if (entry != block_map_.end()) {
@@ -55,15 +56,15 @@ std::shared_ptr<FileBlockCache::Block> FileBlockCache::Lookup(const Key& key) {
 }
 
 // Remove blocks from the cache until we do not exceed our maximum size.
-void FileBlockCache::Trim() {
+void RamFileBlockCache::Trim() {
   while (!lru_list_.empty() && cache_size_ > max_bytes_) {
     RemoveBlock(block_map_.find(lru_list_.back()));
   }
 }
 
 /// Move the block to the front of the LRU list if it isn't already there.
-Status FileBlockCache::UpdateLRU(const Key& key,
-                                 const std::shared_ptr<Block>& block) {
+Status RamFileBlockCache::UpdateLRU(const Key& key,
+                                    const std::shared_ptr<Block>& block) {
   mutex_lock lock(mu_);
   if (block->timestamp == 0) {
     // The block was evicted from another thread. Allow it to remain evicted.
@@ -92,8 +93,8 @@ Status FileBlockCache::UpdateLRU(const Key& key,
   return Status::OK();
 }
 
-Status FileBlockCache::MaybeFetch(const Key& key,
-                                  const std::shared_ptr<Block>& block) {
+Status RamFileBlockCache::MaybeFetch(const Key& key,
+                                     const std::shared_ptr<Block>& block) {
   bool downloaded_block = false;
   auto reconcile_state =
       gtl::MakeCleanup([this, &downloaded_block, &key, &block] {
@@ -151,11 +152,11 @@ Status FileBlockCache::MaybeFetch(const Key& key,
     }
   }
   return errors::Internal(
-      "Control flow should never reach the end of FileBlockCache::Fetch.");
+      "Control flow should never reach the end of RamFileBlockCache::Fetch.");
 }
 
-Status FileBlockCache::Read(const string& filename, size_t offset, size_t n,
-                            char* buffer, size_t* bytes_transferred) {
+Status RamFileBlockCache::Read(const string& filename, size_t offset, size_t n,
+                               char* buffer, size_t* bytes_transferred) {
   *bytes_transferred = 0;
   if (n == 0) {
     return Status::OK();
@@ -216,12 +217,12 @@ Status FileBlockCache::Read(const string& filename, size_t offset, size_t n,
   return Status::OK();
 }
 
-size_t FileBlockCache::CacheSize() const {
+size_t RamFileBlockCache::CacheSize() const {
   mutex_lock lock(mu_);
   return cache_size_;
 }
 
-void FileBlockCache::Prune() {
+void RamFileBlockCache::Prune() {
   while (!WaitForNotificationWithTimeout(&stop_pruning_thread_, 1000000)) {
     mutex_lock lock(mu_);
     uint64 now = env_->NowSeconds();
@@ -238,7 +239,7 @@ void FileBlockCache::Prune() {
   }
 }
 
-void FileBlockCache::Flush() {
+void RamFileBlockCache::Flush() {
   mutex_lock lock(mu_);
   block_map_.clear();
   lru_list_.clear();
@@ -246,12 +247,12 @@ void FileBlockCache::Flush() {
   cache_size_ = 0;
 }
 
-void FileBlockCache::RemoveFile(const string& filename) {
+void RamFileBlockCache::RemoveFile(const string& filename) {
   mutex_lock lock(mu_);
   RemoveFile_Locked(filename);
 }
 
-void FileBlockCache::RemoveFile_Locked(const string& filename) {
+void RamFileBlockCache::RemoveFile_Locked(const string& filename) {
   Key begin = std::make_pair(filename, 0);
   auto it = block_map_.lower_bound(begin);
   while (it != block_map_.end() && it->first.first == filename) {
@@ -261,7 +262,7 @@ void FileBlockCache::RemoveFile_Locked(const string& filename) {
   }
 }
 
-void FileBlockCache::RemoveBlock(BlockMap::iterator entry) {
+void RamFileBlockCache::RemoveBlock(BlockMap::iterator entry) {
   // This signals that the block is removed, and should not be inadvertently
   // reinserted into the cache in UpdateLRU.
   entry->second->timestamp = 0;
diff --git a/tensorflow/core/platform/cloud/ram_file_block_cache.h b/tensorflow/core/platform/cloud/ram_file_block_cache.h
new file mode 100644
index 0000000000000000000000000000000000000000..7fdd7b2e0294e1cf289a77464fb60e08bdb28da7
--- /dev/null
+++ b/tensorflow/core/platform/cloud/ram_file_block_cache.h
@@ -0,0 +1,229 @@
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef TENSORFLOW_CORE_PLATFORM_CLOUD_RAM_FILE_BLOCK_CACHE_H_
+#define TENSORFLOW_CORE_PLATFORM_CLOUD_RAM_FILE_BLOCK_CACHE_H_
+
+#include <functional>
+#include <list>
+#include <map>
+#include <memory>
+#include <string>
+#include <vector>
+#include "tensorflow/core/lib/core/status.h"
+#include "tensorflow/core/lib/core/stringpiece.h"
+#include "tensorflow/core/platform/cloud/file_block_cache.h"
+#include "tensorflow/core/platform/env.h"
+#include "tensorflow/core/platform/mutex.h"
+#include "tensorflow/core/platform/notification.h"
+#include "tensorflow/core/platform/thread_annotations.h"
+#include "tensorflow/core/platform/types.h"
+
+namespace tensorflow {
+
+/// \brief An LRU block cache of file contents, keyed by {filename, offset}.
+///
+/// This class should be shared by read-only random access files on a remote
+/// filesystem (e.g. GCS).
+class RamFileBlockCache : public FileBlockCache {
+ public:
+  /// The callback executed when a block is not found in the cache, and needs to
+  /// be fetched from the backing filesystem. This callback is provided when the
+  /// cache is constructed. The returned Status should be OK as long as the
+  /// read from the remote filesystem succeeded (similar to the semantics of the
+  /// read(2) system call).
+  typedef std::function<Status(const string& filename, size_t offset,
+                               size_t buffer_size, char* buffer,
+                               size_t* bytes_transferred)>
+      BlockFetcher;
+
+  RamFileBlockCache(size_t block_size, size_t max_bytes, uint64 max_staleness,
+                    BlockFetcher block_fetcher, Env* env = Env::Default())
+      : block_size_(block_size),
+        max_bytes_(max_bytes),
+        max_staleness_(max_staleness),
+        block_fetcher_(block_fetcher),
+        env_(env) {
+    if (max_staleness_ > 0) {
+      pruning_thread_.reset(env_->StartThread(ThreadOptions(), "TF_prune_FBC",
+                                              [this] { Prune(); }));
+    }
+  }
+
+  ~RamFileBlockCache() override {
+    if (pruning_thread_) {
+      stop_pruning_thread_.Notify();
+      // Destroying pruning_thread_ will block until Prune() receives the above
+      // notification and returns.
+      pruning_thread_.reset();
+    }
+  }
+
+  /// Read `n` bytes from `filename` starting at `offset` into `out`. This
+  /// method will return:
+  ///
+  /// 1) The error from the remote filesystem, if the read from the remote
+  ///    filesystem failed.
+  /// 2) PRECONDITION_FAILED if the read from the remote filesystem succeeded,
+  ///    but the read returned a partial block, and the LRU cache contained a
+  ///    block at a higher offset (indicating that the partial block should have
+  ///    been a full block).
+  /// 3) OUT_OF_RANGE if the read from the remote filesystem succeeded, but
+  ///    the file contents do not extend past `offset` and thus nothing was
+  ///    placed in `out`.
+  /// 4) OK otherwise (i.e. the read succeeded, and at least one byte was placed
+  ///    in `out`).
+  Status Read(const string& filename, size_t offset, size_t n, char* buffer,
+              size_t* bytes_transferred) override;
+
+  /// Remove all cached blocks for `filename`.
+  void RemoveFile(const string& filename) override LOCKS_EXCLUDED(mu_);
+
+  /// Remove all cached data.
+  void Flush() LOCKS_EXCLUDED(mu_) override;
+
+  /// Accessors for cache parameters.
+  size_t block_size() const override { return block_size_; }
+  size_t max_bytes() const override { return max_bytes_; }
+  uint64 max_staleness() const override { return max_staleness_; }
+
+  /// The current size (in bytes) of the cache.
+  size_t CacheSize() const override LOCKS_EXCLUDED(mu_);
+
+ private:
+  /// The size of the blocks stored in the LRU cache, as well as the size of the
+  /// reads from the underlying filesystem.
+  const size_t block_size_;
+  /// The maximum number of bytes (sum of block sizes) allowed in the LRU cache.
+  const size_t max_bytes_;
+  /// The maximum staleness of any block in the LRU cache, in seconds.
+  const uint64 max_staleness_;
+  /// The callback to read a block from the underlying filesystem.
+  const BlockFetcher block_fetcher_;
+  /// The Env from which we read timestamps.
+  Env* const env_;  // not owned
+
+  /// \brief The key type for the file block cache.
+  ///
+  /// The file block cache key is a {filename, offset} pair.
+  typedef std::pair<string, size_t> Key;
+
+  /// \brief The state of a block.
+  ///
+  /// A block begins in the CREATED stage. The first thread will attempt to read
+  /// the block from the filesystem, transitioning the state of the block to
+  /// FETCHING. After completing, if the read was successful the state should
+  /// be FINISHED. Otherwise the state should be ERROR. A subsequent read can
+  /// re-fetch the block if the state is ERROR.
+  enum class FetchState {
+    CREATED,
+    FETCHING,
+    FINISHED,
+    ERROR,
+  };
+
+  /// \brief A block of a file.
+  ///
+  /// A file block consists of the block data, the block's current position in
+  /// the LRU cache, the timestamp (seconds since epoch) at which the block
+  /// was cached, a coordination lock, and state & condition variables.
+  ///
+  /// Thread safety:
+  /// The iterator and timestamp fields should only be accessed while holding
+  /// the block-cache-wide mu_ instance variable. The state variable should only
+  /// be accessed while holding the Block's mu lock. The data vector should only
+  /// be accessed after state == FINISHED, and it should never be modified.
+  ///
+  /// In order to prevent deadlocks, never grab the block-cache-wide mu_ lock
+  /// AFTER grabbing any block's mu lock. It is safe to grab mu without locking
+  /// mu_.
+  struct Block {
+    /// The block data.
+    std::vector<char> data;
+    /// A list iterator pointing to the block's position in the LRU list.
+    std::list<Key>::iterator lru_iterator;
+    /// A list iterator pointing to the block's position in the LRA list.
+    std::list<Key>::iterator lra_iterator;
+    /// The timestamp (seconds since epoch) at which the block was cached.
+    uint64 timestamp;
+    /// Mutex to guard state variable
+    mutex mu;
+    /// The state of the block.
+    FetchState state GUARDED_BY(mu) = FetchState::CREATED;
+    /// Wait on cond_var if state is FETCHING.
+    condition_variable cond_var;
+  };
+
+  /// \brief The block map type for the file block cache.
+  ///
+  /// The block map is an ordered map from Key to Block.
+  typedef std::map<Key, std::shared_ptr<Block>> BlockMap;
+
+  /// Prune the cache by removing files with expired blocks.
+  void Prune() LOCKS_EXCLUDED(mu_);
+
+  bool BlockNotStale(const std::shared_ptr<Block>& block)
+      EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  /// Look up a Key in the block cache.
+  std::shared_ptr<Block> Lookup(const Key& key) LOCKS_EXCLUDED(mu_);
+
+  Status MaybeFetch(const Key& key, const std::shared_ptr<Block>& block)
+      LOCKS_EXCLUDED(mu_);
+
+  /// Trim the block cache to make room for another entry.
+  void Trim() EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  /// Update the LRU iterator for the block at `key`.
+  Status UpdateLRU(const Key& key, const std::shared_ptr<Block>& block)
+      LOCKS_EXCLUDED(mu_);
+
+  /// Remove all blocks of a file, with mu_ already held.
+  void RemoveFile_Locked(const string& filename) EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  /// Remove the block `entry` from the block map and LRU list, and update the
+  /// cache size accordingly.
+  void RemoveBlock(BlockMap::iterator entry) EXCLUSIVE_LOCKS_REQUIRED(mu_);
+
+  /// The cache pruning thread that removes files with expired blocks.
+  std::unique_ptr<Thread> pruning_thread_;
+
+  /// Notification for stopping the cache pruning thread.
+  Notification stop_pruning_thread_;
+
+  /// Guards access to the block map, LRU list, and cached byte count.
+  mutable mutex mu_;
+
+  /// The block map (map from Key to Block).
+  BlockMap block_map_ GUARDED_BY(mu_);
+
+  /// The LRU list of block keys. The front of the list identifies the most
+  /// recently accessed block.
+  std::list<Key> lru_list_ GUARDED_BY(mu_);
+
+  /// The LRA (least recently added) list of block keys. The front of the list
+  /// identifies the most recently added block.
+  ///
+  /// Note: blocks are added to lra_list_ only after they have successfully been
+  /// fetched from the underlying block store.
+  std::list<Key> lra_list_ GUARDED_BY(mu_);
+
+  /// The combined number of bytes in all of the cached blocks.
+  size_t cache_size_ GUARDED_BY(mu_) = 0;
+};
+
+}  // namespace tensorflow
+
+#endif  // TENSORFLOW_CORE_PLATFORM_CLOUD_RAM_FILE_BLOCK_CACHE_H_
diff --git a/tensorflow/core/platform/cloud/file_block_cache_test.cc b/tensorflow/core/platform/cloud/ram_file_block_cache_test.cc
similarity index 92%
rename from tensorflow/core/platform/cloud/file_block_cache_test.cc
rename to tensorflow/core/platform/cloud/ram_file_block_cache_test.cc
index 596fdbf19eb03a70c5659d392db368b3cdb791fe..d555b682a624309172588c9279d650d436f5d5cd 100644
--- a/tensorflow/core/platform/cloud/file_block_cache_test.cc
+++ b/tensorflow/core/platform/cloud/ram_file_block_cache_test.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/core/platform/cloud/file_block_cache.h"
+#include "tensorflow/core/platform/cloud/ram_file_block_cache.h"
 #include <cstring>
 #include "tensorflow/core/lib/core/blocking_counter.h"
 #include "tensorflow/core/lib/core/status_test_util.h"
@@ -25,8 +25,8 @@ limitations under the License.
 namespace tensorflow {
 namespace {
 
-Status ReadCache(FileBlockCache* cache, const string& filename, size_t offset,
-                 size_t n, std::vector<char>* out) {
+Status ReadCache(RamFileBlockCache* cache, const string& filename,
+                 size_t offset, size_t n, std::vector<char>* out) {
   out->clear();
   out->resize(n, 0);
   size_t bytes_transferred = 0;
@@ -37,7 +37,7 @@ Status ReadCache(FileBlockCache* cache, const string& filename, size_t offset,
   return status;
 }
 
-TEST(FileBlockCacheTest, PassThrough) {
+TEST(RamFileBlockCacheTest, PassThrough) {
   const string want_filename = "foo/bar";
   const size_t want_offset = 42;
   const size_t want_n = 1024;
@@ -54,9 +54,9 @@ TEST(FileBlockCacheTest, PassThrough) {
     return Status::OK();
   };
   // If block_size, max_bytes, or both are zero, the cache is a pass-through.
-  FileBlockCache cache1(1, 0, 0, fetcher);
-  FileBlockCache cache2(0, 1, 0, fetcher);
-  FileBlockCache cache3(0, 0, 0, fetcher);
+  RamFileBlockCache cache1(1, 0, 0, fetcher);
+  RamFileBlockCache cache2(0, 1, 0, fetcher);
+  RamFileBlockCache cache3(0, 0, 0, fetcher);
   std::vector<char> out;
   TF_EXPECT_OK(ReadCache(&cache1, want_filename, want_offset, want_n, &out));
   EXPECT_EQ(calls, 1);
@@ -66,7 +66,7 @@ TEST(FileBlockCacheTest, PassThrough) {
   EXPECT_EQ(calls, 3);
 }
 
-TEST(FileBlockCacheTest, BlockAlignment) {
+TEST(RamFileBlockCacheTest, BlockAlignment) {
   // Initialize a 256-byte buffer.  This is the file underlying the reads we'll
   // do in this test.
   const size_t size = 256;
@@ -89,7 +89,7 @@ TEST(FileBlockCacheTest, BlockAlignment) {
   for (size_t block_size = 2; block_size <= 4; block_size++) {
     // Make a cache of N-byte block size (1 block) and verify that reads of
     // varying offsets and lengths return correct data.
-    FileBlockCache cache(block_size, block_size, 0, fetcher);
+    RamFileBlockCache cache(block_size, block_size, 0, fetcher);
     for (size_t offset = 0; offset < 10; offset++) {
       for (size_t n = block_size - 2; n <= block_size + 2; n++) {
         std::vector<char> got;
@@ -117,7 +117,7 @@ TEST(FileBlockCacheTest, BlockAlignment) {
   }
 }
 
-TEST(FileBlockCacheTest, CacheHits) {
+TEST(RamFileBlockCacheTest, CacheHits) {
   const size_t block_size = 16;
   std::set<size_t> calls;
   auto fetcher = [&calls, block_size](const string& filename, size_t offset,
@@ -132,7 +132,7 @@ TEST(FileBlockCacheTest, CacheHits) {
     return Status::OK();
   };
   const uint32 block_count = 256;
-  FileBlockCache cache(block_size, block_count * block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, block_count * block_size, 0, fetcher);
   std::vector<char> out;
   out.resize(block_count, 0);
   // The cache has space for `block_count` blocks. The loop with i = 0 should
@@ -146,7 +146,7 @@ TEST(FileBlockCacheTest, CacheHits) {
   }
 }
 
-TEST(FileBlockCacheTest, OutOfRange) {
+TEST(RamFileBlockCacheTest, OutOfRange) {
   // Tests reads of a 24-byte file with block size 16.
   const size_t block_size = 16;
   const size_t file_size = 24;
@@ -172,7 +172,7 @@ TEST(FileBlockCacheTest, OutOfRange) {
     *bytes_transferred = bytes_to_copy;
     return Status::OK();
   };
-  FileBlockCache cache(block_size, block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, block_size, 0, fetcher);
   std::vector<char> out;
   // Reading the first 16 bytes should be fine.
   TF_EXPECT_OK(ReadCache(&cache, "", 0, block_size, &out));
@@ -191,7 +191,7 @@ TEST(FileBlockCacheTest, OutOfRange) {
   EXPECT_EQ(out.size(), file_size - block_size);
 }
 
-TEST(FileBlockCacheTest, Inconsistent) {
+TEST(RamFileBlockCacheTest, Inconsistent) {
   // Tests the detection of interrupted reads leading to partially filled blocks
   // where we expected complete blocks.
   const size_t block_size = 16;
@@ -205,7 +205,7 @@ TEST(FileBlockCacheTest, Inconsistent) {
     *bytes_transferred = 1;
     return Status::OK();
   };
-  FileBlockCache cache(block_size, 2 * block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, 2 * block_size, 0, fetcher);
   std::vector<char> out;
   // Read the second block; this should yield an OK status and a single byte.
   TF_EXPECT_OK(ReadCache(&cache, "", block_size, block_size, &out));
@@ -216,7 +216,7 @@ TEST(FileBlockCacheTest, Inconsistent) {
   EXPECT_EQ(status.code(), error::INTERNAL);
 }
 
-TEST(FileBlockCacheTest, LRU) {
+TEST(RamFileBlockCacheTest, LRU) {
   const size_t block_size = 16;
   std::list<size_t> calls;
   auto fetcher = [&calls, block_size](const string& filename, size_t offset,
@@ -233,7 +233,7 @@ TEST(FileBlockCacheTest, LRU) {
     return Status::OK();
   };
   const uint32 block_count = 2;
-  FileBlockCache cache(block_size, block_count * block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, block_count * block_size, 0, fetcher);
   std::vector<char> out;
   // Read blocks from the cache, and verify the LRU behavior based on the
   // fetcher calls that the cache makes.
@@ -265,7 +265,7 @@ TEST(FileBlockCacheTest, LRU) {
   TF_EXPECT_OK(ReadCache(&cache, "", 0, 1, &out));
 }
 
-TEST(FileBlockCacheTest, MaxStaleness) {
+TEST(RamFileBlockCacheTest, MaxStaleness) {
   int calls = 0;
   auto fetcher = [&calls](const string& filename, size_t offset, size_t n,
                           char* buffer, size_t* bytes_transferred) {
@@ -278,7 +278,7 @@ TEST(FileBlockCacheTest, MaxStaleness) {
   std::unique_ptr<NowSecondsEnv> env(new NowSecondsEnv);
   // Create a cache with max staleness of 2 seconds, and verify that it works as
   // expected.
-  FileBlockCache cache1(8, 16, 2 /* max staleness */, fetcher, env.get());
+  RamFileBlockCache cache1(8, 16, 2 /* max staleness */, fetcher, env.get());
   // Execute the first read to load the block.
   TF_EXPECT_OK(ReadCache(&cache1, "", 0, 1, &out));
   EXPECT_EQ(calls, 1);
@@ -294,7 +294,7 @@ TEST(FileBlockCacheTest, MaxStaleness) {
   // as expected.
   calls = 0;
   env->SetNowSeconds(0);
-  FileBlockCache cache2(8, 16, 0 /* max staleness */, fetcher, env.get());
+  RamFileBlockCache cache2(8, 16, 0 /* max staleness */, fetcher, env.get());
   // Execute the first read to load the block.
   TF_EXPECT_OK(ReadCache(&cache2, "", 0, 1, &out));
   EXPECT_EQ(calls, 1);
@@ -305,7 +305,7 @@ TEST(FileBlockCacheTest, MaxStaleness) {
   EXPECT_EQ(calls, 1);
 }
 
-TEST(FileBlockCacheTest, RemoveFile) {
+TEST(RamFileBlockCacheTest, RemoveFile) {
   int calls = 0;
   auto fetcher = [&calls](const string& filename, size_t offset, size_t n,
                           char* buffer, size_t* bytes_transferred) {
@@ -321,7 +321,7 @@ TEST(FileBlockCacheTest, RemoveFile) {
   };
   // This cache has space for 4 blocks; we'll read from two files.
   const size_t n = 3;
-  FileBlockCache cache(8, 32, 0, fetcher);
+  RamFileBlockCache cache(8, 32, 0, fetcher);
   std::vector<char> out;
   std::vector<char> a(n, 'a');
   std::vector<char> b(n, 'b');
@@ -367,7 +367,7 @@ TEST(FileBlockCacheTest, RemoveFile) {
   EXPECT_EQ(calls, 6);
 }
 
-TEST(FileBlockCacheTest, Prune) {
+TEST(RamFileBlockCacheTest, Prune) {
   int calls = 0;
   auto fetcher = [&calls](const string& filename, size_t offset, size_t n,
                           char* buffer, size_t* bytes_transferred) {
@@ -381,7 +381,7 @@ TEST(FileBlockCacheTest, Prune) {
   std::unique_ptr<NowSecondsEnv> env(new NowSecondsEnv);
   uint64 now = Env::Default()->NowSeconds();
   env->SetNowSeconds(now);
-  FileBlockCache cache(8, 32, 1 /* max staleness */, fetcher, env.get());
+  RamFileBlockCache cache(8, 32, 1 /* max staleness */, fetcher, env.get());
   // Read three blocks into the cache, and advance the timestamp by one second
   // with each read. Start with a block of "a" at the current timestamp `now`.
   TF_EXPECT_OK(ReadCache(&cache, "a", 0, 1, &out));
@@ -426,7 +426,7 @@ TEST(FileBlockCacheTest, Prune) {
   EXPECT_EQ(cache.CacheSize(), 0);
 }
 
-TEST(FileBlockCacheTest, ParallelReads) {
+TEST(RamFileBlockCacheTest, ParallelReads) {
   // This fetcher won't respond until either `callers` threads are calling it
   // concurrently (at which point it will respond with success to all callers),
   // or 10 seconds have elapsed (at which point it will respond with an error).
@@ -444,7 +444,7 @@ TEST(FileBlockCacheTest, ParallelReads) {
     return Status::OK();
   };
   const int block_size = 8;
-  FileBlockCache cache(block_size, 2 * callers * block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, 2 * callers * block_size, 0, fetcher);
   std::vector<std::unique_ptr<Thread>> threads;
   for (int i = 0; i < callers; i++) {
     threads.emplace_back(
@@ -461,7 +461,7 @@ TEST(FileBlockCacheTest, ParallelReads) {
   // executed, or 10 seconds have passed).
 }
 
-TEST(FileBlockCacheTest, CoalesceConcurrentReads) {
+TEST(RamFileBlockCacheTest, CoalesceConcurrentReads) {
   // Concurrent reads to the same file blocks should be de-duplicated.
   const size_t block_size = 16;
   int num_requests = 0;
@@ -479,7 +479,7 @@ TEST(FileBlockCacheTest, CoalesceConcurrentReads) {
     Env::Default()->SleepForMicroseconds(100000);  // 0.1 secs
     return Status::OK();
   };
-  FileBlockCache cache(block_size, block_size, 0, fetcher);
+  RamFileBlockCache cache(block_size, block_size, 0, fetcher);
   // Fork off thread for parallel read.
   std::unique_ptr<Thread> concurrent(
       Env::Default()->StartThread({}, "concurrent", [&cache, block_size] {
@@ -496,7 +496,7 @@ TEST(FileBlockCacheTest, CoalesceConcurrentReads) {
   EXPECT_EQ(1, num_requests);
 }
 
-TEST(FileBlockCacheTest, Flush) {
+TEST(RamFileBlockCacheTest, Flush) {
   int calls = 0;
   auto fetcher = [&calls](const string& filename, size_t offset, size_t n,
                           char* buffer, size_t* bytes_transferred) {
@@ -505,7 +505,7 @@ TEST(FileBlockCacheTest, Flush) {
     *bytes_transferred = n;
     return Status::OK();
   };
-  FileBlockCache cache(16, 32, 0, fetcher);
+  RamFileBlockCache cache(16, 32, 0, fetcher);
   std::vector<char> out;
   TF_EXPECT_OK(ReadCache(&cache, "", 0, 16, &out));
   TF_EXPECT_OK(ReadCache(&cache, "", 0, 16, &out));
diff --git a/tensorflow/core/platform/default/context.h b/tensorflow/core/platform/default/context.h
index d8afeb47a9ca06e61a8c02962bc98d7797a279f7..682f64c26d7e3c4306df4139f0f48297e5c01a03 100644
--- a/tensorflow/core/platform/default/context.h
+++ b/tensorflow/core/platform/default/context.h
@@ -22,6 +22,8 @@ class Context {
  public:
   Context() {}
   Context(const ContextKind kind) {}
+
+  bool operator==(const Context& other) const { return true; }
 };
 
 class WithContext {
diff --git a/tensorflow/core/platform/denormal.cc b/tensorflow/core/platform/denormal.cc
index e00dbdb4ae5ef682369b345353e236a6084460ef..3631d9ddf99430372c11403dba56c14331a3db24 100644
--- a/tensorflow/core/platform/denormal.cc
+++ b/tensorflow/core/platform/denormal.cc
@@ -40,36 +40,51 @@ limitations under the License.
 namespace tensorflow {
 namespace port {
 
-ScopedFlushDenormal::ScopedFlushDenormal() {
+static void SetDenormalState(bool flush_zero_mode, bool denormals_zero_mode) {
   // For now, we flush denormals only on SSE 3.  Other architectures such as ARM
   // can be added as needed.
 
 #ifdef DENORM_USE_INTRINSICS
   if (TestCPUFeature(SSE3)) {
-    // Save existing flags
-    flush_zero_mode_ = _MM_GET_FLUSH_ZERO_MODE() == _MM_FLUSH_ZERO_ON;
-    denormals_zero_mode_ =
-        _MM_GET_DENORMALS_ZERO_MODE() == _MM_DENORMALS_ZERO_ON;
-
-    // Flush denormals to zero (the FTZ flag).
-    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
-
-    // Interpret denormal inputs as zero (the DAZ flag).
-    _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
+    // Restore flags
+    _MM_SET_FLUSH_ZERO_MODE(flush_zero_mode ? _MM_FLUSH_ZERO_ON
+                                            : _MM_FLUSH_ZERO_OFF);
+    _MM_SET_DENORMALS_ZERO_MODE(denormals_zero_mode ? _MM_DENORMALS_ZERO_ON
+                                                    : _MM_DENORMALS_ZERO_OFF);
   }
 #endif
 }
 
-ScopedFlushDenormal::~ScopedFlushDenormal() {
+static std::pair<bool, bool> GetDernormalState() {
+  // For now, we flush denormals only on SSE 3.  Other architectures such as ARM
+  // can be added as needed.
+
 #ifdef DENORM_USE_INTRINSICS
   if (TestCPUFeature(SSE3)) {
-    // Restore flags
-    _MM_SET_FLUSH_ZERO_MODE(flush_zero_mode_ ? _MM_FLUSH_ZERO_ON
-                                             : _MM_FLUSH_ZERO_OFF);
-    _MM_SET_DENORMALS_ZERO_MODE(denormals_zero_mode_ ? _MM_DENORMALS_ZERO_ON
-                                                     : _MM_DENORMALS_ZERO_OFF);
+    // Save existing flags
+    bool flush_zero_mode = _MM_GET_FLUSH_ZERO_MODE() == _MM_FLUSH_ZERO_ON;
+    bool denormals_zero_mode =
+        _MM_GET_DENORMALS_ZERO_MODE() == _MM_DENORMALS_ZERO_ON;
+    return {flush_zero_mode, denormals_zero_mode};
   }
 #endif
+  return {false, false};
+}
+
+ScopedRestoreFlushDenormalState::ScopedRestoreFlushDenormalState() {
+  std::tie(flush_zero_mode_, denormals_zero_mode_) = GetDernormalState();
+}
+
+ScopedRestoreFlushDenormalState::~ScopedRestoreFlushDenormalState() {
+  SetDenormalState(flush_zero_mode_, denormals_zero_mode_);
+}
+
+ScopedFlushDenormal::ScopedFlushDenormal() {
+  SetDenormalState(/*flush_zero_mode=*/true, /*denormals_zero_mode=*/true);
+}
+
+ScopedDontFlushDenormal::ScopedDontFlushDenormal() {
+  SetDenormalState(/*flush_zero_mode=*/false, /*denormals_zero_mode=*/false);
 }
 
 }  // namespace port
diff --git a/tensorflow/core/platform/denormal.h b/tensorflow/core/platform/denormal.h
index 5e34131a3b8d8ec5b74bf66add1567e4f5207a02..09bb0352a2f375fac73054ca516cee79905795c1 100644
--- a/tensorflow/core/platform/denormal.h
+++ b/tensorflow/core/platform/denormal.h
@@ -21,19 +21,41 @@ limitations under the License.
 namespace tensorflow {
 namespace port {
 
+// Remembers the flush denormal state on construction and restores that same
+// state on destruction.
+class ScopedRestoreFlushDenormalState {
+ public:
+  ScopedRestoreFlushDenormalState();
+  ~ScopedRestoreFlushDenormalState();
+
+ private:
+  bool flush_zero_mode_;
+  bool denormals_zero_mode_;
+  TF_DISALLOW_COPY_AND_ASSIGN(ScopedRestoreFlushDenormalState);
+};
+
 // While this class is active, denormal floating point numbers are flushed
 // to zero.  The destructor restores the original flags.
 class ScopedFlushDenormal {
  public:
   ScopedFlushDenormal();
-  ~ScopedFlushDenormal();
 
  private:
-  bool flush_zero_mode_;
-  bool denormals_zero_mode_;
+  ScopedRestoreFlushDenormalState restore_;
   TF_DISALLOW_COPY_AND_ASSIGN(ScopedFlushDenormal);
 };
 
+// While this class is active, denormal floating point numbers are not flushed
+// to zero.  The destructor restores the original flags.
+class ScopedDontFlushDenormal {
+ public:
+  ScopedDontFlushDenormal();
+
+ private:
+  ScopedRestoreFlushDenormalState restore_;
+  TF_DISALLOW_COPY_AND_ASSIGN(ScopedDontFlushDenormal);
+};
+
 }  // namespace port
 }  // namespace tensorflow
 
diff --git a/tensorflow/core/platform/windows/port.cc b/tensorflow/core/platform/windows/port.cc
index 582b232054b850a2ef5ab8f47c089eb35a7bb3cf..f3b27ea394d04770b612752328d5d571e6521cc6 100644
--- a/tensorflow/core/platform/windows/port.cc
+++ b/tensorflow/core/platform/windows/port.cc
@@ -25,6 +25,7 @@ limitations under the License.
 #endif
 
 #include <Windows.h>
+#include <shlwapi.h>
 
 #include "tensorflow/core/platform/cpu_info.h"
 #include "tensorflow/core/platform/demangle.h"
@@ -149,11 +150,16 @@ bool Snappy_Uncompress(const char* input, size_t length, char* output) {
 string Demangle(const char* mangled) { return mangled; }
 
 double NominalCPUFrequency() {
-#ifdef TENSORFLOW_USE_ABSL
-  return absl::base_internal::NominalCPUFrequency();
-#else
+  DWORD data;
+  DWORD data_size = sizeof(data);
+  #pragma comment(lib, "shlwapi.lib")  // For SHGetValue().
+  if (SUCCEEDED(
+          SHGetValueA(HKEY_LOCAL_MACHINE,
+                      "HARDWARE\\DESCRIPTION\\System\\CentralProcessor\\0",
+                      "~MHz", nullptr, &data, &data_size))) {
+    return data * 1e6;  // Value is MHz.
+  }
   return 1.0;
-#endif
 }
 
 int64 AvailableRam() {
diff --git a/tensorflow/core/protobuf/control_flow.proto b/tensorflow/core/protobuf/control_flow.proto
index 2c9476a08ad946e7f019475055397fcd6cfbbc5a..3c05b4f0e22e5ce2104980ad4fa52c8d8ad57070 100644
--- a/tensorflow/core/protobuf/control_flow.proto
+++ b/tensorflow/core/protobuf/control_flow.proto
@@ -17,6 +17,15 @@ message ValuesDef {
   map<string, string> external_values = 2;
 }
 
+// Container for any kind of control flow context. Any other control flow
+// contexts that are added below should also be added here.
+message ControlFlowContextDef {
+  oneof ctxt {
+    CondContextDef cond_ctxt = 1;
+    WhileContextDef while_ctxt = 2;
+  }
+}
+
 // Protocol buffer representing a CondContext object.
 message CondContextDef {
   // Name of the context.
@@ -33,6 +42,9 @@ message CondContextDef {
 
   // Values and external values in control flow context.
   ValuesDef values_def = 5;
+
+  // Contexts contained inside this context (e.g. nested conds).
+  repeated ControlFlowContextDef nested_contexts = 6;
 }
 
 // Protocol buffer representing a WhileContext object.
@@ -70,5 +82,8 @@ message WhileContextDef {
   // Optional name of the maximum_iterations tensor.
   string maximum_iterations_name = 11;
 
-  // Next available id: 12.
+  // Contexts contained inside this context (e.g. nested whiles).
+  repeated ControlFlowContextDef nested_contexts = 12;
+
+  // Next available id: 13.
 }
diff --git a/tensorflow/core/protobuf/rewriter_config.proto b/tensorflow/core/protobuf/rewriter_config.proto
index 0e9e202bc9a2d2368772c7fede9eb877d9d99023..0ccf2149f2cd4865627f7ab42441b8d94dc5fff4 100644
--- a/tensorflow/core/protobuf/rewriter_config.proto
+++ b/tensorflow/core/protobuf/rewriter_config.proto
@@ -30,31 +30,46 @@ message RewriterConfig {
   }
 
   // Optimize tensor layouts (default is ON)
+  // e.g. This will try to use NCHW layout on GPU which is faster.
   Toggle layout_optimizer = 1;
   // Fold constants (default is ON)
+  // Statically infer the value of tensors when possible, and materialize the
+  // result using constants.
   Toggle constant_folding = 3;
   // Arithmetic optimizations (default is ON)
+  // e.g. Simplify arithmetic ops; merge ops with same value (like constants).
   Toggle arithmetic_optimization = 7;
   // Control dependency optimizations (default is ON).
+  // Remove redundant control dependencies, which may enable other optimization.
   Toggle dependency_optimization = 8;
   // Loop optimizations (default is OFF).
   Toggle loop_optimization = 9;
+  // Function optimizations (default is OFF).
+  Toggle function_optimization = 10;
   // If true, don't remove unnecessary ops from the graph
   bool disable_model_pruning = 2;
 
   enum MemOptType {
-    // The default setting (SCHEDULING_HEURISTICS only)
+    // The default setting (SCHEDULING and SWAPPING HEURISTICS only)
     DEFAULT_MEM_OPT = 0;
     // Disabled in the meta-optimizer.
     NO_MEM_OPT = 1;
     // Driven by manual op-level annotations.
     MANUAL = 2;
+
     // Driven by heuristics. The behavior of these heuristics is subject to
     // change. Currently includes an experimental recomputation and swapping
     // heuristics. Manual annotations are respected, but additional nodes are
     // selected automatically.
+
+    // Swapping heuristic will move a tensor from the GPU to the CPU and move
+    // it back when needed to reduce peak memory usage.
     SWAPPING_HEURISTICS = 4;
+    // Recomputation heuristics will recompute ops (such as Relu activation)
+    // during backprop instead of storing them, reducing peak memory usage.
     RECOMPUTATION_HEURISTICS = 5;
+    // Scheduling will split big ops such as AddN and try to enforce a schedule
+    // of the new computations that decreases peak memory usage.
     SCHEDULING_HEURISTICS = 6;
     // Use any combination of swapping and recomputation heuristics.
     HEURISTICS = 3;
@@ -63,16 +78,15 @@ message RewriterConfig {
   // effect on manually requested memory optimization passes in the optimizers
   // field.
   MemOptType memory_optimization = 4;
-  // The prefix for nodes which are valid outputs of recomputations. Inputs to
-  // nodes with this name prefix may be recomputed (subject either to manual
-  // annotation of those input nodes or to manual annotation and heuristics
-  // depending on memory_optimization), but the prefixed nodes themselves will
-  // not be recomputed. Typically this will be "gradients/", indicating that
-  // activations from the forward pass of a graph may be recomputed as inputs to
-  // gradients, but may be adjusted if gradients are inside a name scope or if
-  // inputs to non-gradients should be recomputed. Defaults to "gradients/" if
-  // empty or not set.
-  string memory_optimizer_target_node_name_prefix = 6;
+  // A node name scope for node names which are valid outputs of recompuations.
+  // Inputs to nodes that match this scope may be recomputed (subject either to
+  // manual annotation of those input nodes or to manual annotation and
+  // heuristics depending on memory_optimization), but the nodes themselves will
+  // not be recomputed. This matches any sub-scopes as well, meaning the scope
+  // can appear not just as a top-level scope. For example, if the value is
+  // "gradients/", the default, it will match node name "gradients/foo",
+  // "foo/gradients/bar", but not "foo_gradients/"
+  string memory_optimizer_target_node_name_scope = 6;
 
   // Configures AutoParallel optimization passes either through the
   // meta-optimizer or when manually specified through the optimizers field.
@@ -87,5 +101,8 @@ message RewriterConfig {
   // ("autoparallel"). Memory optimization passes ("memory") invoked here are
   // not configurable (in contrast to memory optimization passes through the
   // meta-optimizer) and act only on manual op annotations.
+  //
+  // Custom registered optimizers will be run after the base optimizers, in
+  // the order that they are specified.
   repeated string optimizers = 100;
 }
diff --git a/tensorflow/core/user_ops/fact.cc b/tensorflow/core/user_ops/fact.cc
index 3a4fc8115a7f91badfeda369a599b3dba3057c63..2e8b22a49b620d08aa4f13da35e847b362dd2b3a 100644
--- a/tensorflow/core/user_ops/fact.cc
+++ b/tensorflow/core/user_ops/fact.cc
@@ -15,10 +15,13 @@ limitations under the License.
 
 // An example Op.
 
+#include "tensorflow/core/framework/common_shape_fns.h"
 #include "tensorflow/core/framework/op.h"
 #include "tensorflow/core/framework/op_kernel.h"
 
-REGISTER_OP("Fact").Output("fact: string");
+REGISTER_OP("Fact")
+    .Output("fact: string")
+    .SetShapeFn(tensorflow::shape_inference::UnknownShape);
 
 class FactOp : public tensorflow::OpKernel {
  public:
diff --git a/tensorflow/core/util/cuda_kernel_helper.h b/tensorflow/core/util/cuda_kernel_helper.h
index 3c59524cb6f85911544b8f2d7d3339e19af7f5b4..0ab875625ff617028c4bc53fa8ccba0488c3d0d1 100644
--- a/tensorflow/core/util/cuda_kernel_helper.h
+++ b/tensorflow/core/util/cuda_kernel_helper.h
@@ -21,6 +21,11 @@ limitations under the License.
 #include "tensorflow/core/util/cuda_device_functions.h"
 #include "tensorflow/core/util/cuda_launch_config.h"
 
+#if CUDA_VERSION >= 7050
+#include "cuda/include/cuda_fp16.h"
+#define TF_HAS_CUDA_FP16
+#endif
+
 // Deprecated, use 'for(int i : CudaGridRangeX(n))' instead.
 #define CUDA_1D_KERNEL_LOOP(i, n) \
   for (int i : ::tensorflow::CudaGridRangeX<int>(n))
diff --git a/tensorflow/core/util/cuda_launch_config.h b/tensorflow/core/util/cuda_launch_config.h
index 3ea33ee6cf2195cc0192c59d694672f0d4c69a56..81df7a51d703986b040b5d15e128139ae56c24fb 100644
--- a/tensorflow/core/util/cuda_launch_config.h
+++ b/tensorflow/core/util/cuda_launch_config.h
@@ -169,6 +169,30 @@ inline CudaLaunchConfig GetCudaLaunchConfig(int work_element_count,
   return config;
 }
 
+// Calculate the Cuda launch config we should use for a kernel launch. This
+// variant takes the resource limits of func into account to maximize occupancy.
+// The returned launch config has thread_per_block set to fixed_block_size.
+// REQUIRES: work_element_count > 0.
+template <typename DeviceFunc>
+inline CudaLaunchConfig GetCudaLaunchConfigFixedBlockSize(
+    int work_element_count, const Eigen::GpuDevice& d, DeviceFunc func,
+    size_t dynamic_shared_memory_size, int fixed_block_size) {
+  CHECK_GT(work_element_count, 0);
+  CudaLaunchConfig config;
+  int block_count = 0;
+
+  cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+      &block_count, func, fixed_block_size, dynamic_shared_memory_size);
+  CHECK_EQ(err, cudaSuccess);
+  block_count = std::min(block_count * d.getNumCudaMultiProcessors(),
+                         DivUp(work_element_count, fixed_block_size));
+
+  config.virtual_thread_count = work_element_count;
+  config.thread_per_block = fixed_block_size;
+  config.block_count = block_count;
+  return config;
+}
+
 struct Cuda2DLaunchConfig {
   dim3 virtual_thread_count = dim3(0, 0, 0);
   dim3 thread_per_block = dim3(0, 0, 0);
@@ -236,20 +260,18 @@ inline Cuda3DLaunchConfig GetCuda3DLaunchConfig(
       block_size_limit);
   CHECK_EQ(err, cudaSuccess);
 
-  auto min3 = [](int a, int b, int c) { return std::min(a, std::min(b, c)); };
-
-  int threadsx = min3(xdim, thread_per_block, xthreadlimit);
+  int threadsx = std::min({xdim, thread_per_block, xthreadlimit});
   int threadsy =
-      min3(ydim, std::max(thread_per_block / threadsx, 1), ythreadlimit);
+      std::min({ydim, std::max(thread_per_block / threadsx, 1), ythreadlimit});
   int threadsz =
-      min3(zdim, std::max(thread_per_block / (threadsx * threadsy), 1),
-           zthreadlimit);
-
-  int blocksx = min3(block_count, DivUp(xdim, threadsx), xgridlimit);
-  int blocksy =
-      min3(DivUp(block_count, blocksx), DivUp(ydim, threadsy), ygridlimit);
-  int blocksz = min3(DivUp(block_count, (blocksx * blocksy)),
-                     DivUp(zdim, threadsz), zgridlimit);
+      std::min({zdim, std::max(thread_per_block / (threadsx * threadsy), 1),
+                zthreadlimit});
+
+  int blocksx = std::min({block_count, DivUp(xdim, threadsx), xgridlimit});
+  int blocksy = std::min(
+      {DivUp(block_count, blocksx), DivUp(ydim, threadsy), ygridlimit});
+  int blocksz = std::min({DivUp(block_count, (blocksx * blocksy)),
+                          DivUp(zdim, threadsz), zgridlimit});
 
   config.virtual_thread_count = dim3(xdim, ydim, zdim);
   config.thread_per_block = dim3(threadsx, threadsy, threadsz);
diff --git a/tensorflow/core/util/events_writer.cc b/tensorflow/core/util/events_writer.cc
index 49507616ed8c6461f8d59d8899d93abb4ba58cd2..c50e329bda4b44cb5390081d889d81f231b031a5 100644
--- a/tensorflow/core/util/events_writer.cc
+++ b/tensorflow/core/util/events_writer.cc
@@ -122,9 +122,11 @@ Status EventsWriter::Flush() {
   CHECK(recordio_file_ != nullptr) << "Unexpected NULL file";
 
   TF_RETURN_WITH_CONTEXT_IF_ERROR(recordio_writer_->Flush(), "Failed to flush ",
-                                  num_outstanding_events_, " to ", filename_);
+                                  num_outstanding_events_, " events to ",
+                                  filename_);
   TF_RETURN_WITH_CONTEXT_IF_ERROR(recordio_file_->Sync(), "Failed to sync ",
-                                  num_outstanding_events_, " to ", filename_);
+                                  num_outstanding_events_, " events to ",
+                                  filename_);
 
   // The FileStillExists() condition is necessary because
   // recordio_writer_->Sync() can return OK even if the underlying
@@ -135,7 +137,8 @@ Status EventsWriter::Flush() {
   // disappearing file, in case for some file system File::Exists() is
   // false after File::Open() but before File::Sync().
   TF_RETURN_WITH_CONTEXT_IF_ERROR(FileStillExists(), "Failed to flush ",
-                                  num_outstanding_events_, " to ", filename_);
+                                  num_outstanding_events_, " events to ",
+                                  filename_);
   VLOG(1) << "Wrote " << num_outstanding_events_ << " events to disk.";
   num_outstanding_events_ = 0;
   return Status::OK();
diff --git a/tensorflow/core/util/mkl_util.h b/tensorflow/core/util/mkl_util.h
index eda966bc3342912334f90a7beddf8ccd3aefa68d..34db96075d45f690cffad44bcc08cdf17d6e68dc 100644
--- a/tensorflow/core/util/mkl_util.h
+++ b/tensorflow/core/util/mkl_util.h
@@ -1112,8 +1112,10 @@ inline void ForwardMklTensorInToOutWithMklShape(OpKernelContext* context,
 // Forward the MKL shape ONLY (used in elementwise and other ops where
 // we call the eigen implementation and MKL shape is not used)
 inline void ForwardMklMetaDataInToOut(OpKernelContext* context,
-                                      uint32 idx_data_in, uint32_t idx_data_out) {
-  uint32 idx_meta_in = GetTensorMetaDataIndex(idx_data_in, context->num_inputs());
+                                      uint32 idx_data_in,
+                                      uint32_t idx_data_out) {
+  uint32 idx_meta_in =
+      GetTensorMetaDataIndex(idx_data_in, context->num_inputs());
   uint32 idx_meta_out =
       GetTensorMetaDataIndex(idx_data_out, context->num_outputs());
 
diff --git a/tensorflow/docs_src/about/index.md b/tensorflow/docs_src/about/index.md
index 5326b1e11012618261b85d13770c793bc05736bf..dc1e9af8763e0b55bbee936ec491fba75c6507fd 100644
--- a/tensorflow/docs_src/about/index.md
+++ b/tensorflow/docs_src/about/index.md
@@ -3,7 +3,6 @@
 This section provides a few documents about TensorFlow itself,
 including the following:
 
-  * @{$roadmap$Roadmap}, which summarizes upcoming additions to TensorFlow.
   * @{$uses$TensorFlow in Use}, which provides a link to our model zoo and
     lists some popular ways that TensorFlow is being used.
   * @{$bib$TensorFlow White Papers}, which provides abstracts of white papers
diff --git a/tensorflow/docs_src/about/leftnav_files b/tensorflow/docs_src/about/leftnav_files
index 28f039e9b5f8c948dae16f6bcb74d03b3a7804e7..63763b9d9c9d5d1c604035678e855f29925b408e 100644
--- a/tensorflow/docs_src/about/leftnav_files
+++ b/tensorflow/docs_src/about/leftnav_files
@@ -1,5 +1,4 @@
 index.md
-roadmap.md
 uses.md
 bib.md
 attribution.md
diff --git a/tensorflow/docs_src/about/uses.md b/tensorflow/docs_src/about/uses.md
index 8818177a288ef16ac1907a20ab563ee3d871f7fd..d646880bd350c42e463680a5c7eb0903f2c0a497 100644
--- a/tensorflow/docs_src/about/uses.md
+++ b/tensorflow/docs_src/about/uses.md
@@ -22,6 +22,14 @@ This section describes some of the current uses of the TensorFlow system.
 > TensorFlow, or even better, send us a pull request to add an entry to this
 > file.
 
+* **Deep Speech**
+<ul>
+   <li>**Organization**: Mozilla</li>
+   <li> **Domain**: Speech Recognition</li>
+   <li> **Description**:  A TensorFlow implementation motivated by Baidu's Deep Speech architecture.</li>
+   <li> **More info**: [GitHub Repo](https://github.com/mozilla/deepspeech)</li>
+</ul>
+
 * **RankBrain**
 <ul>
    <li>**Organization**: Google</li>
diff --git a/tensorflow/docs_src/community/index.md b/tensorflow/docs_src/community/index.md
index 8e67022648d4c7161b02072446371e6d7e7168e2..b706d9b2047a4ff9707772edb30bfd036bbffc24 100644
--- a/tensorflow/docs_src/community/index.md
+++ b/tensorflow/docs_src/community/index.md
@@ -5,6 +5,7 @@ This section contains the following documents:
   * @{$welcome$Welcome to the TensorFlow Community}, which explains how
     you can get involved, where to report issues, and where to join
     like-minded TensorFlow enthusiasts online.
+  * @{$roadmap$Roadmap}, which summarizes upcoming additions to TensorFlow.
   * @{$documentation$Writing TensorFlow Documentation}, which explains
     TensorFlow's documentation conventions.  If you are modifying
     TensorFlow source code or documentation, please read this guide.
diff --git a/tensorflow/docs_src/community/leftnav_files b/tensorflow/docs_src/community/leftnav_files
index c1595d3c955bb87120fe6a6c9185c58e9db1097e..fab35024ad63e09adba1298eab52f7904eca1007 100644
--- a/tensorflow/docs_src/community/leftnav_files
+++ b/tensorflow/docs_src/community/leftnav_files
@@ -1,5 +1,6 @@
 index.md
 welcome.md
+roadmap.md
 documentation.md
 style_guide.md
 benchmarks.md
diff --git a/tensorflow/docs_src/about/roadmap.md b/tensorflow/docs_src/community/roadmap.md
similarity index 98%
rename from tensorflow/docs_src/about/roadmap.md
rename to tensorflow/docs_src/community/roadmap.md
index 1f934acab69276d4c32393bb73632d978e0d15c3..a3170a10f2d12ed272ee1d32da679f25916994c6 100644
--- a/tensorflow/docs_src/about/roadmap.md
+++ b/tensorflow/docs_src/community/roadmap.md
@@ -75,8 +75,7 @@ across image recognition, speech, object detection, and
 ### Community and Partner Engagement
 #### Special Interest Groups: 
 * Mobilizing the community to work together in focused domains
-* [tf-distribute](https://groups.google.com/a/tensorflow.org/forum/#!forum/tf-distribute)
-: build and packaging of TensorFlow
+* [tf-distribute](https://groups.google.com/a/tensorflow.org/forum/#!forum/tf-distribute): build and packaging of TensorFlow
 * More to be identified and launched
 
 #### Community:
diff --git a/tensorflow/docs_src/get_started/checkpoints.md b/tensorflow/docs_src/get_started/checkpoints.md
index dfa2110e691167f54e6ea8b7a832f0a88d0ec41a..4aa07c7f2a0b56aa6de6f42e30c364c348753a39 100644
--- a/tensorflow/docs_src/get_started/checkpoints.md
+++ b/tensorflow/docs_src/get_started/checkpoints.md
@@ -154,7 +154,7 @@ classifier = tf.estimator.DNNClassifier(
 
 The first time you call an Estimator's `train` method, TensorFlow saves a
 checkpoint to the `model_dir`. Each subsequent call to the Estimator's
-`train`, `eval`, or `predict` method causes the following:
+`train`, `evaluate`, or `predict` method causes the following:
 
 1.  The Estimator builds the model's
     [graph](https://developers.google.com/machine-learning/glossary/#graph)
@@ -222,7 +222,7 @@ does not match the shape stored in checkpoint: [20]
 
 To run experiments in which you train and compare slightly different
 versions of a model, save a copy of the code that created each
-`model-dir`, possibly by creating a separate git branch for each version.
+`model_dir`, possibly by creating a separate git branch for each version.
 This separation will keep your checkpoints recoverable.
 
 ## Summary
diff --git a/tensorflow/docs_src/get_started/custom_estimators.md b/tensorflow/docs_src/get_started/custom_estimators.md
index 42a246678a054d637fea5a82a03ecb84ff412bd9..ae89b639b422f4bd9e36302cbe78c445d497aa10 100644
--- a/tensorflow/docs_src/get_started/custom_estimators.md
+++ b/tensorflow/docs_src/get_started/custom_estimators.md
@@ -213,7 +213,7 @@ is connected to every node in the preceding layer.  Here's the relevant code:
 ```
 
 * The `units` parameter defines the number of output neurons in a given layer.
-* The `activation` parameter defines the [activation function](https://developers.google.com/machine-learning/glossary/#a) —
+* The `activation` parameter defines the [activation function](https://developers.google.com/machine-learning/glossary/#activation_function) —
   [Relu](https://developers.google.com/machine-learning/glossary/#ReLU) in this
   case.
 
diff --git a/tensorflow/docs_src/get_started/datasets_quickstart.md b/tensorflow/docs_src/get_started/datasets_quickstart.md
index bc69773d2138f5bf280015b61f1f243fd874bdac..c972e5e555eea1fab5a67fdecf13264897785519 100644
--- a/tensorflow/docs_src/get_started/datasets_quickstart.md
+++ b/tensorflow/docs_src/get_started/datasets_quickstart.md
@@ -265,9 +265,6 @@ ds = tf.data.TextLineDataset(train_path).skip(1)
 
 ### Build a csv line parser
 
-Ultimately we will need to parse each of the lines in the dataset, to
-produce the necessary `(features, label)` pairs.
-
 We will start by building a function to parse a single line.
 
 The following `iris_data.parse_line` function accomplishes this task using the
diff --git a/tensorflow/docs_src/get_started/feature_columns.md b/tensorflow/docs_src/get_started/feature_columns.md
index ad3e1fe3e3a4e3f5278e76bcaa0fc8eee2faf374..d8e4bec86357aabd2065be50d1197122c407c9d7 100644
--- a/tensorflow/docs_src/get_started/feature_columns.md
+++ b/tensorflow/docs_src/get_started/feature_columns.md
@@ -146,10 +146,10 @@ single input number into a four-element vector. Therefore, the model now can
 learn _four individual weights_ rather than just one; four weights creates a
 richer model than one weight. More importantly, bucketizing enables the model
 to clearly distinguish between different year categories since only one of the
-elements is set (1) and the other three elements are cleared (0). When we just
-use a single number (a year) as input, the model can only learn a linear
-relationship. So, bucketing provides the model with additional flexibility that
-the model can use to learn.
+elements is set (1) and the other three elements are cleared (0). For example,
+when we just use a single number (a year) as input, a linear model can only
+learn a linear relationship. So, bucketing provides the model with additional
+flexibility that the model can use to learn.
 
 The following code demonstrates how to create a bucketized feature:
 
@@ -242,7 +242,7 @@ on an explicit vocabulary list. For example:
 # the elements in the vocabulary list.
 vocabulary_feature_column =
     tf.feature_column.categorical_column_with_vocabulary_list(
-        key="a feature returned by input_fn()",
+        key=feature_name_from_input_fn,
         vocabulary_list=["kitchenware", "electronics", "sports"])
 ```
 
@@ -259,7 +259,7 @@ you place the vocabulary words in a separate file. For example:
 # the elements in the vocabulary file
 vocabulary_feature_column =
     tf.feature_column.categorical_column_with_vocabulary_file(
-        key="a feature returned by input_fn()",
+        key=feature_name_from_input_fn,
         vocabulary_file="product_class.txt",
         vocabulary_size=3)
 ```
diff --git a/tensorflow/docs_src/get_started/get_started_for_beginners.md b/tensorflow/docs_src/get_started/get_started_for_beginners.md
index 367c187e35ac5182b89e9a11cf8aec05e5250d57..b88483be699630d2275850cbc7c461eeb90f5943 100644
--- a/tensorflow/docs_src/get_started/get_started_for_beginners.md
+++ b/tensorflow/docs_src/get_started/get_started_for_beginners.md
@@ -91,11 +91,10 @@ a number.  Here's the representation scheme:
 
 A **model** is the relationship between features
 and the label.  For the Iris problem, the model defines the relationship
-between the sepal and petal measurements and the Iris species.
-Some simple models can be described with a few lines of algebra;
-more complex machine learning models
-contain such a large number of interlacing mathematical functions and
-parameters that they become hard to summarize mathematically.
+between the sepal and petal measurements and the predicted Iris species. Some
+simple models can be described with a few lines of algebra, but complex machine
+learning models have a large number of parameters that are difficult to
+summarize.
 
 Could you determine the relationship between the four features and the
 Iris species *without* using machine learning?  That is, could you use
diff --git a/tensorflow/docs_src/install/index.md b/tensorflow/docs_src/install/index.md
index 3c8488643f071c147dfbc4e0b4b4760b0a817718..4f85383925bbb8a03372b020e448a0e604f3b999 100644
--- a/tensorflow/docs_src/install/index.md
+++ b/tensorflow/docs_src/install/index.md
@@ -3,7 +3,7 @@
 We've built and tested TensorFlow on the following 64-bit laptop/desktop
 operating systems:
 
-  * MacOS X 10.11 (El Capitan) or later.
+  * macOS 10.12.6 (Sierra) or later.
   * Ubuntu 16.04 or later
   * Windows 7 or later.
 
diff --git a/tensorflow/docs_src/install/install_c.md b/tensorflow/docs_src/install/install_c.md
index f3620cf687359ebc4abfc3365beb3da694ec7baf..818798555aec3a52bd5feb0c0e67d878a6dc41e4 100644
--- a/tensorflow/docs_src/install/install_c.md
+++ b/tensorflow/docs_src/install/install_c.md
@@ -15,7 +15,7 @@ instructions might also work on other variants, we have only tested
 following requirements:
 
   * Linux, 64-bit, x86
-  * macOS X, Version 10.11 (El Capitan) or higher
+  * macOS X, Version 10.12.6 (Sierra) or higher
 
 
 ## Installation
diff --git a/tensorflow/docs_src/install/install_go.md b/tensorflow/docs_src/install/install_go.md
index 4bf4bacaecb88c9335cbe5ccbd7e6557cd21aca6..4c6dfa8dafe2042ea7b80498ca35a359f84ce854 100644
--- a/tensorflow/docs_src/install/install_go.md
+++ b/tensorflow/docs_src/install/install_go.md
@@ -17,7 +17,7 @@ instructions might also work on other variants, we have only tested
 following requirements:
 
   * Linux, 64-bit, x86
-  * macOS X, 10.11 (El Capitan) or higher
+  * macOS X, 10.12.6 (Sierra) or higher
 
 
 ## Installation
diff --git a/tensorflow/docs_src/install/install_java.md b/tensorflow/docs_src/install/install_java.md
index 1905f9729e8b9d23c50bddc3de8e174c8f3f6e2b..527884863ea5104e60569008ea067b407e74d29b 100644
--- a/tensorflow/docs_src/install/install_java.md
+++ b/tensorflow/docs_src/install/install_java.md
@@ -18,7 +18,7 @@ instructions might also work on other variants, we have only tested
 following requirements:
 
   * Ubuntu 16.04 or higher; 64-bit, x86
-  * macOS X 10.11 (El Capitan) or higher
+  * macOS 10.12.6 (Sierra) or higher
   * Windows 7 or higher; 64-bit, x86
 
 The installation instructions for Android are in a separate
@@ -243,7 +243,7 @@ and macOS X:
 
 And the following command line executes the `HelloTF` program on Windows:
 
-<pre><b>java -cp libtensorflow-1.6.0-rc1.jar;. -Djava.library.path=jni HelloTF</b></pre>
+<pre><b>java -cp libtensorflow-1.6.0-rc1.jar;. -Djava.library.path=jni HelloTF</b></pre>d
 
 If the program prints <tt>Hello from <i>version</i></tt>, you've successfully
 installed TensorFlow for Java and are ready to use the API.  If the program
diff --git a/tensorflow/docs_src/install/install_linux.md b/tensorflow/docs_src/install/install_linux.md
index 62bd45650ad33355c27f053bdc60a4471ba78dfe..e3e115d9f618265864363810acf96033882ad89d 100644
--- a/tensorflow/docs_src/install/install_linux.md
+++ b/tensorflow/docs_src/install/install_linux.md
@@ -482,7 +482,6 @@ Take the following steps to install TensorFlow in an Anaconda environment:
      (tensorflow)$ <b>pip install --ignore-installed --upgrade \
      https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0rc1-cp34-cp34m-linux_x86_64.whl</b></pre>
 
-
 <a name="ValidateYourInstallation"></a>
 ## Validate your installation
 
diff --git a/tensorflow/docs_src/install/install_mac.md b/tensorflow/docs_src/install/install_mac.md
index e3832a7a2a857aee5819b02f733793f33ea1fb52..623ca6bb7919bf74fa9bcaad3184cdf0bcd9ccff 100644
--- a/tensorflow/docs_src/install/install_mac.md
+++ b/tensorflow/docs_src/install/install_mac.md
@@ -5,7 +5,11 @@ instructions might also work on other macOS variants, we have only
 tested (and we only support) these instructions on machines meeting the
 following requirements:
 
-  * macOS X 10.11 (El Capitan) or higher
+  * macOS 10.12.6 (Sierra) or higher
+
+Note: There are known, accuracy-affecting numerical issues before macOS 10.12.6
+(Sierra) that are described in
+[GitHub#15933](https://github.com/tensorflow/tensorflow/issues/15933#issuecomment-366331383).
 
 Note: As of version 1.2, TensorFlow no longer provides GPU support on macOS.
 
@@ -114,8 +118,8 @@ Take the following steps to install TensorFlow with Virtualenv:
      Python 2.7, the command to install
      TensorFlow in the active Virtualenv is as follows:
 
-     <pre> $ <b>pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py3-none-any.whl</b></pre>
+     <pre> $ <b>pip install --upgrade \
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl</b></pre>
 
 If you encounter installation problems, see
 [Common Installation Problems](#common-installation-problems).
@@ -237,8 +241,8 @@ take the following steps:
      you are installing TensorFlow for Mac OS and Python 2.7
      issue the following command:
 
-     <pre> $ <b>sudo pip3 install --upgrade \
-     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py3-none-any.whl</b> </pre>
+     <pre> $ <b>sudo pip install --upgrade \
+     https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.6.0rc1-py2-none-any.whl</b> </pre>
 
      If the preceding command fails, see
      [installation problems](#common-installation-problems).
diff --git a/tensorflow/docs_src/install/install_sources.md b/tensorflow/docs_src/install/install_sources.md
index 8d83e9f1190ed307ca99d81168df7dfab51e4507..acf0af0d9d558d58e625fdd315db859a5bd08121 100644
--- a/tensorflow/docs_src/install/install_sources.md
+++ b/tensorflow/docs_src/install/install_sources.md
@@ -393,8 +393,7 @@ TensorFlow programs:
 
 <pre>Hello, TensorFlow!</pre>
 
-If you are new to TensorFlow, see @{$get_started/premade_estimators$Getting Started with
-TensorFlow}.
+If you are new to TensorFlow, see @{$get_started/premade_estimators$Getting Started with TensorFlow}.
 
 If the system outputs an error message instead of a greeting, see [Common
 installation problems](#common_installation_problems).
diff --git a/tensorflow/docs_src/install/install_windows.md b/tensorflow/docs_src/install/install_windows.md
index dedf485f93d6fd6a8ce7b4465548cc998d307daa..f0a30ee39448c09d0125f17cc2eaaaee9ab6c1bb 100644
--- a/tensorflow/docs_src/install/install_windows.md
+++ b/tensorflow/docs_src/install/install_windows.md
@@ -153,8 +153,7 @@ TensorFlow programs:
 
 <pre>Hello, TensorFlow!</pre>
 
-If you are new to TensorFlow, see @{$get_started/premade_estimators$Getting Started with
-TensorFlow}.
+If you are new to TensorFlow, see @{$get_started/premade_estimators$Getting Started with TensorFlow}.
 
 If the system outputs an error message instead of a greeting, see [Common
 installation problems](#common_installation_problems).
diff --git a/tensorflow/docs_src/mobile/leftnav_files b/tensorflow/docs_src/mobile/leftnav_files
index ac50f528ba468d8a830c059539d3399f413f39c8..4cf134cc3c2c323405d769a5ced5d5a68f188203 100644
--- a/tensorflow/docs_src/mobile/leftnav_files
+++ b/tensorflow/docs_src/mobile/leftnav_files
@@ -2,6 +2,7 @@ index.md
 ### TensorFlow Lite
 tflite/index.md
 tflite/demo_android.md
+tflite/demo_ios.md
 >>>
 ### TensorFlow Mobile
 mobile_intro.md
diff --git a/tensorflow/docs_src/mobile/mobile_intro.md b/tensorflow/docs_src/mobile/mobile_intro.md
index 17dbf1c3e6ad89768529864ba884274a51b3dfb2..69b63ae7d22ced9fd0299f17d1ae2d614c9a6be7 100644
--- a/tensorflow/docs_src/mobile/mobile_intro.md
+++ b/tensorflow/docs_src/mobile/mobile_intro.md
@@ -235,7 +235,7 @@ TensorFlow [on Github](https://github.com/tensorflow/models) that you can look
 through. Lean towards the simplest model you can find, and try to get started as
 soon as you have even a small amount of labelled data, since you’ll get the best
 results when you’re able to iterate quickly. The shorter the time it takes to
-try training a model and running it in s real application, the better overall
+try training a model and running it in its real application, the better overall
 results you’ll see. It’s common for an algorithm to get great training accuracy
 numbers but then fail to be useful within a real application because there’s a
 mismatch between the dataset and real usage. Prototype end-to-end usage as soon
diff --git a/tensorflow/docs_src/mobile/tflite/demo_ios.md b/tensorflow/docs_src/mobile/tflite/demo_ios.md
new file mode 100644
index 0000000000000000000000000000000000000000..3ee9b1cbca6cfef98616bd33bbf91b756b4efa15
--- /dev/null
+++ b/tensorflow/docs_src/mobile/tflite/demo_ios.md
@@ -0,0 +1,68 @@
+# TensorFlow Lite Demo for iOS
+
+The TensorFlow Lite demo is a camera app that continuously classifies whatever
+it sees from your device's back camera, using a quantized MobileNet model. These
+instructions walk you through building and running the demo on an iOS device.
+
+## Prerequisites
+
+* You must have [Xcode](https://developer.apple.com/xcode/) installed and have a
+  valid Apple Developer ID, and have an iOS device set up and linked to your
+  developer account with all of the appropriate certificates. For these
+  instructions, we assume that you have already been able to build and deploy an
+  app to an iOS device with your current developer environment.
+
+* The demo app requires a camera and must be executed on a real iOS device. You
+  can build it and run with the iPhone Simulator but it won't have any camera
+  information to classify.
+
+* You don't need to build the entire TensorFlow library to run the demo, but you
+  will need to clone the TensorFlow repository if you haven't already:
+
+        git clone https://github.com/tensorflow/tensorflow
+
+* You'll also need the Xcode command-line tools:
+
+        xcode-select --install
+
+    If this is a new install, you will need to run the Xcode application once to
+    agree to the license before continuing.
+
+## Building the iOS Demo App
+
+1. Install CocoaPods if you don't have it:
+
+        sudo gem install cocoapods
+
+2. Download the model files used by the demo app (this is done from inside the
+   cloned directory):
+
+        sh tensorflow/contrib/lite/examples/ios/download_models.sh
+
+3. Install the pod to generate the workspace file:
+
+        cd tensorflow/contrib/lite/examples/ios/camera
+        pod install
+
+    If you have installed this pod before and that command doesn't work, try
+
+        pod update
+
+    At the end of this step you should have a file called 
+    `tflite_camera_example.xcworkspace`.
+
+4. Open the project in Xcode by typing this on the command line:
+
+        open tflite_camera_example.xcworkspace
+
+    This launches Xcode if it isn't open already and opens the
+    `tflite_camera_example` project.
+
+5. Build and run the app in Xcode.
+
+    Note that as mentioned earlier, you must already have a device set up and
+    linked to your Apple Developer account in order to deploy the app on a
+    device.
+
+You'll have to grant permissions for the app to use the device's camera. Point
+the camera at various objects and enjoy seeing how the model classifies things!
diff --git a/tensorflow/docs_src/performance/datasets_performance.md b/tensorflow/docs_src/performance/datasets_performance.md
index 4f95e17c3598c23645fad07441c267266e5ef34e..46b43b7673c561679e89fff0ae738b0e751fcff5 100644
--- a/tensorflow/docs_src/performance/datasets_performance.md
+++ b/tensorflow/docs_src/performance/datasets_performance.md
@@ -92,11 +92,11 @@ transform the data.
 
 Without pipelining, the CPU and the GPU/TPU sit idle much of the time:
 
-![without pipelining](https://www.tensorflow.org/images/datasets_without_pipelining.png)
+![without pipelining](/images/datasets_without_pipelining.png)
 
 With pipelining, idle time diminishes significantly:
 
-![with pipelining](https://www.tensorflow.org/images/datasets_with_pipelining.png)
+![with pipelining](/images/datasets_with_pipelining.png)
 
 The `tf.data` API provides a software pipelining mechanism through the
 @{tf.data.Dataset.prefetch} transformation, which can be used to decouple the
@@ -139,7 +139,7 @@ multiple CPU cores. To make this possible, the `map` transformation provides the
 the following diagram illustrates the effect of setting `num_parallel_calls=2`
 to the `map` transformation:
 
-![parallel map](https://www.tensorflow.org/images/datasets_parallel_map.png)
+![parallel map](/images/datasets_parallel_map.png)
 
 Choosing the best value for the `num_parallel_calls` argument depends on your
 hardware, characteristics of your training data (such as its size and shape),
@@ -213,7 +213,7 @@ number of datasets to overlap can be specified by the `cycle_length` argument.
 The following diagram illustrates the effect of supplying `cycle_length=2` to
 the `parallel_interleave` transformation:
 
-![parallel io](https://www.tensorflow.org/images/datasets_parallel_io.png)
+![parallel io](/images/datasets_parallel_io.png)
 
 To apply this change to our running example, change:
 
diff --git a/tensorflow/docs_src/performance/leftnav_files b/tensorflow/docs_src/performance/leftnav_files
index 316f023f43dcfe781c7819d1681335267ddd5f76..d11a7e5d07c3e6cfa092e7ac11189ce6c272c1ad 100644
--- a/tensorflow/docs_src/performance/leftnav_files
+++ b/tensorflow/docs_src/performance/leftnav_files
@@ -2,6 +2,7 @@ performance_guide.md
 datasets_performance.md
 performance_models.md
 benchmarks.md
+quantization.md
 
 ### XLA
 xla/index.md
@@ -11,6 +12,3 @@ xla/jit.md
 xla/operation_semantics.md
 xla/shapes.md
 xla/tfcompile.md
-
-### Quantization
-quantization.md
diff --git a/tensorflow/docs_src/performance/quantization.md b/tensorflow/docs_src/performance/quantization.md
index 544274cab68934419e8601a4d9714d80335fca28..411889cb1c616130f809e6228cc692ba3f951d48 100644
--- a/tensorflow/docs_src/performance/quantization.md
+++ b/tensorflow/docs_src/performance/quantization.md
@@ -1,226 +1,253 @@
-# How to Quantize Neural Networks with TensorFlow
-
-When modern neural networks were being developed, the biggest challenge was
-getting them to work at all! That meant that accuracy and speed during training
-were the top priorities. Using floating point arithmetic was the easiest way to
-preserve accuracy, and GPUs were well-equipped to accelerate those calculations,
-so it's natural that not much attention was paid to other numerical formats.
-
-These days, we actually have a lot of models being deployed in commercial
-applications. The computation demands of training grow with the number of
-researchers, but the cycles needed for inference expand in proportion to users.
-That means pure inference efficiency has become a burning issue for a lot of
-teams.
-
-That is where quantization comes in. It's an umbrella term that covers a lot of
-different techniques to store numbers and perform calculations on them in more
-compact formats than 32-bit floating point. I am going to focus on eight-bit
-fixed point, for reasons I'll go into more detail on later.
-
-[TOC]
-
-## Why does Quantization Work?
-
-Training neural networks is done by applying many tiny nudges to the weights,
-and these small increments typically need floating point precision to work
-(though there are research efforts to use quantized representations here too).
-
-Taking a pre-trained model and running inference is very different. One of the
-magical qualities of deep networks is that they tend to cope very well with high
-levels of noise in their inputs. If you think about recognizing an object in a
-photo you've just taken, the network has to ignore all the CCD noise, lighting
-changes, and other non-essential differences between it and the training
-examples it's seen before, and focus on the important similarities instead. This
-ability means that they seem to treat low-precision calculations as just another
-source of noise, and still produce accurate results even with numerical formats
-that hold less information.
-
-## Why Quantize?
-
-Neural network models can take up a lot of space on disk, with the original
-AlexNet being over 200 MB in float format for example. Almost all of that size
-is taken up with the weights for the neural connections, since there are often
-many millions of these in a single model. Because they're all slightly different
-floating point numbers, simple compression formats like zip don't compress them
-well. They are arranged in large layers though, and within each layer the
-weights tend to be normally distributed within a certain range, for example -3.0
-to 6.0.
-
-The simplest motivation for quantization is to shrink file sizes by storing the
-min and max for each layer, and then compressing each float value to an
-eight-bit integer representing the closest real number in a linear set of 256
-within the range. For example with the -3.0 to 6.0 range, a 0 byte would
-represent -3.0, a 255 would stand for 6.0, and 128 would represent about 1.5.
-I'll go into the exact calculations later, since there's some subtleties, but
-this means you can get the benefit of a file on disk that's shrunk by 75%, and
-then convert back to float after loading so that your existing floating-point
-code can work without any changes.
-
-Another reason to quantize is to reduce the computational resources you need to
-do the inference calculations, by running them entirely with eight-bit inputs
-and outputs. This is a lot more difficult since it requires changes everywhere
-you do calculations, but offers a lot of potential rewards. Fetching eight-bit
-values only requires 25% of the memory bandwidth of floats, so you'll make much
-better use of caches and avoid bottlenecking on RAM access. You can also
-typically use SIMD operations that do many more operations per clock cycle. In
-some case you'll have a DSP chip available that can accelerate eight-bit
-calculations too, which can offer a lot of advantages.
-
-Moving calculations over to eight bit will help you run your models faster, and
-use less power (which is especially important on mobile devices). It also opens
-the door to a lot of embedded systems that can't run floating point code
-efficiently, so it can enable a lot of applications in the IoT world.
-
-## Why Not Train in Lower Precision Directly?
-
-There have been some experiments training at lower bit depths, but the results
-seem to indicate that you need higher than eight bit to handle the back
-propagation and gradients. That makes implementing the training more
-complicated, and so starting with inference made sense. We also already have a
-lot of float models already that we use and know well, so being able to convert
-them directly is very convenient.
-
-## How Can You Quantize Your Models?
-
-TensorFlow has production-grade support for eight-bit calculations built in. It
-also has a process for converting many models trained in floating-point over to
-equivalent graphs using quantized calculations for inference. For example,
-here's how you can translate the latest GoogLeNet model into a version that uses
-eight-bit computations:
-
-```sh
-curl -L "https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz" |
-  tar -C tensorflow/examples/label_image/data -xz
-bazel build tensorflow/tools/graph_transforms:transform_graph
-bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
-  --in_graph=tensorflow/examples/label_image/data/inception_v3_2016_08_28_frozen.pb \
-  --out_graph=/tmp/quantized_graph.pb \
-  --inputs=input \
-  --outputs=InceptionV3/Predictions/Reshape_1 \
-  --transforms='add_default_attributes strip_unused_nodes(type=float, shape="1,299,299,3")
-    remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true)
-    fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes
-    strip_unused_nodes sort_by_execution_order'
+# Fixed Point Quantization
+
+Quantization techniques store and calculate numbers in more compact formats.
+[TensorFlow Lite](/mobile/tflite/) adds quantization that uses an 8-bit fixed
+point representation.
+
+Since a challenge for modern neural networks is optimizing for high accuracy, the
+priority has been improving accuracy and speed during training. Using floating
+point arithmetic is an easy way to preserve accuracy and GPUs are designed to
+accelerate these calculations.
+
+However, as more machine learning models are deployed to mobile devices,
+inference efficiency has become a critical issue. Where the computational demand
+for *training* grows with the amount of models trained on different
+architectures, the computational demand for *inference* grows in proportion to
+the amount of users.
+
+## Quantization benefits
+
+
+Using 8-bit calculations help your models run faster and use less power. This is
+especially important for mobile devices and embedded applications that can't run
+floating point code efficiently, for example, Internet of Things (IoT) and
+robotics devices. There are additional opportunities to extend this support to
+more backends and research lower precision networks.
+
+### Smaller file sizes {: .hide-from-toc}
+
+Neural network models require a lot of space on disk. For example, the original
+AlexNet requires over 200 MB for the float format—almost all of that for the
+model's millions of weights. Because the weights are slightly different
+floating point numbers, simple compression formats perform poorly (like zip).
+
+Weights fall in large layers of numerical values. For each layer, weights tend to
+be normally distributed within a range. Quantization can shrink file sizes by
+storing the minimum and maximum weight for each layer, then compress each
+weight's float value to an 8-bit integer representing the closest real number in
+a linear set of 256 within the range.
+
+### Faster inference {: .hide-from-toc}
+
+Since calculations are run entirely on 8-bit inputs and outputs, quantization
+reduces the computational resources needed for inference calculations. This is
+more involved, requiring changes to all floating point calculations, but results
+in a large speed-up for inference time.
+
+### Memory efficiency {: .hide-from-toc}
+
+Since fetching 8-bit values only requires 25% of the memory bandwidth of floats,
+more efficient caches avoid bottlenecks for RAM access. In many cases, the power
+consumption for running a neural network is dominated by memory access. The
+savings from using fixed-point 8-bit weights and activations are significant. 
+
+Typically, SIMD operations are available that run more operations per clock
+cycle. In some cases, a DSP chip is available that accelerates 8-bit calculations
+resulting in a massive speedup.
+
+## Fixed point quantization techniques
+
+The goal is to use the same precision for weights and activations during both
+training and inference. But an important difference is that training consists of
+a forward pass and a backward pass, while inference only uses a forward pass.
+When we train the model with quantization in the loop, we ensure that the forward
+pass matches precision for both training and inference.
+
+To minimize the loss in accuracy for fully fixed point models (weights and
+activations), train the model with quantization in the loop. This simulates
+quantization in the forward pass of a model so weights tend towards values that
+perform better during quantized inference. The backward pass uses quantized
+weights and activations and models quantization as a straight through estimator.
+(See Bengio et al., [2013](https://arxiv.org/abs/1308.3432))
+
+Additionally, the minimum and maximum values for activations are determined
+during training. This allows a model trained with quantization in the loop to be
+converted to a fixed point inference model with little effort, eliminating the
+need for a separate calibration step.
+
+## Quantization training with TensorFlow
+
+TensorFlow can train models with quantization in the loop. Because training
+requires small gradient adjustments, floating point values are still used. To
+keep models as floating point while adding the quantization error in the training
+loop, @{$array_ops#Fake_quantization$fake quantization} nodes simulate the
+effect of quantization in the forward and backward passes.
+
+Since it's difficult to add these fake quantization operations to all the
+required locations in the model, there's a function available that rewrites the
+training graph. To create a fake quantized training graph:
+
+```
+# Build forward pass of model.
+loss = tf.losses.get_total_loss()
+
+# Call the training rewrite which rewrites the graph in-place with
+# FakeQuantization nodes and folds batchnorm for training. It is
+# often needed to fine tune a floating point model for quantization
+# with this training tool. When training from scratch, quant_delay
+# can be used to activate quantization after training to converge
+# with the float graph, effectively fine-tuning the model.
+tf.contrib.quantize.create_training_graph(quant_delay=2000000)
+
+# Call backward pass optimizer as usual.
+optimizer = tf.train.GradientDescentOptimizer(learning_rate)
+optimizer.minimize(loss)
 ```
 
-This will produce a new model that runs the same operations as the original, but
-with eight bit calculations internally, and all weights quantized as well. If
-you look at the file size, you'll see it's about a quarter of the original (23MB
-versus 91MB). You can still run this model using exactly the same inputs and
-outputs though, and you should get equivalent results. Here's an example:
+The rewritten *eval graph* is non-trivially different from the *training graph*
+since the quantization ops affect the batch normalization step. Because of this,
+we've added a separate rewrite for the *eval graph*:
 
-```sh
-bazel build tensorflow/examples/label_image:label_image
-bazel-bin/tensorflow/examples/label_image/label_image \
---graph=/tmp/quantized_graph.pb \
+```
+# Build eval model
+logits = tf.nn.softmax_cross_entropy_with_logits(...)
+
+# Call the eval rewrite which rewrites the graph in-place with
+# FakeQuantization nodes and fold batchnorm for eval.
+tf.contrib.quantize.create_eval_graph()
+
+# Save the checkpoint and eval graph proto to disk for freezing
+# and providing to TFLite.
+with open(eval_graph_file, ‘w’) as f:
+  f.write(str(g.as_graph_def()))
+saver = tf.train.Saver()
+saver.save(sess, checkpoint_name)
+```
+
+Methods to rewrite the training and eval graphs are an active area of research
+and experimentation. Although rewrites and quantized training might not work or
+improve performance for all models, we are working to generalize these
+techniques.
+
+## Generating fully quantized models
+
+The previously demonstrated after-rewrite eval graph only *simulates*
+quantization. To generate real fixed point computations from a trained
+quantization model, convert it to a fixed point kernel. Tensorflow Lite supports
+this conversion from the graph resulting from `create_eval_graph`.
+
+First, create a frozen graph that will be the input for the TensorFlow Lite
+toolchain:
+
+```
+bazel build tensorflow/python/tools:freeze_graph && \
+  bazel-bin/tensorflow/python/tools/freeze_graph \
+  --input_graph=eval_graph_def.pb \
+  --input_checkpoint=checkpoint \
+  --output_graph=frozen_eval_graph.pb --output_node_names=outputs
 ```
 
-You'll see that this runs the newly-quantized graph, and outputs a very similar
-answer to the original.
-
-You can run the same process on your own models saved out as GraphDefs, with the
-input and output names adapted to those your network requires. I recommend that
-you run them through the freeze_graph script first, to convert checkpoints into
-constants stored in the file.
-
-## How Does the Quantization Process Work?
-
-We've implemented quantization by writing equivalent eight-bit versions of
-operations that are commonly used during inference. These include convolution,
-matrix multiplication, activation functions, pooling operations and
-concatenation. The conversion script first replaces all the individual ops it
-knows about with quantized equivalents. These are small sub-graphs that have
-conversion functions before and after to move the data between float and
-eight-bit. Below is an example of what they look like. First here's the original
-Relu operation, with float inputs and outputs:
-
-![Relu Diagram](https://www.tensorflow.org/images/quantization0.png)
-
-Then, this is the equivalent converted subgraph, still with float inputs and
-outputs, but with internal conversions so the calculations are done in eight
-bit.
-
-![Converted Diagram](https://www.tensorflow.org/images/quantization1.png)
-
-The min and max operations actually look at the values in the input float
-tensor, and then feeds them into the Dequantize operation that converts the
-tensor into eight-bits. There are more details on how the quantized representation
-works later on.
-
-Once the individual operations have been converted, the next stage is to remove
-unnecessary conversions to and from float. If there are consecutive sequences of
-operations that all have float equivalents, then there will be a lot of adjacent
-Dequantize/Quantize ops. This stage spots that pattern, recognizes that they
-cancel each other out, and removes them, like this:
-
-![Stripping Diagram](https://www.tensorflow.org/images/quantization2.png)
-
-Applied on a large scale to models where all of the operations have quantized
-equivalents, this gives a graph where all of the tensor calculations are done in
-eight bit, without having to convert to float.
-
-## What Representation is Used for Quantized Tensors?
-
-We approach converting floating-point arrays of numbers into eight-bit
-representations as a compression problem. We know that the weights and
-activation tensors in trained neural network models tend to have values that are
-distributed across comparatively small ranges (for example you might have -15 to
-+15 for weights, -500 to 1000 for activations on an image model, though the
-exact numbers will vary). We also know from experiment that neural nets tend to
-be very robust in the face of noise, and so the noise-like error produced by
-quantizing down to a small set of values will not hurt the precision of the
-overall results very much. We also want to pick a representation that's easy to
-perform calculations on, especially the large matrix multiplications that form
-the bulk of the work that's needed to run a model.
-
-These led us to pick a representation that has two floats to store the overall
-minimum and maximum values that are represented by the lowest and highest
-quantized value. Each entry in the quantized array represents a float value in
-that range, distributed linearly between the minimum and maximum. For example,
-if we have minimum = -10.0, and maximum = 30.0f, and an eight-bit array, here's
-what the quantized values represent:
+Provide this to the TensorFlow Lite Optimizing Converter (TOCO) to get a fully
+quantized TensorFLow Lite model:
 
 ```
-Quantized | Float
---------- | -----
-0         | -10.0
-255       | 30.0
-128       | 10.0
+bazel build tensorflow/contrib/lite/toco:toco && \
+  ./bazel-bin/third_party/tensorflow/contrib/lite/toco/toco \
+  --input_file=frozen_eval_graph.pb \
+  --output_file=tflite_model.tflite \
+  --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE \
+  --inference_type=QUANTIZED_UINT8 \
+  --input_shape="1,224, 224,3" \
+  --input_array=input \
+  --output_array=outputs \
+  --std_value=127.5 --mean_value=127.5
 ```
 
-The advantages of this format are that it can represent arbitrary magnitudes of
-ranges, they don't have to be symmetrical, it can represent signed and unsigned
-values, and the linear spread makes doing multiplications straightforward. There
-are alternatives like [Song Han's code books](http://arxiv.org/pdf/1510.00149.pdf)
-that can use lower bit depths by non-linearly distributing the float values
-across the representation, but these tend to be more expensive to calculate on.
-
-The advantage of having a strong and clear definition of the quantized format is
-that it's always possible to convert back and forth from float for operations
-that aren't quantization-ready, or to inspect the tensors for debugging
-purposes. One implementation detail in TensorFlow that we're hoping to improve
-in the future is that the minimum and maximum float values need to be passed as
-separate tensors to the one holding the quantized values, so graphs can get a
-bit dense!
-
-The nice thing about the minimum and maximum ranges is that they can often be
-pre-calculated. Weight parameters are constants known at load time, so their
-ranges can also be stored as constants. We often know the ranges for inputs (for
-examples images are usually RGB values in the range 0.0 to 255.0), and many
-activation functions have known ranges too. This can avoid having to analyze the
-outputs of an operation to determine the range, which we need to do for math ops
-like convolution or matrix multiplication which produce 32-bit accumulated
-results from 8-bit inputs.
-
-## What's Next?
-
-We've found that we can get extremely good performance on mobile and embedded
-devices by using eight-bit arithmetic rather than floating-point. You can see
-the framework we use to optimize matrix multiplications at
-[gemmlowp](https://github.com/google/gemmlowp). We still need to apply all the
-lessons we've learned to the TensorFlow ops to get maximum performance on
-mobile, but we're actively working on that. Right now, this quantized
-implementation is a reasonably fast and accurate reference implementation that
-we're hoping will enable wider support for our eight-bit models on a wider
-variety of devices. We also hope that this demonstration will encourage the
-community to explore what's possible with low-precision neural networks.
+See the documentation for @{tf.contrib.quantize} and
+[TensorFlow Lite](/mobile/tflite/).
+
+## Quantized accuracy
+
+Fixed point [MobileNet](https://arxiv.org/abs/1704.0486) models are released with
+8-bit weights and activations. Using the rewriters, these models achieve the
+Top-1 accuracies listed in Table 1. For comparison, the floating point accuracies
+are listed for the same models. The code used to generate these models
+[is available](https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md)
+along with links to all of the pretrained mobilenet_v1 models.
+
+<figure>
+  <table>
+    <tr>
+      <th>Image Size</th>
+      <th>Depth</th>
+      <th>Top-1 Accuracy:<br>Floating point</th>
+      <th>Top-1 Accuracy:<br>Fixed point: 8 bit weights and activations</th>
+    </tr>
+    <tr><td>128</td><td>0.25</td><td>0.415</td><td>0.399</td></tr>
+    <tr><td>128</td><td>0.5</td><td>0.563</td><td>0.549</td></tr>
+    <tr><td>128</td><td>0.75</td><td>0.621</td><td>0.598</td></tr>
+    <tr><td>128</td><td>1</td><td>0.652</td><td>0.64</td></tr>
+    <tr><td>160</td><td>0.25</td><td>0.455</td><td>0.435</td></tr>
+    <tr><td>160</td><td>0.5</td><td>0.591</td><td>0.577</td></tr>
+    <tr><td>160</td><td>0.75</td><td>0.653</td><td>0.639</td></tr>
+    <tr><td>160</td><td>1</td><td>0.68</td><td>0.673</td></tr>
+    <tr><td>192</td><td>0.25</td><td>0.477</td><td>0.458</td></tr>
+    <tr><td>192</td><td>0.5</td><td>0.617</td><td>0.604</td></tr>
+    <tr><td>192</td><td>0.75</td><td>0.672</td><td>0.662</td></tr>
+    <tr><td>192</td><td>1</td><td>0.7</td><td>0.69</td></tr>
+    <tr><td>224</td><td>0.25</td><td>0.498</td><td>0.482</td></tr>
+    <tr><td>224</td><td>0.5</td><td>0.633</td><td>0.622</td></tr>
+    <tr><td>224</td><td>0.75</td><td>0.684</td><td>0.679</td></tr>
+    <tr><td>224</td><td>1</td><td>0.709</td><td>0.697</td></tr>
+  </table>
+  <figcaption>
+    <b>Table 1</b>: MobileNet Top-1 accuracy on Imagenet Validation dataset.
+  </figcaption>
+</figure>
+
+## Representation for quantized tensors
+
+TensorFlow approaches the conversion of floating-point arrays of numbers into
+8-bit representations as a compression problem. Since the weights and activation
+tensors in trained neural network models tend to have values that are distributed
+across comparatively small ranges (for example, -15 to +15 for weights or -500 to
+1000 for image model activations). And since neural nets tend to be robust
+handling noise, the error introduced by quantizing to a small set of values
+maintains the precision of the overall results within an acceptable threshold. A
+chosen representation must perform fast calculations, especially the large matrix
+multiplications that comprise the bulk of the computations while running a model.
+
+This is represented with two floats that store the overall minimum and maximum
+values corresponding to the lowest and highest quantized value. Each entry in the
+quantized array represents a float value in that range, distributed linearly
+between the minimum and maximum. For example, with a minimum of -10.0 and maximum
+of 30.0f, and an 8-bit array, the quantized values represent the following:
+
+<figure>
+  <table>
+    <tr><th>Quantized</th><th>Float</th></tr>
+    <tr><td>0</td><td>-10.0</td></tr>
+    <tr><td>255</td><td>30.0</td></tr>
+    <tr><td>128</td><td>10.0</td></tr>
+  </table>
+  <figcaption>
+    <b>Table 2</b>: Example quantized value range
+  </figcaption>
+</figure>
+
+The advantages of this representation format are:
+
+* It efficiently represents an arbitrary magnitude of ranges.
+* The values don't have to be symmetrical.
+* The format represents both signed and unsigned values.
+* The linear spread makes multiplications straightforward.
+
+Alternative techniques use lower bit depths by non-linearly distributing the
+float values across the representation, but currently are more expensive in terms
+of computation time. (See Han et al.,
+[2016](https://arxiv.org/abs/1510.00149).)
+
+The advantage of having a clear definition of the quantized format is that it's
+always possible to convert back and forth from fixed-point to floating-point for
+operations that aren't quantization-ready, or to inspect the tensors for
+debugging.
diff --git a/tensorflow/docs_src/performance/xla/operation_semantics.md b/tensorflow/docs_src/performance/xla/operation_semantics.md
index 5431572db83a84c034c56656928bdc927e708dc9..712d425331b9e75d4a1fe030a65ddff6be3d5292 100644
--- a/tensorflow/docs_src/performance/xla/operation_semantics.md
+++ b/tensorflow/docs_src/performance/xla/operation_semantics.md
@@ -45,27 +45,30 @@ feature dimension in `operand`), the operation calculates the gradients with
 respect to `operand`, `offset` and `scale` across all the other dimensions. The
 `feature_index` must be a valid index for the feature dimension in `operand`.
 
-The three gradients are defined by the following formulas:
+The three gradients are defined by the following formulas (Assuming a
+4-dimensional tensor as `operand` and (l) is the index for feature dimension):
 
-\\( \nabla x = \nabla y * \gamma * \sqrt{\sigma^2+\epsilon} \\)
+\\( coef_l = \frac{1}{mwh}\sum_{i=1}^m\sum_{j=1}^w\sum_{k=1}^h (\nabla y_{ijkl} * (x_{ijkl} - \mu_l) / (\sigma^2_{l}+\epsilon)) \\)
 
-\\( \nabla \gamma = sum(\nabla y * (x - \mu) * \sqrt{\sigma^2 + \epsilon}) \\)
+\\( \nabla x_{ijkl} = \gamma_{l} * (1/\sqrt{\sigma^2_{l}+\epsilon}) * [\nabla y_{ijkl} - mean(\nabla y) - (x_{ijkl} - \mu_{l}) * coef_l] \\)
 
-\\( \nabla \beta = sum(\nabla y) \\)
+\\( \nabla \beta_l = \sum_{i=1}^m\sum_{j=1}^w\sum_{k=1}^h \nabla y_{ijkl} \\)
+
+\\( \nabla \gamma_l = \sum_{i=1}^m\sum_{j=1}^w\sum_{k=1}^h \nabla y_{ijkl} * ((x_{ijkl} - \mu_l) / \sqrt{\sigma^2_{l}+\epsilon}) \\)
 
 The inputs `mean` and `variance` represents moments value
 across batch and spatial dimensions.
 
 The output type is a tuple of three handles:
 
-|Outputs       | Type                    | Semantics                           |
-|------------- | ----------------------- | ------------------------------------|
-|`grad_operand`| `ComputationDataHandle` | gradient with respect to input      |
-:              :                         : `operand`                           :
-|`grad_scale`  | `ComputationDataHandle` | gradient with respect to input      |
-:              :                         : `scale`                             :
-|`grad_offset` | `ComputationDataHandle` | gradient with respect to input      |
-:              :                         : `offset`                            :
+|Outputs       | Type                    | Semantics                            |
+|------------- | ----------------------- | ------------------------------------ |
+|`grad_operand`| `ComputationDataHandle` | gradient with respect to input       |
+:              :                         : `operand` (\\( \nabla x\\))          :
+|`grad_scale`  | `ComputationDataHandle` | gradient with respect to input       |
+:              :                         : `scale` (\\( \nabla \gamma\\))       :
+|`grad_offset` | `ComputationDataHandle` | gradient with respect to input       |
+:              :                         : `offset`(\\( \nabla \beta\\))        :
 
 
 ## BatchNormInference
@@ -119,11 +122,11 @@ Normalizes an array across batch and spatial dimensions.
 | Arguments       | Type                    | Semantics                        |
 | --------------- | ----------------------- | -------------------------------- |
 | `operand`       | `ComputationDataHandle` | n dimensional array to be        |
-:                 :                         : normalized                       :
+:                 :                         : normalized (x)                   :
 | `scale`         | `ComputationDataHandle` | 1 dimensional array              |
 :                 :                         : (\\(\gamma\\))                   :
 | `offset`        | `ComputationDataHandle` | 1 dimensional array              |
-:                 :                         : (\\(\beta\\ )                    :
+:                 :                         : (\\(\beta\\))                    :
 | `epsilon`       | `float`                 | Epsilon value (\\(\epsilon\\))   |
 | `feature_index` | `int64`                 | Index to feature dimension       |
 :                 :                         : in `operand`                     :
@@ -135,8 +138,8 @@ element in `operand`. The `feature_index` must be a valid index for the feature
 dimension in `operand`.
 
 The algorithm goes as follows for each batch in `operand` \\(x\\) that
-contains `m` elements with `w` and `h` as the size of spatial dimensions (
-assuming `operand` is an 4 dimensional array):
+contains `m` elements with `w` and `h` as the size of spatial dimensions
+(assuming `operand` is an 4 dimensional array):
 
 - Calculates batch mean \\(\mu_l\\) for each feature `l` in feature dimension:
 \\(\mu_l=\frac{1}{mwh}\sum_{i=1}^m\sum_{j=1}^w\sum_{k=1}^h x_{ijkl}\\)
@@ -170,7 +173,7 @@ Similar to a `tf.bitcast` in TensorFlow, performs an element-wise bitcast
 operation from a data shape to a target shape. The dimensions must match, and
 the conversion is an element-wise one; e.g. `s32` elements become `f32` elements
 via bitcast routine. Bitcast is implemented as a low-level cast, so machines
-with different floating point representations will give different results.
+with different floating-point representations will give different results.
 
 <b> `BitcastConvertType(operand, new_element_type)` </b>
 
@@ -351,7 +354,7 @@ each other) and contains the arguments in the order that they were specified.
 :             :                         : concatenated between the `operands`. :
 
 With the exception of `dimension` all dimensions must be the same. This is
-because XLA does not support "ragged" arrays Also note that rank-0 values
+because XLA does not support "ragged" arrays. Also note that rank-0 values
 cannot be concatenated (as it's impossible to name the dimension along which the
 concatenation occurs).
 
@@ -440,11 +443,13 @@ area and a computation is performed for each possible position of the window.
 | `lhs`            | `ComputationDataHandle` | rank n+2 array of inputs      |
 | `rhs`            | `ComputationDataHandle` | rank n+2 array of kernel      |
 :                  :                         : weights                       :
-| `window_strides` | `ArraySlice<int64>`     | n-d array of kernel strides   |
-| `padding`        | `ArraySlice<pair<int64, | n-d array of (low, high)      |
+| `window_strides` | `ArraySlice<int64>`     | size n array of kernel strides|
+| `padding`        | `ArraySlice<pair<int64, | size n array of (low, high)   |
 :                  : int64>>`                : padding                       :
-| `lhs_dilation`   | `ArraySlice<int64>`     | n-d lhs dilation factor array |
-| `rhs_dilation`   | `ArraySlice<int64>`     | n-d rhs dilation factor array |
+| `lhs_dilation`   | `ArraySlice<int64>`     | size n lhs dilation factor    |
+:                  :                         : array                         |
+| `rhs_dilation`   | `ArraySlice<int64>`     | size n rhs dilation factor
+:                  :                         : array                         |
 
 Let n be the number of spatial dimensions. The `lhs` argument is a rank n+2
 array describing the base area. This is called the input, even though of course
@@ -468,7 +473,7 @@ filter/kernel/window. The dimensions are, in this order:
     window that moves across the base area.
 
 The `window_strides` argument specifies the stride of the convolutional window
-in the spatial dimensions. For example, if the stride in a the first spatial
+in the spatial dimensions. For example, if the stride in the first spatial
 dimension is 3, then the window can only be placed at coordinates where the
 first spatial index is divisible by 3.
 
@@ -942,7 +947,7 @@ expand the rank of the lower-rank operand up to the rank of the higher-rank
 operand. `broadcast_dimensions` maps the dimensions of the lower-rank shape to
 the dimensions of the higher-rank shape. The unmapped dimensions of the expanded
 shape are filled with dimensions of size one. Degenerate-dimension broadcasting
-then broadcasts the shapes along these degenerate dimension to equalize the
+then broadcasts the shapes along these degenerate dimensions to equalize the
 shapes of both operands. The semantics are described in detail on the
 @{$broadcasting$broadcasting page}.
 
@@ -1027,6 +1032,213 @@ Arguments | Type                    | Semantics
 The function is applied to each element in the `operand` array, resulting in an
 array with the same shape. It is allowed for `operand` to be a scalar (rank 0).
 
+## Gather
+
+The XLA gather operation stitches together several slices (each slice at a
+potentially different runtime offset) of an input tensor into an output tensor.
+
+### General Semantics
+
+See also
+[`ComputationBuilder::Gather`](https://www.tensorflow.org/code/tensorflow/compiler/xla/client/computation_builder.h).
+For a more intuitive description, see the "Informal Description" section below.
+
+<b> `gather(operand, gather_indices, output_window_dims, elided_window_dims, window_bounds, gather_dims_to_operand_dims)` </b>
+
+|Arguments         | Type                    | Semantics                       |
+|----------------- | ----------------------- | --------------------------------|
+|`operand`         | `ComputationDataHandle` | The tensor we’re gathering      |
+:                  :                         : from.                           :
+|`gather_indices`  | `ComputationDataHandle` | Tensor containing the starting  |
+:                  :                         : indices of the slices we're     :
+:                  :                         : we're stitching together into   :
+:                  :                         : the output tensor.              :
+|`index_vector_dim`  | `int64`               | The dimension in                |
+:                  :                         : `gather_indices` that contains  :
+:                  :                         : the starting indices.           :
+|`output_window_dims` | `ArraySlice<int64>`  | The set of dimensions in the    |
+:                  :                         : output shape that are _window   :
+:                  :                         : dimensions_ (defined below).    :
+:                  :                         : Not all window dimensions may   :
+:                  :                         : be present in the output shape. :
+|`elided_window_dims` | `ArraySlice<int64>`  | The set of _window dimensions_  |
+:                  :            : that are not present in the output shape.    :
+:                  :            : `window_bounds[i]` must be `1` for all `i`   :
+:                  :            : in `elided_window_dims`.                     :
+|`window_bounds`   | `ArraySlice<int64>`    | `window_bounds[i]` is the bounds |
+:                  :            : for  window dimension `i`. This includes     :
+:                  :            : both the window dimensions that are          :
+:                  :            : explicitly part of the output shape (via     :
+:                  :            : `output_window_dims`) and the window         :
+:                  :            : dimensions that are elided (via              :
+:                  :            : `elided_window_dims`).                       :
+|`gather_dims_to_operand_dims` | `ArraySlice<int64>` | A dimension map (the    |
+:                  :            : array is interpreted as mapping `i` to       :
+:                  :            : `gather_dims_to_operand_dims[i]`)  from      :
+:                  :            : the gather indices in `gather_indices` to    :
+:                  :            : the operand index space.  It has to be       :
+:                  :            : one-to-one and total.                        :
+
+For every index `Out` in the output tensor, we compute two things (more
+precisely described later):
+
+  - An index into `gather_indices.rank` - `1` dimensions of `gather_indices`,
+    which gives us a starting index of a slice, _operand slice_, in the operand
+    tensor.  These `gather_indices.rank` - `1` dimensions are all the dimensions
+    in `gather_indices` except `index_vector_dim`.
+
+  - A _window index_ that has the same rank as the operand.  This index is
+    composed of the values in `Out` at dimensions `output_window_dims`, embedded
+    with zeroes according to `elided_window_dims`.
+
+The _window index_ is the relative index of the element in _operand slice_ that
+should be present in the output at index `Out`.
+
+The output is a tensor of rank `output_window_dims.size` + `gather_indices.rank`
+- `1`.  Additionally, as a shorthand, we define `output_gather_dims` of type
+`ArraySlice<int64>` as the set of dimensions in the output shape but not in
+`output_window_dims`, in ascending order.  E.g. if the output tensor has rank
+`5`, `output_window_dims` is {`2`, `4`} then `output_gather_dims` is {`0`, `1`,
+`3`}
+
+If `index_vector_dim` is equal to `gather_indices.rank` we implicitly
+consider `gather_indices` to have a trailing `1` dimension (i.e. if
+`gather_indices` was of shape `[6,7]` and `index_vector_dim` is `2` then
+we implicitly consider the shape of `gather_indices` to be `[6,7,1]`).
+
+The bounds for the output tensor along dimension `i` is computed as follows:
+
+  1. If `i` is present in `output_gather_dims` (i.e. is equal to
+     `output_gather_dims[k]` for some `k`) then we pick the corresponding
+     dimension bounds out of `gather_indices.shape`, skipping
+     `index_vector_dim` (i.e. pick `gather_indices.shape.dims`[`k`] if `k`
+     < `index_vector_dim` and `gather_indices.shape.dims`[`k`+`1`]
+     otherwise).
+  2. If `i` is present in `output_window_dims` (i.e. equal to
+     `output_window_dims`[`k`] for some `k`) then we pick the corresponding
+     bound out of `window_bounds` after accounting for `elided_window_dims`
+     (i.e. we pick `adjusted_window_bounds`[`k`] where `adjusted_window_bounds`
+     is `window_bounds` with the bounds at indices `elided_window_dims`
+     removed).
+
+The operand index `In` corresponding to an output index `Out` is computed as
+follows:
+
+  1. Let `G` = { `Out`[`k`] for `k` in `output_gather_dims` }.  Use `G` to slice
+     out vector `S` such that `S`[`i`] = `gather_indices`[Combine(`G`, `i`)]
+     where Combine(A, b) inserts b at position `index_vector_dim` into A.
+     Note that this is well defined even if `G` is empty -- if `G` is empty then
+     `S` = `gather_indices`.
+  2. Create an index, `S`<sub>`in`</sub>, into `operand` using `S` by
+     scattering `S` using the `gather_dims_to_operand_dims` map
+     (`S`<sub>`in`</sub> is the starting indices for _operand slice_ mentioned
+     above).  More precisely:
+       1. `S`<sub>`in`</sub>[`gather_dims_to_operand_dims`[`k`]] = `S`[`k`] if `k` <
+          `gather_dims_to_operand_dims.size`.
+       2. `S`<sub>`in`</sub>[`_`] = `0` otherwise.
+  3. Create an index `W`<sub>`in`</sub> into `operand` by scattering the indices
+     at the output window dimensions in `Out` according to
+     the `elided_window_dims` set (`W`<sub>`in`</sub> is the _window index_
+     mentioned above).  More precisely:
+       1. `W`<sub>`in`</sub>[`window_dims_to_operand_dims`(`k`)] = `Out`[`k`] if
+          `k` < `output_window_dims.size` (`window_dims_to_operand_dims` is
+          defined below).
+       2. `W`<sub>`in`</sub>[`_`] = `0` otherwise.
+  4. `In` is `W`<sub>`in`</sub> + `S`<sub>`in`</sub> where + is element-wise
+     addition.
+
+`window_dims_to_operand_dims` is the monotonic function with domain [`0`,
+`output_window_dims.size`) and range [`0`, `operand.rank`) \
+`elided_window_dims`.  So if, e.g., `output_window_dims.size` is `4`,
+`operand.rank` is `6` and `elided_window_dims` is {`0`, `2`} then
+`window_dims_to_operand_dims` is {`0`→`1`, `1`→`3`, `2`→`4`, `3`→`5`}.
+
+### Informal Description and Examples
+
+`index_vector_dim` is set to `gather_indices.rank` - `1` in all of the
+examples that follow.  More interesting values for `index_vector_dim`
+does not change the operation fundamentally, but makes the visual representation
+more cumbersome.
+
+To get an intuition on how all of the above fits together, let's look at an
+example that gathers 5 slices of shape `[8,6]` from a `[16,11]` tensor.  The
+position of a slice into the `[16,11]` tensor can be represented as an index
+vector of shape `S64[2]`, so the set of 5 positions can be represented as a
+`S64[5,2]` tensor.
+
+The behavior of the gather operation can then be depicted as an index
+transformation that takes [`G`,`W`<sub>`0`</sub>,`W`<sub>`1`</sub>], an index in
+the output shape, and maps it to an element in the input tensor in the following
+way:
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+  <img style="width:100%" src="../../images/ops_xla_gather_0.svg">
+</div>
+
+We first select an (`X`,`Y`) vector from the gather indices tensor using `G`.
+The element in the output tensor at index
+[`G`,`W`<sub>`0`</sub>,`W`<sub>`1`</sub>] is then the element in the input
+tensor at index [`X`+`W`<sub>`0`</sub>,`Y`+`W`<sub>`1`</sub>].
+
+`window_bounds` is `[8,6]`, which decides the range of W<sub>`0`</sub> and
+W<sub>`1`</sub>, and this in turn decides the bounds of the slice.
+
+This gather operation acts as a batch dynamic slice with `G` as the batch
+dimension.
+
+The gather indices may be multidimensional.  For instance, a more general
+version of the example above using a "gather indices" tensor of shape `[4,5,2]`
+would translate indices like this:
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+  <img style="width:100%" src="../../images/ops_xla_gather_1.svg">
+</div>
+
+Again, this acts as a batch dynamic slice `G`<sub>`0`</sub> and
+`G`<sub>`1`</sub> as the batch dimensions.  The window bounds are still `[8,6]`.
+
+The gather operation in XLA generalizes the informal semantics outlined above in
+the following ways:
+
+ 1. We can configure which dimensions in the output shape are the window
+    dimensions (dimensions containing `W`<sub>`0`</sub>, `W`<sub>`1`</sub> in
+    the last example).  The output gather dimensions (dimensions containing
+    `G`<sub>`0`</sub>, `G`<sub>`1`</sub> in the last example) are defined to be
+    the output dimensions that are not window dimensions.
+
+ 2. The number of output window dimensions explicitly present in the output
+    shape may be smaller than the input rank.  These "missing" dimensions, which
+    are listed explicitly as `elided_window_dims`, must have a window bound of
+    `1`.  Since they have a window bound of `1` the only valid index for them is
+    `0` and eliding them does not introduce ambiguity.
+
+ 3. The slice extracted from the "Gather Indices" tensor ((`X`, `Y`) in the last
+    example) may have fewer elements than the input tensor rank, and an explicit
+    mapping dictates how the index should be expanded to have the same rank as
+    the input.
+
+As a final example, we use (2) and (3) to implement `tf.gather_nd`:
+
+<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
+  <img style="width:100%" src="../../images/ops_xla_gather_2.svg">
+</div>
+
+`G`<sub>`0`</sub> and `G`<sub>`1`</sub> are used to slice out a starting index
+from the gather indices tensor as usual, except the starting index has only one
+element, `X`.  Similarly, there is only one output window index with the value
+`W`<sub>`0`</sub>.  However, before being used as indices into the input tensor,
+these are expanded in accordance to "Gather Index Mapping"
+(`gather_dims_to_operand_dims` in the formal description) and "Window Mapping"
+(`window_dims_to_operand_dims` in the formal description) into
+[`0`,`W`<sub>`0`</sub>] and [`X`,`0`] respectively, adding up to
+[`X`,`W`<sub>`0`</sub>].  In other words, the output index
+[`G`<sub>`0`</sub>,`G`<sub>`1`</sub>,`W`<sub>`0`</sub>] maps to the input index
+[`GatherIndices`[`G`<sub>`0`</sub>,`G`<sub>`1`</sub>,`0`],`X`] which gives us
+the semantics for `tf.gather_nd`.
+
+`window_bounds` for this case is `[1,11]`.  Intuitively this means that every
+index `X` in the gather indices tensor picks an entire row and the result is the
+concatenation of all these rows.
 
 ## GetTupleElement
 
@@ -1081,7 +1293,7 @@ result2 = while (condition, init = result1) {
 ```
 
 Nested tuple shapes are not supported. For an empty tuple shape, the Infeed
-operation is effectively a nop and proceeds without reading any data from the
+operation is effectively a no-op and proceeds without reading any data from the
 Infeed of the device.
 
 > Note: We plan to allow multiple Infeed operations without a total order, in
@@ -1144,7 +1356,7 @@ dimension.
 
 `PaddingConfig` is a repeated field of `PaddingConfigDimension`, which contains
 three fields for each dimension: `edge_padding_low`, `edge_padding_high`, and
-`interior_padding`. `edge_padding_low` and `edge_padding_high` specifies the
+`interior_padding`. `edge_padding_low` and `edge_padding_high` specify the
 amount of padding added at the low-end (next to index 0) and the high-end (next
 to the highest index) of each dimension respectively. The amount of edge padding
 can be negative -- the absolute value of negative padding indicates the number
@@ -1153,8 +1365,8 @@ the amount of padding added between any two elements in each dimension. Interior
 padding occurs logically before edge padding, so in the case of negative edge
 padding elements are removed from the interior-padded operand. This operation is
 a no-op if the edge padding pairs are all (0, 0) and the interior padding values
-are all 0. Figure below shows examples of different `edge_padding` and
-`interior_padding` values for a two dimensional array.
+are all 0. The figure below shows examples of different `edge_padding` and
+`interior_padding` values for a two-dimensional array.
 
 <div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;">
   <img style="width:100%" src="https://www.tensorflow.org/images/ops_pad.png">
diff --git a/tensorflow/docs_src/programmers_guide/datasets.md b/tensorflow/docs_src/programmers_guide/datasets.md
index d19200e80cdfe6620789ddd273647660c10b2a60..d38fbddfa1cfad305b0549bd4a8ffda371c978b6 100644
--- a/tensorflow/docs_src/programmers_guide/datasets.md
+++ b/tensorflow/docs_src/programmers_guide/datasets.md
@@ -327,6 +327,35 @@ same op/node (created by `Iterator.get_next()`). Therefore,  evaluating *any* of
 these tensors will advance the iterator for all components. A typical consumer
 of an iterator will include all components in a single expression.
 
+### Saving iterator state
+
+The @{tf.contrib.data.make_saveable_from_iterator} function creates a
+`SaveableObject` from an iterator, which can be used to save and
+restore the current state of the iterator (and, effectively, the whole input
+pipeline). A saveable object thus created can be added to @{tf.train.Saver}
+variables list or the `tf.GraphKeys.SAVEABLE_OBJECTS` collection for saving and
+restoring in the same manner as a @{tf.Variable}. Refer to
+@{$saved_model$Saving and Restoring} for details on how to save and restore
+variables.
+
+```python
+# Create saveable object from iterator.
+saveable = tf.contrib.data.make_saveable_from_iterator(iterator)
+
+# Save the iterator state by adding it to the saveable objects collection.
+tf.add_to_collection(tf.GraphKeys.SAVEABLE_OBJECTS, saveable)
+saver = tf.train.Saver()
+
+with tf.Session() as sess:
+
+  if should_checkpoint:
+    saver.save(path_to_checkpoint)
+
+# Restore the iterator state.
+with tf.Session() as sess:
+  saver.restore(sess, path_to_checkpoint)
+```
+
 ## Reading input data
 
 ### Consuming NumPy arrays
diff --git a/tensorflow/docs_src/programmers_guide/saved_model.md b/tensorflow/docs_src/programmers_guide/saved_model.md
index f27a658342b8d33407e1c6ed5799a10c2305a74c..c54c278584ec6265f0da1453fc266aeec7cb6f30 100644
--- a/tensorflow/docs_src/programmers_guide/saved_model.md
+++ b/tensorflow/docs_src/programmers_guide/saved_model.md
@@ -3,6 +3,9 @@
 This document explains how to save and restore
 @{$variables$variables} and models.
 
+Important: TensorFlow model files are code. Be careful with untrusted code.
+See [Using TensorFlow Securely](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/SECURITY.md)
+for details.
 
 ## Saving and restoring variables
 
@@ -694,15 +697,15 @@ executing the computation graph later. For example:
 $ saved_model_cli show --dir \
 /tmp/saved_model_dir --tag_set serve --signature_def serving_default
 The given SavedModel SignatureDef contains the following input(s):
-inputs['x'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x:0
+  inputs['x'] tensor_info:
+      dtype: DT_FLOAT
+      shape: (-1, 1)
+      name: x:0
 The given SavedModel SignatureDef contains the following output(s):
-outputs['y'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
+  outputs['y'] tensor_info:
+      dtype: DT_FLOAT
+      shape: (-1, 1)
+      name: y:0
 Method name is: tensorflow/serving/predict
 ```
 
@@ -714,32 +717,32 @@ $ saved_model_cli show --dir /tmp/saved_model_dir --all
 MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
 
 signature_def['classify_x2_to_y3']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x2:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['scores'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y3:0
-Method name is: tensorflow/serving/classify
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x2:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['scores'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y3:0
+  Method name is: tensorflow/serving/classify
 
 ...
 
 signature_def['serving_default']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['x'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['y'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
-Method name is: tensorflow/serving/predict
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['x'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['y'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y:0
+  Method name is: tensorflow/serving/predict
 ```
 
 
diff --git a/tensorflow/docs_src/programmers_guide/variables.md b/tensorflow/docs_src/programmers_guide/variables.md
index 64250738056043e236b5eb236bcbf29375655260..e8cf7711552f4c83ed1e03e0753b580cc7505ddc 100644
--- a/tensorflow/docs_src/programmers_guide/variables.md
+++ b/tensorflow/docs_src/programmers_guide/variables.md
@@ -62,9 +62,10 @@ them. For this reason TensorFlow provides **collections**, which are named lists
 of tensors or other objects, such as `tf.Variable` instances.
 
 By default every `tf.Variable` gets placed in the following two collections:
+
  * `tf.GraphKeys.GLOBAL_VARIABLES` --- variables that can be shared across
-multiple devices,
- * `tf.GraphKeys.TRAINABLE_VARIABLES`--- variables for which TensorFlow will
+   multiple devices,
+ * `tf.GraphKeys.TRAINABLE_VARIABLES` --- variables for which TensorFlow will
    calculate gradients.
 
 If you don't want a variable to be trainable, add it to the
diff --git a/tensorflow/docs_src/programmers_guide/version_compat.md b/tensorflow/docs_src/programmers_guide/version_compat.md
index a28f1385c87c7a083ee96977c5ab268c6977e17e..e6613cc69f8aedf344fa25b6564889e34cd9bf53 100644
--- a/tensorflow/docs_src/programmers_guide/version_compat.md
+++ b/tensorflow/docs_src/programmers_guide/version_compat.md
@@ -60,7 +60,8 @@ patch versions.  The public APIs consist of
     * [`tensor_shape`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/tensor_shape.proto)
     * [`types`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/types.proto)
 
-## What is *not* covered {not_covered}
+<a name="not_covered"></a>
+## What is *not* covered
 
 Some API functions are explicitly marked as "experimental" and can change in
 backward incompatible ways between minor releases. These include:
diff --git a/tensorflow/docs_src/tutorials/image_retraining.md b/tensorflow/docs_src/tutorials/image_retraining.md
index df15bc0a9c3763aa51c2fc8cf36ce9fc3544ae68..246a420400a706387fc5d4a78672351f8fa48647 100644
--- a/tensorflow/docs_src/tutorials/image_retraining.md
+++ b/tensorflow/docs_src/tutorials/image_retraining.md
@@ -349,31 +349,32 @@ results, but if you intend to deploy your model on mobile devices or other
 resource-constrained environments you may want to trade off a little accuracy
 for much smaller file sizes or faster speeds. To help with that, the
 [retrain.py script](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py)
-supports 32 different variations on the [Mobilenet architecture](https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html).
+supports different variations on the [Mobilenet architecture](https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html).
 
 These are a little less precise than Inception v3, but can result in far
-smaller file sizes (down to less than a megabyte) and can be many times faster
+smaller file sizes (a few megabytes) and can be many times faster
 to run. To train with one of these models, pass in the `--architecture` flag,
 for example:
 
 ```
 python tensorflow/examples/image_retraining/retrain.py \
-    --image_dir ~/flower_photos --architecture mobilenet_0.25_128_quantized
+    --image_dir ~/flower_photos --architecture mobilenet_0.25_128
 ```
 
-This will create a 941KB model file in `/tmp/output_graph.pb`, with 25% of the
-parameters of the full Mobilenet, taking 128x128 sized input images, and with
-its weights quantized down to eight bits on disk. You can choose '1.0', '0.75',
-'0.50', or '0.25' to control the number of weight parameters, and so the file
-size (and to some extent the speed), '224', '192', '160', or '128' for the input
-image size, with smaller sizes giving faster speeds, and an optional
-'_quantized' at the end to indicate whether the file should contain 8-bit or
-32-bit float weights.
+This will create a 1.9MB model file in `/tmp/output_graph.pb`, with only 25% of
+the number of neurons of the full Mobilenet, and trained to take 128x128 sized
+input images.
+
+You can choose '1.0', '0.75', '0.50', or '0.25' to control the number of
+neurons (activations of hidden layers); the number of weights (and hence to
+some extent the file size and speed) shrinks like the square of that fraction.
+You can choose '224', '192', '160', or '128' for the input image size,
+with smaller sizes giving faster speeds.
 
 The speed and size advantages come at a loss to accuracy of course, but for many
 purposes this isn't critical. They can also be somewhat offset with improved
 training data. For example, training with distortions allows me to get above 80%
-accuracy on the flower data set even with the 0.25/128/quantized graph above.
+accuracy on the flower data set even with the 0.25/128 graph above.
 
 If you're going to be using the Mobilenet models in label_image or your own
 programs, you'll need to feed in an image of the specified size converted to a
@@ -395,3 +396,9 @@ python tensorflow/examples/label_image/label_image.py \
 --input_mean=128 --input_std=128 \
 --image=$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg
 ```
+
+For more information on deploying the retrained model to a mobile device, see
+the [codelab version](https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/#0)
+of this tutorial, especially [part 2](https://codelabs.developers.google.com/codelabs/tensorflow-for-poets-2-tflite/#0), which describes
+[TensorFlow Lite](/mobile/tflite/) and the additional optimizations it offers
+(including quantization of model weights).
diff --git a/tensorflow/examples/android/src/org/tensorflow/demo/SpeechActivity.java b/tensorflow/examples/android/src/org/tensorflow/demo/SpeechActivity.java
index 8a1d86d9eedf3a1e1aa80e998ff150ad0c2447a1..1cddf3dc5568babb8c08c690fad143299f5ccca5 100644
--- a/tensorflow/examples/android/src/org/tensorflow/demo/SpeechActivity.java
+++ b/tensorflow/examples/android/src/org/tensorflow/demo/SpeechActivity.java
@@ -332,8 +332,10 @@ public class SpeechActivity extends Activity {
                 }
                 final View labelView = labelsListView.getChildAt(labelIndex - 2);
 
-                AnimatorSet colorAnimation = (AnimatorSet) AnimatorInflater.loadAnimator(
-                    SpeechActivity.this, R.animator.color_animation);
+                AnimatorSet colorAnimation =
+                    (AnimatorSet)
+                        AnimatorInflater.loadAnimator(
+                            SpeechActivity.this, R.animator.color_animation);
                 colorAnimation.setTarget(labelView);
                 colorAnimation.start();
               }
diff --git a/tensorflow/examples/get_started/regression/imports85.py b/tensorflow/examples/get_started/regression/imports85.py
index a8e4c782b3f7b5d01a91f38a48e5edb2202108de..4fdaceea9afee74550196031fe590c3a2abd20ed 100644
--- a/tensorflow/examples/get_started/regression/imports85.py
+++ b/tensorflow/examples/get_started/regression/imports85.py
@@ -131,11 +131,12 @@ def dataset(y_name="price", train_fraction=0.7):
     # booleans but we are dealing with symbolic tensors.
     return ~in_training_set(line)
 
-  base_dataset = (tf.data
-                  # Get the lines from the file.
-                  .TextLineDataset(path)
-                  # drop lines with question marks.
-                  .filter(has_no_question_marks))
+  base_dataset = (
+      tf.data
+      # Get the lines from the file.
+      .TextLineDataset(path)
+      # drop lines with question marks.
+      .filter(has_no_question_marks))
 
   train = (base_dataset
            # Take only the training-set lines.
diff --git a/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py b/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py
index 461fb1c5173f66278eb585d30bd8749a58fb6245..307eede5c03780e9244b035f020fc7846290d4d9 100644
--- a/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py
+++ b/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py
@@ -1,4 +1,4 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -45,6 +45,7 @@ VALIDATION_FILE = 'validation.tfrecords'
 
 
 def decode(serialized_example):
+  """Parses an image and label from the given `serialized_example`."""
   features = tf.parse_single_example(
       serialized_example,
       # Defaults are not specified since both keys are required.
@@ -66,6 +67,7 @@ def decode(serialized_example):
 
 
 def augment(image, label):
+  """Placeholder for data augmentation."""
   # OPTIONAL: Could reshape into a 28x28 image and apply distortions
   # here.  Since we are not applying any distortions in this
   # example, and the next step expects the image to be flattened
@@ -74,9 +76,8 @@ def augment(image, label):
 
 
 def normalize(image, label):
-  # Convert from [0, 255] -> [-0.5, 0.5] floats.
+  """Convert `image` from [0, 255] -> [-0.5, 0.5] floats."""
   image = tf.cast(image, tf.float32) * (1. / 255) - 0.5
-
   return image, label
 
 
@@ -106,18 +107,23 @@ def inputs(train, batch_size, num_epochs):
                           if train else VALIDATION_FILE)
 
   with tf.name_scope('input'):
-    # TFRecordDataset opens a protobuf and reads entries line by line
-    # could also be [list, of, filenames]
+    # TFRecordDataset opens a binary file and reads one record at a time.
+    # `filename` could also be a list of filenames, which will be read in order.
     dataset = tf.data.TFRecordDataset(filename)
-    dataset = dataset.repeat(num_epochs)
 
-    # map takes a python function and applies it to every sample
+    # The map transformation takes a function and applies it to every element
+    # of the dataset.
     dataset = dataset.map(decode)
     dataset = dataset.map(augment)
     dataset = dataset.map(normalize)
 
-    #the parameter is the queue size
+    # The shuffle transformation uses a finite-sized buffer to shuffle elements
+    # in memory. The parameter is the number of elements in the buffer. For
+    # completely uniform shuffling, set the parameter to be the same as the
+    # number of elements in the dataset.
     dataset = dataset.shuffle(1000 + 3 * batch_size)
+
+    dataset = dataset.repeat(num_epochs)
     dataset = dataset.batch(batch_size)
 
     iterator = dataset.make_one_shot_iterator()
@@ -153,7 +159,7 @@ def run_training():
       sess.run(init_op)
       try:
         step = 0
-        while True:  #train until OutOfRangeError
+        while True:  # Train until OutOfRangeError
           start_time = time.time()
 
           # Run one step of the model.  The return values are
diff --git a/tensorflow/examples/image_retraining/retrain.py b/tensorflow/examples/image_retraining/retrain.py
index c49e7e7ee2e397e353b468c727263ff3eb931401..99a71206acbd533ec8bc5a9644435eacad564cd4 100644
--- a/tensorflow/examples/image_retraining/retrain.py
+++ b/tensorflow/examples/image_retraining/retrain.py
@@ -75,13 +75,16 @@ python tensorflow/examples/image_retraining/retrain.py \
     --image_dir ~/flower_photos --architecture mobilenet_1.0_224
 ```
 
-Run quantized version of mobilenet:
+Run mobilenet, instrumented for quantization:
 
 ```bash
 python tensorflow/examples/image_retraining/retrain.py \
-    --image_dir ~/flower_photos/   --architecture mobilenet_1.0_224_quantized
+    --image_dir ~/flower_photos/   --architecture mobilenet_1.0_224_quant
 ```
 
+These instrumented models can be converted to fully quantized mobile models via
+TensorFlow Lite.
+
 There are 32 different Mobilenet models to choose from, with a variety of file
 size and latency options. The first number can be '1.0', '0.75', '0.50', or
 '0.25' to control the size, and the second controls the input image size, either
@@ -121,7 +124,6 @@ import numpy as np
 from six.moves import urllib
 import tensorflow as tf
 
-from tensorflow.contrib.quantize.python import quant_ops
 from tensorflow.python.framework import graph_util
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.platform import gfile
@@ -135,6 +137,9 @@ FLAGS = None
 # need to update these to reflect the values in the network you're using.
 MAX_NUM_IMAGES_PER_CLASS = 2 ** 27 - 1  # ~134M
 
+# The location where variable checkpoints will be stored.
+CHECKPOINT_NAME = '/tmp/_retrain_checkpoint'
+
 
 def create_image_lists(image_dir, testing_percentage, validation_percentage):
   """Builds a list of training images from the file system.
@@ -351,8 +356,8 @@ def maybe_download_and_extract(data_url):
     filepath, _ = urllib.request.urlretrieve(data_url, filepath, _progress)
     print()
     statinfo = os.stat(filepath)
-    tf.logging.info('Successfully downloaded %s %d bytes.',
-                    filename, statinfo.st_size)
+    tf.logging.info('Successfully downloaded %s %d bytes.', filename,
+                    statinfo.st_size)
     print('Extracting file from ', filepath)
     tarfile.open(filepath, 'r:gz').extractall(dest_directory)
   else:
@@ -745,9 +750,9 @@ def variable_summaries(var):
     tf.summary.histogram('histogram', var)
 
 
-def add_final_training_ops(class_count, final_tensor_name, bottleneck_tensor,
-                           bottleneck_tensor_size, quantize_layer):
-  """Adds a new softmax and fully-connected layer for training.
+def add_final_retrain_ops(class_count, final_tensor_name, bottleneck_tensor,
+                          bottleneck_tensor_size, quantize_layer, is_training):
+  """Adds a new softmax and fully-connected layer for training and eval.
 
   We need to retrain the top layer to identify our new classes, so this function
   adds the right operations to the graph, along with some variables to hold the
@@ -763,7 +768,9 @@ def add_final_training_ops(class_count, final_tensor_name, bottleneck_tensor,
     bottleneck_tensor: The output of the main CNN graph.
     bottleneck_tensor_size: How many entries in the bottleneck vector.
     quantize_layer: Boolean, specifying whether the newly added layer should be
-        quantized.
+        instrumented for quantized.
+    is_training: Boolean, specifying whether the newly add layer is for training
+        or eval.
 
   Returns:
     The tensors for the training and cross entropy results, and tensors for the
@@ -778,50 +785,41 @@ def add_final_training_ops(class_count, final_tensor_name, bottleneck_tensor,
     ground_truth_input = tf.placeholder(
         tf.int64, [None], name='GroundTruthInput')
 
-  # Organizing the following ops as `final_training_ops` so they're easier
-  # to see in TensorBoard
-  layer_name = 'final_training_ops'
+  # Organizing the following ops so they are easier to see in TensorBoard.
+  layer_name = 'final_retrain_ops'
   with tf.name_scope(layer_name):
     with tf.name_scope('weights'):
       initial_value = tf.truncated_normal(
           [bottleneck_tensor_size, class_count], stddev=0.001)
       layer_weights = tf.Variable(initial_value, name='final_weights')
-      if quantize_layer:
-        quantized_layer_weights = quant_ops.MovingAvgQuantize(
-            layer_weights, is_training=True)
-        variable_summaries(quantized_layer_weights)
-
       variable_summaries(layer_weights)
+
     with tf.name_scope('biases'):
       layer_biases = tf.Variable(tf.zeros([class_count]), name='final_biases')
-      if quantize_layer:
-        quantized_layer_biases = quant_ops.MovingAvgQuantize(
-            layer_biases, is_training=True)
-        variable_summaries(quantized_layer_biases)
-
       variable_summaries(layer_biases)
 
     with tf.name_scope('Wx_plus_b'):
-      if quantize_layer:
-        logits = tf.matmul(bottleneck_input,
-                           quantized_layer_weights) + quantized_layer_biases
-        logits = quant_ops.MovingAvgQuantize(
-            logits,
-            init_min=-32.0,
-            init_max=32.0,
-            is_training=True,
-            num_bits=8,
-            narrow_range=False,
-            ema_decay=0.5)
-        tf.summary.histogram('pre_activations', logits)
-      else:
-        logits = tf.matmul(bottleneck_input, layer_weights) + layer_biases
-        tf.summary.histogram('pre_activations', logits)
+      logits = tf.matmul(bottleneck_input, layer_weights) + layer_biases
+      tf.summary.histogram('pre_activations', logits)
 
   final_tensor = tf.nn.softmax(logits, name=final_tensor_name)
 
+  # The tf.contrib.quantize functions rewrite the graph in place for
+  # quantization. The imported model graph has already been rewritten, so upon
+  # calling these rewrites, only the newly added final layer will be
+  # transformed.
+  if quantize_layer:
+    if is_training:
+      tf.contrib.quantize.create_training_graph()
+    else:
+      tf.contrib.quantize.create_eval_graph()
+
   tf.summary.histogram('activations', final_tensor)
 
+  # If this is an eval graph, we don't need to add loss ops or an optimizer.
+  if not is_training:
+    return None, None, bottleneck_input, ground_truth_input, final_tensor
+
   with tf.name_scope('cross_entropy'):
     cross_entropy_mean = tf.losses.sparse_softmax_cross_entropy(
         labels=ground_truth_input, logits=logits)
@@ -857,13 +855,91 @@ def add_evaluation_step(result_tensor, ground_truth_tensor):
   return evaluation_step, prediction
 
 
-def save_graph_to_file(sess, graph, graph_file_name):
+def run_final_eval(sess, model_info, class_count, image_lists, jpeg_data_tensor,
+                   decoded_image_tensor, resized_image_tensor,
+                   bottleneck_tensor):
+  """Runs a final evaluation on an eval graph using the test data set.
+
+  Args:
+    sess: Session for the train graph.
+    model_info: Model info dictionary from create_model_info()
+    class_count: Number of classes
+    image_lists: Dictionary of training images for each label.
+    jpeg_data_tensor: The layer to feed jpeg image data into.
+    decoded_image_tensor: The output of decoding and resizing the image.
+    resized_image_tensor: The input node of the recognition graph.
+    bottleneck_tensor: The bottleneck output layer of the CNN graph.
+  """
+  (sess, bottleneck_input, ground_truth_input, evaluation_step,
+   prediction) = build_eval_session(model_info, class_count)
+
+  test_bottlenecks, test_ground_truth, test_filenames = (
+      get_random_cached_bottlenecks(sess, image_lists, FLAGS.test_batch_size,
+                                    'testing', FLAGS.bottleneck_dir,
+                                    FLAGS.image_dir, jpeg_data_tensor,
+                                    decoded_image_tensor, resized_image_tensor,
+                                    bottleneck_tensor, FLAGS.architecture))
+  test_accuracy, predictions = sess.run(
+      [evaluation_step, prediction],
+      feed_dict={
+          bottleneck_input: test_bottlenecks,
+          ground_truth_input: test_ground_truth
+      })
+  tf.logging.info('Final test accuracy = %.1f%% (N=%d)' %
+                  (test_accuracy * 100, len(test_bottlenecks)))
+
+  if FLAGS.print_misclassified_test_images:
+    tf.logging.info('=== MISCLASSIFIED TEST IMAGES ===')
+    for i, test_filename in enumerate(test_filenames):
+      if predictions[i] != test_ground_truth[i]:
+        tf.logging.info('%70s  %s' % (test_filename,
+                                      list(image_lists.keys())[predictions[i]]))
+
+
+def build_eval_session(model_info, class_count):
+  """Builds an restored eval session without train operations for exporting.
+
+  Args:
+    model_info: Model info dictionary from create_model_info()
+    class_count: Number of classes
+
+  Returns:
+    Eval session containing the restored eval graph.
+    The bottleneck input, ground truth, eval step, and prediction tensors.
+  """
+  # If quantized, we need to create the correct eval graph for exporting.
+  eval_graph, bottleneck_tensor, _ = create_model_graph(model_info)
+
+  eval_sess = tf.Session(graph=eval_graph)
+  with eval_graph.as_default():
+    # Add the new layer for exporting.
+    (_, _, bottleneck_input,
+     ground_truth_input, final_tensor) = add_final_retrain_ops(
+         class_count, FLAGS.final_tensor_name, bottleneck_tensor,
+         model_info['bottleneck_tensor_size'], model_info['quantize_layer'],
+         False)
+
+    # Now we need to restore the values from the training graph to the eval
+    # graph.
+    tf.train.Saver().restore(eval_sess, CHECKPOINT_NAME)
+
+    evaluation_step, prediction = add_evaluation_step(final_tensor,
+                                                      ground_truth_input)
+
+  return (eval_sess, bottleneck_input, ground_truth_input, evaluation_step,
+          prediction)
+
+
+def save_graph_to_file(graph, graph_file_name, model_info, class_count):
+  """Saves an graph to file, creating a valid quantized one if necessary."""
+  sess, _, _, _, _ = build_eval_session(model_info, class_count)
+  graph = sess.graph
+
   output_graph_def = graph_util.convert_variables_to_constants(
       sess, graph.as_graph_def(), [FLAGS.final_tensor_name])
 
   with gfile.FastGFile(graph_file_name, 'wb') as f:
     f.write(output_graph_def.SerializeToString())
-  return
 
 
 def prepare_file_system():
@@ -916,11 +992,10 @@ def create_model_info(architecture):
       return None
     version_string = parts[1]
     if (version_string != '1.0' and version_string != '0.75' and
-        version_string != '0.50' and version_string != '0.25'):
+        version_string != '0.5' and version_string != '0.25'):
       tf.logging.error(
-          """"The Mobilenet version should be '1.0', '0.75', '0.50', or '0.25',
-  but found '%s' for architecture '%s'""",
-          version_string, architecture)
+          """"The Mobilenet version should be '1.0', '0.75', '0.5', or '0.25',
+  but found '%s' for architecture '%s'""", version_string, architecture)
       return None
     size_string = parts[2]
     if (size_string != '224' and size_string != '192' and
@@ -933,35 +1008,26 @@ def create_model_info(architecture):
     if len(parts) == 3:
       is_quantized = False
     else:
-      if parts[3] != 'quantized':
+      if parts[3] != 'quant':
         tf.logging.error(
             "Couldn't understand architecture suffix '%s' for '%s'", parts[3],
             architecture)
         return None
       is_quantized = True
 
+    data_url = 'http://download.tensorflow.org/models/mobilenet_v1_2018_02_22/'
+    model_name = 'mobilenet_v1_' + version_string + '_' + size_string
     if is_quantized:
-      data_url = 'http://download.tensorflow.org/models/mobilenet_v1_'
-      data_url += version_string + '_' + size_string + '_quantized_frozen.tgz'
-      bottleneck_tensor_name = 'MobilenetV1/Predictions/Reshape:0'
-      resized_input_tensor_name = 'Placeholder:0'
-      model_dir_name = ('mobilenet_v1_' + version_string + '_' + size_string +
-                        '_quantized_frozen')
-      model_base_name = 'quantized_frozen_graph.pb'
-
-    else:
-      data_url = 'http://download.tensorflow.org/models/mobilenet_v1_'
-      data_url += version_string + '_' + size_string + '_frozen.tgz'
-      bottleneck_tensor_name = 'MobilenetV1/Predictions/Reshape:0'
-      resized_input_tensor_name = 'input:0'
-      model_dir_name = 'mobilenet_v1_' + version_string + '_' + size_string
-      model_base_name = 'frozen_graph.pb'
+      model_name += '_quant'
+    data_url += model_name + '.tgz'
+    bottleneck_tensor_name = 'MobilenetV1/Predictions/Reshape:0'
+    resized_input_tensor_name = 'input:0'
+    model_file_name = model_name + '_frozen.pb'
 
     bottleneck_tensor_size = 1001
     input_width = int(size_string)
     input_height = int(size_string)
     input_depth = 3
-    model_file_name = os.path.join(model_dir_name, model_base_name)
     input_mean = 127.5
     input_std = 127.5
   else:
@@ -1011,44 +1077,45 @@ def add_jpeg_decoding(input_width, input_height, input_depth, input_mean,
   return jpeg_data, mul_image
 
 
-def export_model(sess, architecture, saved_model_dir):
+def export_model(model_info, class_count, saved_model_dir):
   """Exports model for serving.
 
   Args:
-    sess: Current active TensorFlow Session.
-    architecture: Model architecture.
+    model_info: The modelinfo for the current model.
+    class_count: The number of classes.
     saved_model_dir: Directory in which to save exported model and variables.
   """
-  if architecture == 'inception_v3':
-    input_tensor = 'DecodeJpeg/contents:0'
-  elif architecture.startswith('mobilenet_'):
-    input_tensor = 'input:0'
-  else:
-    raise ValueError('Unknown architecture', architecture)
-  in_image = sess.graph.get_tensor_by_name(input_tensor)
-  inputs = {'image': tf.saved_model.utils.build_tensor_info(in_image)}
-
-  out_classes = sess.graph.get_tensor_by_name('final_result:0')
-  outputs = {'prediction': 
-             tf.saved_model.utils.build_tensor_info(out_classes)}
+  # The SavedModel should hold the eval graph.
+  sess, _, _, _, _ = build_eval_session(model_info, class_count)
+  graph = sess.graph
+  with graph.as_default():
+    input_tensor = model_info['resized_input_tensor_name']
+    in_image = sess.graph.get_tensor_by_name(input_tensor)
+    inputs = {'image': tf.saved_model.utils.build_tensor_info(in_image)}
+
+    out_classes = sess.graph.get_tensor_by_name('final_result:0')
+    outputs = {
+        'prediction': tf.saved_model.utils.build_tensor_info(out_classes)
+    }
 
-  signature = tf.saved_model.signature_def_utils.build_signature_def(
-      inputs=inputs,
-      outputs=outputs,
-      method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
+    signature = tf.saved_model.signature_def_utils.build_signature_def(
+        inputs=inputs,
+        outputs=outputs,
+        method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
 
-  legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
+    legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
 
-  # Save out the SavedModel.
-  builder = tf.saved_model.builder.SavedModelBuilder(saved_model_dir)
-  builder.add_meta_graph_and_variables(
-      sess, [tf.saved_model.tag_constants.SERVING],
-      signature_def_map={
-          tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: 
-          signature
-      },
-      legacy_init_op=legacy_init_op)
-  builder.save()
+    # Save out the SavedModel.
+    builder = tf.saved_model.builder.SavedModelBuilder(saved_model_dir)
+    builder.add_meta_graph_and_variables(
+        sess, [tf.saved_model.tag_constants.SERVING],
+        signature_def_map={
+            tf.saved_model.signature_constants.
+            DEFAULT_SERVING_SIGNATURE_DEF_KEY:
+                signature
+        },
+        legacy_init_op=legacy_init_op)
+    builder.save()
 
 
 def main(_):
@@ -1065,11 +1132,6 @@ def main(_):
     tf.logging.error('Did not recognize architecture flag')
     return -1
 
-  # Set up the pre-trained graph.
-  maybe_download_and_extract(model_info['data_url'])
-  graph, bottleneck_tensor, resized_image_tensor = (
-      create_model_graph(model_info))
-
   # Look at the folder structure, and create lists of all the images.
   image_lists = create_image_lists(FLAGS.image_dir, FLAGS.testing_percentage,
                                    FLAGS.validation_percentage)
@@ -1088,6 +1150,19 @@ def main(_):
       FLAGS.flip_left_right, FLAGS.random_crop, FLAGS.random_scale,
       FLAGS.random_brightness)
 
+  # Set up the pre-trained graph.
+  maybe_download_and_extract(model_info['data_url'])
+  graph, bottleneck_tensor, resized_image_tensor = (
+      create_model_graph(model_info))
+
+  # Add the new layer that we'll be training.
+  with graph.as_default():
+    (train_step, cross_entropy, bottleneck_input,
+     ground_truth_input, final_tensor) = add_final_retrain_ops(
+         class_count, FLAGS.final_tensor_name, bottleneck_tensor,
+         model_info['bottleneck_tensor_size'], model_info['quantize_layer'],
+         True)
+
   with tf.Session(graph=graph) as sess:
     # Set up the image decoding sub-graph.
     jpeg_data_tensor, decoded_image_tensor = add_jpeg_decoding(
@@ -1111,15 +1186,8 @@ def main(_):
                         decoded_image_tensor, resized_image_tensor,
                         bottleneck_tensor, FLAGS.architecture)
 
-    # Add the new layer that we'll be training.
-    (train_step, cross_entropy, bottleneck_input, ground_truth_input,
-     final_tensor) = add_final_training_ops(
-         len(image_lists.keys()), FLAGS.final_tensor_name, bottleneck_tensor,
-         model_info['bottleneck_tensor_size'], model_info['quantize_layer'])
-
     # Create the operations we need to evaluate the accuracy of our new layer.
-    evaluation_step, prediction = add_evaluation_step(
-        final_tensor, ground_truth_input)
+    evaluation_step, _ = add_evaluation_step(final_tensor, ground_truth_input)
 
     # Merge all the summaries and write them out to the summaries_dir
     merged = tf.summary.merge_all()
@@ -1129,6 +1197,10 @@ def main(_):
     validation_writer = tf.summary.FileWriter(
         FLAGS.summaries_dir + '/validation')
 
+    # Create a train saver that is used to restore values into an eval graph
+    # when exporting models.
+    train_saver = tf.train.Saver()
+
     # Set up all our weights to their initial default values.
     init = tf.global_variables_initializer()
     sess.run(init)
@@ -1169,6 +1241,9 @@ def main(_):
                         (datetime.now(), i, train_accuracy * 100))
         tf.logging.info('%s: Step %d: Cross entropy = %f' %
                         (datetime.now(), i, cross_entropy_value))
+        # TODO(suharshs): Make this use an eval graph, to avoid quantization
+        # moving averages being updated by the validation set, though in
+        # practice this makes a negligable difference.
         validation_bottlenecks, validation_ground_truth, _ = (
             get_random_cached_bottlenecks(
                 sess, image_lists, FLAGS.validation_batch_size, 'validation',
@@ -1191,42 +1266,32 @@ def main(_):
 
       if (intermediate_frequency > 0 and (i % intermediate_frequency == 0)
           and i > 0):
+        # If we want to do an intermediate save, save a checkpoint of the train
+        # graph, to restore into the eval graph.
+        train_saver.save(sess, CHECKPOINT_NAME)
         intermediate_file_name = (FLAGS.intermediate_output_graphs_dir +
                                   'intermediate_' + str(i) + '.pb')
         tf.logging.info('Save intermediate result to : ' +
                         intermediate_file_name)
-        save_graph_to_file(sess, graph, intermediate_file_name)
+        save_graph_to_file(graph, intermediate_file_name, model_info,
+                           class_count)
+
+    # After training is complete, force one last save of the train checkpoint.
+    train_saver.save(sess, CHECKPOINT_NAME)
 
     # We've completed all our training, so run a final test evaluation on
     # some new images we haven't used before.
-    test_bottlenecks, test_ground_truth, test_filenames = (
-        get_random_cached_bottlenecks(
-            sess, image_lists, FLAGS.test_batch_size, 'testing',
-            FLAGS.bottleneck_dir, FLAGS.image_dir, jpeg_data_tensor,
-            decoded_image_tensor, resized_image_tensor, bottleneck_tensor,
-            FLAGS.architecture))
-    test_accuracy, predictions = sess.run(
-        [evaluation_step, prediction],
-        feed_dict={bottleneck_input: test_bottlenecks,
-                   ground_truth_input: test_ground_truth})
-    tf.logging.info('Final test accuracy = %.1f%% (N=%d)' %
-                    (test_accuracy * 100, len(test_bottlenecks)))
-
-    if FLAGS.print_misclassified_test_images:
-      tf.logging.info('=== MISCLASSIFIED TEST IMAGES ===')
-      for i, test_filename in enumerate(test_filenames):
-        if predictions[i] != test_ground_truth[i]:
-          tf.logging.info('%70s  %s' %
-                          (test_filename,
-                           list(image_lists.keys())[predictions[i]]))
+    run_final_eval(sess, model_info, class_count, image_lists, jpeg_data_tensor,
+                   decoded_image_tensor, resized_image_tensor,
+                   bottleneck_tensor)
 
     # Write out the trained graph and labels with the weights stored as
     # constants.
-    save_graph_to_file(sess, graph, FLAGS.output_graph)
+    save_graph_to_file(graph, FLAGS.output_graph, model_info, class_count)
     with gfile.FastGFile(FLAGS.output_labels, 'w') as f:
       f.write('\n'.join(image_lists.keys()) + '\n')
 
-    export_model(sess, FLAGS.architecture, FLAGS.saved_model_dir)
+    export_model(model_info, class_count, FLAGS.saved_model_dir)
 
 
 if __name__ == '__main__':
@@ -1407,15 +1472,15 @@ if __name__ == '__main__':
       form 'mobilenet_<parameter size>_<input_size>[_quantized]'. For example,
       'mobilenet_1.0_224' will pick a model that is 17 MB in size and takes 224
       pixel input images, while 'mobilenet_0.25_128_quantized' will choose a much
-      less accurate, but smaller and faster network that's 920 KB on disk and
-      takes 128x128 images. See https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html
+      smaller and less accurate model, taking 128x128 images, and instrumented
+      for eventual quantization via TensorFlow Lite.
+      See https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html
       for more information on Mobilenet.\
       """)
   parser.add_argument(
       '--saved_model_dir',
       type=str,
       default='/tmp/saved_models/1/',
-      help='Where to save the exported graph.'
-  )
+      help='Where to save the exported graph.')
   FLAGS, unparsed = parser.parse_known_args()
   tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
diff --git a/tensorflow/examples/image_retraining/retrain_test.py b/tensorflow/examples/image_retraining/retrain_test.py
index 8b8dd45fd72e3d29bdb7f6291cc53b912adf3644..fb7324c58ac1be60baad840207f31a61ec6182be 100644
--- a/tensorflow/examples/image_retraining/retrain_test.py
+++ b/tensorflow/examples/image_retraining/retrain_test.py
@@ -67,22 +67,52 @@ class ImageRetrainingTest(test_util.TensorFlowTestCase):
         self.assertIsNotNone(sess.graph.get_tensor_by_name('DistortResult:0'))
 
   @tf.test.mock.patch.object(retrain, 'FLAGS', learning_rate=0.01)
-  def testAddFinalTrainingOps(self, flags_mock):
+  def testAddFinalRetrainOps(self, flags_mock):
     with tf.Graph().as_default():
       with tf.Session() as sess:
         bottleneck = tf.placeholder(tf.float32, [1, 1024], name='bottleneck')
-        # Test creating final training op with quantization
-        retrain.add_final_training_ops(5, 'final', bottleneck, 1024, False)
+        # Test creating final training op with quantization.
+        retrain.add_final_retrain_ops(5, 'final', bottleneck, 1024, False,
+                                      False)
         self.assertIsNotNone(sess.graph.get_tensor_by_name('final:0'))
 
   @tf.test.mock.patch.object(retrain, 'FLAGS', learning_rate=0.01)
-  def testAddFinalTrainingOpsQuantized(self, flags_mock):
-    with tf.Graph().as_default():
+  def testAddFinalRetrainOpsQuantized(self, flags_mock):
+    # Ensure that the training and eval graph for quantized models are correctly
+    # created.
+    with tf.Graph().as_default() as g:
+      with tf.Session() as sess:
+        bottleneck = tf.placeholder(tf.float32, [1, 1024], name='bottleneck')
+        # Test creating final training op with quantization, set is_training to
+        # true.
+        retrain.add_final_retrain_ops(5, 'final', bottleneck, 1024, True, True)
+        self.assertIsNotNone(sess.graph.get_tensor_by_name('final:0'))
+        found_fake_quant = 0
+        for op in g.get_operations():
+          if op.type == 'FakeQuantWithMinMaxVars':
+            found_fake_quant += 1
+            # Ensure that the inputs of each FakeQuant operations has 2 Assign
+            # operations in the training graph (Assign[Min,Max]Last,
+            # Assign[Min,Max]Ema)
+            self.assertEqual(2,
+                             len([i for i in op.inputs if 'Assign' in i.name]))
+        self.assertEqual(found_fake_quant, 2)
+    with tf.Graph().as_default() as g:
       with tf.Session() as sess:
         bottleneck = tf.placeholder(tf.float32, [1, 1024], name='bottleneck')
-        # Test creating final training op with quantization
-        retrain.add_final_training_ops(5, 'final', bottleneck, 1024, True)
+        # Test creating final training op with quantization, set is_training to
+        # false.
+        retrain.add_final_retrain_ops(5, 'final', bottleneck, 1024, True, False)
         self.assertIsNotNone(sess.graph.get_tensor_by_name('final:0'))
+        found_fake_quant = 0
+        for op in g.get_operations():
+          if op.type == 'FakeQuantWithMinMaxVars':
+            found_fake_quant += 1
+            for i in op.inputs:
+              # Ensure that no operations are Assign operation since this is the
+              # evaluation graph.
+              self.assertTrue('Assign' not in i.name)
+        self.assertEqual(found_fake_quant, 2)
 
   def testAddEvaluationStep(self):
     with tf.Graph().as_default():
diff --git a/tensorflow/examples/speech_commands/label_wav_dir.py b/tensorflow/examples/speech_commands/label_wav_dir.py
index 2f305359e380e7192795851112c8261ea896c290..a34db512dda86be138e07a4ffaa1963fe00a5cea 100644
--- a/tensorflow/examples/speech_commands/label_wav_dir.py
+++ b/tensorflow/examples/speech_commands/label_wav_dir.py
@@ -32,8 +32,8 @@ from __future__ import division
 from __future__ import print_function
 
 import argparse
-import sys
 import glob
+import sys
 
 import tensorflow as tf
 
@@ -65,7 +65,7 @@ def run_graph(wav_dir, labels, input_layer_name, output_layer_name,
     #   predictions  will contain a two-dimensional array, where one
     #   dimension represents the input image count, and the other has
     #   predictions per class
-    for wav_path in glob.glob(wav_dir + "/*.wav"):
+    for wav_path in glob.glob(wav_dir + '/*.wav'):
       if not wav_path or not tf.gfile.Exists(wav_path):
         tf.logging.fatal('Audio file does not exist %s', wav_path)
 
diff --git a/tensorflow/go/op/wrappers.go b/tensorflow/go/op/wrappers.go
index 13f38dfb32a476477d306093bad6b56e1744a640..336df7c2f72ab1bbdc92ac00d097e04fdacc9fd5 100644
--- a/tensorflow/go/op/wrappers.go
+++ b/tensorflow/go/op/wrappers.go
@@ -278,174 +278,94 @@ func FakeQuantWithMinMaxVarsPerChannelGradient(scope *Scope, gradients tf.Output
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// Partitions `data` into `num_partitions` tensors using indices from `partitions`.
-//
-// For each index tuple `js` of size `partitions.ndim`, the slice `data[js, ...]`
-// becomes part of `outputs[partitions[js]]`.  The slices with `partitions[js] = i`
-// are placed in `outputs[i]` in lexicographic order of `js`, and the first
-// dimension of `outputs[i]` is the number of entries in `partitions` equal to `i`.
-// In detail,
-//
-// ```python
-//     outputs[i].shape = [sum(partitions == i)] + data.shape[partitions.ndim:]
-//
-//     outputs[i] = pack([data[js, ...] for js if partitions[js] == i])
-// ```
-//
-// `data.shape` must start with `partitions.shape`.
-//
-// For example:
-//
-// ```python
-//     # Scalar partitions.
-//     partitions = 1
-//     num_partitions = 2
-//     data = [10, 20]
-//     outputs[0] = []  # Empty with shape [0, 2]
-//     outputs[1] = [[10, 20]]
-//
-//     # Vector partitions.
-//     partitions = [0, 0, 1, 1, 0]
-//     num_partitions = 2
-//     data = [10, 20, 30, 40, 50]
-//     outputs[0] = [10, 20, 50]
-//     outputs[1] = [30, 40]
-// ```
-//
-// See `dynamic_stitch` for an example on how to merge partitions back.
-//
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicPartition.png" alt>
-// </div>
-//
-// Arguments:
-//
-//	partitions: Any shape.  Indices in the range `[0, num_partitions)`.
-//	num_partitions: The number of partitions to output.
-func DynamicPartition(scope *Scope, data tf.Output, partitions tf.Output, num_partitions int64) (outputs []tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"num_partitions": num_partitions}
-	opspec := tf.OpSpec{
-		Type: "DynamicPartition",
-		Input: []tf.Input{
-			data, partitions,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if outputs, idx, err = makeOutputList(op, idx, "outputs"); err != nil {
-		scope.UpdateErr("DynamicPartition", err)
-		return
-	}
-	return outputs
-}
-
-// MutableHashTableOfTensorsV2Attr is an optional argument to MutableHashTableOfTensorsV2.
-type MutableHashTableOfTensorsV2Attr func(optionalAttr)
-
-// MutableHashTableOfTensorsV2Container sets the optional container attribute to value.
-//
-// value: If non-empty, this table is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func MutableHashTableOfTensorsV2Container(value string) MutableHashTableOfTensorsV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
+// FakeQuantWithMinMaxVarsPerChannelAttr is an optional argument to FakeQuantWithMinMaxVarsPerChannel.
+type FakeQuantWithMinMaxVarsPerChannelAttr func(optionalAttr)
 
-// MutableHashTableOfTensorsV2SharedName sets the optional shared_name attribute to value.
-//
-// value: If non-empty, this table is shared under the given name across
-// multiple sessions.
-// If not specified, defaults to ""
-func MutableHashTableOfTensorsV2SharedName(value string) MutableHashTableOfTensorsV2Attr {
+// FakeQuantWithMinMaxVarsPerChannelNumBits sets the optional num_bits attribute to value.
+// If not specified, defaults to 8
+func FakeQuantWithMinMaxVarsPerChannelNumBits(value int64) FakeQuantWithMinMaxVarsPerChannelAttr {
 	return func(m optionalAttr) {
-		m["shared_name"] = value
+		m["num_bits"] = value
 	}
 }
 
-// MutableHashTableOfTensorsV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
+// FakeQuantWithMinMaxVarsPerChannelNarrowRange sets the optional narrow_range attribute to value.
 // If not specified, defaults to false
-func MutableHashTableOfTensorsV2UseNodeNameSharing(value bool) MutableHashTableOfTensorsV2Attr {
-	return func(m optionalAttr) {
-		m["use_node_name_sharing"] = value
-	}
-}
-
-// MutableHashTableOfTensorsV2ValueShape sets the optional value_shape attribute to value.
-// If not specified, defaults to <>
-func MutableHashTableOfTensorsV2ValueShape(value tf.Shape) MutableHashTableOfTensorsV2Attr {
+func FakeQuantWithMinMaxVarsPerChannelNarrowRange(value bool) FakeQuantWithMinMaxVarsPerChannelAttr {
 	return func(m optionalAttr) {
-		m["value_shape"] = value
+		m["narrow_range"] = value
 	}
 }
 
-// Creates an empty hash table.
+// Fake-quantize the 'inputs' tensor of type float and one of the shapes: `[d]`,
 //
-// This op creates a mutable hash table, specifying the type of its keys and
-// values. Each value must be a vector. Data can be inserted into the table using
-// the insert operations. It does not support the initialization operation.
+// `[b, d]` `[b, h, w, d]` via per-channel floats `min` and `max` of shape `[d]`
+// to 'outputs' tensor of same shape as `inputs`.
 //
-// Arguments:
-//	key_dtype: Type of the table keys.
-//	value_dtype: Type of the table values.
+// `[min; max]` define the clamping range for the `inputs` data.
+// `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
+// when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
+// then de-quantized and output as floats in `[min; max]` interval.
+// `num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
 //
-// Returns Handle to a table.
-func MutableHashTableOfTensorsV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableOfTensorsV2Attr) (table_handle tf.Output) {
+// This operation has a gradient and thus allows for training `min` and `max`
+// values.
+func FakeQuantWithMinMaxVarsPerChannel(scope *Scope, inputs tf.Output, min tf.Output, max tf.Output, optional ...FakeQuantWithMinMaxVarsPerChannelAttr) (outputs tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
+	attrs := map[string]interface{}{}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MutableHashTableOfTensorsV2",
-
+		Type: "FakeQuantWithMinMaxVarsPerChannel",
+		Input: []tf.Input{
+			inputs, min, max,
+		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceApplyProximalAdagradAttr is an optional argument to ResourceApplyProximalAdagrad.
-type ResourceApplyProximalAdagradAttr func(optionalAttr)
+// FakeQuantWithMinMaxVarsGradientAttr is an optional argument to FakeQuantWithMinMaxVarsGradient.
+type FakeQuantWithMinMaxVarsGradientAttr func(optionalAttr)
 
-// ResourceApplyProximalAdagradUseLocking sets the optional use_locking attribute to value.
+// FakeQuantWithMinMaxVarsGradientNumBits sets the optional num_bits attribute to value.
 //
-// value: If True, updating of the var and accum tensors will be protected by
-// a lock; otherwise the behavior is undefined, but may exhibit less contention.
-// If not specified, defaults to false
-func ResourceApplyProximalAdagradUseLocking(value bool) ResourceApplyProximalAdagradAttr {
+// value: The bitwidth of the quantization; between 2 and 8, inclusive.
+// If not specified, defaults to 8
+func FakeQuantWithMinMaxVarsGradientNumBits(value int64) FakeQuantWithMinMaxVarsGradientAttr {
 	return func(m optionalAttr) {
-		m["use_locking"] = value
+		m["num_bits"] = value
 	}
 }
 
-// Update '*var' and '*accum' according to FOBOS with Adagrad learning rate.
+// FakeQuantWithMinMaxVarsGradientNarrowRange sets the optional narrow_range attribute to value.
 //
-// accum += grad * grad
-// prox_v = var - lr * grad * (1 / sqrt(accum))
-// var = sign(prox_v)/(1+lr*l2) * max{|prox_v|-lr*l1,0}
+// value: Whether to quantize into 2^num_bits - 1 distinct values.
+// If not specified, defaults to false
+func FakeQuantWithMinMaxVarsGradientNarrowRange(value bool) FakeQuantWithMinMaxVarsGradientAttr {
+	return func(m optionalAttr) {
+		m["narrow_range"] = value
+	}
+}
+
+// Compute gradients for a FakeQuantWithMinMaxVars operation.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	l1: L1 regularization. Must be a scalar.
-//	l2: L2 regularization. Must be a scalar.
-//	grad: The gradient.
+//	gradients: Backpropagated gradients above the FakeQuantWithMinMaxVars operation.
+//	inputs: Values passed as inputs to the FakeQuantWithMinMaxVars operation.
+// min, max: Quantization interval, scalar floats.
 //
-// Returns the created operation.
-func ResourceApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, optional ...ResourceApplyProximalAdagradAttr) (o *tf.Operation) {
+//
+//
+// Returns Backpropagated gradients w.r.t. inputs:
+// `gradients * (inputs >= min && inputs <= max)`.Backpropagated gradients w.r.t. min parameter:
+// `sum(gradients * (inputs < min))`.Backpropagated gradients w.r.t. max parameter:
+// `sum(gradients * (inputs > max))`.
+func FakeQuantWithMinMaxVarsGradient(scope *Scope, gradients tf.Output, inputs tf.Output, min tf.Output, max tf.Output, optional ...FakeQuantWithMinMaxVarsGradientAttr) (backprops_wrt_input tf.Output, backprop_wrt_min tf.Output, backprop_wrt_max tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -454,13 +374,14 @@ func ResourceApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output,
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceApplyProximalAdagrad",
+		Type: "FakeQuantWithMinMaxVarsGradient",
 		Input: []tf.Input{
-			var_, accum, lr, l1, l2, grad,
+			gradients, inputs, min, max,
 		},
 		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1), op.Output(2)
 }
 
 // MutableHashTableV2Attr is an optional argument to MutableHashTableV2.
@@ -527,50 +448,74 @@ func MutableHashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.Data
 	return op.Output(0)
 }
 
-// MapUnstageNoKeyAttr is an optional argument to MapUnstageNoKey.
-type MapUnstageNoKeyAttr func(optionalAttr)
+// Replaces the contents of the table with the specified keys and values.
+//
+// The tensor `keys` must be of the same type as the keys of the table.
+// The tensor `values` must be of the type of the table values.
+//
+// Arguments:
+//	table_handle: Handle to the table.
+//	keys: Any shape.  Keys to look up.
+//	values: Values to associate with keys.
+//
+// Returns the created operation.
+func LookupTableImportV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "LookupTableImportV2",
+		Input: []tf.Input{
+			table_handle, keys, values,
+		},
+	}
+	return scope.AddOperation(opspec)
+}
 
-// MapUnstageNoKeyCapacity sets the optional capacity attribute to value.
+// MapPeekAttr is an optional argument to MapPeek.
+type MapPeekAttr func(optionalAttr)
+
+// MapPeekCapacity sets the optional capacity attribute to value.
 // If not specified, defaults to 0
 //
 // REQUIRES: value >= 0
-func MapUnstageNoKeyCapacity(value int64) MapUnstageNoKeyAttr {
+func MapPeekCapacity(value int64) MapPeekAttr {
 	return func(m optionalAttr) {
 		m["capacity"] = value
 	}
 }
 
-// MapUnstageNoKeyMemoryLimit sets the optional memory_limit attribute to value.
+// MapPeekMemoryLimit sets the optional memory_limit attribute to value.
 // If not specified, defaults to 0
 //
 // REQUIRES: value >= 0
-func MapUnstageNoKeyMemoryLimit(value int64) MapUnstageNoKeyAttr {
+func MapPeekMemoryLimit(value int64) MapPeekAttr {
 	return func(m optionalAttr) {
 		m["memory_limit"] = value
 	}
 }
 
-// MapUnstageNoKeyContainer sets the optional container attribute to value.
+// MapPeekContainer sets the optional container attribute to value.
 // If not specified, defaults to ""
-func MapUnstageNoKeyContainer(value string) MapUnstageNoKeyAttr {
+func MapPeekContainer(value string) MapPeekAttr {
 	return func(m optionalAttr) {
 		m["container"] = value
 	}
 }
 
-// MapUnstageNoKeySharedName sets the optional shared_name attribute to value.
+// MapPeekSharedName sets the optional shared_name attribute to value.
 // If not specified, defaults to ""
-func MapUnstageNoKeySharedName(value string) MapUnstageNoKeyAttr {
+func MapPeekSharedName(value string) MapPeekAttr {
 	return func(m optionalAttr) {
 		m["shared_name"] = value
 	}
 }
 
-// Op removes and returns a random (key, value)
+// Op peeks at the values at the specified key.  If the
 //
-// from the underlying container.   If the underlying container
-// does not contain elements, the op will block until it does.
-func MapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageNoKeyAttr) (key tf.Output, values []tf.Output) {
+// underlying container does not contain this key
+// this op will block until it does.
+func MapPeek(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapPeekAttr) (values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -579,9 +524,9 @@ func MapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, opti
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MapUnstageNoKey",
+		Type: "MapPeek",
 		Input: []tf.Input{
-			indices,
+			key, indices,
 		},
 		Attrs: attrs,
 	}
@@ -591,171 +536,11 @@ func MapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, opti
 	}
 	var idx int
 	var err error
-	key = op.Output(idx)
 	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("MapUnstageNoKey", err)
+		scope.UpdateErr("MapPeek", err)
 		return
 	}
-	return key, values
-}
-
-// HashTableV2Attr is an optional argument to HashTableV2.
-type HashTableV2Attr func(optionalAttr)
-
-// HashTableV2Container sets the optional container attribute to value.
-//
-// value: If non-empty, this table is placed in the given container.
-// Otherwise, a default container is used.
-// If not specified, defaults to ""
-func HashTableV2Container(value string) HashTableV2Attr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// HashTableV2SharedName sets the optional shared_name attribute to value.
-//
-// value: If non-empty, this table is shared under the given name across
-// multiple sessions.
-// If not specified, defaults to ""
-func HashTableV2SharedName(value string) HashTableV2Attr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// HashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
-//
-// value: If true and shared_name is empty, the table is shared
-// using the node name.
-// If not specified, defaults to false
-func HashTableV2UseNodeNameSharing(value bool) HashTableV2Attr {
-	return func(m optionalAttr) {
-		m["use_node_name_sharing"] = value
-	}
-}
-
-// Creates a non-initialized hash table.
-//
-// This op creates a hash table, specifying the type of its keys and values.
-// Before using the table you will have to initialize it.  After initialization the
-// table will be immutable.
-//
-// Arguments:
-//	key_dtype: Type of the table keys.
-//	value_dtype: Type of the table values.
-//
-// Returns Handle to a table.
-func HashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...HashTableV2Attr) (table_handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "HashTableV2",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Replaces the contents of the table with the specified keys and values.
-//
-// The tensor `keys` must be of the same type as the keys of the table.
-// The tensor `values` must be of the type of the table values.
-//
-// Arguments:
-//	table_handle: Handle to the table.
-//	keys: Any shape.  Keys to look up.
-//	values: Values to associate with keys.
-//
-// Returns the created operation.
-func LookupTableImportV2(scope *Scope, table_handle tf.Output, keys tf.Output, values tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "LookupTableImportV2",
-		Input: []tf.Input{
-			table_handle, keys, values,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
-// MapPeekAttr is an optional argument to MapPeek.
-type MapPeekAttr func(optionalAttr)
-
-// MapPeekCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func MapPeekCapacity(value int64) MapPeekAttr {
-	return func(m optionalAttr) {
-		m["capacity"] = value
-	}
-}
-
-// MapPeekMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func MapPeekMemoryLimit(value int64) MapPeekAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// MapPeekContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func MapPeekContainer(value string) MapPeekAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// MapPeekSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func MapPeekSharedName(value string) MapPeekAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Op peeks at the values at the specified key.  If the
-//
-// underlying container does not contain this key
-// this op will block until it does.
-func MapPeek(scope *Scope, key tf.Output, indices tf.Output, dtypes []tf.DataType, optional ...MapPeekAttr) (values []tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "MapPeek",
-		Input: []tf.Input{
-			key, indices,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	if scope.Err() != nil {
-		return
-	}
-	var idx int
-	var err error
-	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
-		scope.UpdateErr("MapPeek", err)
-		return
-	}
-	return values
+	return values
 }
 
 // Returns (x - y)(x - y) element-wise.
@@ -1644,61 +1429,6 @@ func Igammac(scope *Scope, a tf.Output, x tf.Output) (z tf.Output) {
 	return op.Output(0)
 }
 
-// FakeQuantWithMinMaxVarsGradientAttr is an optional argument to FakeQuantWithMinMaxVarsGradient.
-type FakeQuantWithMinMaxVarsGradientAttr func(optionalAttr)
-
-// FakeQuantWithMinMaxVarsGradientNumBits sets the optional num_bits attribute to value.
-//
-// value: The bitwidth of the quantization; between 2 and 8, inclusive.
-// If not specified, defaults to 8
-func FakeQuantWithMinMaxVarsGradientNumBits(value int64) FakeQuantWithMinMaxVarsGradientAttr {
-	return func(m optionalAttr) {
-		m["num_bits"] = value
-	}
-}
-
-// FakeQuantWithMinMaxVarsGradientNarrowRange sets the optional narrow_range attribute to value.
-//
-// value: Whether to quantize into 2^num_bits - 1 distinct values.
-// If not specified, defaults to false
-func FakeQuantWithMinMaxVarsGradientNarrowRange(value bool) FakeQuantWithMinMaxVarsGradientAttr {
-	return func(m optionalAttr) {
-		m["narrow_range"] = value
-	}
-}
-
-// Compute gradients for a FakeQuantWithMinMaxVars operation.
-//
-// Arguments:
-//	gradients: Backpropagated gradients above the FakeQuantWithMinMaxVars operation.
-//	inputs: Values passed as inputs to the FakeQuantWithMinMaxVars operation.
-// min, max: Quantization interval, scalar floats.
-//
-//
-//
-// Returns Backpropagated gradients w.r.t. inputs:
-// `gradients * (inputs >= min && inputs <= max)`.Backpropagated gradients w.r.t. min parameter:
-// `sum(gradients * (inputs < min))`.Backpropagated gradients w.r.t. max parameter:
-// `sum(gradients * (inputs > max))`.
-func FakeQuantWithMinMaxVarsGradient(scope *Scope, gradients tf.Output, inputs tf.Output, min tf.Output, max tf.Output, optional ...FakeQuantWithMinMaxVarsGradientAttr) (backprops_wrt_input tf.Output, backprop_wrt_min tf.Output, backprop_wrt_max tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "FakeQuantWithMinMaxVarsGradient",
-		Input: []tf.Input{
-			gradients, inputs, min, max,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1), op.Output(2)
-}
-
 // LogUniformCandidateSamplerAttr is an optional argument to LogUniformCandidateSampler.
 type LogUniformCandidateSamplerAttr func(optionalAttr)
 
@@ -2429,26 +2159,6 @@ func ReaderNumWorkUnitsCompletedV2(scope *Scope, reader_handle tf.Output) (units
 	return op.Output(0)
 }
 
-// Returns x / y element-wise for real types.
-//
-// If `x` and `y` are reals, this will return the floating-point division.
-//
-// *NOTE*: `Div` supports broadcasting. More about broadcasting
-// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
-func RealDiv(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "RealDiv",
-		Input: []tf.Input{
-			x, y,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // Computes the log of the absolute value of `Gamma(x)` element-wise.
 func Lgamma(scope *Scope, x tf.Output) (y tf.Output) {
 	if scope.Err() != nil {
@@ -4456,59 +4166,31 @@ func MaxPoolGradGradV2(scope *Scope, orig_input tf.Output, orig_output tf.Output
 	return op.Output(0)
 }
 
-// Fast Fourier transform.
+// MaxPoolAttr is an optional argument to MaxPool.
+type MaxPoolAttr func(optionalAttr)
+
+// MaxPoolDataFormat sets the optional data_format attribute to value.
 //
-// Computes the 1-dimensional discrete Fourier transform over the inner-most
-// dimension of `input`.
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func MaxPoolDataFormat(value string) MaxPoolAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Performs max pooling on the input.
 //
 // Arguments:
-//	input: A complex64 tensor.
-//
-// Returns A complex64 tensor of the same shape as `input`. The inner-most
-//   dimension of `input` is replaced with its 1D Fourier transform.
-//
-// @compatibility(numpy)
-// Equivalent to np.fft.fft
-// @end_compatibility
-func FFT(scope *Scope, input tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "FFT",
-		Input: []tf.Input{
-			input,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// MaxPoolAttr is an optional argument to MaxPool.
-type MaxPoolAttr func(optionalAttr)
-
-// MaxPoolDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func MaxPoolDataFormat(value string) MaxPoolAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Performs max pooling on the input.
-//
-// Arguments:
-//	input: 4-D input to pool over.
-//	ksize: The size of the window for each dimension of the input tensor.
-//	strides: The stride of the sliding window for each dimension of the
-// input tensor.
-//	padding: The type of padding algorithm to use.
+//	input: 4-D input to pool over.
+//	ksize: The size of the window for each dimension of the input tensor.
+//	strides: The stride of the sliding window for each dimension of the
+// input tensor.
+//	padding: The type of padding algorithm to use.
 //
 // Returns The max pooled output tensor.
 func MaxPool(scope *Scope, input tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPoolAttr) (output tf.Output) {
@@ -4597,47 +4279,6 @@ func MaxPoolGradWithArgmax(scope *Scope, input tf.Output, grad tf.Output, argmax
 	return op.Output(0)
 }
 
-// CriticalSectionOpAttr is an optional argument to CriticalSectionOp.
-type CriticalSectionOpAttr func(optionalAttr)
-
-// CriticalSectionOpContainer sets the optional container attribute to value.
-//
-// value: the container this critical section is placed in.
-// If not specified, defaults to ""
-func CriticalSectionOpContainer(value string) CriticalSectionOpAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// CriticalSectionOpSharedName sets the optional shared_name attribute to value.
-//
-// value: the name by which this critical section is referred to.
-// If not specified, defaults to ""
-func CriticalSectionOpSharedName(value string) CriticalSectionOpAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Creates a handle to a CriticalSection resource.
-func CriticalSectionOp(scope *Scope, optional ...CriticalSectionOpAttr) (resource tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "CriticalSectionOp",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // FakeQuantWithMinMaxArgsGradientAttr is an optional argument to FakeQuantWithMinMaxArgsGradient.
 type FakeQuantWithMinMaxArgsGradientAttr func(optionalAttr)
 
@@ -5005,6 +4646,78 @@ func DepthwiseConv2dNative(scope *Scope, input tf.Output, filter tf.Output, stri
 	return op.Output(0)
 }
 
+// MaxPoolGradV2Attr is an optional argument to MaxPoolGradV2.
+type MaxPoolGradV2Attr func(optionalAttr)
+
+// MaxPoolGradV2DataFormat sets the optional data_format attribute to value.
+//
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func MaxPoolGradV2DataFormat(value string) MaxPoolGradV2Attr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Computes gradients of the maxpooling function.
+//
+// Arguments:
+//	orig_input: The original input tensor.
+//	orig_output: The original output tensor.
+//	grad: 4-D.  Gradients w.r.t. the output of `max_pool`.
+//	ksize: The size of the window for each dimension of the input tensor.
+//	strides: The stride of the sliding window for each dimension of the
+// input tensor.
+//	padding: The type of padding algorithm to use.
+//
+// Returns Gradients w.r.t. the input to `max_pool`.
+func MaxPoolGradV2(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize tf.Output, strides tf.Output, padding string, optional ...MaxPoolGradV2Attr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "MaxPoolGradV2",
+		Input: []tf.Input{
+			orig_input, orig_output, grad, ksize, strides,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Restore a reader to a previously saved state.
+//
+// Not all Readers support being restored, so this can produce an
+// Unimplemented error.
+//
+// Arguments:
+//	reader_handle: Handle to a Reader.
+//	state: Result of a ReaderSerializeState of a Reader with type
+// matching reader_handle.
+//
+// Returns the created operation.
+func ReaderRestoreStateV2(scope *Scope, reader_handle tf.Output, state tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "ReaderRestoreStateV2",
+		Input: []tf.Input{
+			reader_handle, state,
+		},
+	}
+	return scope.AddOperation(opspec)
+}
+
 // TensorArrayGatherV3Attr is an optional argument to TensorArrayGatherV3.
 type TensorArrayGatherV3Attr func(optionalAttr)
 
@@ -5677,111 +5390,6 @@ func QuantizedReluX(scope *Scope, features tf.Output, max_value tf.Output, min_f
 	return op.Output(0), op.Output(1), op.Output(2)
 }
 
-// SummaryWriterAttr is an optional argument to SummaryWriter.
-type SummaryWriterAttr func(optionalAttr)
-
-// SummaryWriterSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func SummaryWriterSharedName(value string) SummaryWriterAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// SummaryWriterContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func SummaryWriterContainer(value string) SummaryWriterAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// Returns a handle to be used to access a summary writer.
-//
-// The summary writer is an in-graph resource which can be used by ops to write
-// summaries to event files.
-//
-// Returns the summary writer resource. Scalar handle.
-func SummaryWriter(scope *Scope, optional ...SummaryWriterAttr) (writer tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "SummaryWriter",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes gradients for SparseSegmentMean.
-//
-// Returns tensor "output" with same shape as grad, except for dimension 0 whose
-// value is output_dim0.
-//
-// Arguments:
-//	grad: gradient propagated to the SparseSegmentMean op.
-//	indices: indices passed to the corresponding SparseSegmentMean op.
-//	segment_ids: segment_ids passed to the corresponding SparseSegmentMean op.
-//	output_dim0: dimension 0 of "data" passed to SparseSegmentMean op.
-func SparseSegmentMeanGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SparseSegmentMeanGrad",
-		Input: []tf.Input{
-			grad, indices, segment_ids, output_dim0,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Applies softmax to a batched N-D `SparseTensor`.
-//
-// The inputs represent an N-D SparseTensor  with logical shape `[..., B, C]`
-// (where `N >= 2`), and with indices sorted in the canonical lexicographic order.
-//
-// This op is equivalent to applying the normal `tf.nn.softmax()` to each innermost
-// logical submatrix with shape `[B, C]`, but with the catch that *the implicitly
-// zero elements do not participate*.  Specifically, the algorithm is equivalent
-// to the following:
-//
-//   (1) Applies `tf.nn.softmax()` to a densified view of each innermost submatrix
-//       with shape `[B, C]`, along the size-C dimension;
-//   (2) Masks out the original implicitly-zero locations;
-//   (3) Renormalizes the remaining elements.
-//
-// Hence, the `SparseTensor` result has exactly the same non-zero indices and
-// shape.
-//
-// Arguments:
-//	sp_indices: 2-D.  `NNZ x R` matrix with the indices of non-empty values in a
-// SparseTensor, in canonical ordering.
-//	sp_values: 1-D.  `NNZ` non-empty values corresponding to `sp_indices`.
-//	sp_shape: 1-D.  Shape of the input SparseTensor.
-//
-// Returns 1-D.  The `NNZ` values for the result `SparseTensor`.
-func SparseSoftmax(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "SparseSoftmax",
-		Input: []tf.Input{
-			sp_indices, sp_values, sp_shape,
-		},
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // RandomPoissonAttr is an optional argument to RandomPoisson.
 type RandomPoissonAttr func(optionalAttr)
 
@@ -5823,78 +5431,6 @@ func RandomPoisson(scope *Scope, shape tf.Output, rate tf.Output, optional ...Ra
 	return op.Output(0)
 }
 
-// MaxPoolGradV2Attr is an optional argument to MaxPoolGradV2.
-type MaxPoolGradV2Attr func(optionalAttr)
-
-// MaxPoolGradV2DataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func MaxPoolGradV2DataFormat(value string) MaxPoolGradV2Attr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Computes gradients of the maxpooling function.
-//
-// Arguments:
-//	orig_input: The original input tensor.
-//	orig_output: The original output tensor.
-//	grad: 4-D.  Gradients w.r.t. the output of `max_pool`.
-//	ksize: The size of the window for each dimension of the input tensor.
-//	strides: The stride of the sliding window for each dimension of the
-// input tensor.
-//	padding: The type of padding algorithm to use.
-//
-// Returns Gradients w.r.t. the input to `max_pool`.
-func MaxPoolGradV2(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize tf.Output, strides tf.Output, padding string, optional ...MaxPoolGradV2Attr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "MaxPoolGradV2",
-		Input: []tf.Input{
-			orig_input, orig_output, grad, ksize, strides,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Restore a reader to a previously saved state.
-//
-// Not all Readers support being restored, so this can produce an
-// Unimplemented error.
-//
-// Arguments:
-//	reader_handle: Handle to a Reader.
-//	state: Result of a ReaderSerializeState of a Reader with type
-// matching reader_handle.
-//
-// Returns the created operation.
-func ReaderRestoreStateV2(scope *Scope, reader_handle tf.Output, state tf.Output) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "ReaderRestoreStateV2",
-		Input: []tf.Input{
-			reader_handle, state,
-		},
-	}
-	return scope.AddOperation(opspec)
-}
-
 // ResourceSparseApplyFtrlV2Attr is an optional argument to ResourceSparseApplyFtrlV2.
 type ResourceSparseApplyFtrlV2Attr func(optionalAttr)
 
@@ -7132,67 +6668,32 @@ func DestroyResourceOp(scope *Scope, resource tf.Output, optional ...DestroyReso
 	return scope.AddOperation(opspec)
 }
 
-// CumprodAttr is an optional argument to Cumprod.
-type CumprodAttr func(optionalAttr)
+// SummaryWriterAttr is an optional argument to SummaryWriter.
+type SummaryWriterAttr func(optionalAttr)
 
-// CumprodExclusive sets the optional exclusive attribute to value.
-//
-// value: If `True`, perform exclusive cumprod.
-// If not specified, defaults to false
-func CumprodExclusive(value bool) CumprodAttr {
+// SummaryWriterSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func SummaryWriterSharedName(value string) SummaryWriterAttr {
 	return func(m optionalAttr) {
-		m["exclusive"] = value
+		m["shared_name"] = value
 	}
 }
 
-// CumprodReverse sets the optional reverse attribute to value.
-//
-// value: A `bool` (default: False).
-// If not specified, defaults to false
-func CumprodReverse(value bool) CumprodAttr {
+// SummaryWriterContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func SummaryWriterContainer(value string) SummaryWriterAttr {
 	return func(m optionalAttr) {
-		m["reverse"] = value
+		m["container"] = value
 	}
 }
 
-// Compute the cumulative product of the tensor `x` along `axis`.
-//
-// By default, this op performs an inclusive cumprod, which means that the first
-// element of the input is identical to the first element of the output:
-//
-// ```python
-// tf.cumprod([a, b, c])  # => [a, a * b, a * b * c]
-// ```
-//
-// By setting the `exclusive` kwarg to `True`, an exclusive cumprod is
-// performed instead:
-//
-// ```python
-// tf.cumprod([a, b, c], exclusive=True)  # => [1, a, a * b]
-// ```
-//
-// By setting the `reverse` kwarg to `True`, the cumprod is performed in the
-// opposite direction:
-//
-// ```python
-// tf.cumprod([a, b, c], reverse=True)  # => [a * b * c, b * c, c]
-// ```
-//
-// This is more efficient than using separate `tf.reverse` ops.
-//
-// The `reverse` and `exclusive` kwargs can also be combined:
+// Returns a handle to be used to access a summary writer.
 //
-// ```python
-// tf.cumprod([a, b, c], exclusive=True, reverse=True)  # => [b * c, c, 1]
-// ```
+// The summary writer is an in-graph resource which can be used by ops to write
+// summaries to event files.
 //
-// Arguments:
-//	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
-// `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
-// `complex128`, `qint8`, `quint8`, `qint32`, `half`.
-//	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
-// `[-rank(x), rank(x))`.
-func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr) (out tf.Output) {
+// Returns the summary writer resource. Scalar handle.
+func SummaryWriter(scope *Scope, optional ...SummaryWriterAttr) (writer tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -7201,230 +6702,252 @@ func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr)
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "Cumprod",
-		Input: []tf.Input{
-			x, axis,
-		},
+		Type: "SummaryWriter",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// Computes the mean along segments of a tensor.
-//
-// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
-// segments.
-//
-// Computes a tensor such that
-// \\(output_i = \frac{\sum_j data_j}{N}\\) where `mean` is
-// over `j` such that `segment_ids[j] == i` and `N` is the total number of
-// values summed.
-//
-// If the mean is empty for a given segment ID `i`, `output[i] = 0`.
+// Computes gradients for SparseSegmentMean.
 //
-// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
-// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMean.png" alt>
-// </div>
+// Returns tensor "output" with same shape as grad, except for dimension 0 whose
+// value is output_dim0.
 //
 // Arguments:
-//
-//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
-// first dimension.  Values should be sorted and can be repeated.
-//
-// Returns Has same shape as data, except for dimension 0 which
-// has size `k`, the number of segments.
-func SegmentMean(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
+//	grad: gradient propagated to the SparseSegmentMean op.
+//	indices: indices passed to the corresponding SparseSegmentMean op.
+//	segment_ids: segment_ids passed to the corresponding SparseSegmentMean op.
+//	output_dim0: dimension 0 of "data" passed to SparseSegmentMean op.
+func SparseSegmentMeanGrad(scope *Scope, grad tf.Output, indices tf.Output, segment_ids tf.Output, output_dim0 tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "SegmentMean",
+		Type: "SparseSegmentMeanGrad",
 		Input: []tf.Input{
-			data, segment_ids,
+			grad, indices, segment_ids, output_dim0,
 		},
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// ResourceSparseApplyCenteredRMSPropAttr is an optional argument to ResourceSparseApplyCenteredRMSProp.
-type ResourceSparseApplyCenteredRMSPropAttr func(optionalAttr)
-
-// ResourceSparseApplyCenteredRMSPropUseLocking sets the optional use_locking attribute to value.
-//
-// value: If `True`, updating of the var, mg, ms, and mom tensors is
-// protected by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceSparseApplyCenteredRMSPropUseLocking(value bool) ResourceSparseApplyCenteredRMSPropAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
-	}
-}
-
-// Update '*var' according to the centered RMSProp algorithm.
+// Applies softmax to a batched N-D `SparseTensor`.
 //
-// The centered RMSProp algorithm uses an estimate of the centered second moment
-// (i.e., the variance) for normalization, as opposed to regular RMSProp, which
-// uses the (uncentered) second moment. This often helps with training, but is
-// slightly more expensive in terms of computation and memory.
+// The inputs represent an N-D SparseTensor  with logical shape `[..., B, C]`
+// (where `N >= 2`), and with indices sorted in the canonical lexicographic order.
 //
-// Note that in dense implementation of this algorithm, mg, ms, and mom will
-// update even if the grad is zero, but in this sparse implementation, mg, ms,
-// and mom will not update in iterations during which the grad is zero.
+// This op is equivalent to applying the normal `tf.nn.softmax()` to each innermost
+// logical submatrix with shape `[B, C]`, but with the catch that *the implicitly
+// zero elements do not participate*.  Specifically, the algorithm is equivalent
+// to the following:
 //
-// mean_square = decay * mean_square + (1-decay) * gradient ** 2
-// mean_grad = decay * mean_grad + (1-decay) * gradient
-// Delta = learning_rate * gradient / sqrt(mean_square + epsilon - mean_grad ** 2)
+//   (1) Applies `tf.nn.softmax()` to a densified view of each innermost submatrix
+//       with shape `[B, C]`, along the size-C dimension;
+//   (2) Masks out the original implicitly-zero locations;
+//   (3) Renormalizes the remaining elements.
 //
-// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
-// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
-// var <- var - mom
+// Hence, the `SparseTensor` result has exactly the same non-zero indices and
+// shape.
 //
 // Arguments:
-//	var_: Should be from a Variable().
-//	mg: Should be from a Variable().
-//	ms: Should be from a Variable().
-//	mom: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	rho: Decay rate. Must be a scalar.
-//
-//	epsilon: Ridge term. Must be a scalar.
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var, ms and mom.
+//	sp_indices: 2-D.  `NNZ x R` matrix with the indices of non-empty values in a
+// SparseTensor, in canonical ordering.
+//	sp_values: 1-D.  `NNZ` non-empty values corresponding to `sp_indices`.
+//	sp_shape: 1-D.  Shape of the input SparseTensor.
 //
-// Returns the created operation.
-func ResourceSparseApplyCenteredRMSProp(scope *Scope, var_ tf.Output, mg tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyCenteredRMSPropAttr) (o *tf.Operation) {
+// Returns 1-D.  The `NNZ` values for the result `SparseTensor`.
+func SparseSoftmax(scope *Scope, sp_indices tf.Output, sp_values tf.Output, sp_shape tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyCenteredRMSProp",
+		Type: "SparseSoftmax",
 		Input: []tf.Input{
-			var_, mg, ms, mom, lr, rho, momentum, epsilon, grad, indices,
+			sp_indices, sp_values, sp_shape,
 		},
-		Attrs: attrs,
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// Creates a dataset that batches `batch_size` elements from `input_dataset`.
-//
-// Arguments:
+// Partitions `data` into `num_partitions` tensors using indices from `partitions`.
 //
-//	batch_size: A scalar representing the number of elements to accumulate in a
-// batch.
+// For each index tuple `js` of size `partitions.ndim`, the slice `data[js, ...]`
+// becomes part of `outputs[partitions[js]]`.  The slices with `partitions[js] = i`
+// are placed in `outputs[i]` in lexicographic order of `js`, and the first
+// dimension of `outputs[i]` is the number of entries in `partitions` equal to `i`.
+// In detail,
 //
+// ```python
+//     outputs[i].shape = [sum(partitions == i)] + data.shape[partitions.ndim:]
 //
-func BatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+//     outputs[i] = pack([data[js, ...] for js if partitions[js] == i])
+// ```
+//
+// `data.shape` must start with `partitions.shape`.
+//
+// For example:
+//
+// ```python
+//     # Scalar partitions.
+//     partitions = 1
+//     num_partitions = 2
+//     data = [10, 20]
+//     outputs[0] = []  # Empty with shape [0, 2]
+//     outputs[1] = [[10, 20]]
+//
+//     # Vector partitions.
+//     partitions = [0, 0, 1, 1, 0]
+//     num_partitions = 2
+//     data = [10, 20, 30, 40, 50]
+//     outputs[0] = [10, 20, 50]
+//     outputs[1] = [30, 40]
+// ```
+//
+// See `dynamic_stitch` for an example on how to merge partitions back.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/DynamicPartition.png" alt>
+// </div>
+//
+// Arguments:
+//
+//	partitions: Any shape.  Indices in the range `[0, num_partitions)`.
+//	num_partitions: The number of partitions to output.
+func DynamicPartition(scope *Scope, data tf.Output, partitions tf.Output, num_partitions int64) (outputs []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	attrs := map[string]interface{}{"num_partitions": num_partitions}
 	opspec := tf.OpSpec{
-		Type: "BatchDataset",
+		Type: "DynamicPartition",
 		Input: []tf.Input{
-			input_dataset, batch_size,
+			data, partitions,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	if scope.Err() != nil {
+		return
+	}
+	var idx int
+	var err error
+	if outputs, idx, err = makeOutputList(op, idx, "outputs"); err != nil {
+		scope.UpdateErr("DynamicPartition", err)
+		return
+	}
+	return outputs
 }
 
-// Inverse fast Fourier transform.
+// ResourceApplyAdagradAttr is an optional argument to ResourceApplyAdagrad.
+type ResourceApplyAdagradAttr func(optionalAttr)
+
+// ResourceApplyAdagradUseLocking sets the optional use_locking attribute to value.
 //
-// Computes the inverse 1-dimensional discrete Fourier transform over the
-// inner-most dimension of `input`.
+// value: If `True`, updating of the var and accum tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceApplyAdagradUseLocking(value bool) ResourceApplyAdagradAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// Update '*var' according to the adagrad scheme.
 //
-// Arguments:
-//	input: A complex64 tensor.
+// accum += grad * grad
+// var -= lr * grad * (1 / sqrt(accum))
 //
-// Returns A complex64 tensor of the same shape as `input`. The inner-most
-//   dimension of `input` is replaced with its inverse 1D Fourier transform.
+// Arguments:
+//	var_: Should be from a Variable().
+//	accum: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	grad: The gradient.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.ifft
-// @end_compatibility
-func IFFT(scope *Scope, input tf.Output) (output tf.Output) {
+// Returns the created operation.
+func ResourceApplyAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, optional ...ResourceApplyAdagradAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "IFFT",
+		Type: "ResourceApplyAdagrad",
 		Input: []tf.Input{
-			input,
+			var_, accum, lr, grad,
 		},
+		Attrs: attrs,
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return scope.AddOperation(opspec)
 }
 
-// LRNAttr is an optional argument to LRN.
-type LRNAttr func(optionalAttr)
+// CumprodAttr is an optional argument to Cumprod.
+type CumprodAttr func(optionalAttr)
 
-// LRNDepthRadius sets the optional depth_radius attribute to value.
+// CumprodExclusive sets the optional exclusive attribute to value.
 //
-// value: 0-D.  Half-width of the 1-D normalization window.
-// If not specified, defaults to 5
-func LRNDepthRadius(value int64) LRNAttr {
+// value: If `True`, perform exclusive cumprod.
+// If not specified, defaults to false
+func CumprodExclusive(value bool) CumprodAttr {
 	return func(m optionalAttr) {
-		m["depth_radius"] = value
+		m["exclusive"] = value
 	}
 }
 
-// LRNBias sets the optional bias attribute to value.
+// CumprodReverse sets the optional reverse attribute to value.
 //
-// value: An offset (usually positive to avoid dividing by 0).
-// If not specified, defaults to 1
-func LRNBias(value float32) LRNAttr {
+// value: A `bool` (default: False).
+// If not specified, defaults to false
+func CumprodReverse(value bool) CumprodAttr {
 	return func(m optionalAttr) {
-		m["bias"] = value
+		m["reverse"] = value
 	}
 }
 
-// LRNAlpha sets the optional alpha attribute to value.
+// Compute the cumulative product of the tensor `x` along `axis`.
 //
-// value: A scale factor, usually positive.
-// If not specified, defaults to 1
-func LRNAlpha(value float32) LRNAttr {
-	return func(m optionalAttr) {
-		m["alpha"] = value
-	}
-}
-
-// LRNBeta sets the optional beta attribute to value.
+// By default, this op performs an inclusive cumprod, which means that the first
+// element of the input is identical to the first element of the output:
 //
-// value: An exponent.
-// If not specified, defaults to 0.5
-func LRNBeta(value float32) LRNAttr {
-	return func(m optionalAttr) {
-		m["beta"] = value
-	}
-}
-
-// Local Response Normalization.
+// ```python
+// tf.cumprod([a, b, c])  # => [a, a * b, a * b * c]
+// ```
 //
-// The 4-D `input` tensor is treated as a 3-D array of 1-D vectors (along the last
-// dimension), and each vector is normalized independently.  Within a given vector,
-// each component is divided by the weighted, squared sum of inputs within
-// `depth_radius`.  In detail,
+// By setting the `exclusive` kwarg to `True`, an exclusive cumprod is
+// performed instead:
 //
-//     sqr_sum[a, b, c, d] =
-//         sum(input[a, b, c, d - depth_radius : d + depth_radius + 1] ** 2)
-//     output = input / (bias + alpha * sqr_sum) ** beta
+// ```python
+// tf.cumprod([a, b, c], exclusive=True)  # => [1, a, a * b]
+// ```
 //
-// For details, see [Krizhevsky et al., ImageNet classification with deep
-// convolutional neural networks (NIPS 2012)](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks).
+// By setting the `reverse` kwarg to `True`, the cumprod is performed in the
+// opposite direction:
+//
+// ```python
+// tf.cumprod([a, b, c], reverse=True)  # => [a * b * c, b * c, c]
+// ```
+//
+// This is more efficient than using separate `tf.reverse` ops.
+//
+// The `reverse` and `exclusive` kwargs can also be combined:
+//
+// ```python
+// tf.cumprod([a, b, c], exclusive=True, reverse=True)  # => [b * c, c, 1]
+// ```
 //
 // Arguments:
-//	input: 4-D.
-func LRN(scope *Scope, input tf.Output, optional ...LRNAttr) (output tf.Output) {
+//	x: A `Tensor`. Must be one of the following types: `float32`, `float64`,
+// `int64`, `int32`, `uint8`, `uint16`, `int16`, `int8`, `complex64`,
+// `complex128`, `qint8`, `quint8`, `qint32`, `half`.
+//	axis: A `Tensor` of type `int32` (default: 0). Must be in the range
+// `[-rank(x), rank(x))`.
+func Cumprod(scope *Scope, x tf.Output, axis tf.Output, optional ...CumprodAttr) (out tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
@@ -7433,26 +6956,9 @@ func LRN(scope *Scope, input tf.Output, optional ...LRNAttr) (output tf.Output)
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "LRN",
-		Input: []tf.Input{
-			input,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Creates a dataset that zips together `input_datasets`.
-func ZipDataset(scope *Scope, input_datasets []tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "ZipDataset",
+		Type: "Cumprod",
 		Input: []tf.Input{
-			tf.OutputList(input_datasets),
+			x, axis,
 		},
 		Attrs: attrs,
 	}
@@ -7460,45 +6966,294 @@ func ZipDataset(scope *Scope, input_datasets []tf.Output, output_types []tf.Data
 	return op.Output(0)
 }
 
-// Writes a `GraphDef` protocol buffer to a `SummaryWriter`.
+// Computes the mean along segments of a tensor.
+//
+// Read @{$math_ops#segmentation$the section on segmentation} for an explanation of
+// segments.
+//
+// Computes a tensor such that
+// \\(output_i = \frac{\sum_j data_j}{N}\\) where `mean` is
+// over `j` such that `segment_ids[j] == i` and `N` is the total number of
+// values summed.
+//
+// If the mean is empty for a given segment ID `i`, `output[i] = 0`.
+//
+// <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
+// <img style="width:100%" src="https://www.tensorflow.org/images/SegmentMean.png" alt>
+// </div>
 //
 // Arguments:
-//	writer: Handle of `SummaryWriter`.
-//	step: The step to write the summary for.
-//	tensor: A scalar string of the serialized tf.GraphDef proto.
 //
-// Returns the created operation.
-func WriteGraphSummary(scope *Scope, writer tf.Output, step tf.Output, tensor tf.Output) (o *tf.Operation) {
+//	segment_ids: A 1-D tensor whose rank is equal to the rank of `data`'s
+// first dimension.  Values should be sorted and can be repeated.
+//
+// Returns Has same shape as data, except for dimension 0 which
+// has size `k`, the number of segments.
+func SegmentMean(scope *Scope, data tf.Output, segment_ids tf.Output) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
 	opspec := tf.OpSpec{
-		Type: "WriteGraphSummary",
+		Type: "SegmentMean",
 		Input: []tf.Input{
-			writer, step, tensor,
+			data, segment_ids,
 		},
 	}
-	return scope.AddOperation(opspec)
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
 }
 
-// ResourceSparseApplyAdagradAttr is an optional argument to ResourceSparseApplyAdagrad.
-type ResourceSparseApplyAdagradAttr func(optionalAttr)
+// ResourceSparseApplyCenteredRMSPropAttr is an optional argument to ResourceSparseApplyCenteredRMSProp.
+type ResourceSparseApplyCenteredRMSPropAttr func(optionalAttr)
 
-// ResourceSparseApplyAdagradUseLocking sets the optional use_locking attribute to value.
+// ResourceSparseApplyCenteredRMSPropUseLocking sets the optional use_locking attribute to value.
 //
-// value: If `True`, updating of the var and accum tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
+// value: If `True`, updating of the var, mg, ms, and mom tensors is
+// protected by a lock; otherwise the behavior is undefined, but may exhibit less
 // contention.
 // If not specified, defaults to false
-func ResourceSparseApplyAdagradUseLocking(value bool) ResourceSparseApplyAdagradAttr {
+func ResourceSparseApplyCenteredRMSPropUseLocking(value bool) ResourceSparseApplyCenteredRMSPropAttr {
 	return func(m optionalAttr) {
 		m["use_locking"] = value
 	}
 }
 
-// Update relevant entries in '*var' and '*accum' according to the adagrad scheme.
+// Update '*var' according to the centered RMSProp algorithm.
 //
-// That is for rows we have grad for, we update var and accum as follows:
+// The centered RMSProp algorithm uses an estimate of the centered second moment
+// (i.e., the variance) for normalization, as opposed to regular RMSProp, which
+// uses the (uncentered) second moment. This often helps with training, but is
+// slightly more expensive in terms of computation and memory.
+//
+// Note that in dense implementation of this algorithm, mg, ms, and mom will
+// update even if the grad is zero, but in this sparse implementation, mg, ms,
+// and mom will not update in iterations during which the grad is zero.
+//
+// mean_square = decay * mean_square + (1-decay) * gradient ** 2
+// mean_grad = decay * mean_grad + (1-decay) * gradient
+// Delta = learning_rate * gradient / sqrt(mean_square + epsilon - mean_grad ** 2)
+//
+// ms <- rho * ms_{t-1} + (1-rho) * grad * grad
+// mom <- momentum * mom_{t-1} + lr * grad / sqrt(ms + epsilon)
+// var <- var - mom
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	mg: Should be from a Variable().
+//	ms: Should be from a Variable().
+//	mom: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
+//	rho: Decay rate. Must be a scalar.
+//
+//	epsilon: Ridge term. Must be a scalar.
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var, ms and mom.
+//
+// Returns the created operation.
+func ResourceSparseApplyCenteredRMSProp(scope *Scope, var_ tf.Output, mg tf.Output, ms tf.Output, mom tf.Output, lr tf.Output, rho tf.Output, momentum tf.Output, epsilon tf.Output, grad tf.Output, indices tf.Output, optional ...ResourceSparseApplyCenteredRMSPropAttr) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "ResourceSparseApplyCenteredRMSProp",
+		Input: []tf.Input{
+			var_, mg, ms, mom, lr, rho, momentum, epsilon, grad, indices,
+		},
+		Attrs: attrs,
+	}
+	return scope.AddOperation(opspec)
+}
+
+// Creates a dataset that batches `batch_size` elements from `input_dataset`.
+//
+// Arguments:
+//
+//	batch_size: A scalar representing the number of elements to accumulate in a
+// batch.
+//
+//
+func BatchDataset(scope *Scope, input_dataset tf.Output, batch_size tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "BatchDataset",
+		Input: []tf.Input{
+			input_dataset, batch_size,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Inverse fast Fourier transform.
+//
+// Computes the inverse 1-dimensional discrete Fourier transform over the
+// inner-most dimension of `input`.
+//
+// Arguments:
+//	input: A complex64 tensor.
+//
+// Returns A complex64 tensor of the same shape as `input`. The inner-most
+//   dimension of `input` is replaced with its inverse 1D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.ifft
+// @end_compatibility
+func IFFT(scope *Scope, input tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "IFFT",
+		Input: []tf.Input{
+			input,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// LRNAttr is an optional argument to LRN.
+type LRNAttr func(optionalAttr)
+
+// LRNDepthRadius sets the optional depth_radius attribute to value.
+//
+// value: 0-D.  Half-width of the 1-D normalization window.
+// If not specified, defaults to 5
+func LRNDepthRadius(value int64) LRNAttr {
+	return func(m optionalAttr) {
+		m["depth_radius"] = value
+	}
+}
+
+// LRNBias sets the optional bias attribute to value.
+//
+// value: An offset (usually positive to avoid dividing by 0).
+// If not specified, defaults to 1
+func LRNBias(value float32) LRNAttr {
+	return func(m optionalAttr) {
+		m["bias"] = value
+	}
+}
+
+// LRNAlpha sets the optional alpha attribute to value.
+//
+// value: A scale factor, usually positive.
+// If not specified, defaults to 1
+func LRNAlpha(value float32) LRNAttr {
+	return func(m optionalAttr) {
+		m["alpha"] = value
+	}
+}
+
+// LRNBeta sets the optional beta attribute to value.
+//
+// value: An exponent.
+// If not specified, defaults to 0.5
+func LRNBeta(value float32) LRNAttr {
+	return func(m optionalAttr) {
+		m["beta"] = value
+	}
+}
+
+// Local Response Normalization.
+//
+// The 4-D `input` tensor is treated as a 3-D array of 1-D vectors (along the last
+// dimension), and each vector is normalized independently.  Within a given vector,
+// each component is divided by the weighted, squared sum of inputs within
+// `depth_radius`.  In detail,
+//
+//     sqr_sum[a, b, c, d] =
+//         sum(input[a, b, c, d - depth_radius : d + depth_radius + 1] ** 2)
+//     output = input / (bias + alpha * sqr_sum) ** beta
+//
+// For details, see [Krizhevsky et al., ImageNet classification with deep
+// convolutional neural networks (NIPS 2012)](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks).
+//
+// Arguments:
+//	input: 4-D.
+func LRN(scope *Scope, input tf.Output, optional ...LRNAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "LRN",
+		Input: []tf.Input{
+			input,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Creates a dataset that zips together `input_datasets`.
+func ZipDataset(scope *Scope, input_datasets []tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
+	opspec := tf.OpSpec{
+		Type: "ZipDataset",
+		Input: []tf.Input{
+			tf.OutputList(input_datasets),
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Writes a `GraphDef` protocol buffer to a `SummaryWriter`.
+//
+// Arguments:
+//	writer: Handle of `SummaryWriter`.
+//	step: The step to write the summary for.
+//	tensor: A scalar string of the serialized tf.GraphDef proto.
+//
+// Returns the created operation.
+func WriteGraphSummary(scope *Scope, writer tf.Output, step tf.Output, tensor tf.Output) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "WriteGraphSummary",
+		Input: []tf.Input{
+			writer, step, tensor,
+		},
+	}
+	return scope.AddOperation(opspec)
+}
+
+// ResourceSparseApplyAdagradAttr is an optional argument to ResourceSparseApplyAdagrad.
+type ResourceSparseApplyAdagradAttr func(optionalAttr)
+
+// ResourceSparseApplyAdagradUseLocking sets the optional use_locking attribute to value.
+//
+// value: If `True`, updating of the var and accum tensors will be protected
+// by a lock; otherwise the behavior is undefined, but may exhibit less
+// contention.
+// If not specified, defaults to false
+func ResourceSparseApplyAdagradUseLocking(value bool) ResourceSparseApplyAdagradAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// Update relevant entries in '*var' and '*accum' according to the adagrad scheme.
+//
+// That is for rows we have grad for, we update var and accum as follows:
 // accum += grad * grad
 // var -= lr * grad * (1 / sqrt(accum))
 //
@@ -8016,78 +7771,21 @@ func ResourceScatterNdUpdate(scope *Scope, ref tf.Output, indices tf.Output, upd
 	return scope.AddOperation(opspec)
 }
 
-// StageSizeAttr is an optional argument to StageSize.
-type StageSizeAttr func(optionalAttr)
+// NonMaxSuppressionAttr is an optional argument to NonMaxSuppression.
+type NonMaxSuppressionAttr func(optionalAttr)
 
-// StageSizeCapacity sets the optional capacity attribute to value.
-// If not specified, defaults to 0
+// NonMaxSuppressionIouThreshold sets the optional iou_threshold attribute to value.
 //
-// REQUIRES: value >= 0
-func StageSizeCapacity(value int64) StageSizeAttr {
+// value: A float representing the threshold for deciding whether boxes
+// overlap too much with respect to IOU.
+// If not specified, defaults to 0.5
+func NonMaxSuppressionIouThreshold(value float32) NonMaxSuppressionAttr {
 	return func(m optionalAttr) {
-		m["capacity"] = value
+		m["iou_threshold"] = value
 	}
 }
 
-// StageSizeMemoryLimit sets the optional memory_limit attribute to value.
-// If not specified, defaults to 0
-//
-// REQUIRES: value >= 0
-func StageSizeMemoryLimit(value int64) StageSizeAttr {
-	return func(m optionalAttr) {
-		m["memory_limit"] = value
-	}
-}
-
-// StageSizeContainer sets the optional container attribute to value.
-// If not specified, defaults to ""
-func StageSizeContainer(value string) StageSizeAttr {
-	return func(m optionalAttr) {
-		m["container"] = value
-	}
-}
-
-// StageSizeSharedName sets the optional shared_name attribute to value.
-// If not specified, defaults to ""
-func StageSizeSharedName(value string) StageSizeAttr {
-	return func(m optionalAttr) {
-		m["shared_name"] = value
-	}
-}
-
-// Op returns the number of elements in the underlying container.
-func StageSize(scope *Scope, dtypes []tf.DataType, optional ...StageSizeAttr) (size tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"dtypes": dtypes}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "StageSize",
-
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// NonMaxSuppressionAttr is an optional argument to NonMaxSuppression.
-type NonMaxSuppressionAttr func(optionalAttr)
-
-// NonMaxSuppressionIouThreshold sets the optional iou_threshold attribute to value.
-//
-// value: A float representing the threshold for deciding whether boxes
-// overlap too much with respect to IOU.
-// If not specified, defaults to 0.5
-func NonMaxSuppressionIouThreshold(value float32) NonMaxSuppressionAttr {
-	return func(m optionalAttr) {
-		m["iou_threshold"] = value
-	}
-}
-
-// Greedily selects a subset of bounding boxes in descending order of score,
+// Greedily selects a subset of bounding boxes in descending order of score,
 //
 // pruning away boxes that have high intersection-over-union (IOU) overlap
 // with previously selected boxes.  Bounding boxes are supplied as
@@ -8812,51 +8510,6 @@ func SparseReduceSum(scope *Scope, input_indices tf.Output, input_values tf.Outp
 	return op.Output(0)
 }
 
-// ResourceApplyAdagradAttr is an optional argument to ResourceApplyAdagrad.
-type ResourceApplyAdagradAttr func(optionalAttr)
-
-// ResourceApplyAdagradUseLocking sets the optional use_locking attribute to value.
-//
-// value: If `True`, updating of the var and accum tensors will be protected
-// by a lock; otherwise the behavior is undefined, but may exhibit less
-// contention.
-// If not specified, defaults to false
-func ResourceApplyAdagradUseLocking(value bool) ResourceApplyAdagradAttr {
-	return func(m optionalAttr) {
-		m["use_locking"] = value
-	}
-}
-
-// Update '*var' according to the adagrad scheme.
-//
-// accum += grad * grad
-// var -= lr * grad * (1 / sqrt(accum))
-//
-// Arguments:
-//	var_: Should be from a Variable().
-//	accum: Should be from a Variable().
-//	lr: Scaling factor. Must be a scalar.
-//	grad: The gradient.
-//
-// Returns the created operation.
-func ResourceApplyAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, grad tf.Output, optional ...ResourceApplyAdagradAttr) (o *tf.Operation) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "ResourceApplyAdagrad",
-		Input: []tf.Input{
-			var_, accum, lr, grad,
-		},
-		Attrs: attrs,
-	}
-	return scope.AddOperation(opspec)
-}
-
 // Returns element-wise remainder of division. This emulates C semantics in that
 //
 // the result here is consistent with a truncating divide. E.g. `truncate(x / y) *
@@ -9211,60 +8864,358 @@ func LogSoftmax(scope *Scope, logits tf.Output) (logsoftmax tf.Output) {
 // ```
 //
 // Arguments:
-//	x: 1-D.
+//	x: 1-D.
+//
+// Returns 1-D.
+func InvertPermutation(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "InvertPermutation",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Gradient op for `MirrorPad` op. This op folds a mirror-padded tensor.
+//
+// This operation folds the padded areas of `input` by `MirrorPad` according to the
+// `paddings` you specify. `paddings` must be the same as `paddings` argument
+// given to the corresponding `MirrorPad` op.
+//
+// The folded size of each dimension D of the output is:
+//
+// `input.dim_size(D) - paddings(D, 0) - paddings(D, 1)`
+//
+// For example:
+//
+// ```
+// # 't' is [[1, 2, 3], [4, 5, 6], [7, 8, 9]].
+// # 'paddings' is [[0, 1]], [0, 1]].
+// # 'mode' is SYMMETRIC.
+// # rank of 't' is 2.
+// pad(t, paddings) ==> [[ 1,  5]
+//                       [11, 28]]
+// ```
+//
+// Arguments:
+//	input: The input tensor to be folded.
+//	paddings: A two-column matrix specifying the padding sizes. The number of
+// rows must be the same as the rank of `input`.
+//	mode: The mode used in the `MirrorPad` op.
+//
+// Returns The folded tensor.
+func MirrorPadGrad(scope *Scope, input tf.Output, paddings tf.Output, mode string) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"mode": mode}
+	opspec := tf.OpSpec{
+		Type: "MirrorPadGrad",
+		Input: []tf.Input{
+			input, paddings,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// Computes softmax cross entropy cost and gradients to backpropagate.
+//
+// Unlike `SoftmaxCrossEntropyWithLogits`, this operation does not accept
+// a matrix of label probabilities, but rather a single label per row
+// of features.  This label is considered to have probability 1.0 for the
+// given row.
+//
+// Inputs are the logits, not probabilities.
+//
+// Arguments:
+//	features: batch_size x num_classes matrix
+//	labels: batch_size vector with values in [0, num_classes).
+// This is the label for the given minibatch entry.
+//
+// Returns Per example loss (batch_size vector).backpropagated gradients (batch_size x num_classes matrix).
+func SparseSoftmaxCrossEntropyWithLogits(scope *Scope, features tf.Output, labels tf.Output) (loss tf.Output, backprop tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "SparseSoftmaxCrossEntropyWithLogits",
+		Input: []tf.Input{
+			features, labels,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0), op.Output(1)
+}
+
+// Fast Fourier transform.
+//
+// Computes the 1-dimensional discrete Fourier transform over the inner-most
+// dimension of `input`.
+//
+// Arguments:
+//	input: A complex64 tensor.
+//
+// Returns A complex64 tensor of the same shape as `input`. The inner-most
+//   dimension of `input` is replaced with its 1D Fourier transform.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.fft
+// @end_compatibility
+func FFT(scope *Scope, input tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "FFT",
+		Input: []tf.Input{
+			input,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// ResourceSparseApplyAdagradDAAttr is an optional argument to ResourceSparseApplyAdagradDA.
+type ResourceSparseApplyAdagradDAAttr func(optionalAttr)
+
+// ResourceSparseApplyAdagradDAUseLocking sets the optional use_locking attribute to value.
+//
+// value: If True, updating of the var and accum tensors will be protected by
+// a lock; otherwise the behavior is undefined, but may exhibit less contention.
+// If not specified, defaults to false
+func ResourceSparseApplyAdagradDAUseLocking(value bool) ResourceSparseApplyAdagradDAAttr {
+	return func(m optionalAttr) {
+		m["use_locking"] = value
+	}
+}
+
+// Update entries in '*var' and '*accum' according to the proximal adagrad scheme.
+//
+// Arguments:
+//	var_: Should be from a Variable().
+//	gradient_accumulator: Should be from a Variable().
+//	gradient_squared_accumulator: Should be from a Variable().
+//	grad: The gradient.
+//	indices: A vector of indices into the first dimension of var and accum.
+//	lr: Learning rate. Must be a scalar.
+//	l1: L1 regularization. Must be a scalar.
+//	l2: L2 regularization. Must be a scalar.
+//	global_step: Training step number. Must be a scalar.
+//
+// Returns the created operation.
+func ResourceSparseApplyAdagradDA(scope *Scope, var_ tf.Output, gradient_accumulator tf.Output, gradient_squared_accumulator tf.Output, grad tf.Output, indices tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, global_step tf.Output, optional ...ResourceSparseApplyAdagradDAAttr) (o *tf.Operation) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "ResourceSparseApplyAdagradDA",
+		Input: []tf.Input{
+			var_, gradient_accumulator, gradient_squared_accumulator, grad, indices, lr, l1, l2, global_step,
+		},
+		Attrs: attrs,
+	}
+	return scope.AddOperation(opspec)
+}
+
+// Returns the truth value of NOT x element-wise.
+func LogicalNot(scope *Scope, x tf.Output) (y tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "LogicalNot",
+		Input: []tf.Input{
+			x,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// 3D real-valued fast Fourier transform.
+//
+// Computes the 3-dimensional discrete Fourier transform of a real-valued signal
+// over the inner-most 3 dimensions of `input`.
+//
+// Since the DFT of a real signal is Hermitian-symmetric, `RFFT3D` only returns the
+// `fft_length / 2 + 1` unique components of the FFT for the inner-most dimension
+// of `output`: the zero-frequency term, followed by the `fft_length / 2`
+// positive-frequency terms.
+//
+// Along each axis `RFFT3D` is computed on, if `fft_length` is smaller than the
+// corresponding dimension of `input`, the dimension is cropped. If it is larger,
+// the dimension is padded with zeros.
+//
+// Arguments:
+//	input: A float32 tensor.
+//	fft_length: An int32 tensor of shape [3]. The FFT length for each dimension.
+//
+// Returns A complex64 tensor of the same rank as `input`. The inner-most 3
+//   dimensions of `input` are replaced with the their 3D Fourier transform. The
+//   inner-most dimension contains `fft_length / 2 + 1` unique frequency
+//   components.
+//
+// @compatibility(numpy)
+// Equivalent to np.fft.rfftn with 3 dimensions.
+// @end_compatibility
+func RFFT3D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "RFFT3D",
+		Input: []tf.Input{
+			input, fft_length,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// TensorArrayV3Attr is an optional argument to TensorArrayV3.
+type TensorArrayV3Attr func(optionalAttr)
+
+// TensorArrayV3ElementShape sets the optional element_shape attribute to value.
+//
+// value: The expected shape of an element, if known. Used to
+// validate the shapes of TensorArray elements. If this shape is not
+// fully specified, gathering zero-size TensorArrays is an error.
+// If not specified, defaults to <unknown_rank:true >
+func TensorArrayV3ElementShape(value tf.Shape) TensorArrayV3Attr {
+	return func(m optionalAttr) {
+		m["element_shape"] = value
+	}
+}
+
+// TensorArrayV3DynamicSize sets the optional dynamic_size attribute to value.
+//
+// value: A boolean that determines whether writes to the TensorArray
+// are allowed to grow the size.  By default, this is not allowed.
+// If not specified, defaults to false
+func TensorArrayV3DynamicSize(value bool) TensorArrayV3Attr {
+	return func(m optionalAttr) {
+		m["dynamic_size"] = value
+	}
+}
+
+// TensorArrayV3ClearAfterRead sets the optional clear_after_read attribute to value.
+//
+// value: If true (default), Tensors in the TensorArray are cleared
+// after being read.  This disables multiple read semantics but allows early
+// release of memory.
+// If not specified, defaults to true
+func TensorArrayV3ClearAfterRead(value bool) TensorArrayV3Attr {
+	return func(m optionalAttr) {
+		m["clear_after_read"] = value
+	}
+}
+
+// TensorArrayV3IdenticalElementShapes sets the optional identical_element_shapes attribute to value.
+//
+// value: If true (default is false), then all
+// elements in the TensorArray will be expected to have have identical shapes.
+// This allows certain behaviors, like dynamically checking for
+// consistent shapes on write, and being able to fill in properly
+// shaped zero tensors on stack -- even if the element_shape attribute
+// is not fully defined.
+// If not specified, defaults to false
+func TensorArrayV3IdenticalElementShapes(value bool) TensorArrayV3Attr {
+	return func(m optionalAttr) {
+		m["identical_element_shapes"] = value
+	}
+}
+
+// TensorArrayV3TensorArrayName sets the optional tensor_array_name attribute to value.
+//
+// value: Overrides the name used for the temporary tensor_array
+// resource. Default value is the name of the 'TensorArray' op (which
+// is guaranteed unique).
+// If not specified, defaults to ""
+func TensorArrayV3TensorArrayName(value string) TensorArrayV3Attr {
+	return func(m optionalAttr) {
+		m["tensor_array_name"] = value
+	}
+}
+
+// An array of Tensors of given size.
+//
+// Write data via Write and read via Read or Pack.
+//
+// Arguments:
+//	size: The size of the array.
+//	dtype: The type of the elements on the tensor_array.
 //
-// Returns 1-D.
-func InvertPermutation(scope *Scope, x tf.Output) (y tf.Output) {
+// Returns The handle to the TensorArray.A scalar used to control gradient flow.
+func TensorArrayV3(scope *Scope, size tf.Output, dtype tf.DataType, optional ...TensorArrayV3Attr) (handle tf.Output, flow tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"dtype": dtype}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "InvertPermutation",
+		Type: "TensorArrayV3",
 		Input: []tf.Input{
-			x,
+			size,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return op.Output(0), op.Output(1)
 }
 
-// Gradient op for `MirrorPad` op. This op folds a mirror-padded tensor.
-//
-// This operation folds the padded areas of `input` by `MirrorPad` according to the
-// `paddings` you specify. `paddings` must be the same as `paddings` argument
-// given to the corresponding `MirrorPad` op.
-//
-// The folded size of each dimension D of the output is:
-//
-// `input.dim_size(D) - paddings(D, 0) - paddings(D, 1)`
-//
-// For example:
+// MaxPool3DAttr is an optional argument to MaxPool3D.
+type MaxPool3DAttr func(optionalAttr)
+
+// MaxPool3DDataFormat sets the optional data_format attribute to value.
 //
-// ```
-// # 't' is [[1, 2, 3], [4, 5, 6], [7, 8, 9]].
-// # 'paddings' is [[0, 1]], [0, 1]].
-// # 'mode' is SYMMETRIC.
-// # rank of 't' is 2.
-// pad(t, paddings) ==> [[ 1,  5]
-//                       [11, 28]]
-// ```
+// value: The data format of the input and output data. With the
+// default format "NDHWC", the data is stored in the order of:
+//     [batch, in_depth, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCDHW", the data storage order is:
+//     [batch, in_channels, in_depth, in_height, in_width].
+// If not specified, defaults to "NDHWC"
+func MaxPool3DDataFormat(value string) MaxPool3DAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Performs 3D max pooling on the input.
 //
 // Arguments:
-//	input: The input tensor to be folded.
-//	paddings: A two-column matrix specifying the padding sizes. The number of
-// rows must be the same as the rank of `input`.
-//	mode: The mode used in the `MirrorPad` op.
+//	input: Shape `[batch, depth, rows, cols, channels]` tensor to pool over.
+//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
+// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
 //
-// Returns The folded tensor.
-func MirrorPadGrad(scope *Scope, input tf.Output, paddings tf.Output, mode string) (output tf.Output) {
+// Returns The max pooled output tensor.
+func MaxPool3D(scope *Scope, input tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPool3DAttr) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"mode": mode}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "MirrorPadGrad",
+		Type: "MaxPool3D",
 		Input: []tf.Input{
-			input, paddings,
+			input,
 		},
 		Attrs: attrs,
 	}
@@ -9272,64 +9223,65 @@ func MirrorPadGrad(scope *Scope, input tf.Output, paddings tf.Output, mode strin
 	return op.Output(0)
 }
 
-// Computes softmax cross entropy cost and gradients to backpropagate.
-//
-// Unlike `SoftmaxCrossEntropyWithLogits`, this operation does not accept
-// a matrix of label probabilities, but rather a single label per row
-// of features.  This label is considered to have probability 1.0 for the
-// given row.
+// Computes the gradients of 3-D convolution with respect to the input.
 //
-// Inputs are the logits, not probabilities.
+// DEPRECATED at GraphDef version 10: Use Conv3DBackpropInputV2
 //
 // Arguments:
-//	features: batch_size x num_classes matrix
-//	labels: batch_size vector with values in [0, num_classes).
-// This is the label for the given minibatch entry.
-//
-// Returns Per example loss (batch_size vector).backpropagated gradients (batch_size x num_classes matrix).
-func SparseSoftmaxCrossEntropyWithLogits(scope *Scope, features tf.Output, labels tf.Output) (loss tf.Output, backprop tf.Output) {
+//	input: Shape `[batch, depth, rows, cols, in_channels]`.
+//	filter: Shape `[depth, rows, cols, in_channels, out_channels]`.
+// `in_channels` must match between `input` and `filter`.
+//	out_backprop: Backprop signal of shape `[batch, out_depth, out_rows, out_cols,
+// out_channels]`.
+//	strides: 1-D tensor of length 5. The stride of the sliding window for each
+// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
+//	padding: The type of padding algorithm to use.
+func Conv3DBackpropInput(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"strides": strides, "padding": padding}
 	opspec := tf.OpSpec{
-		Type: "SparseSoftmaxCrossEntropyWithLogits",
+		Type: "Conv3DBackpropInput",
 		Input: []tf.Input{
-			features, labels,
+			input, filter, out_backprop,
 		},
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// ResourceSparseApplyAdagradDAAttr is an optional argument to ResourceSparseApplyAdagradDA.
-type ResourceSparseApplyAdagradDAAttr func(optionalAttr)
+// ResourceApplyProximalAdagradAttr is an optional argument to ResourceApplyProximalAdagrad.
+type ResourceApplyProximalAdagradAttr func(optionalAttr)
 
-// ResourceSparseApplyAdagradDAUseLocking sets the optional use_locking attribute to value.
+// ResourceApplyProximalAdagradUseLocking sets the optional use_locking attribute to value.
 //
 // value: If True, updating of the var and accum tensors will be protected by
 // a lock; otherwise the behavior is undefined, but may exhibit less contention.
 // If not specified, defaults to false
-func ResourceSparseApplyAdagradDAUseLocking(value bool) ResourceSparseApplyAdagradDAAttr {
+func ResourceApplyProximalAdagradUseLocking(value bool) ResourceApplyProximalAdagradAttr {
 	return func(m optionalAttr) {
 		m["use_locking"] = value
 	}
 }
 
-// Update entries in '*var' and '*accum' according to the proximal adagrad scheme.
+// Update '*var' and '*accum' according to FOBOS with Adagrad learning rate.
+//
+// accum += grad * grad
+// prox_v = var - lr * grad * (1 / sqrt(accum))
+// var = sign(prox_v)/(1+lr*l2) * max{|prox_v|-lr*l1,0}
 //
 // Arguments:
 //	var_: Should be from a Variable().
-//	gradient_accumulator: Should be from a Variable().
-//	gradient_squared_accumulator: Should be from a Variable().
-//	grad: The gradient.
-//	indices: A vector of indices into the first dimension of var and accum.
-//	lr: Learning rate. Must be a scalar.
+//	accum: Should be from a Variable().
+//	lr: Scaling factor. Must be a scalar.
 //	l1: L1 regularization. Must be a scalar.
 //	l2: L2 regularization. Must be a scalar.
-//	global_step: Training step number. Must be a scalar.
+//	grad: The gradient.
 //
 // Returns the created operation.
-func ResourceSparseApplyAdagradDA(scope *Scope, var_ tf.Output, gradient_accumulator tf.Output, gradient_squared_accumulator tf.Output, grad tf.Output, indices tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, global_step tf.Output, optional ...ResourceSparseApplyAdagradDAAttr) (o *tf.Operation) {
+func ResourceApplyProximalAdagrad(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.Output, l1 tf.Output, l2 tf.Output, grad tf.Output, optional ...ResourceApplyProximalAdagradAttr) (o *tf.Operation) {
 	if scope.Err() != nil {
 		return
 	}
@@ -9338,237 +9290,218 @@ func ResourceSparseApplyAdagradDA(scope *Scope, var_ tf.Output, gradient_accumul
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "ResourceSparseApplyAdagradDA",
+		Type: "ResourceApplyProximalAdagrad",
 		Input: []tf.Input{
-			var_, gradient_accumulator, gradient_squared_accumulator, grad, indices, lr, l1, l2, global_step,
+			var_, accum, lr, l1, l2, grad,
 		},
 		Attrs: attrs,
 	}
 	return scope.AddOperation(opspec)
 }
 
-// Returns the truth value of NOT x element-wise.
-func LogicalNot(scope *Scope, x tf.Output) (y tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	opspec := tf.OpSpec{
-		Type: "LogicalNot",
-		Input: []tf.Input{
-			x,
-		},
+// MutableHashTableOfTensorsV2Attr is an optional argument to MutableHashTableOfTensorsV2.
+type MutableHashTableOfTensorsV2Attr func(optionalAttr)
+
+// MutableHashTableOfTensorsV2Container sets the optional container attribute to value.
+//
+// value: If non-empty, this table is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func MutableHashTableOfTensorsV2Container(value string) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["container"] = value
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
 }
 
-// 3D real-valued fast Fourier transform.
-//
-// Computes the 3-dimensional discrete Fourier transform of a real-valued signal
-// over the inner-most 3 dimensions of `input`.
+// MutableHashTableOfTensorsV2SharedName sets the optional shared_name attribute to value.
 //
-// Since the DFT of a real signal is Hermitian-symmetric, `RFFT3D` only returns the
-// `fft_length / 2 + 1` unique components of the FFT for the inner-most dimension
-// of `output`: the zero-frequency term, followed by the `fft_length / 2`
-// positive-frequency terms.
+// value: If non-empty, this table is shared under the given name across
+// multiple sessions.
+// If not specified, defaults to ""
+func MutableHashTableOfTensorsV2SharedName(value string) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// MutableHashTableOfTensorsV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
+// If not specified, defaults to false
+func MutableHashTableOfTensorsV2UseNodeNameSharing(value bool) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["use_node_name_sharing"] = value
+	}
+}
+
+// MutableHashTableOfTensorsV2ValueShape sets the optional value_shape attribute to value.
+// If not specified, defaults to <>
+func MutableHashTableOfTensorsV2ValueShape(value tf.Shape) MutableHashTableOfTensorsV2Attr {
+	return func(m optionalAttr) {
+		m["value_shape"] = value
+	}
+}
+
+// Creates an empty hash table.
 //
-// Along each axis `RFFT3D` is computed on, if `fft_length` is smaller than the
-// corresponding dimension of `input`, the dimension is cropped. If it is larger,
-// the dimension is padded with zeros.
+// This op creates a mutable hash table, specifying the type of its keys and
+// values. Each value must be a vector. Data can be inserted into the table using
+// the insert operations. It does not support the initialization operation.
 //
 // Arguments:
-//	input: A float32 tensor.
-//	fft_length: An int32 tensor of shape [3]. The FFT length for each dimension.
-//
-// Returns A complex64 tensor of the same rank as `input`. The inner-most 3
-//   dimensions of `input` are replaced with the their 3D Fourier transform. The
-//   inner-most dimension contains `fft_length / 2 + 1` unique frequency
-//   components.
+//	key_dtype: Type of the table keys.
+//	value_dtype: Type of the table values.
 //
-// @compatibility(numpy)
-// Equivalent to np.fft.rfftn with 3 dimensions.
-// @end_compatibility
-func RFFT3D(scope *Scope, input tf.Output, fft_length tf.Output) (output tf.Output) {
+// Returns Handle to a table.
+func MutableHashTableOfTensorsV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...MutableHashTableOfTensorsV2Attr) (table_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
+	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
+	for _, a := range optional {
+		a(attrs)
+	}
 	opspec := tf.OpSpec{
-		Type: "RFFT3D",
-		Input: []tf.Input{
-			input, fft_length,
-		},
+		Type: "MutableHashTableOfTensorsV2",
+
+		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
 	return op.Output(0)
 }
 
-// TensorArrayV3Attr is an optional argument to TensorArrayV3.
-type TensorArrayV3Attr func(optionalAttr)
-
-// TensorArrayV3ElementShape sets the optional element_shape attribute to value.
-//
-// value: The expected shape of an element, if known. Used to
-// validate the shapes of TensorArray elements. If this shape is not
-// fully specified, gathering zero-size TensorArrays is an error.
-// If not specified, defaults to <unknown_rank:true >
-func TensorArrayV3ElementShape(value tf.Shape) TensorArrayV3Attr {
-	return func(m optionalAttr) {
-		m["element_shape"] = value
-	}
-}
-
-// TensorArrayV3DynamicSize sets the optional dynamic_size attribute to value.
-//
-// value: A boolean that determines whether writes to the TensorArray
-// are allowed to grow the size.  By default, this is not allowed.
-// If not specified, defaults to false
-func TensorArrayV3DynamicSize(value bool) TensorArrayV3Attr {
-	return func(m optionalAttr) {
-		m["dynamic_size"] = value
-	}
-}
-
-// TensorArrayV3ClearAfterRead sets the optional clear_after_read attribute to value.
+// HashTableV2Attr is an optional argument to HashTableV2.
+type HashTableV2Attr func(optionalAttr)
+
+// HashTableV2Container sets the optional container attribute to value.
 //
-// value: If true (default), Tensors in the TensorArray are cleared
-// after being read.  This disables multiple read semantics but allows early
-// release of memory.
-// If not specified, defaults to true
-func TensorArrayV3ClearAfterRead(value bool) TensorArrayV3Attr {
+// value: If non-empty, this table is placed in the given container.
+// Otherwise, a default container is used.
+// If not specified, defaults to ""
+func HashTableV2Container(value string) HashTableV2Attr {
 	return func(m optionalAttr) {
-		m["clear_after_read"] = value
+		m["container"] = value
 	}
 }
 
-// TensorArrayV3IdenticalElementShapes sets the optional identical_element_shapes attribute to value.
+// HashTableV2SharedName sets the optional shared_name attribute to value.
 //
-// value: If true (default is false), then all
-// elements in the TensorArray will be expected to have have identical shapes.
-// This allows certain behaviors, like dynamically checking for
-// consistent shapes on write, and being able to fill in properly
-// shaped zero tensors on stack -- even if the element_shape attribute
-// is not fully defined.
-// If not specified, defaults to false
-func TensorArrayV3IdenticalElementShapes(value bool) TensorArrayV3Attr {
+// value: If non-empty, this table is shared under the given name across
+// multiple sessions.
+// If not specified, defaults to ""
+func HashTableV2SharedName(value string) HashTableV2Attr {
 	return func(m optionalAttr) {
-		m["identical_element_shapes"] = value
+		m["shared_name"] = value
 	}
 }
 
-// TensorArrayV3TensorArrayName sets the optional tensor_array_name attribute to value.
+// HashTableV2UseNodeNameSharing sets the optional use_node_name_sharing attribute to value.
 //
-// value: Overrides the name used for the temporary tensor_array
-// resource. Default value is the name of the 'TensorArray' op (which
-// is guaranteed unique).
-// If not specified, defaults to ""
-func TensorArrayV3TensorArrayName(value string) TensorArrayV3Attr {
+// value: If true and shared_name is empty, the table is shared
+// using the node name.
+// If not specified, defaults to false
+func HashTableV2UseNodeNameSharing(value bool) HashTableV2Attr {
 	return func(m optionalAttr) {
-		m["tensor_array_name"] = value
+		m["use_node_name_sharing"] = value
 	}
 }
 
-// An array of Tensors of given size.
+// Creates a non-initialized hash table.
 //
-// Write data via Write and read via Read or Pack.
+// This op creates a hash table, specifying the type of its keys and values.
+// Before using the table you will have to initialize it.  After initialization the
+// table will be immutable.
 //
 // Arguments:
-//	size: The size of the array.
-//	dtype: The type of the elements on the tensor_array.
+//	key_dtype: Type of the table keys.
+//	value_dtype: Type of the table values.
 //
-// Returns The handle to the TensorArray.A scalar used to control gradient flow.
-func TensorArrayV3(scope *Scope, size tf.Output, dtype tf.DataType, optional ...TensorArrayV3Attr) (handle tf.Output, flow tf.Output) {
+// Returns Handle to a table.
+func HashTableV2(scope *Scope, key_dtype tf.DataType, value_dtype tf.DataType, optional ...HashTableV2Attr) (table_handle tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"dtype": dtype}
+	attrs := map[string]interface{}{"key_dtype": key_dtype, "value_dtype": value_dtype}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "TensorArrayV3",
-		Input: []tf.Input{
-			size,
-		},
+		Type: "HashTableV2",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0), op.Output(1)
+	return op.Output(0)
 }
 
-// MaxPool3DAttr is an optional argument to MaxPool3D.
-type MaxPool3DAttr func(optionalAttr)
+// MapUnstageNoKeyAttr is an optional argument to MapUnstageNoKey.
+type MapUnstageNoKeyAttr func(optionalAttr)
 
-// MaxPool3DDataFormat sets the optional data_format attribute to value.
+// MapUnstageNoKeyCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
 //
-// value: The data format of the input and output data. With the
-// default format "NDHWC", the data is stored in the order of:
-//     [batch, in_depth, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCDHW", the data storage order is:
-//     [batch, in_channels, in_depth, in_height, in_width].
-// If not specified, defaults to "NDHWC"
-func MaxPool3DDataFormat(value string) MaxPool3DAttr {
+// REQUIRES: value >= 0
+func MapUnstageNoKeyCapacity(value int64) MapUnstageNoKeyAttr {
 	return func(m optionalAttr) {
-		m["data_format"] = value
+		m["capacity"] = value
 	}
 }
 
-// Performs 3D max pooling on the input.
+// MapUnstageNoKeyMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
 //
-// Arguments:
-//	input: Shape `[batch, depth, rows, cols, channels]` tensor to pool over.
-//	ksize: 1-D tensor of length 5. The size of the window for each dimension of
-// the input tensor. Must have `ksize[0] = ksize[4] = 1`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
+// REQUIRES: value >= 0
+func MapUnstageNoKeyMemoryLimit(value int64) MapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// MapUnstageNoKeyContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func MapUnstageNoKeyContainer(value string) MapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// MapUnstageNoKeySharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func MapUnstageNoKeySharedName(value string) MapUnstageNoKeyAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op removes and returns a random (key, value)
 //
-// Returns The max pooled output tensor.
-func MaxPool3D(scope *Scope, input tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPool3DAttr) (output tf.Output) {
+// from the underlying container.   If the underlying container
+// does not contain elements, the op will block until it does.
+func MapUnstageNoKey(scope *Scope, indices tf.Output, dtypes []tf.DataType, optional ...MapUnstageNoKeyAttr) (key tf.Output, values []tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MaxPool3D",
+		Type: "MapUnstageNoKey",
 		Input: []tf.Input{
-			input,
+			indices,
 		},
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
-// Computes the gradients of 3-D convolution with respect to the input.
-//
-// DEPRECATED at GraphDef version 10: Use Conv3DBackpropInputV2
-//
-// Arguments:
-//	input: Shape `[batch, depth, rows, cols, in_channels]`.
-//	filter: Shape `[depth, rows, cols, in_channels, out_channels]`.
-// `in_channels` must match between `input` and `filter`.
-//	out_backprop: Backprop signal of shape `[batch, out_depth, out_rows, out_cols,
-// out_channels]`.
-//	strides: 1-D tensor of length 5. The stride of the sliding window for each
-// dimension of `input`. Must have `strides[0] = strides[4] = 1`.
-//	padding: The type of padding algorithm to use.
-func Conv3DBackpropInput(scope *Scope, input tf.Output, filter tf.Output, out_backprop tf.Output, strides []int64, padding string) (output tf.Output) {
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"strides": strides, "padding": padding}
-	opspec := tf.OpSpec{
-		Type: "Conv3DBackpropInput",
-		Input: []tf.Input{
-			input, filter, out_backprop,
-		},
-		Attrs: attrs,
+	var idx int
+	var err error
+	key = op.Output(idx)
+	if values, idx, err = makeOutputList(op, idx, "values"); err != nil {
+		scope.UpdateErr("MapUnstageNoKey", err)
+		return
 	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
+	return key, values
 }
 
 // Inverse 2D fast Fourier transform.
@@ -11437,6 +11370,54 @@ func ResourceApplyMomentum(scope *Scope, var_ tf.Output, accum tf.Output, lr tf.
 	return scope.AddOperation(opspec)
 }
 
+// MaxPoolGradGradAttr is an optional argument to MaxPoolGradGrad.
+type MaxPoolGradGradAttr func(optionalAttr)
+
+// MaxPoolGradGradDataFormat sets the optional data_format attribute to value.
+//
+// value: Specify the data format of the input and output data. With the
+// default format "NHWC", the data is stored in the order of:
+//     [batch, in_height, in_width, in_channels].
+// Alternatively, the format could be "NCHW", the data storage order of:
+//     [batch, in_channels, in_height, in_width].
+// If not specified, defaults to "NHWC"
+func MaxPoolGradGradDataFormat(value string) MaxPoolGradGradAttr {
+	return func(m optionalAttr) {
+		m["data_format"] = value
+	}
+}
+
+// Computes second-order gradients of the maxpooling function.
+//
+// Arguments:
+//	orig_input: The original input tensor.
+//	orig_output: The original output tensor.
+//	grad: 4-D.  Gradients of gradients w.r.t. the input of `max_pool`.
+//	ksize: The size of the window for each dimension of the input tensor.
+//	strides: The stride of the sliding window for each dimension of the
+// input tensor.
+//	padding: The type of padding algorithm to use.
+//
+// Returns Gradients of gradients w.r.t. the input to `max_pool`.
+func MaxPoolGradGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPoolGradGradAttr) (output tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "MaxPoolGradGrad",
+		Input: []tf.Input{
+			orig_input, orig_output, grad,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Returns the truth value of (x >= y) element-wise.
 //
 // *NOTE*: `GreaterEqual` supports broadcasting. More about broadcasting
@@ -12204,15 +12185,72 @@ func MutableDenseHashTableV2(scope *Scope, empty_key tf.Output, value_dtype tf.D
 	if scope.Err() != nil {
 		return
 	}
-	attrs := map[string]interface{}{"value_dtype": value_dtype}
+	attrs := map[string]interface{}{"value_dtype": value_dtype}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "MutableDenseHashTableV2",
+		Input: []tf.Input{
+			empty_key,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
+// StageSizeAttr is an optional argument to StageSize.
+type StageSizeAttr func(optionalAttr)
+
+// StageSizeCapacity sets the optional capacity attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func StageSizeCapacity(value int64) StageSizeAttr {
+	return func(m optionalAttr) {
+		m["capacity"] = value
+	}
+}
+
+// StageSizeMemoryLimit sets the optional memory_limit attribute to value.
+// If not specified, defaults to 0
+//
+// REQUIRES: value >= 0
+func StageSizeMemoryLimit(value int64) StageSizeAttr {
+	return func(m optionalAttr) {
+		m["memory_limit"] = value
+	}
+}
+
+// StageSizeContainer sets the optional container attribute to value.
+// If not specified, defaults to ""
+func StageSizeContainer(value string) StageSizeAttr {
+	return func(m optionalAttr) {
+		m["container"] = value
+	}
+}
+
+// StageSizeSharedName sets the optional shared_name attribute to value.
+// If not specified, defaults to ""
+func StageSizeSharedName(value string) StageSizeAttr {
+	return func(m optionalAttr) {
+		m["shared_name"] = value
+	}
+}
+
+// Op returns the number of elements in the underlying container.
+func StageSize(scope *Scope, dtypes []tf.DataType, optional ...StageSizeAttr) (size tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{"dtypes": dtypes}
 	for _, a := range optional {
 		a(attrs)
 	}
 	opspec := tf.OpSpec{
-		Type: "MutableDenseHashTableV2",
-		Input: []tf.Input{
-			empty_key,
-		},
+		Type: "StageSize",
+
 		Attrs: attrs,
 	}
 	op := scope.AddOperation(opspec)
@@ -12961,6 +12999,56 @@ func Neg(scope *Scope, x tf.Output) (y tf.Output) {
 	return op.Output(0)
 }
 
+// FakeQuantWithMinMaxVarsAttr is an optional argument to FakeQuantWithMinMaxVars.
+type FakeQuantWithMinMaxVarsAttr func(optionalAttr)
+
+// FakeQuantWithMinMaxVarsNumBits sets the optional num_bits attribute to value.
+// If not specified, defaults to 8
+func FakeQuantWithMinMaxVarsNumBits(value int64) FakeQuantWithMinMaxVarsAttr {
+	return func(m optionalAttr) {
+		m["num_bits"] = value
+	}
+}
+
+// FakeQuantWithMinMaxVarsNarrowRange sets the optional narrow_range attribute to value.
+// If not specified, defaults to false
+func FakeQuantWithMinMaxVarsNarrowRange(value bool) FakeQuantWithMinMaxVarsAttr {
+	return func(m optionalAttr) {
+		m["narrow_range"] = value
+	}
+}
+
+// Fake-quantize the 'inputs' tensor of type float via global float scalars `min`
+//
+// and `max` to 'outputs' tensor of same shape as `inputs`.
+//
+// `[min; max]` define the clamping range for the `inputs` data.
+// `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
+// when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
+// then de-quantized and output as floats in `[min; max]` interval.
+// `num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
+//
+// This operation has a gradient and thus allows for training `min` and `max`
+// values.
+func FakeQuantWithMinMaxVars(scope *Scope, inputs tf.Output, min tf.Output, max tf.Output, optional ...FakeQuantWithMinMaxVarsAttr) (outputs tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	attrs := map[string]interface{}{}
+	for _, a := range optional {
+		a(attrs)
+	}
+	opspec := tf.OpSpec{
+		Type: "FakeQuantWithMinMaxVars",
+		Input: []tf.Input{
+			inputs, min, max,
+		},
+		Attrs: attrs,
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Writes a `Summary` protocol buffer with a histogram.
 //
 // The generated
@@ -14994,54 +15082,6 @@ func TensorArrayCloseV3(scope *Scope, handle tf.Output) (o *tf.Operation) {
 	return scope.AddOperation(opspec)
 }
 
-// MaxPoolGradGradAttr is an optional argument to MaxPoolGradGrad.
-type MaxPoolGradGradAttr func(optionalAttr)
-
-// MaxPoolGradGradDataFormat sets the optional data_format attribute to value.
-//
-// value: Specify the data format of the input and output data. With the
-// default format "NHWC", the data is stored in the order of:
-//     [batch, in_height, in_width, in_channels].
-// Alternatively, the format could be "NCHW", the data storage order of:
-//     [batch, in_channels, in_height, in_width].
-// If not specified, defaults to "NHWC"
-func MaxPoolGradGradDataFormat(value string) MaxPoolGradGradAttr {
-	return func(m optionalAttr) {
-		m["data_format"] = value
-	}
-}
-
-// Computes second-order gradients of the maxpooling function.
-//
-// Arguments:
-//	orig_input: The original input tensor.
-//	orig_output: The original output tensor.
-//	grad: 4-D.  Gradients of gradients w.r.t. the input of `max_pool`.
-//	ksize: The size of the window for each dimension of the input tensor.
-//	strides: The stride of the sliding window for each dimension of the
-// input tensor.
-//	padding: The type of padding algorithm to use.
-//
-// Returns Gradients of gradients w.r.t. the input to `max_pool`.
-func MaxPoolGradGrad(scope *Scope, orig_input tf.Output, orig_output tf.Output, grad tf.Output, ksize []int64, strides []int64, padding string, optional ...MaxPoolGradGradAttr) (output tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"ksize": ksize, "strides": strides, "padding": padding}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "MaxPoolGradGrad",
-		Input: []tf.Input{
-			orig_input, orig_output, grad,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // RandomUniformIntAttr is an optional argument to RandomUniformInt.
 type RandomUniformIntAttr func(optionalAttr)
 
@@ -15312,57 +15352,6 @@ func TruncatedNormal(scope *Scope, shape tf.Output, dtype tf.DataType, optional
 	return op.Output(0)
 }
 
-// FakeQuantWithMinMaxVarsPerChannelAttr is an optional argument to FakeQuantWithMinMaxVarsPerChannel.
-type FakeQuantWithMinMaxVarsPerChannelAttr func(optionalAttr)
-
-// FakeQuantWithMinMaxVarsPerChannelNumBits sets the optional num_bits attribute to value.
-// If not specified, defaults to 8
-func FakeQuantWithMinMaxVarsPerChannelNumBits(value int64) FakeQuantWithMinMaxVarsPerChannelAttr {
-	return func(m optionalAttr) {
-		m["num_bits"] = value
-	}
-}
-
-// FakeQuantWithMinMaxVarsPerChannelNarrowRange sets the optional narrow_range attribute to value.
-// If not specified, defaults to false
-func FakeQuantWithMinMaxVarsPerChannelNarrowRange(value bool) FakeQuantWithMinMaxVarsPerChannelAttr {
-	return func(m optionalAttr) {
-		m["narrow_range"] = value
-	}
-}
-
-// Fake-quantize the 'inputs' tensor of type float and one of the shapes: `[d]`,
-//
-// `[b, d]` `[b, h, w, d]` via per-channel floats `min` and `max` of shape `[d]`
-// to 'outputs' tensor of same shape as `inputs`.
-//
-// `[min; max]` define the clamping range for the `inputs` data.
-// `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
-// when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
-// then de-quantized and output as floats in `[min; max]` interval.
-// `num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
-//
-// This operation has a gradient and thus allows for training `min` and `max`
-// values.
-func FakeQuantWithMinMaxVarsPerChannel(scope *Scope, inputs tf.Output, min tf.Output, max tf.Output, optional ...FakeQuantWithMinMaxVarsPerChannelAttr) (outputs tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "FakeQuantWithMinMaxVarsPerChannel",
-		Input: []tf.Input{
-			inputs, min, max,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // RandomShuffleAttr is an optional argument to RandomShuffle.
 type RandomShuffleAttr func(optionalAttr)
 
@@ -17760,23 +17749,6 @@ func SoftplusGrad(scope *Scope, gradients tf.Output, features tf.Output) (backpr
 	return op.Output(0)
 }
 
-// Creates a dataset that contains the unique elements of `input_dataset`.
-func UniqueDataset(scope *Scope, input_dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{"output_types": output_types, "output_shapes": output_shapes}
-	opspec := tf.OpSpec{
-		Type: "UniqueDataset",
-		Input: []tf.Input{
-			input_dataset,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
-
 // SelfAdjointEigV2Attr is an optional argument to SelfAdjointEigV2.
 type SelfAdjointEigV2Attr func(optionalAttr)
 
@@ -20021,6 +19993,26 @@ func SparseTensorSliceDataset(scope *Scope, indices tf.Output, values tf.Output,
 	return op.Output(0)
 }
 
+// Returns x / y element-wise for real types.
+//
+// If `x` and `y` are reals, this will return the floating-point division.
+//
+// *NOTE*: `Div` supports broadcasting. More about broadcasting
+// [here](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
+func RealDiv(scope *Scope, x tf.Output, y tf.Output) (z tf.Output) {
+	if scope.Err() != nil {
+		return
+	}
+	opspec := tf.OpSpec{
+		Type: "RealDiv",
+		Input: []tf.Input{
+			x, y,
+		},
+	}
+	op := scope.AddOperation(opspec)
+	return op.Output(0)
+}
+
 // Creates a dataset that concatenates `input_dataset` with `another_dataset`.
 func ConcatenateDataset(scope *Scope, input_dataset tf.Output, another_dataset tf.Output, output_types []tf.DataType, output_shapes []tf.Shape) (handle tf.Output) {
 	if scope.Err() != nil {
@@ -28288,53 +28280,3 @@ func QuantizedInstanceNorm(scope *Scope, x tf.Output, x_min tf.Output, x_max tf.
 	op := scope.AddOperation(opspec)
 	return op.Output(0), op.Output(1), op.Output(2)
 }
-
-// FakeQuantWithMinMaxVarsAttr is an optional argument to FakeQuantWithMinMaxVars.
-type FakeQuantWithMinMaxVarsAttr func(optionalAttr)
-
-// FakeQuantWithMinMaxVarsNumBits sets the optional num_bits attribute to value.
-// If not specified, defaults to 8
-func FakeQuantWithMinMaxVarsNumBits(value int64) FakeQuantWithMinMaxVarsAttr {
-	return func(m optionalAttr) {
-		m["num_bits"] = value
-	}
-}
-
-// FakeQuantWithMinMaxVarsNarrowRange sets the optional narrow_range attribute to value.
-// If not specified, defaults to false
-func FakeQuantWithMinMaxVarsNarrowRange(value bool) FakeQuantWithMinMaxVarsAttr {
-	return func(m optionalAttr) {
-		m["narrow_range"] = value
-	}
-}
-
-// Fake-quantize the 'inputs' tensor of type float via global float scalars `min`
-//
-// and `max` to 'outputs' tensor of same shape as `inputs`.
-//
-// `[min; max]` define the clamping range for the `inputs` data.
-// `inputs` values are quantized into the quantization range (`[0; 2^num_bits - 1]`
-// when `narrow_range` is false and `[1; 2^num_bits - 1]` when it is true) and
-// then de-quantized and output as floats in `[min; max]` interval.
-// `num_bits` is the bitwidth of the quantization; between 2 and 8, inclusive.
-//
-// This operation has a gradient and thus allows for training `min` and `max`
-// values.
-func FakeQuantWithMinMaxVars(scope *Scope, inputs tf.Output, min tf.Output, max tf.Output, optional ...FakeQuantWithMinMaxVarsAttr) (outputs tf.Output) {
-	if scope.Err() != nil {
-		return
-	}
-	attrs := map[string]interface{}{}
-	for _, a := range optional {
-		a(attrs)
-	}
-	opspec := tf.OpSpec{
-		Type: "FakeQuantWithMinMaxVars",
-		Input: []tf.Input{
-			inputs, min, max,
-		},
-		Attrs: attrs,
-	}
-	op := scope.AddOperation(opspec)
-	return op.Output(0)
-}
diff --git a/tensorflow/java/maven/libtensorflow/pom.xml b/tensorflow/java/maven/libtensorflow/pom.xml
index d35bb4111271c11839a160517dc9695ead5b46e9..1c84eae540d476ad3d3f5010be5c8aef48af58a5 100644
--- a/tensorflow/java/maven/libtensorflow/pom.xml
+++ b/tensorflow/java/maven/libtensorflow/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.6.0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow</artifactId>
diff --git a/tensorflow/java/maven/libtensorflow_jni/pom.xml b/tensorflow/java/maven/libtensorflow_jni/pom.xml
index d9ba1bbbfb91170257f64a56f47c6c980e8a9570..cf1a7b6c9c51d2a6f08c2bd89d917af9d45212be 100644
--- a/tensorflow/java/maven/libtensorflow_jni/pom.xml
+++ b/tensorflow/java/maven/libtensorflow_jni/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.6.0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow_jni</artifactId>
diff --git a/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml b/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
index f6f532c2c10d0a4dad9fc2d7750ea708652000b1..b202dcd5c79b253eb42929ef378392deaf83b8d9 100644
--- a/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
+++ b/tensorflow/java/maven/libtensorflow_jni_gpu/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.6.0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>libtensorflow_jni_gpu</artifactId>
diff --git a/tensorflow/java/maven/pom.xml b/tensorflow/java/maven/pom.xml
index 0a6b3d23d7d37515cf275e6a46842e32ada4fee1..606805ff33376f639740f324762290ac6732e19d 100644
--- a/tensorflow/java/maven/pom.xml
+++ b/tensorflow/java/maven/pom.xml
@@ -6,7 +6,7 @@
   <modelVersion>4.0.0</modelVersion>
   <groupId>org.tensorflow</groupId>
   <artifactId>parentpom</artifactId>
-  <version>1.6.0-rc1</version>
+  <version>1.6.0</version>
   <packaging>pom</packaging>
 
   <url>https://www.tensorflow.org</url>
diff --git a/tensorflow/java/maven/proto/pom.xml b/tensorflow/java/maven/proto/pom.xml
index 1d8e8723731f959c8142f0648fc805593d7beac8..c6bba4e5365c7b400359d1d4005071cfec32978a 100644
--- a/tensorflow/java/maven/proto/pom.xml
+++ b/tensorflow/java/maven/proto/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.6.0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>proto</artifactId>
diff --git a/tensorflow/java/maven/tensorflow/pom.xml b/tensorflow/java/maven/tensorflow/pom.xml
index 5c1b55085c5df1ec473a3f4e0bf750b236cfc264..a22663f9f3ead8195b032e45c085ba02f337b594 100644
--- a/tensorflow/java/maven/tensorflow/pom.xml
+++ b/tensorflow/java/maven/tensorflow/pom.xml
@@ -6,7 +6,7 @@
   <parent>
     <groupId>org.tensorflow</groupId>
     <artifactId>parentpom</artifactId>
-    <version>1.6.0-rc1</version>
+    <version>1.6.0</version>
     <relativePath>../</relativePath>
   </parent>
   <artifactId>tensorflow</artifactId>
diff --git a/tensorflow/java/src/main/native/tensor_jni.cc b/tensorflow/java/src/main/native/tensor_jni.cc
index 745abec244d1528e918464473e5d3fb19ad5082c..7e3cf4a88aac5acd4721a07c8316d8d124dce001 100644
--- a/tensorflow/java/src/main/native/tensor_jni.cc
+++ b/tensorflow/java/src/main/native/tensor_jni.cc
@@ -400,7 +400,13 @@ size_t nonScalarTF_STRINGTensorSize(JNIEnv* env, jarray value, int num_dims) {
   for (jsize i = 0; i < len; ++i) {
     jarray elem = static_cast<jarray>(
         env->GetObjectArrayElement(static_cast<jobjectArray>(value), i));
+    if (elem == nullptr) {
+      throwException(env, kNullPointerException,
+                     "null entries in provided array");
+      return ret;
+    }
     ret += nonScalarTF_STRINGTensorSize(env, elem, num_dims - 1);
+    if (env->ExceptionCheck()) return ret;
   }
   return ret;
 }
@@ -421,8 +427,8 @@ void fillNonScalarTF_STRINGTensorData(JNIEnv* env, jarray value, int num_dims,
   for (jsize i = 0; i < len; ++i) {
     jarray elem = static_cast<jarray>(
         env->GetObjectArrayElement(static_cast<jobjectArray>(value), i));
-    if (TF_GetCode(status) != TF_OK) return;
     fillNonScalarTF_STRINGTensorData(env, elem, num_dims - 1, writer, status);
+    if (TF_GetCode(status) != TF_OK) return;
   }
 }
 }  // namespace
@@ -444,6 +450,7 @@ JNIEXPORT jlong JNICALL Java_org_tensorflow_Tensor_allocateNonScalarBytes(
   }
   const size_t encoded_size =
       nonScalarTF_STRINGTensorSize(env, value, num_dims);
+  if (env->ExceptionCheck()) return 0;
   TF_Tensor* t = TF_AllocateTensor(TF_STRING, dims, num_dims,
                                    8 * num_elements + encoded_size);
   if (t == nullptr) {
diff --git a/tensorflow/java/src/test/java/org/tensorflow/TensorTest.java b/tensorflow/java/src/test/java/org/tensorflow/TensorTest.java
index 6538359d11a95eae698cc5aac8430e74ab1ed74c..1bd00a763ddff2f067183f57cfa80fdcbed84fd2 100644
--- a/tensorflow/java/src/test/java/org/tensorflow/TensorTest.java
+++ b/tensorflow/java/src/test/java/org/tensorflow/TensorTest.java
@@ -432,7 +432,7 @@ public class TensorTest {
     try (Tensor<Integer> t = Tensor.create(vector, Integer.class)) {
       fail("Tensor.create() should fail because it was given an array of boxed values");
     } catch (IllegalArgumentException e) {
-        // The expected exception
+      // The expected exception
     }
   }
 
@@ -536,4 +536,15 @@ public class TensorTest {
       assertArrayEquals(matrix, cpy.copyTo(new float[2][3]));
     }
   }
+
+  @Test
+  public void gracefullyFailCreationFromNullArrayForStringTensor() {
+    // Motivated by: https://github.com/tensorflow/tensorflow/issues/17130
+    byte[][] array = new byte[1][];
+    try {
+      Tensors.create(array);
+    } catch (NullPointerException e) {
+      // expected.
+    }
+  }
 }
diff --git a/tensorflow/python/BUILD b/tensorflow/python/BUILD
index cee7c47e00d5673cd2abf2b1e526523ad61bbafd..db17a3fe0237ff03c51358bf5df76c4a912dee6d 100644
--- a/tensorflow/python/BUILD
+++ b/tensorflow/python/BUILD
@@ -765,6 +765,31 @@ py_library(
     ],
 )
 
+py_library(
+    name = "smart_cond",
+    srcs = ["framework/smart_cond.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":control_flow_ops",
+        ":tensor_util",
+    ],
+)
+
+py_test(
+    name = "smart_cond_test",
+    size = "small",
+    srcs = ["framework/smart_cond_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":client_testlib",
+        ":constant_op",
+        ":framework_ops",
+        ":math_ops",
+        ":session",
+        ":smart_cond",
+    ],
+)
+
 py_library(
     name = "sparse_tensor",
     srcs = ["framework/sparse_tensor.py"],
@@ -2857,10 +2882,10 @@ py_library(
     srcs = ["training/checkpointable.py"],
     srcs_version = "PY2AND3",
     deps = [
+        ":array_ops",
         ":dtypes",
         ":io_ops_gen",
         ":ops",
-        ":pywrap_tensorflow",
         ":util",
         "//tensorflow/python/eager:context",
     ],
@@ -3630,6 +3655,7 @@ py_test(
         ":framework_for_generated_wrappers",
         ":math_ops",
         ":state_ops_gen",
+        ":variable_scope",
         ":variables",
         "//tensorflow/core:protos_all_py",
     ],
@@ -3920,7 +3946,10 @@ py_test(
     size = "small",
     srcs = ["training/checkpoint_utils_test.py"],
     srcs_version = "PY2AND3",
-    tags = ["no_windows"],
+    tags = [
+        "no_cuda_on_cpu_tap",
+        "no_windows",
+    ],
     deps = [
         ":client",
         ":client_testlib",
@@ -3929,6 +3958,7 @@ py_test(
         ":partitioned_variables",
         ":platform",
         ":pywrap_tensorflow",
+        ":resource_variable_ops",
         ":state_ops",
         ":training",
         ":variable_scope",
@@ -4092,6 +4122,7 @@ py_library(
         ":control_flow_ops",
         ":framework_for_generated_wrappers",
         ":platform",
+        ":smart_cond",
         ":tensor_util",
         ":util",
         ":variable_scope",
@@ -4108,8 +4139,6 @@ py_library(
         "layers/convolutional.py",
         "layers/core.py",
         "layers/layers.py",
-        "layers/maxout.py",
-        "layers/network.py",
         "layers/normalization.py",
         "layers/pooling.py",
     ],
@@ -4162,25 +4191,6 @@ py_test(
     ],
 )
 
-py_test(
-    name = "layers_network_test",
-    size = "small",
-    srcs = ["layers/network_test.py"],
-    main = "layers/network_test.py",
-    srcs_version = "PY2AND3",
-    deps = [
-        ":array_ops",
-        ":client_testlib",
-        ":framework_for_generated_wrappers",
-        ":framework_test_lib",
-        ":layers",
-        ":layers_base",
-        ":sparse_ops",
-        "//tensorflow/python/eager:context",
-        "//third_party/py/numpy",
-    ],
-)
-
 py_test(
     name = "layers_core_test",
     size = "small",
@@ -4219,22 +4229,6 @@ py_test(
     ],
 )
 
-py_test(
-    name = "layers_maxout_test",
-    size = "small",
-    srcs = ["layers/maxout_test.py"],
-    main = "layers/maxout_test.py",
-    srcs_version = "PY2AND3",
-    deps = [
-        ":client_testlib",
-        ":framework_for_generated_wrappers",
-        ":layers",
-        ":math_ops",
-        ":nn_ops",
-        ":random_ops",
-    ],
-)
-
 py_test(
     name = "layers_utils_test",
     size = "small",
@@ -4630,6 +4624,34 @@ py_test(
     ],
 )
 
+py_library(
+    name = "graph_placer",
+    srcs = [
+        "grappler/controller.py",
+        "grappler/graph_placer.py",
+        "grappler/hierarchical_controller.py",
+    ],
+    deps = [
+        ":python",
+        "//third_party/py/numpy",
+    ],
+)
+
+py_test(
+    name = "graph_placer_test",
+    size = "large",
+    srcs = ["grappler/graph_placer_test.py"],
+    tags = [
+        "grappler",
+        "no_pip",  # graph_placer is not available in pip.
+    ],
+    deps = [
+        ":client_testlib",
+        ":graph_placer",
+        "//tensorflow/python:math_ops",
+    ],
+)
+
 py_test(
     name = "memory_optimizer_test",
     size = "medium",
@@ -4721,6 +4743,7 @@ py_test(
     srcs_version = "PY2AND3",
     tags = [
         "grappler",
+        "no_cuda_on_cpu_tap",
         "no_pip",
     ],
     deps = [
diff --git a/tensorflow/python/__init__.py b/tensorflow/python/__init__.py
index 02ed5517ca895ab070a89f8810f77dadcff9212b..d6715fa5222b26f5ed7500ec995836ef28b69957 100644
--- a/tensorflow/python/__init__.py
+++ b/tensorflow/python/__init__.py
@@ -198,13 +198,9 @@ tf_export('TensorInfo')(TensorInfo)
 _allowed_symbols.extend([
     'arg_max',
     'arg_min',
-    'mul',  # use tf.multiply instead.
-    'neg',  # use tf.negative instead.
-    'sub',  # use tf.subtract instead.
     'create_partitioned_variables',
     'deserialize_many_sparse',
     'lin_space',
-    'list_diff',  # Use tf.listdiff instead.
     'listdiff',  # Use tf.listdiff instead.
     'parse_single_sequence_example',
     'serialize_many_sparse',
diff --git a/tensorflow/python/client/session.py b/tensorflow/python/client/session.py
index f3c4fecdc0fde0436bea76cc774edaabe1bc07dd..5737047c4b9927fefd9700f0f2a1841c0c561fbd 100644
--- a/tensorflow/python/client/session.py
+++ b/tensorflow/python/client/session.py
@@ -1085,7 +1085,10 @@ class BaseSession(SessionInterface):
           if isinstance(subfeed_val, ops.Tensor):
             raise TypeError('The value of a feed cannot be a tf.Tensor object. '
                             'Acceptable feed values include Python scalars, '
-                            'strings, lists, numpy ndarrays, or TensorHandles.')
+                            'strings, lists, numpy ndarrays, or TensorHandles.'
+                            'For reference, the tensor object was ' +
+                            str(feed_val) + ' which was passed to the '
+                            'feed with key ' + str(feed) + '.')
 
           subfeed_dtype = subfeed_t.dtype.as_numpy_dtype
           if isinstance(subfeed_val, int) and _convert_to_numpy_obj(
diff --git a/tensorflow/python/client/tf_session.i b/tensorflow/python/client/tf_session.i
index 1fd488e7b6388f7953a279dca8f93ab57a85f63d..f305cd271f98bea697ea8ff15be799d3e80db0bf 100644
--- a/tensorflow/python/client/tf_session.i
+++ b/tensorflow/python/client/tf_session.i
@@ -719,6 +719,8 @@ def TF_Reset(target, containers=None, config=None):
   $1 = &types_local;
 }
 
+%unignore SetRequireShapeInferenceFns;
+
 %include "tensorflow/python/client/tf_session_helper.h"
 
 %unignoreall
diff --git a/tensorflow/python/data/kernel_tests/dataset_from_generator_op_test.py b/tensorflow/python/data/kernel_tests/dataset_from_generator_op_test.py
index f129d07b57b96b7869c84467aeb2276c93531ef8..6aabad2f574551cbdc152fe378eb9dc0f5f71995 100644
--- a/tensorflow/python/data/kernel_tests/dataset_from_generator_op_test.py
+++ b/tensorflow/python/data/kernel_tests/dataset_from_generator_op_test.py
@@ -21,9 +21,12 @@ import threading
 
 import numpy as np
 
+from tensorflow.python.client import session
 from tensorflow.python.data.ops import dataset_ops
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
+from tensorflow.python.ops import script_ops
 from tensorflow.python.platform import test
 
 
@@ -302,6 +305,89 @@ class DatasetConstructorTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
+  def testFromGeneratorStopShort(self):
+
+    def generator():
+      yield 0
+      yield 1
+      yield 2
+
+    iterator = (
+        dataset_ops.Dataset.from_generator(
+            generator, output_types=dtypes.int64).make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      self.assertAllEqual(0, sess.run(get_next))
+      self.assertAllEqual(1, sess.run(get_next))
+
+  def testFromGeneratorDestructorCalled(self):
+    # Use an `Event` to signal that the generator has been deleted.
+    event = threading.Event()
+
+    class GeneratorWrapper(object):
+
+      def __iter__(self):
+        return self
+
+      def next(self):
+        return self.__next__()
+
+      def __next__(self):
+        return 42
+
+      def __del__(self):
+        event.set()
+
+    iterator = dataset_ops.Dataset.from_generator(
+        GeneratorWrapper,
+        output_types=dtypes.int64).take(2).make_initializable_iterator()
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with session.Session() as sess:
+      sess.run(init_op)
+      self.assertAllEqual(42, sess.run(get_next))
+      self.assertAllEqual(42, sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+      # Test that `GeneratorWrapper` object is destroyed when the
+      # iterator terminates (and the generator iterator is deleted).
+      self.assertTrue(event.is_set())
+
+  def testGeneratorDatasetFinalizeFunctionCalled(self):
+    # NOTE(mrry): This test tests the internal `_GeneratorDataset`,
+    # which affords more control over what the finalize function can do than
+    # the `Dataset.from_generator()` wrapper.
+
+    # Use an `Event` to signal that the generator has been deleted.
+    event = threading.Event()
+
+    def finalize_fn(_):
+      def finalize_py_func():
+        event.set()
+        return 0
+      return script_ops.py_func(finalize_py_func, [], [dtypes.int64],
+                                stateful=True)
+
+    dummy = constant_op.constant(37)
+    iterator = (dataset_ops._GeneratorDataset(dummy, lambda x: x,
+                                              lambda x: x, finalize_fn)
+                .take(2)
+                .make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      self.assertAllEqual(37, sess.run(get_next))
+      self.assertAllEqual(37, sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+        self.assertTrue(event.is_set())
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/data/kernel_tests/list_files_dataset_op_test.py b/tensorflow/python/data/kernel_tests/list_files_dataset_op_test.py
index 4e7691ee8144a19a62476281d86fb5df46dd3e4b..6442eb9ff554e61829796fb904342072d1846a32 100644
--- a/tensorflow/python/data/kernel_tests/list_files_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/list_files_dataset_op_test.py
@@ -46,8 +46,9 @@ class ListFilesDatasetOpTest(test.TestCase):
     dataset = dataset_ops.Dataset.list_files(path.join(self.tmp_dir, '*'))
     with self.test_session() as sess:
       itr = dataset.make_one_shot_iterator()
+      next_element = itr.get_next()
       with self.assertRaises(errors.OutOfRangeError):
-        sess.run(itr.get_next())
+        sess.run(next_element)
 
   def testSimpleDirectory(self):
     filenames = ['a', 'b', 'c']
@@ -56,13 +57,14 @@ class ListFilesDatasetOpTest(test.TestCase):
     dataset = dataset_ops.Dataset.list_files(path.join(self.tmp_dir, '*'))
     with self.test_session() as sess:
       itr = dataset.make_one_shot_iterator()
+      next_element = itr.get_next()
 
       full_filenames = []
       produced_filenames = []
       for filename in filenames:
         full_filenames.append(
             compat.as_bytes(path.join(self.tmp_dir, filename)))
-        produced_filenames.append(compat.as_bytes(sess.run(itr.get_next())))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
       self.assertItemsEqual(full_filenames, produced_filenames)
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(itr.get_next())
@@ -73,12 +75,13 @@ class ListFilesDatasetOpTest(test.TestCase):
 
     with self.test_session() as sess:
       itr = dataset.make_initializable_iterator()
+      next_element = itr.get_next()
       sess.run(
           itr.initializer,
           feed_dict={filename_placeholder: path.join(self.tmp_dir, '*')})
 
       with self.assertRaises(errors.OutOfRangeError):
-        sess.run(itr.get_next())
+        sess.run(next_element)
 
   def testSimpleDirectoryInitializer(self):
     filenames = ['a', 'b', 'c']
@@ -89,6 +92,7 @@ class ListFilesDatasetOpTest(test.TestCase):
 
     with self.test_session() as sess:
       itr = dataset.make_initializable_iterator()
+      next_element = itr.get_next()
       sess.run(
           itr.initializer,
           feed_dict={filename_placeholder: path.join(self.tmp_dir, '*')})
@@ -98,7 +102,7 @@ class ListFilesDatasetOpTest(test.TestCase):
       for filename in filenames:
         full_filenames.append(
             compat.as_bytes(path.join(self.tmp_dir, filename)))
-        produced_filenames.append(compat.as_bytes(sess.run(itr.get_next())))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
 
       self.assertItemsEqual(full_filenames, produced_filenames)
 
@@ -114,6 +118,7 @@ class ListFilesDatasetOpTest(test.TestCase):
 
     with self.test_session() as sess:
       itr = dataset.make_initializable_iterator()
+      next_element = itr.get_next()
       sess.run(
           itr.initializer,
           feed_dict={filename_placeholder: path.join(self.tmp_dir, '*.py')})
@@ -123,7 +128,7 @@ class ListFilesDatasetOpTest(test.TestCase):
       for filename in filenames[1:-1]:
         full_filenames.append(
             compat.as_bytes(path.join(self.tmp_dir, filename)))
-        produced_filenames.append(compat.as_bytes(sess.run(itr.get_next())))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
       self.assertItemsEqual(full_filenames, produced_filenames)
 
       with self.assertRaises(errors.OutOfRangeError):
@@ -138,6 +143,7 @@ class ListFilesDatasetOpTest(test.TestCase):
 
     with self.test_session() as sess:
       itr = dataset.make_initializable_iterator()
+      next_element = itr.get_next()
       sess.run(
           itr.initializer,
           feed_dict={filename_placeholder: path.join(self.tmp_dir, '*.py*')})
@@ -147,13 +153,44 @@ class ListFilesDatasetOpTest(test.TestCase):
       for filename in filenames[1:]:
         full_filenames.append(
             compat.as_bytes(path.join(self.tmp_dir, filename)))
-        produced_filenames.append(compat.as_bytes(sess.run(itr.get_next())))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
 
       self.assertItemsEqual(full_filenames, produced_filenames)
 
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(itr.get_next())
 
+  def testNoShuffle(self):
+    filenames = ['a', 'b', 'c']
+    self._touchTempFiles(filenames)
+
+    # Repeat the list twice and ensure that the order is the same each time.
+    # NOTE(mrry): This depends on an implementation detail of `list_files()`,
+    # which is that the list of files is captured when the iterator is
+    # initialized. Otherwise, or if e.g. the iterator were initialized more than
+    # once, it's possible that the non-determinism of `tf.matching_files()`
+    # would cause this test to fail. However, it serves as a useful confirmation
+    # that the `shuffle=False` argument is working as intended.
+    # TODO(b/73959787): Provide some ordering guarantees so that this test is
+    # more meaningful.
+    dataset = dataset_ops.Dataset.list_files(
+        path.join(self.tmp_dir, '*'), shuffle=False).repeat(2)
+    with self.test_session() as sess:
+      itr = dataset.make_one_shot_iterator()
+      next_element = itr.get_next()
+
+      full_filenames = []
+      produced_filenames = []
+      for filename in filenames * 2:
+        full_filenames.append(
+            compat.as_bytes(path.join(self.tmp_dir, filename)))
+        produced_filenames.append(compat.as_bytes(sess.run(next_element)))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(itr.get_next())
+      self.assertItemsEqual(full_filenames, produced_filenames)
+      self.assertEqual(produced_filenames[:len(filenames)],
+                       produced_filenames[len(filenames):])
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/data/kernel_tests/map_dataset_op_test.py b/tensorflow/python/data/kernel_tests/map_dataset_op_test.py
index 04d1abdb254feea1df6f1b8cfc5a512802107224..0791c614fa88700fdf2d0d673e168fc9784731a5 100644
--- a/tensorflow/python/data/kernel_tests/map_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/map_dataset_op_test.py
@@ -602,6 +602,28 @@ class MapDatasetTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
+  def testParallelMapOutOfRangeError(self):
+    def raising_py_func(i):
+      if i == 100:
+        raise StopIteration()
+      else:
+        return i
+
+    iterator = (
+        dataset_ops.Dataset.range(105)
+        .map(lambda x: script_ops.py_func(raising_py_func, [x], dtypes.int64),
+             num_parallel_calls=2)
+        .make_initializable_iterator())
+    init_op = iterator.initializer
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(init_op)
+      for i in range(100):
+        self.assertEqual(i, sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
 
 class MapDatasetBenchmark(test.Benchmark):
 
diff --git a/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py b/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py
index d7140088c310767d40bd2cf3413c899375acab15..1ddedfda4e1c9d6b6949f796be1870f167435763 100644
--- a/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py
+++ b/tensorflow/python/data/kernel_tests/reader_dataset_ops_test.py
@@ -21,6 +21,7 @@ import gzip
 import os
 import zlib
 
+from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.data.ops import readers
 from tensorflow.python.framework import constant_op
@@ -736,12 +737,43 @@ class TFRecordDatasetTest(test.TestCase):
     one_mebibyte = 2**20
     d = readers.TFRecordDataset(self.test_filenames, buffer_size=one_mebibyte)
     iterator = d.make_one_shot_iterator()
+    next_element = iterator.get_next()
     with self.test_session() as sess:
       for j in range(self._num_files):
         for i in range(self._num_records):
-          self.assertAllEqual(self._record(j, i), sess.run(iterator.get_next()))
+          self.assertAllEqual(self._record(j, i), sess.run(next_element))
       with self.assertRaises(errors.OutOfRangeError):
-        sess.run(iterator.get_next())
+        sess.run(next_element)
+
+  def testReadFromDatasetOfFiles(self):
+    files = dataset_ops.Dataset.from_tensor_slices(self.test_filenames)
+    d = readers.TFRecordDataset(files)
+    iterator = d.make_one_shot_iterator()
+    next_element = iterator.get_next()
+    with self.test_session() as sess:
+      for j in range(self._num_files):
+        for i in range(self._num_records):
+          self.assertAllEqual(self._record(j, i), sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+
+  def testReadTenEpochsFromDatasetOfFilesInParallel(self):
+    files = dataset_ops.Dataset.from_tensor_slices(
+        self.test_filenames).repeat(10)
+    d = readers.TFRecordDataset(files, num_parallel_reads=4)
+    iterator = d.make_one_shot_iterator()
+    next_element = iterator.get_next()
+    expected = []
+    actual = []
+    with self.test_session() as sess:
+      for _ in range(10):
+        for j in range(self._num_files):
+          for i in range(self._num_records):
+            expected.append(self._record(j, i))
+            actual.append(sess.run(next_element))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(next_element)
+      self.assertEqual(sorted(expected), sorted(actual))
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/data/kernel_tests/shuffle_dataset_op_test.py b/tensorflow/python/data/kernel_tests/shuffle_dataset_op_test.py
index c089fb08c1082c1cf74d492796550980d6755591..5fcc48831f3ca744e015c92760f12ea4dbef2ff7 100644
--- a/tensorflow/python/data/kernel_tests/shuffle_dataset_op_test.py
+++ b/tensorflow/python/data/kernel_tests/shuffle_dataset_op_test.py
@@ -132,6 +132,33 @@ class ShuffleDatasetTest(test.TestCase):
       with self.assertRaises(errors.OutOfRangeError):
         sess.run(get_next)
 
+  def testSeedZero(self):
+    """Test for same behavior when the seed is a Python or Tensor zero."""
+    iterator = (
+        dataset_ops.Dataset.range(10).shuffle(10, seed=0)
+        .make_one_shot_iterator())
+    get_next = iterator.get_next()
+
+    elems = []
+    with self.test_session() as sess:
+      for _ in range(10):
+        elems.append(sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
+    seed_placeholder = array_ops.placeholder(dtypes.int64, shape=[])
+    iterator = (
+        dataset_ops.Dataset.range(10).shuffle(10, seed=seed_placeholder)
+        .make_initializable_iterator())
+    get_next = iterator.get_next()
+
+    with self.test_session() as sess:
+      sess.run(iterator.initializer, feed_dict={seed_placeholder: 0})
+      for elem in elems:
+        self.assertEqual(elem, sess.run(get_next))
+      with self.assertRaises(errors.OutOfRangeError):
+        sess.run(get_next)
+
   def testDefaultArguments(self):
     components = [0, 1, 2, 3, 4]
     iterator = (dataset_ops.Dataset.from_tensor_slices(components).shuffle(5)
diff --git a/tensorflow/python/data/ops/BUILD b/tensorflow/python/data/ops/BUILD
index f12b358a7dc35c18338171e489fa88ba1a82d11b..a8f2154db8c1555dcba07229edf33d9c581ddad1 100644
--- a/tensorflow/python/data/ops/BUILD
+++ b/tensorflow/python/data/ops/BUILD
@@ -23,6 +23,7 @@ py_library(
         "//tensorflow/python:tensor_util",
         "//tensorflow/python:util",
         "//tensorflow/python/data/util:nest",
+        "//tensorflow/python/data/util:random_seed",
         "//tensorflow/python/data/util:sparse",
         "//third_party/py/numpy",
     ],
@@ -34,6 +35,7 @@ py_library(
     srcs_version = "PY2AND3",
     deps = [
         ":dataset_ops",
+        "//tensorflow/python:array_ops",
         "//tensorflow/python:dataset_ops_gen",
         "//tensorflow/python:dtypes",
         "//tensorflow/python:framework_ops",
diff --git a/tensorflow/python/data/ops/dataset_ops.py b/tensorflow/python/data/ops/dataset_ops.py
index b665443b7acb9eb266b6fcf36a002cfce54875f1..7c5aa4c76791cd9b864ec81f8734272f997018cf 100644
--- a/tensorflow/python/data/ops/dataset_ops.py
+++ b/tensorflow/python/data/ops/dataset_ops.py
@@ -26,16 +26,17 @@ import six
 
 from tensorflow.python.data.ops import iterator_ops
 from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import random_seed
 from tensorflow.python.data.util import sparse
 from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
-from tensorflow.python.framework import random_seed
 from tensorflow.python.framework import sparse_tensor as sparse_tensor_lib
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.ops import gen_io_ops
 from tensorflow.python.ops import math_ops
@@ -331,10 +332,10 @@ class Dataset(object):
 
     generator_state = Dataset._GeneratorState(generator)
 
-    def get_iterator_id_map_fn(unused_dummy):
+    def get_iterator_id_fn(unused_dummy):
       """Creates a unique `iterator_id` for each pass over the dataset.
 
-      The "iterator_id" disambiguates between multiple concurrently
+      The returned `iterator_id` disambiguates between multiple concurrently
       existing iterators.
 
       Args:
@@ -347,7 +348,7 @@ class Dataset(object):
       return script_ops.py_func(
           generator_state.get_next_id, [], dtypes.int64, stateful=True)
 
-    def generator_map_fn(iterator_id_t):
+    def generator_next_fn(iterator_id_t):
       """Generates the next element from iterator with ID `iterator_id_t`.
 
       We map this function across an infinite repetition of the
@@ -363,11 +364,9 @@ class Dataset(object):
 
       def generator_py_func(iterator_id):
         """A `py_func` that will be called to invoke the iterator."""
-        try:
-          values = next(generator_state.get_iterator(iterator_id))
-        except StopIteration:
-          generator_state.iterator_completed(iterator_id)
-          raise StopIteration("Iteration finished.")
+        # `next()` raises `StopIteration` when there are no more
+        # elements remaining to be generated.
+        values = next(generator_state.get_iterator(iterator_id))
 
         # Use the same _convert function from the py_func() implementation to
         # convert the returned values to arrays early, so that we can inspect
@@ -408,17 +407,31 @@ class Dataset(object):
 
       return nest.pack_sequence_as(output_types, flat_values)
 
+    def finalize_fn(iterator_id_t):
+      """Releases host-side state for the iterator with ID `iterator_id_t`."""
+
+      def finalize_py_func(iterator_id):
+        generator_state.iterator_completed(iterator_id)
+        # We return a dummy value so that the `finalize_fn` has a valid
+        # signature.
+        # NOTE(mrry): Explicitly create an array of `np.int64` because implicit
+        # casting in `py_func()` will create an array of `np.int32` on Windows,
+        # leading to a runtime error.
+        return np.array(0, dtype=np.int64)
+
+      return script_ops.py_func(
+          finalize_py_func, [iterator_id_t], dtypes.int64, stateful=True)
+
     # This function associates each traversal of `generator` with a unique
     # iterator ID.
-    def flat_map_fn(iterator_id_t):
-      # First, generate an infinite dataset containing the iterator ID repeated
-      # forever.
-      repeated_id = Dataset.from_tensors(iterator_id_t).repeat(None)
-
-      # The `generator_map_fn` gets the next element from the iterator with the
-      # relevant ID, and raises StopIteration when that iterator contains no
+    def flat_map_fn(dummy_arg):
+      # The `get_iterator_id_fn` gets a unique ID for the current instance of
+      # of the generator.
+      # The `generator_next_fn` gets the next element from the iterator with the
+      # given ID, and raises StopIteration when that iterator contains no
       # more elements.
-      return repeated_id.map(generator_map_fn)
+      return _GeneratorDataset(dummy_arg, get_iterator_id_fn, generator_next_fn,
+                               finalize_fn)
 
     # A single-element dataset that, each time it is evaluated, contains a
     # freshly-generated and unique (for the returned dataset) int64
@@ -426,7 +439,7 @@ class Dataset(object):
     # is encapsulated in `generator_state`, and captured in
     # `get_iterator_id_map_fn`.
     dummy = 0
-    id_dataset = Dataset.from_tensors(dummy).map(get_iterator_id_map_fn)
+    id_dataset = Dataset.from_tensors(dummy)
 
     # A dataset that contains all of the elements generated by a
     # single iterator created from `generator`, identified by the
@@ -545,7 +558,7 @@ class Dataset(object):
     return PrefetchDataset(self, buffer_size)
 
   @staticmethod
-  def list_files(file_pattern):
+  def list_files(file_pattern, shuffle=None):
     """A dataset of all files matching a pattern.
 
     Example:
@@ -558,16 +571,31 @@ class Dataset(object):
         - /path/to/dir/b.py
         - /path/to/dir/c.py
 
-    NOTE: The order of the file names returned can be non-deterministic.
+    NOTE: The order of the file names returned can be non-deterministic even
+    when `shuffle` is `False`.
 
     Args:
       file_pattern: A string or scalar string `tf.Tensor`, representing
         the filename pattern that will be matched.
+      shuffle: (Optional.) If `True`, the file names will be shuffled randomly.
+        Defaults to `True`.
 
     Returns:
      Dataset: A `Dataset` of strings corresponding to file names.
     """
-    return Dataset.from_tensor_slices(gen_io_ops.matching_files(file_pattern))
+    # TODO(b/73959787): Add a `seed` argument and make the `shuffle=False`
+    # behavior deterministic (e.g. by sorting the filenames).
+    if shuffle is None:
+      shuffle = True
+    matching_files = gen_io_ops.matching_files(file_pattern)
+    dataset = Dataset.from_tensor_slices(matching_files)
+    if shuffle:
+      # NOTE(mrry): The shuffle buffer size must be greater than zero, but the
+      # list of files might be empty.
+      buffer_size = math_ops.maximum(
+          array_ops.shape(matching_files, out_type=dtypes.int64)[0], 1)
+      dataset = dataset.shuffle(buffer_size)
+    return dataset
 
   def repeat(self, count=None):
     """Repeats this dataset `count` times.
@@ -1033,6 +1061,196 @@ class SparseTensorSliceDataset(Dataset):
     return (dtypes.int64, self._sparse_tensor.dtype, dtypes.int64)
 
 
+class _GeneratorDataset(Dataset):
+  """A `Dataset` that generates elements by invoking a function."""
+
+  def __init__(self, init_args, init_func, next_func, finalize_func):
+    """Constructs a `_GeneratorDataset`.
+
+    Args:
+      init_args: A nested structure representing the arguments to `init_func`.
+      init_func: A TensorFlow function that will be called on `init_args` each
+        time a C++ iterator over this dataset is constructed. Returns a nested
+        structure representing the "state" of the dataset.
+      next_func: A TensorFlow function that will be called on the result of
+        `init_func` to produce each element, and that raises `OutOfRangeError`
+        to terminate iteration.
+      finalize_func: A TensorFlow function that will be called on the result of
+        `init_func` immediately before a C++ iterator over this dataset is
+        destroyed. The return value is ignored.
+    """
+    super(_GeneratorDataset, self).__init__()
+    # These members will be initialized by `tf_init_func`.
+    self._state_classes = None
+    self._state_shapes = None
+    self._state_types = None
+
+    self._init_args = init_args
+
+    init_args_classes = sparse.get_classes(init_args)
+    init_args_shapes = nest.pack_sequence_as(
+        init_args, [t.get_shape() for t in nest.flatten(init_args)])
+    init_args_types = nest.pack_sequence_as(
+        init_args, [t.dtype for t in nest.flatten(init_args)])
+
+    @function.Defun(*nest.flatten(
+        sparse.as_dense_types(init_args_types, init_args_classes)))
+    def tf_init_func(*args):
+      """A wrapper for Defun that facilitates shape inference."""
+      dense_shapes = sparse.as_dense_shapes(init_args_shapes, init_args_classes)
+      for arg, shape in zip(args, nest.flatten(dense_shapes)):
+        arg.set_shape(shape)
+
+      nested_args = nest.pack_sequence_as(init_args_classes, args)
+      nested_args = sparse.deserialize_sparse_tensors(
+          nested_args, init_args_types, init_args_shapes, init_args_classes)
+      if _should_unpack_args(nested_args):
+        ret = init_func(*nested_args)
+      else:
+        ret = init_func(nested_args)
+
+      # If `init_func` returns a list of tensors, `nest.flatten()` and
+      # `ops.convert_to_tensor()` would conspire to attempt to stack
+      # those tensors into a single tensor, because the customized
+      # version of `nest.flatten()` does not recurse into lists. Since
+      # it is more likely that the list arose from returning the
+      # result of an operation (such as `tf.py_func()`) that returns a
+      # list of not-necessarily-stackable tensors, we treat the
+      # returned value is a `tuple` instead. A user wishing to pack
+      # the return value into a single tensor can use an explicit
+      # `tf.stack()` before returning.
+      if isinstance(ret, list):
+        ret = tuple(ret)
+
+      # Convert any `SparseTensorValue`s to `SparseTensor`s.
+      ret = nest.pack_sequence_as(ret, [
+          sparse_tensor_lib.SparseTensor.from_value(t)
+          if sparse_tensor_lib.is_sparse(t) else t for t in nest.flatten(ret)
+      ])
+
+      self._state_classes = sparse.get_classes(ret)
+      self._state_shapes = nest.pack_sequence_as(
+          ret, [t.get_shape() for t in nest.flatten(ret)])
+      self._state_types = nest.pack_sequence_as(
+          ret, [t.dtype for t in nest.flatten(ret)])
+
+      # Serialize any sparse tensors and convert result to tensors.
+      ret = nest.pack_sequence_as(ret, [
+          ops.convert_to_tensor(t)
+          for t in nest.flatten(sparse.serialize_sparse_tensors(ret))
+      ])
+      return nest.flatten(ret)
+
+    self._init_func = tf_init_func
+    self._init_func.add_to_graph(ops.get_default_graph())
+
+    # These members will be initialized by `tf_next_func`.
+    self._output_classes = None
+    self._output_shapes = None
+    self._output_types = None
+
+    @function.Defun(*nest.flatten(
+        sparse.as_dense_types(self._state_types, self._state_classes)))
+    def tf_next_func(*args):
+      """A wrapper for Defun that facilitates shape inference."""
+      # Pass in shape information from the input_dataset.
+      dense_shapes = sparse.as_dense_shapes(self._state_shapes,
+                                            self._state_classes)
+      for arg, shape in zip(args, nest.flatten(dense_shapes)):
+        arg.set_shape(shape)
+
+      nested_args = nest.pack_sequence_as(self._state_classes, args)
+      nested_args = sparse.deserialize_sparse_tensors(
+          nested_args, self._state_types, self._state_shapes,
+          self._state_classes)
+      if _should_unpack_args(nested_args):
+        ret = next_func(*nested_args)
+      else:
+        ret = next_func(nested_args)
+
+      # If `next_func` returns a list of tensors, `nest.flatten()` and
+      # `ops.convert_to_tensor()` would conspire to attempt to stack
+      # those tensors into a single tensor, because the customized
+      # version of `nest.flatten()` does not recurse into lists. Since
+      # it is more likely that the list arose from returning the
+      # result of an operation (such as `tf.py_func()`) that returns a
+      # list of not-necessarily-stackable tensors, we treat the
+      # returned value is a `tuple` instead. A user wishing to pack
+      # the return value into a single tensor can use an explicit
+      # `tf.stack()` before returning.
+      if isinstance(ret, list):
+        ret = tuple(ret)
+
+      # Convert any `SparseTensorValue`s to `SparseTensor`s.
+      ret = nest.pack_sequence_as(ret, [
+          sparse_tensor_lib.SparseTensor.from_value(t)
+          if sparse_tensor_lib.is_sparse(t) else t for t in nest.flatten(ret)
+      ])
+
+      self._output_classes = sparse.get_classes(ret)
+      self._output_shapes = nest.pack_sequence_as(
+          ret, [t.get_shape() for t in nest.flatten(ret)])
+      self._output_types = nest.pack_sequence_as(
+          ret, [t.dtype for t in nest.flatten(ret)])
+
+      # Serialize any sparse tensors and convert result to tensors.
+      ret = nest.pack_sequence_as(ret, [
+          ops.convert_to_tensor(t)
+          for t in nest.flatten(sparse.serialize_sparse_tensors(ret))
+      ])
+      return nest.flatten(ret)
+
+    self._next_func = tf_next_func
+    self._next_func.add_to_graph(ops.get_default_graph())
+
+    @function.Defun(*nest.flatten(
+        sparse.as_dense_types(self._state_types, self._state_classes)))
+    def tf_finalize_func(*args):
+      """A wrapper for Defun that facilitates shape inference."""
+      # Pass in shape information from the state.
+      dense_shapes = sparse.as_dense_shapes(self._state_shapes,
+                                            self._state_classes)
+      for arg, shape in zip(args, nest.flatten(dense_shapes)):
+        arg.set_shape(shape)
+
+      nested_args = nest.pack_sequence_as(self._state_classes, args)
+      nested_args = sparse.deserialize_sparse_tensors(
+          nested_args, self._state_types, self._state_shapes,
+          self._state_classes)
+      if _should_unpack_args(nested_args):
+        return finalize_func(*nested_args)
+      else:
+        return finalize_func(nested_args)
+
+    self._finalize_func = tf_finalize_func
+    self._finalize_func.add_to_graph(ops.get_default_graph())
+
+  def _as_variant_tensor(self):
+    return gen_dataset_ops.generator_dataset(
+        nest.flatten(self._init_args) + self._init_func.captured_inputs,
+        self._next_func.captured_inputs,
+        self._finalize_func.captured_inputs,
+        init_func=self._init_func,
+        next_func=self._next_func,
+        finalize_func=self._finalize_func,
+        output_types=nest.flatten(
+            sparse.as_dense_types(self.output_types, self.output_classes)),
+        output_shapes=nest.flatten(
+            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+
+  @property
+  def output_classes(self):
+    return self._output_classes
+
+  @property
+  def output_shapes(self):
+    return self._output_shapes
+
+  @property
+  def output_types(self):
+    return self._output_types
+
+
 class ZipDataset(Dataset):
   """A `Dataset` that zips its inputs together."""
 
@@ -1282,16 +1500,7 @@ class ShuffleDataset(Dataset):
     self._input_dataset = input_dataset
     self._buffer_size = ops.convert_to_tensor(
         buffer_size, dtype=dtypes.int64, name="buffer_size")
-    seed, seed2 = random_seed.get_seed(seed)
-    if seed is None:
-      self._seed = constant_op.constant(0, dtype=dtypes.int64, name="seed")
-    else:
-      self._seed = ops.convert_to_tensor(seed, dtype=dtypes.int64, name="seed")
-    if seed2 is None:
-      self._seed2 = constant_op.constant(0, dtype=dtypes.int64, name="seed2")
-    else:
-      self._seed2 = ops.convert_to_tensor(
-          seed2, dtype=dtypes.int64, name="seed2")
+    self._seed, self._seed2 = random_seed.get_seed(seed)
     if reshuffle_each_iteration is None:
       self._reshuffle_each_iteration = True
     else:
diff --git a/tensorflow/python/data/ops/readers.py b/tensorflow/python/data/ops/readers.py
index fa7601741b11f018e9b53ed3b77a7561be50d3f4..6c493d8163b051b2e724335923d7b4c721523083 100644
--- a/tensorflow/python/data/ops/readers.py
+++ b/tensorflow/python/data/ops/readers.py
@@ -17,11 +17,15 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.python.data.ops.dataset_ops import Dataset
+from tensorflow.python.data.ops import dataset_ops
 from tensorflow.python.data.util import convert
+from tensorflow.python.data.util import nest
+from tensorflow.python.data.util import sparse
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_shape
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_dataset_ops
 from tensorflow.python.util.tf_export import tf_export
 
@@ -31,7 +35,7 @@ _DEFAULT_READER_BUFFER_SIZE_BYTES = 256 * 1024  # 256 KB
 
 
 @tf_export("data.TextLineDataset")
-class TextLineDataset(Dataset):
+class TextLineDataset(dataset_ops.Dataset):
   """A `Dataset` comprising lines from one or more text files."""
 
   def __init__(self, filenames, compression_type=None, buffer_size=None):
@@ -73,8 +77,7 @@ class TextLineDataset(Dataset):
     return dtypes.string
 
 
-@tf_export("data.TFRecordDataset")
-class TFRecordDataset(Dataset):
+class _TFRecordDataset(dataset_ops.Dataset):
   """A `Dataset` comprising records from one or more TFRecord files."""
 
   def __init__(self, filenames, compression_type=None, buffer_size=None):
@@ -87,7 +90,7 @@ class TFRecordDataset(Dataset):
       buffer_size: (Optional.) A `tf.int64` scalar representing the number of
         bytes in the read buffer. 0 means no buffering.
     """
-    super(TFRecordDataset, self).__init__()
+    super(_TFRecordDataset, self).__init__()
     # Force the type to string even if filenames is an empty list.
     self._filenames = ops.convert_to_tensor(
         filenames, dtypes.string, name="filenames")
@@ -118,8 +121,159 @@ class TFRecordDataset(Dataset):
     return dtypes.string
 
 
+class ParallelInterleaveDataset(dataset_ops.Dataset):
+  """A `Dataset` that maps a function over its input and flattens the result."""
+
+  def __init__(self, input_dataset, map_func, cycle_length, block_length,
+               sloppy, buffer_output_elements, prefetch_input_elements):
+    """See `tf.contrib.data.parallel_interleave()` for details."""
+    super(ParallelInterleaveDataset, self).__init__()
+    self._input_dataset = input_dataset
+
+    @function.Defun(*nest.flatten(
+        sparse.as_dense_types(input_dataset.output_types,
+                              input_dataset.output_classes)))
+    def tf_map_func(*args):
+      """A wrapper for Defun that facilitates shape inference."""
+      # Pass in shape information from the input_dataset.
+      dense_shapes = sparse.as_dense_shapes(input_dataset.output_shapes,
+                                            input_dataset.output_classes)
+      for arg, shape in zip(args, nest.flatten(dense_shapes)):
+        arg.set_shape(shape)
+
+      nested_args = nest.pack_sequence_as(input_dataset.output_types, args)
+      nested_args = sparse.deserialize_sparse_tensors(
+          nested_args, input_dataset.output_types, input_dataset.output_shapes,
+          input_dataset.output_classes)
+      if dataset_ops._should_unpack_args(nested_args):  # pylint: disable=protected-access
+        dataset = map_func(*nested_args)
+      else:
+        dataset = map_func(nested_args)
+
+      if not isinstance(dataset, dataset_ops.Dataset):
+        raise TypeError("`map_func` must return a `Dataset` object.")
+
+      self._output_classes = dataset.output_classes
+      self._output_types = dataset.output_types
+      self._output_shapes = dataset.output_shapes
+
+      return dataset._as_variant_tensor()  # pylint: disable=protected-access
+
+    self._map_func = tf_map_func
+    self._map_func.add_to_graph(ops.get_default_graph())
+
+    self._cycle_length = ops.convert_to_tensor(
+        cycle_length, dtype=dtypes.int64, name="cycle_length")
+    self._block_length = ops.convert_to_tensor(
+        block_length, dtype=dtypes.int64, name="block_length")
+    self._sloppy = ops.convert_to_tensor(
+        sloppy, dtype=dtypes.bool, name="sloppy")
+    self._buffer_output_elements = convert.optional_param_to_tensor(
+        "buffer_output_elements",
+        buffer_output_elements,
+        argument_default=2 * block_length)
+    self._prefetch_input_elements = convert.optional_param_to_tensor(
+        "prefetch_input_elements",
+        prefetch_input_elements,
+        argument_default=2 * cycle_length)
+
+  def _as_variant_tensor(self):
+    return gen_dataset_ops.parallel_interleave_dataset(
+        self._input_dataset._as_variant_tensor(),  # pylint: disable=protected-access
+        self._map_func.captured_inputs,
+        self._cycle_length,
+        self._block_length,
+        self._sloppy,
+        self._buffer_output_elements,
+        self._prefetch_input_elements,
+        f=self._map_func,
+        output_types=nest.flatten(
+            sparse.as_dense_types(self.output_types, self.output_classes)),
+        output_shapes=nest.flatten(
+            sparse.as_dense_shapes(self.output_shapes, self.output_classes)))
+
+  @property
+  def output_classes(self):
+    return self._output_classes
+
+  @property
+  def output_shapes(self):
+    return self._output_shapes
+
+  @property
+  def output_types(self):
+    return self._output_types
+
+
+@tf_export("data.TFRecordDataset")
+class TFRecordDataset(dataset_ops.Dataset):
+  """A `Dataset` comprising records from one or more TFRecord files."""
+
+  def __init__(self, filenames, compression_type=None, buffer_size=None,
+               num_parallel_reads=None):
+    """Creates a `TFRecordDataset` to read for one or more TFRecord files.
+
+    NOTE: The `num_parallel_reads` argument can be used to improve performance
+    when reading from a remote filesystem.
+
+    Args:
+      filenames: A `tf.string` tensor or `tf.data.Dataset` containing one or
+        more filenames.
+      compression_type: (Optional.) A `tf.string` scalar evaluating to one of
+        `""` (no compression), `"ZLIB"`, or `"GZIP"`.
+      buffer_size: (Optional.) A `tf.int64` scalar representing the number of
+        bytes in the read buffer. 0 means no buffering.
+      num_parallel_reads: (Optional.) A `tf.int64` scalar representing the
+        number of files to read in parallel. Defaults to reading files
+        sequentially.
+
+    Raises:
+      TypeError: If any argument does not have the expected type.
+      ValueError: If any argument does not have the expected shape.
+    """
+    super(TFRecordDataset, self).__init__()
+    if isinstance(filenames, dataset_ops.Dataset):
+      if filenames.output_types != dtypes.string:
+        raise TypeError(
+            "`filenames` must be a `tf.data.Dataset` of `tf.string` elements.")
+      if not filenames.output_shapes.is_compatible_with(tensor_shape.scalar()):
+        raise ValueError(
+            "`filenames` must be a `tf.data.Dataset` of scalar `tf.string` "
+            "elements.")
+    else:
+      filenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
+      filenames = array_ops.reshape(filenames, [-1], name="flat_filenames")
+      filenames = dataset_ops.Dataset.from_tensor_slices(filenames)
+
+    def read_one_file(filename):
+      return _TFRecordDataset(filename, compression_type, buffer_size)
+
+    if num_parallel_reads is None:
+      self._impl = filenames.flat_map(read_one_file)
+    else:
+      self._impl = ParallelInterleaveDataset(
+          filenames, read_one_file, cycle_length=num_parallel_reads,
+          block_length=1, sloppy=False, buffer_output_elements=None,
+          prefetch_input_elements=None)
+
+  def _as_variant_tensor(self):
+    return self._impl._as_variant_tensor()  # pylint: disable=protected-access
+
+  @property
+  def output_classes(self):
+    return self._impl.output_classes
+
+  @property
+  def output_shapes(self):
+    return self._impl.output_shapes
+
+  @property
+  def output_types(self):
+    return self._impl.output_types
+
+
 @tf_export("data.FixedLengthRecordDataset")
-class FixedLengthRecordDataset(Dataset):
+class FixedLengthRecordDataset(dataset_ops.Dataset):
   """A `Dataset` of fixed-length records from one or more binary files."""
 
   def __init__(self,
diff --git a/tensorflow/python/data/util/BUILD b/tensorflow/python/data/util/BUILD
index e32c7b54a48dd887c2748897c3ce3661aab9f497..b1bdbdab37b63667b475c732df7a47d9e57f2b19 100644
--- a/tensorflow/python/data/util/BUILD
+++ b/tensorflow/python/data/util/BUILD
@@ -86,6 +86,30 @@ py_test(
     ],
 )
 
+py_library(
+    name = "random_seed",
+    srcs = ["random_seed.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:framework",
+    ],
+)
+
+py_test(
+    name = "random_seed_test",
+    size = "small",
+    srcs = ["random_seed_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":random_seed",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:util",
+    ],
+)
+
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/python/data/util/random_seed.py b/tensorflow/python/data/util/random_seed.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2c9d8672f94587fd3164f25f97b44a97526be07
--- /dev/null
+++ b/tensorflow/python/data/util/random_seed.py
@@ -0,0 +1,58 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Utilities for generating Tensor-valued random seeds."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import random_seed
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import math_ops
+
+
+def get_seed(seed):
+  """Returns the local seeds an operation should use given an op-specific seed.
+
+  See @{tf.get_seed} for more details. This wrapper adds support for the case
+  where `seed` may be a tensor.
+
+  Args:
+    seed: An integer or a @{tf.int64} scalar tensor.
+
+  Returns:
+    A tuple of two @{tf.int64} scalar tensors that should be used for the local
+    seed of the calling dataset.
+  """
+  seed, seed2 = random_seed.get_seed(seed)
+  if seed is None:
+    seed = constant_op.constant(0, dtype=dtypes.int64, name="seed")
+  else:
+    seed = ops.convert_to_tensor(seed, dtype=dtypes.int64, name="seed")
+  if seed2 is None:
+    seed2 = constant_op.constant(0, dtype=dtypes.int64, name="seed2")
+  else:
+    with ops.name_scope("seed2") as scope:
+      seed2 = ops.convert_to_tensor(seed2, dtype=dtypes.int64)
+      seed2 = array_ops.where(
+          math_ops.logical_and(
+              math_ops.equal(seed, 0), math_ops.equal(seed2, 0)),
+          constant_op.constant(2**31 - 1, dtype=dtypes.int64),
+          seed2,
+          name=scope)
+  return seed, seed2
diff --git a/tensorflow/python/data/util/random_seed_test.py b/tensorflow/python/data/util/random_seed_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..c3a2dc05379e198baacab28a813219d059f34a40
--- /dev/null
+++ b/tensorflow/python/data/util/random_seed_test.py
@@ -0,0 +1,83 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for utilities working with arbitrarily nested structures."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.data.util import random_seed as data_random_seed
+from tensorflow.python.eager import context
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import random_seed
+from tensorflow.python.framework import test_util
+from tensorflow.python.platform import test
+
+
+class RandomSeedTest(test.TestCase):
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testRandomSeed(self):
+    zero_t = constant_op.constant(0, dtype=dtypes.int64, name='zero')
+    one_t = constant_op.constant(1, dtype=dtypes.int64, name='one')
+    intmax_t = constant_op.constant(
+        2**31 - 1, dtype=dtypes.int64, name='intmax')
+    test_cases = [
+        # Each test case is a tuple with input to get_seed:
+        # (input_graph_seed, input_op_seed)
+        # and output from get_seed:
+        # (output_graph_seed, output_op_seed)
+        ((None, None), (0, 0)),
+        ((None, 1), (random_seed.DEFAULT_GRAPH_SEED, 1)),
+        ((1, 1), (1, 1)),
+        ((0, 0), (0, 2**31 - 1)),  # Avoid nondeterministic (0, 0) output
+        ((2**31 - 1, 0), (0, 2**31 - 1)),  # Don't wrap to (0, 0) either
+        ((0, 2**31 - 1), (0, 2**31 - 1)),  # Wrapping for the other argument
+        # Once more, with tensor-valued arguments
+        ((None, one_t), (random_seed.DEFAULT_GRAPH_SEED, 1)),
+        ((1, one_t), (1, 1)),
+        ((0, zero_t), (0, 2**31 - 1)),  # Avoid nondeterministic (0, 0) output
+        ((2**31 - 1, zero_t), (0, 2**31 - 1)),  # Don't wrap to (0, 0) either
+        ((0, intmax_t), (0, 2**31 - 1)),  # Wrapping for the other argument
+    ]
+    for tc in test_cases:
+      tinput, toutput = tc[0], tc[1]
+      random_seed.set_random_seed(tinput[0])
+      g_seed, op_seed = data_random_seed.get_seed(tinput[1])
+      g_seed = self.evaluate(g_seed)
+      op_seed = self.evaluate(op_seed)
+      msg = 'test_case = {0}, got {1}, want {2}'.format(
+          tinput, (g_seed, op_seed), toutput)
+      self.assertEqual((g_seed, op_seed), toutput, msg=msg)
+      random_seed.set_random_seed(None)
+
+    if context.in_graph_mode():
+      random_seed.set_random_seed(1)
+      tinput = (1, None)
+      toutput = (1, ops.get_default_graph()._last_id)  # pylint: disable=protected-access
+      random_seed.set_random_seed(tinput[0])
+      g_seed, op_seed = data_random_seed.get_seed(tinput[1])
+      g_seed = self.evaluate(g_seed)
+      op_seed = self.evaluate(op_seed)
+      msg = 'test_case = {0}, got {1}, want {2}'.format(1, (g_seed, op_seed),
+                                                        toutput)
+      self.assertEqual((g_seed, op_seed), toutput, msg=msg)
+      random_seed.set_random_seed(None)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/debug/BUILD b/tensorflow/python/debug/BUILD
index f0e90f67772d114142ccc218ed9f42b723a1b556..253588fc3b2986af3ab8c6be5b0b85f178c06336 100644
--- a/tensorflow/python/debug/BUILD
+++ b/tensorflow/python/debug/BUILD
@@ -957,7 +957,7 @@ cuda_py_test(
 
 cuda_py_test(
     name = "session_debug_grpc_test",
-    size = "medium",
+    size = "large",
     srcs = ["lib/session_debug_grpc_test.py"],
     additional_deps = [
         ":debug_data",
diff --git a/tensorflow/python/debug/lib/debug_gradients.py b/tensorflow/python/debug/lib/debug_gradients.py
index 16f51a4b32f711b97077643cec669bb8970e0b21..589a13db7f798aef3bb82dfbd442deabfbcf2a41 100644
--- a/tensorflow/python/debug/lib/debug_gradients.py
+++ b/tensorflow/python/debug/lib/debug_gradients.py
@@ -156,11 +156,12 @@ class GradientsDebugger(object):
     # TODO(cais): Implement value_stack.
     grad_debug_op_name = _tensor_to_grad_debug_op_name(input_tensor, self._uuid)
     # pylint: disable=protected-access
-    identity_op = (gen_array_ops._debug_gradient_ref_identity
-                   if input_tensor.dtype._is_ref_dtype
-                   else gen_array_ops._debug_gradient_identity)
-    debug_grad_identity = identity_op(input_tensor, name=grad_debug_op_name)
+    identity_op = (
+        gen_array_ops.debug_gradient_ref_identity
+        if input_tensor.dtype._is_ref_dtype else
+        gen_array_ops.debug_gradient_identity)
     # pylint: enable=protected-access
+    debug_grad_identity = identity_op(input_tensor, name=grad_debug_op_name)
     assert debug_grad_identity.dtype == input_tensor.dtype
     if debug_grad_identity.op.name != grad_debug_op_name:
       raise ValueError(
diff --git a/tensorflow/python/eager/backprop.py b/tensorflow/python/eager/backprop.py
index 5c235382652811ff83ec800c0a28a3beccd45f0f..14bcc60006228eeaabea241ee18d960174a9dbea 100644
--- a/tensorflow/python/eager/backprop.py
+++ b/tensorflow/python/eager/backprop.py
@@ -137,112 +137,6 @@ _gradient_functions_lock = threading.Lock()
 _tracing = False
 
 
-# TODO(apassos) replace this with a mechanism which can happen at the op
-# gradient function registration site, to be less error-prone
-# TODO(apassos) add ops other than those in nn_grad and math_grad
-_ops_which_dont_need_outputs = set([
-    "Identity",
-    "MatMul",
-    "Conv2DBackpropInput",
-    "Conv2DBackpropFilter",
-    "Conv3D",
-    "Conv3DBackpropInputV2",
-    "AvgPool3D",
-    "AvgPool3DGrad",
-    "MaxPool3D",
-    "MaxPool3DGrad",
-    "MaxPool3DGradGrad",
-    "BiasAdd",
-    "BiasAddV1",
-    "BiasAddGrad",
-    "Relu6",
-    "Softplus",
-    "SoftplusGrad",
-    "Softsign",
-    "ReluGrad",
-    "Conv2D",
-    "DepthwiseConv2dNative",
-    "Dilation2D",
-    "AvgPool",
-    "AvgPoolGrad",
-    "BatchNormWithGlobalNormalization",
-    "L2Loss",
-    "Sum",
-    "Prod",
-    "SegmentSum",
-    "SegmentMean",
-    "SparseSegmentSum",
-    "SparseSegmentMean",
-    "SparseSegmentSqrtN",
-    "SegmentMin",
-    "SegmentMax",
-    "UnsortedSegmentSum",
-    "UnsortedSegmentMax",
-    "UnsortedSegmentMin",
-    "UnsortedSegmentProd",
-    "Abs",
-    "Neg",
-    "ReciprocalGrad",
-    "Square",
-    "Expm1",
-    "Log",
-    "Log1p",
-    "TanhGrad",
-    "SigmoidGrad",
-    "Sign",
-    "Sin",
-    "Cos",
-    "Tan",
-    "Add",
-    "Sub",
-    "Mul",
-    "Div",
-    "RealDiv",
-    "Maximum",
-    "Minimum",
-    "SquaredDifference",
-    "Select",
-    "SparseMatMul",
-    "BatchMatMul",
-    "Complex",
-    "Real",
-    "Imag",
-    "Angle",
-    "Conj",
-    "Cast",
-    "Cross",
-    "Cumsum",
-    "Cumprod",
-    "ReadVariableOp",
-    "VarHandleOp",
-    "Shape",
-])
-
-_ops_which_dont_need_inputs = set([
-    "Identity",
-    "Softmax",
-    "LogSoftmax",
-    "BiasAdd",
-    "Relu",
-    "Elu",
-    "Selu",
-    "SparseSoftmaxCrossEntropyWithLogits",
-    "Neg",
-    "Inv",
-    "Reciprocal",
-    "Sqrt",
-    "Exp",
-    "Tanh",
-    "Sigmoid",
-    "Real",
-    "Imag",
-    "Conj",
-    "ReadVariableOp",
-    "VarHandleOp",
-    "Shape",
-])
-
-
 # TODO(agarwal): use an automatic mechanism for handling None arguments to
 # gradient functions.
 # Some gradient functions can accept None arguments for gradients. The following
@@ -261,57 +155,25 @@ _grad_fn_accepts_none_for_indices = {
 }
 
 
-def _record_gradient(op_name, inputs, attrs, results, name):
-  """Records gradients for a TensorFlow operation.
-
-  Args:
-    op_name: Name of the TensorFlow operation (see REGISTER_OP in C++ code) to
-      execute.
-    inputs: A flat list of Tensor object inputs to the operation.
-    attrs: A tuple with alternating string attr names and attr values for this
-      operation.
-    results: The results of the operation (as a flat list).
-    name: Customized name for the operation.
-
-  Returns:
-    A list of maybe-wrapped results. Either Tensors or TensorNodes.
-
-  Raises:
-    An exception on error.
-  """
-  if not tape.could_possibly_record():
-    return
-
-  if op_name in _ops_which_dont_need_outputs:
-    op_outputs = None
-  else:
-    # TODO(apassos) this line creates a weak circular reference where the
-    # backprop function keeps an output alive which in turn keeps the tape entry
-    # alive which keeps the backprop function alive. Figure out how to break
-    # this up without breaking second derivatives of ops like Exp whose
-    # gradients depend only on the outputs.
-    op_outputs = results
-
-  if op_name in _ops_which_dont_need_inputs:
-    op_inputs = None
-  else:
-    op_inputs = inputs
-
-  num_inputs = len(inputs)
+def _get_backward_fn(op_name, attrs, num_inputs, op_inputs, op_outputs):
 
   def grad_fn(*orig_outputs):
-    """Generated gradient function."""
     result = _magic_gradient_function(op_name, attrs, num_inputs,
                                       op_inputs, op_outputs, orig_outputs)
     if _tracing:
-      print("Gradient for", (name if name else op_name), "inputs", op_inputs,
-            "output_grads", orig_outputs, "gradients", result)
+      print("Gradient for", op_name, "inputs", op_inputs, "output_grads",
+            orig_outputs, "gradients", result)
     return nest.flatten(result)
 
-  tape.record_operation(op_name, results, inputs, grad_fn)
-  if _tracing:
-    print("Computed op", (name if name else op_name), "inputs", inputs,
-          "outputs", results)
+  return grad_fn
+
+
+pywrap_tensorflow.TFE_Py_RegisterBackwardFunctionGetter(_get_backward_fn)
+
+
+def _record_gradient(op_name, inputs, attrs, results, name):
+  return pywrap_tensorflow.TFE_Py_RecordGradient(op_name, inputs, attrs,
+                                                 results, name)
 
 
 execute.record_gradient = _record_gradient
@@ -378,6 +240,7 @@ def implicit_val_and_grad(f):
       tape.pop_tape(this_tape)
     # Sorting variables by id, which is monotonically increasing in construction
     # order. This ensures unique order across executions.
+    # TODO(josh11b): Move the sort to the C++ implementation in pywrap_tfe_src.cc.
     variables = list(sorted(this_tape.watched_variables(),
                             key=lambda v: v.handle._id))  # pylint: disable=protected-access
     sources = [x.handle for x in variables]
@@ -639,7 +502,7 @@ def val_and_grad_function(f, params=None):
   return decorated
 
 
-def make_vjp(f, params=None):
+def make_vjp(f, params=None, persistent=True):
   """Returns a function that computes f and is vjp w.r.t. params.
 
   The term "vjp" here is an abbreviation for vector-jacobian product.
@@ -648,6 +511,8 @@ def make_vjp(f, params=None):
     f: the function to be differentiated.
     params: the parameters (numbers or names) to differentiate with respect to.
        A value of None will differentiate with respect to all parameters.
+    persistent: Boolean controlling whether the VJP function can be re-used.
+      Must be True or False.
 
   Returns:
     A function, which when called, returns a tuple (value, vjp), where:
@@ -675,7 +540,7 @@ def make_vjp(f, params=None):
     """Computes the value and gradient of the decorated function."""
     parameter_positions = _get_arg_spec(f, params, args)
     assert not kwds, "The gradient function can't take keyword arguments."
-    this_tape = tape.push_new_tape()
+    this_tape = tape.push_new_tape(persistent=persistent)
     try:
       sources = []
       args = [
@@ -884,7 +749,11 @@ class GradientTape(object):
       tape.watch(t)
 
   def watched_variables(self):
-    return self._tape.watched_variables()
+    # Sorting variables by id, which is monotonically increasing in construction
+    # order. This ensures unique order across executions.
+    # TODO(josh11b): Move the sort to the C++ implementation in pywrap_tfe_src.cc.
+    return list(sorted(self._tape.watched_variables(),
+                       key=lambda v: v.handle._id))  # pylint: disable=protected-access
 
   def gradient(self, target, sources, output_gradients=None):
     """Computes the gradient using information traced by the tape.
diff --git a/tensorflow/python/eager/backprop_test.py b/tensorflow/python/eager/backprop_test.py
index a12113893ab3eac671e8138472bc95e9d8b89499..48fd1707643511413f501e8b09ba3d86fcd8e904 100644
--- a/tensorflow/python/eager/backprop_test.py
+++ b/tensorflow/python/eager/backprop_test.py
@@ -115,6 +115,19 @@ class BackpropTest(test.TestCase):
     with self.assertRaises(RuntimeError):
       backprop.gradients_function(f)(constant_op.constant(1.0))
 
+  def testGradientsFunctionInCustomGradient(self):
+
+    @custom_gradient.custom_gradient
+    def f(x):
+      (y,) = backprop.gradients_function(lambda x: x * x)(x)
+
+      def grad(dy):
+        return [2 * dy]
+
+      return y, grad
+
+    self.assertAllEqual(f(1.0), 2.0)
+
   def testImplicitGradOverEmbeddingLookup(self):
     batch_size = 8
     embedding_size = 512
@@ -205,11 +218,22 @@ class BackpropTest(test.TestCase):
     def f(x):
       return x * x
 
-    wrapped_fn = backprop.make_vjp(f)
+    wrapped_fn = backprop.make_vjp(f, persistent=False)
     result, vjp = wrapped_fn(constant_op.constant(3.0))
     self.assertAllEqual(result, 9.0)
     self.assertAllEqual(vjp(2.0)[0], 12.0)
 
+  def testPersistentMakeVJP(self):
+
+    def f(x):
+      return x * x
+
+    wrapped_fn = backprop.make_vjp(f, persistent=True)
+    _, vjp = wrapped_fn(constant_op.constant(3.0))
+    vjp_result1 = vjp(2.0)[0]
+    vjp_result2 = vjp(2.0)[0]
+    self.assertAllEqual(vjp_result1, vjp_result2, 12.0)
+
   @test_util.assert_no_new_tensors
   def testGradGrad(self):
 
diff --git a/tensorflow/python/eager/benchmarks_test.py b/tensorflow/python/eager/benchmarks_test.py
index b56cbe80a7ab6b90d715187b0f0a44847038fc37..551d5647dda25b5e3b7f59981e072e0774422825 100644
--- a/tensorflow/python/eager/benchmarks_test.py
+++ b/tensorflow/python/eager/benchmarks_test.py
@@ -35,7 +35,6 @@ from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import backprop  # pylint: disable=unused-import
 from tensorflow.python.eager import context
 from tensorflow.python.eager import core
-from tensorflow.python.eager import execute
 from tensorflow.python.eager import function
 from tensorflow.python.eager import test
 from tensorflow.python.framework import dtypes
@@ -60,7 +59,7 @@ def c_tfe_py_fastpath_execute(a,
   ), "The prototype doesn't contain C code for graph construction"
   try:
     return pywrap_tensorflow.TFE_Py_FastPathExecute(
-        ctx._handle, ctx.device_name, "MatMul", execute.record_gradient, name,
+        ctx._handle, ctx.device_name, "MatMul", name,
         ctx._post_execution_callbacks, a, b, "transpose_a", transpose_a,
         "transpose_b", transpose_b)
   except core._NotOkStatusException as e:
@@ -243,7 +242,8 @@ class MicroBenchmarks(test.Benchmark):
 
   def _benchmark_gen_math_ops_matmul(self, m, transpose_b, num_iters):
     def func():
-      gen_math_ops._mat_mul(m, m, transpose_b=transpose_b)
+      gen_math_ops.mat_mul(m, m, transpose_b=transpose_b)
+
     self._run(func, num_iters)
 
   def _benchmark_tfe_py_fastpath_execute_matmul(self, m, transpose_b,
@@ -275,6 +275,16 @@ class MicroBenchmarks(test.Benchmark):
   def _benchmark_read_variable(self, m, num_iters):
     self._run(m.value, num_iters)
 
+  def _benchmark_matmul_read_variable(self, m, num_iters):
+    self._benchmark_gen_math_ops_matmul(
+        m, transpose_b=False, num_iters=num_iters)
+
+  def _benchmark_matmul_read_variable_with_tape(self, m, num_iters):
+    with backprop.GradientTape() as tape:
+      tape.watch(m)
+      self._benchmark_gen_math_ops_matmul(
+          m, transpose_b=False, num_iters=num_iters)
+
   def _benchmark_read_variable_with_tape(self, m, num_iters):
     with backprop.GradientTape() as tape:
       tape.watch(m)
@@ -416,6 +426,17 @@ class MicroBenchmarks(test.Benchmark):
       self._benchmark_defun_matmul(
           m, transpose_b=True, num_iters=self._num_iters_100_by_784)
 
+  def benchmark_matmul_read_variable_op_2_by_2_CPU(self):
+    with context.device(CPU):
+      m = resource_variable_ops.ResourceVariable(self._m_2_by_2)
+      self._benchmark_matmul_read_variable(m, num_iters=self._num_iters_2_by_2)
+
+  def benchmark_matmul_read_variable_op_with_tape_2_by_2_CPU(self):
+    with context.device(CPU):
+      m = resource_variable_ops.ResourceVariable(self._m_2_by_2)
+      self._benchmark_matmul_read_variable_with_tape(
+          m, num_iters=self._num_iters_2_by_2)
+
   def benchmark_read_variable_op_2_by_2_CPU(self):
     with context.device(CPU):
       m = resource_variable_ops.ResourceVariable(self._m_2_by_2)
diff --git a/tensorflow/python/eager/core_test.py b/tensorflow/python/eager/core_test.py
index ee3c10633e1cb849e319f2f5490e5beb5dd15c80..e418be5fae4da46615f7b1467252ae6b26b9e6a3 100644
--- a/tensorflow/python/eager/core_test.py
+++ b/tensorflow/python/eager/core_test.py
@@ -33,7 +33,10 @@ from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import gen_resource_variable_ops
 from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import resource_variable_ops
 
 
 def execute(op_name, num_outputs, inputs, attrs=None):
@@ -99,6 +102,18 @@ class TFETest(test_util.TensorFlowTestCase):
     self.assertEqual(len(cpu_stats.node_stats), 1)
     self.assertEqual(cpu_stats.node_stats[0].node_name, 'Add')
 
+  def testShouldCopy(self):
+    if not context.context().num_gpus():
+      self.skipTest('No devices other than CPUs found')
+    with ops.device('gpu:0'):
+      x = constant_op.constant(1.0)
+    y = array_ops.identity(x)
+    # The value we're testing y.device against will depend on what the behavior
+    # of not explicitly specifying a device in the context is.  This behavior is
+    # subject to change (for example, in the future we may want to use GPUs, if
+    # available, when no device is explicitly provided)
+    self.assertEqual(y.device, '/job:localhost/replica:0/task:0/device:CPU:0')
+
   def testContextStackContainsEagerMode(self):
     # Eager execution has been enabled, and no other context
     # switch has occurred, so `context_stack` should contain
@@ -168,6 +183,18 @@ class TFETest(test_util.TensorFlowTestCase):
         attrs=('T', x.dtype.as_datatype_enum))[0].cpu().numpy()
     self.assertEqual(3, result)
 
+  def testResourceTensorPlacement(self):
+    if not context.context().num_gpus():
+      self.skipTest('No GPUs found')
+
+    with context.device('gpu:0'):
+      v = resource_variable_ops.ResourceVariable(1.0)
+    with context.device('cpu:0'):
+      # Check that even though we specified the cpu device we'll run the read op
+      # in the device where the handle is.
+      self.assertAllEqual(
+          gen_resource_variable_ops.read_variable_op(v.handle, v.dtype), 1.0)
+
   def testCopyBetweenDevices(self):
     if not context.context().num_gpus():
       self.skipTest('No GPUs found')
diff --git a/tensorflow/python/eager/custom_gradient.py b/tensorflow/python/eager/custom_gradient.py
index 05460ff9968312528d87f5fc2ad0495b4da2ad1a..fb932a937206a9500996e6d1ae721a8294c676d0 100644
--- a/tensorflow/python/eager/custom_gradient.py
+++ b/tensorflow/python/eager/custom_gradient.py
@@ -71,11 +71,10 @@ def custom_gradient(f):
 
     input_tensors = [tf_ops.convert_to_tensor(x) for x in args]
 
-    with tape.stop_recording():
-      result, grad_fn = f(*args, **kwargs)
-      flat_result = nest.flatten(result)
-      # TODO(apassos) consider removing the identity below.
-      flat_result = [gen_array_ops.identity(x) for x in flat_result]
+    result, grad_fn = f(*args, **kwargs)
+    flat_result = nest.flatten(result)
+    # TODO(apassos) consider removing the identity below.
+    flat_result = [gen_array_ops.identity(x) for x in flat_result]
 
     def actual_grad_fn(*outputs):
       return nest.flatten(grad_fn(*outputs))
diff --git a/tensorflow/python/eager/function.py b/tensorflow/python/eager/function.py
index 28f5289ffc0ace6f9b6cad7cdd1160a184f882c7..655eaf3a1ec5a38dc820f0b7706b0ea4dcf09d7a 100644
--- a/tensorflow/python/eager/function.py
+++ b/tensorflow/python/eager/function.py
@@ -36,6 +36,7 @@ from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes as dtypes_module
 from tensorflow.python.framework import errors
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.util import compat
@@ -162,31 +163,15 @@ class CapturingGraph(ops.Graph):
       op_def=None,
       compute_shapes=True,
       compute_device=True):
-    # TODO(apassos) probably control flow has to be handled delicately here as
-    # in if a resource is accessed inside a control flow context we need the
-    # control dependency to point to something outside the context which is
-    # guaranteed to happen after the access.
-    #
     # TODO(apassos) this should do some form of alias analysis as ops which
     # forward the resources such as Identity and Switch can cause serialization
     # to fail.
-    resource_inputs = set()
-    control_inputs = set()
     for i, inp in enumerate(inputs):
       if inp.graph is not self:
         inputs[i] = capture_value(self.captures, inp, inp.dtype, inp.op.name)
-      inp = inputs[i]
-      if inp.dtype == dtypes_module.resource:
-        if inp.name in self._last_op_using_resource_tensor:
-          control_inputs.add(self._last_op_using_resource_tensor[inp.name])
-        resource_inputs.add(inp.name)
-    with self.control_dependencies(list(control_inputs)):
-      op = super(CapturingGraph, self).create_op(
-          op_type, inputs, dtypes, input_types, name, attrs, op_def,
-          compute_shapes, compute_device)
-    for name in resource_inputs:
-      self._last_op_using_resource_tensor[name] = op
-    return op
+    return super(CapturingGraph, self).create_op(
+        op_type, inputs, dtypes, input_types, name, attrs, op_def,
+        compute_shapes, compute_device)
 
 
 # TODO(apassos): it'd be really nice if we could scope this registration.
@@ -196,33 +181,66 @@ ops.register_tensor_conversion_function(
     ops.EagerTensor, _convert_to_graph_tensor, priority=-1)
 
 
-class _CapturingContext(object):
-  """Tracks references to Tensors outside this context while it is active."""
+# pylint: disable=invalid-name
+class HelperContext(object):
+  """ControlFlowContext with a customizable AddOp method."""
 
-  def __init__(self):
-    # known_ops are ops which are created while this context is active
-    self.known_ops = set()
+  def __init__(self, add_op_internal):
+    self._add_op_internal = add_op_internal
+    self._values = set()  # control flow code sometimes updates this.
+
+  def _AddOpInternal(self, op):
+    self._add_op_internal(op)
+
+  @property
+  def outer_context(self):
+    return self._outer_context
+
+  def GetWhileContext(self):
+    if self._outer_context:
+      return self._outer_context.GetWhileContext()
 
-    # captured_tensors are all tensors referenced to by ops in this context but
-    # not produced in it
-    self.captured_tensors = set()
+  def IsWhileContext(self):
+    return False
+
+  def IsCondContext(self):
+    return False
+
+  def IsXLAContext(self):
+    return False
 
   def AddOp(self, op):  # pylint: disable=invalid-name
-    if op.type in ["Variable", "VariableV2", "VarHandleOp"]:
-      raise ValueError("tfe.defun cannot capture variables created without "
-                       "using tf.get_variable. Op: %s" % op)
-    self.known_ops.add(op)
-    for i in op.inputs:
-      if i.op not in self.known_ops:
-        self.captured_tensors.add(i)
+    self._AddOpInternal(op)
+    if self._outer_context:
+      self._outer_context.AddOp(op)
+
+  def AddName(self, _):
+    pass
+
+  def AddInnerOp(self, op):
+    self._AddOpInternal(op)
+    if self._outer_context:
+      self._outer_context.AddInnerOp(op)
+
+  def AddValue(self, val):
+    if self._outer_context:
+      return self._outer_context.AddValue(val)
+    else:
+      return val
 
   def __enter__(self):
+    # pylint: disable=protected-access
     self._g = ops.get_default_graph()
-    self._old = self._g._get_control_flow_context()  # pylint: disable=protected-access
-    self._g._set_control_flow_context(self)  # pylint: disable=protected-access
+    self._outer_context = self._g._get_control_flow_context()
+    self._g._set_control_flow_context(self)
+    self._nested_contexts = (
+        self._outer_context._nested_contexts
+        if self._outer_context is not None else None)
+    # pylint: enable=protected-access
 
-  def __exit__(self, _, __, ___):  # pylint: disable=invalid-name
-    self._g._set_control_flow_context(self._old)  # pylint: disable=protected-access
+  def __exit__(self, *_):
+    self._g._set_control_flow_context(self._outer_context)  # pylint: disable=protected-access
+# pylint: enable=invalid-name
 
 
 def _forward_name(n):
@@ -368,7 +386,20 @@ class GraphModeFunction(object):
   def _construct_backprop_function(self):
     """Constructs the backprop function object for this function."""
     with self._graph.as_default(), context.graph_mode():
-      c = _CapturingContext()
+      c_known_ops = set()
+      c_captured_tensors = set()
+
+      def add_op_internal(op):
+        if op.type in ["Variable", "VariableV2", "VarHandleOp"]:
+          raise ValueError("tfe.defun cannot capture variables created without "
+                           "using tf.get_variable. Op: %s" % op)
+        c_known_ops.add(op)
+        for i in op.inputs:
+          if i.op not in c_known_ops:
+            c_captured_tensors.add(i)
+
+      c = HelperContext(add_op_internal)
+
       with c:
         filtered_outputs = [x for x in self._returns if x is not None]
         self._out_grad_placeholders = [
@@ -382,7 +413,7 @@ class GraphModeFunction(object):
         grad for grad in _flatten(in_gradients) if grad is not None)
     output_shapes = tuple(grad.shape for grad in backward_outputs)
 
-    captures = list(sorted(c.captured_tensors, key=lambda x: x.name))
+    captures = list(sorted(c_captured_tensors, key=lambda x: x.name))
     forward_name = _forward_name(self._func_name)
     self._forward_fdef = _EagerDefinedFunction(
         forward_name, self._graph, self._ops, self._input_placeholders,
@@ -395,7 +426,7 @@ class GraphModeFunction(object):
     # means rerunning the function-defining code will always define the same
     # function, which is useful if we serialize this etc.
     function_def_ops = tuple(x
-                             for x in sorted(c.known_ops, key=lambda x: x.name)
+                             for x in sorted(c_known_ops, key=lambda x: x.name)
                              if x not in all_ignored_ops)
     bname = _backward_name(self._func_name)
     self._backward_function = GraphModeFunction(
@@ -590,13 +621,15 @@ def _defun_internal(name, func, args, kwds):
     for collection in curr_graph.collections:
       tmp_graph.get_collection_ref(collection)[:] = curr_graph.get_collection(
           collection)
-    with tmp_graph.as_default():
+    with tmp_graph.as_default(), AutomaticControlDependencies() as a:
       func_inputs = _get_defun_inputs(args)
 
       def convert(x):
         if x is None:
           return None
-        return ops.convert_to_tensor_or_indexed_slices(x)
+        x = ops.convert_to_tensor_or_indexed_slices(x)
+        x = a.mark_as_return(x)
+        return x
 
       with capture_tensors(captures):
         this_tape = tape.push_new_tape()
@@ -841,7 +874,36 @@ class AutomaticControlDependencies(object):
     self._returned_tensors = set()
 
   def mark_as_return(self, tensor):
+    """Acts like identity but marks the `Tensor` as a return value.
+
+    This will possibly return a copy of the `Tensor`. Usage:
+
+    ```
+      with AutomaticControlDependencies() as a:
+       ...
+       t = a.mark_as_return(t)
+      _ = ...(t...)  # i.e. it's safe to use t here
+    ```
+
+    Args:
+      tensor: the `Tensor` to be marked
+
+    Returns:
+      a copy of the `Tensor`.
+    """
+    if isinstance(tensor, ops.IndexedSlices):
+      values = array_ops.identity(tensor.values)
+      indices = array_ops.identity(tensor.indices)
+      self._returned_tensors.add(indices)
+      self._returned_tensors.add(values)
+      return ops.IndexedSlices(values, indices, dense_shape=tensor.dense_shape)
+    # We want to make the return values depend on the stateful operations, but
+    # we don't want to introduce a cycle, so we make the return value the result
+    # of a new identity operation that the stateful operations definitely don't
+    # depend on.
+    tensor = array_ops.identity(tensor)
     self._returned_tensors.add(tensor)
+    return tensor
 
   def __enter__(self):
     if context.in_eager_mode():
@@ -962,7 +1024,8 @@ class AutomaticControlDependencies(object):
     for op in new_operations:
       control_inputs = set()
       # Ensure stateful ops run
-      if self._graph._registered_ops[op.type].is_stateful:  # pylint: disable=protected-access
+      if (op.type not in self._graph._registered_ops  # pylint: disable=protected-access
+          or self._graph._registered_ops[op.type].is_stateful):  # pylint: disable=protected-access
         ops_which_must_run.add(op)
       # Ignore switches (they're handled separately)
       if op.type == "Switch" and op.inputs[0].dtype == dtypes_module.resource:
@@ -998,9 +1061,10 @@ class AutomaticControlDependencies(object):
 
     # Ensure all ops which must run do run
     for r in self._returned_tensors:
-      r.op._add_control_inputs(  # pylint: disable=protected-access
-          [o for o in ops_which_must_run
-           if o._control_flow_context is r.op._control_flow_context])  # pylint: disable=protected-access
+      if ops_which_must_run:
+        r.op._add_control_inputs(  # pylint: disable=protected-access
+            [o for o in ops_which_must_run
+             if o._control_flow_context is r.op._control_flow_context])  # pylint: disable=protected-access
 
 
 def automatic_control_dependencies(f):
@@ -1020,8 +1084,7 @@ def automatic_control_dependencies(f):
   def wrapper(*args, **kwds):
     with AutomaticControlDependencies() as a:
       result = f(*args, **kwds)
-      for t in nest.flatten(result):
-        a.mark_as_return(t)
-      return result
+      result_flat = [a.mark_as_return(t) for t in nest.flatten(result)]
+      return nest.pack_sequence_as(result, result_flat)
 
   return tf_decorator.make_decorator(f, wrapper)
diff --git a/tensorflow/python/eager/function_test.py b/tensorflow/python/eager/function_test.py
index 431d9388c0ee97eda197142ec97b9448d985b04b..b9cde16867d498d73715535f028a5eb2bea97ea6 100644
--- a/tensorflow/python/eager/function_test.py
+++ b/tensorflow/python/eager/function_test.py
@@ -606,7 +606,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
         v.assign(v + 1)
         v.assign(2 * v)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(), 4.0)
 
   def testCondMustRun(self):
@@ -626,7 +626,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
 
         control_flow_ops.cond(p, true_fn, false_fn)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False}), 5.0)
       self.assertAllEqual(val.eval(feed_dict={p: True}), 6.0)
 
@@ -647,7 +647,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
 
         control_flow_ops.cond(p, true_fn, false_fn)
         one = constant_op.constant(1.0)
-        c.mark_as_return(one)
+        one = c.mark_as_return(one)
       one.eval(feed_dict={p: False})
       self.assertAllEqual(v.read_value().eval(), 5.0)
       one.eval(feed_dict={p: True})
@@ -681,7 +681,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
         control_flow_ops.cond(p, true_fn, false_fn)
         with ops.name_scope('final'):
           val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False, q: False}), 3.0)
       self.assertAllEqual(val.eval(feed_dict={p: False, q: True}), 6.0)
       self.assertAllEqual(val.eval(feed_dict={p: True, q: True}), 7.0)
@@ -703,7 +703,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
 
         control_flow_ops.cond(p, true_fn, false_fn)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False}), 5.0)
       self.assertAllEqual(val.eval(feed_dict={p: True}), 5.0)
 
@@ -724,7 +724,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
 
         control_flow_ops.cond(p, true_fn, false_fn)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False}), 6.0)
       self.assertAllEqual(val.eval(feed_dict={p: True}), 12.0)
 
@@ -745,7 +745,7 @@ class AutomaticControlDependenciesTest(test.TestCase):
         control_flow_ops.cond(p, true_fn, false_fn)
         v.assign(v * 2)
         val = v.read_value()
-        c.mark_as_return(val)
+        val = c.mark_as_return(val)
       self.assertAllEqual(val.eval(feed_dict={p: False}), 10.0)
       self.assertAllEqual(val.eval(feed_dict={p: True}), 20.0)
 
diff --git a/tensorflow/python/eager/graph_callable.py b/tensorflow/python/eager/graph_callable.py
index 62106bf0e2809e3c056e4a357f3d05251b7dca68..623f3564ad6c178d6259c06b040e003897ed6ca4 100644
--- a/tensorflow/python/eager/graph_callable.py
+++ b/tensorflow/python/eager/graph_callable.py
@@ -279,9 +279,12 @@ def _graph_callable_internal(func, shape_and_dtypes):
       # scope's view of which variables exist.
       variable_captures = _VariableCapturingScope()
       with variable_captures.initializing_scope(), function.capture_tensors(
-          captures):
+          captures), function.AutomaticControlDependencies() as a:
         func_outputs = func(*func_inputs)
-      outputs_list = nest.flatten(func_outputs)
+        outputs_list = nest.flatten(func_outputs)
+        for i, x in enumerate(outputs_list):
+          if x is not None:
+            outputs_list[i] = a.mark_as_return(x)
       if len(outputs_list) == 1 and outputs_list[0] is None:
         outputs_list = []
       output_shapes = [x.shape for x in outputs_list]
@@ -294,9 +297,12 @@ def _graph_callable_internal(func, shape_and_dtypes):
       # knows about all variables.
       tmp_graph.clear_resource_control_flow_state()
       with variable_captures.capturing_scope(), function.capture_tensors(
-          captures):
+          captures), function.AutomaticControlDependencies() as a:
         captured_outputs = func(*func_inputs)
       captured_outlist = nest.flatten(captured_outputs)
+      for i, x in enumerate(captured_outlist):
+        if x is not None:
+          captured_outlist[i] = a.mark_as_return(x)
       capturing_operations = tmp_graph.get_operations()[
           len(initializing_operations):]
 
diff --git a/tensorflow/python/eager/ops_test.py b/tensorflow/python/eager/ops_test.py
index f2e70341d975fb06bce7f2ce6cba7d8c3bc9826c..f70c7544d6c9e8095e95d0629b94384bc1cbe35b 100644
--- a/tensorflow/python/eager/ops_test.py
+++ b/tensorflow/python/eager/ops_test.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.core.protobuf import config_pb2
 from tensorflow.python.eager import context
 from tensorflow.python.eager import execute
 from tensorflow.python.eager import test
@@ -130,8 +131,12 @@ class OpsTest(test_util.TensorFlowTestCase):
                                    dtype=dtypes.int64)
     values = constant_op.constant([2, 3, 5, 7, 11])
     shape = constant_op.constant([2, 7], dtype=dtypes.int64)
-    result = sparse_ops.gen_sparse_ops._sparse_split(  # pylint: disable=protected-access
-        split_dim, indices, values, shape, num_split=2)
+    result = sparse_ops.gen_sparse_ops.sparse_split(
+        split_dim,
+        indices,
+        values,
+        shape,
+        num_split=2)
     output_indices, output_values, output_shape = result
     self.assertEqual(2, len(output_indices))
     self.assertEqual(2, len(output_values))
@@ -277,6 +282,25 @@ class OpsTest(test_util.TensorFlowTestCase):
       context._context = context.Context()
     # pylint: enable=protected-access
 
+  def testSoftPlacement(self):
+    if not context.context().num_gpus():
+      self.skipTest('No GPUs found')
+    # Temporarily replace the context
+    # pylint: disable=protected-access
+    del context._context
+    try:
+      context._context = context.Context(
+          device_policy=context.DEVICE_PLACEMENT_SILENT,
+          config=config_pb2.ConfigProto(allow_soft_placement=True))
+      cpu_tensor = constant_op.constant(1.0)
+      result = cpu_tensor + cpu_tensor
+      self.assertEqual(result.device,
+                       '/job:localhost/replica:0/task:0/device:GPU:0')
+    finally:
+      del context._context
+      context._context = context.Context()
+    # pylint: enable=protected-access
+
   def testRandomUniform(self):
     scalar_shape = constant_op.constant([], dtype=dtypes.int32)
 
diff --git a/tensorflow/python/eager/python_eager_op_gen.cc b/tensorflow/python/eager/python_eager_op_gen.cc
index e6d03297e0b85856ff165af310149c79e494ab36..3de7445a50a8f1df52c1b904e45e3e55571c3c5b 100644
--- a/tensorflow/python/eager/python_eager_op_gen.cc
+++ b/tensorflow/python/eager/python_eager_op_gen.cc
@@ -712,9 +712,9 @@ bool GenEagerPythonOp::AddEagerFallbackCode(
 }
 
 void GenEagerPythonOp::AddEagerFastPathExecute() {
-  string fastpath_execute_params = strings::StrCat(
-      "_ctx._handle, _ctx.device_name, \"", op_def_.name(), "\", ",
-      "_execute.record_gradient, name, _ctx._post_execution_callbacks");
+  string fastpath_execute_params =
+      strings::StrCat("_ctx._handle, _ctx.device_name, \"", op_def_.name(),
+                      "\", ", "name, _ctx._post_execution_callbacks");
   string fallback_params;
 
   for (int i = 0; i < api_def_.in_arg_size(); i++) {
@@ -955,10 +955,10 @@ from tensorflow.python.util.tf_export import tf_export
     if (api_def->visibility() == ApiDef::SKIP) {
       continue;
     }
-
     // An op is hidden if either its ApiDef visibility is HIDDEN
     // or it is in the hidden_ops list.
     bool is_hidden = api_def->visibility() == ApiDef::HIDDEN;
+    bool hidden_by_api_def = is_hidden;
     if (!is_hidden) {
       for (const string& hidden : hidden_ops) {
         if (op_def.name() == hidden) {
@@ -971,13 +971,22 @@ from tensorflow.python.util.tf_export import tf_export
     string function_name;
     python_op_gen_internal::GenerateLowerCaseOpName(op_def.name(),
                                                     &function_name);
-    if (is_hidden) function_name = strings::StrCat("_", function_name);
-
-    // When users create custom python wrappers, they may link in the
-    // default op registry by accident, and because they can't
-    // enumerate all 'hidden' symbols, this guard is to prevent
-    // instantiating a python reserved word in their wrapper.
-    if (python_op_gen_internal::IsPythonReserved(function_name)) {
+    bool is_reserved = python_op_gen_internal::IsPythonReserved(function_name);
+
+    // Prefix an op with underscore if the op is listed in hidden_ops or
+    // name is reserved or it is of the exceptions in IsOpWithUnderscorePrefix.
+    // Do not add underscores to ops set to HIDDEN in ApiDef otherwise.
+    // TODO(annarev): don't prefix with underscores even if op is in hidden_ops.
+    if (is_hidden) {
+      if (!hidden_by_api_def || is_reserved ||
+          python_op_gen_internal::IsOpWithUnderscorePrefix(function_name)) {
+        function_name = strings::StrCat("_", function_name);
+      }
+    } else if (is_reserved) {
+      // When users create custom python wrappers, they may link in the
+      // default op registry by accident, and because they can't
+      // enumerate all 'hidden' symbols, this guard is to prevent
+      // instantiating a python reserved word in their wrapper.
       continue;
     }
 
diff --git a/tensorflow/python/eager/pywrap_tensor.cc b/tensorflow/python/eager/pywrap_tensor.cc
index 6fa076507d11ab9c88891cbeb0a4fb3959e4e99d..8338bc43432ef8b214cfc43cb9819a2c29c957dc 100644
--- a/tensorflow/python/eager/pywrap_tensor.cc
+++ b/tensorflow/python/eager/pywrap_tensor.cc
@@ -185,6 +185,16 @@ typedef struct EagerTensor {
 
   // This stores `_keras_mask` object and is set by Tensorflow layers.
   PyObject* keras_mask;
+
+  // This stores `_tensor_shape`, a cached `TensorShape` object, and is set the
+  // first time that `_EagerTensorBase`'s `shape` property is called.
+  PyObject* tensor_shape;
+
+  // We store a status object here as an optimization to avoid allocating a new
+  // Status objects on different functions that operate on EagerTensor and need
+  // to use a TF_Status object. However note that accesses to `status` are not
+  // thread-safe.
+  TF_Status* status;
 } EagerTensor;
 
 // tp_init for EagerTensor.
@@ -195,6 +205,9 @@ int EagerTensor_init(EagerTensor* self, PyObject* args, PyObject* kwds) {
   self->handle_data = Py_None;
   Py_INCREF(Py_None);
   self->keras_mask = Py_None;
+  Py_INCREF(Py_None);
+  self->tensor_shape = Py_None;
+  self->status = TF_NewStatus();
   PyObject* value;
   PyObject* context = nullptr;
   PyObject* device = nullptr;
@@ -269,17 +282,17 @@ int EagerTensor_init(EagerTensor* self, PyObject* args, PyObject* kwds) {
   }
   TF_DataType handle_dtype = TFE_TensorHandleDataType(handle.get());
   if (desired_dtype >= 0 && desired_dtype != handle_dtype) {
-    auto out_status = tensorflow::make_safe(TF_NewStatus());
     handle = tensorflow::make_safe(
         EagerCast(GetContext(context), handle.get(), handle_dtype,
-                  static_cast<TF_DataType>(desired_dtype), out_status.get()));
-    if (TF_GetCode(out_status.get()) != TF_OK) {
-      PyErr_SetString(
-          PyExc_ValueError,
-          tensorflow::strings::StrCat("Error while casting from DataType ",
-                                      handle_dtype, " to ", desired_dtype, ". ",
-                                      TF_Message(out_status.get()))
-              .c_str());
+                  static_cast<TF_DataType>(desired_dtype), self->status));
+    if (TF_GetCode(self->status) != TF_OK) {
+      PyErr_SetString(PyExc_ValueError,
+                      tensorflow::strings::StrCat(
+                          "Error while casting from DataType ", handle_dtype,
+                          " to ", desired_dtype, ". ", TF_Message(self->status))
+                          .c_str());
+      // Cleanup self->status before returning.
+      TF_SetStatus(self->status, TF_OK, "");
       return -1;
     }
     handle_dtype = TFE_TensorHandleDataType(handle.get());
@@ -323,8 +336,10 @@ int EagerTensor_init(EagerTensor* self, PyObject* args, PyObject* kwds) {
 
 // tp_dealloc for EagerTensor.
 void EagerTensor_dealloc(EagerTensor* self) {
+  TF_DeleteStatus(self->status);
   Py_DECREF(self->handle_data);
   Py_DECREF(self->keras_mask);
+  Py_DECREF(self->tensor_shape);
   TFE_DeleteTensorHandle(self->handle);
   self->handle = nullptr;
   // We have the global interpreter lock, so use this chance to perform delayed
@@ -348,12 +363,21 @@ static PyObject* EagerTensor_datatype_enum(EagerTensor* self) {
 // Getter for `_shape_tuple`.
 static PyObject* EagerTensor_shape_tuple(EagerTensor* self) {
   auto handle = self->handle;
-  int n = TFE_TensorHandleNumDims(handle);
+  int n = TFE_TensorHandleNumDims(handle, self->status);
+  if (MaybeRaiseExceptionFromTFStatus(self->status, PyExc_ValueError)) {
+    // Cleanup self->status before returning.
+    TF_SetStatus(self->status, TF_OK, "");
+    return nullptr;
+  }
   PyObject* shape = PyTuple_New(n);
   if (PyErr_Occurred()) return nullptr;
   for (int i = 0; i < n; ++i) {
-    PyObject* dim = PyLong_FromLongLong(TFE_TensorHandleDim(handle, i));
-    if (dim == nullptr || PyTuple_SetItem(shape, i, dim) != 0) {
+    PyObject* dim =
+        PyLong_FromLongLong(TFE_TensorHandleDim(handle, i, self->status));
+    if (MaybeRaiseExceptionFromTFStatus(self->status, PyExc_ValueError) ||
+        dim == nullptr || PyTuple_SetItem(shape, i, dim) != 0) {
+      // Cleanup self->status before returning.
+      TF_SetStatus(self->status, TF_OK, "");
       Py_DECREF(shape);
       if (dim != nullptr) Py_DECREF(dim);
       PyErr_SetString(PyExc_RuntimeError, "Error while creating shape");
@@ -365,10 +389,16 @@ static PyObject* EagerTensor_shape_tuple(EagerTensor* self) {
 
 // Getter for `_rank`.
 static PyObject* EagerTensor_rank(EagerTensor* self) {
+  int num_dims = TFE_TensorHandleNumDims(self->handle, self->status);
+  if (MaybeRaiseExceptionFromTFStatus(self->status, PyExc_ValueError)) {
+    // Cleanup self->status before returning.
+    TF_SetStatus(self->status, TF_OK, "");
+    return nullptr;
+  }
 #if PY_MAJOR_VERSION < 3
-  return PyInt_FromLong(TFE_TensorHandleNumDims(self->handle));
+  return PyInt_FromLong(num_dims);
 #else
-  return PyLong_FromLong(TFE_TensorHandleNumDims(self->handle));
+  return PyLong_FromLong(num_dims);
 #endif
 }
 
@@ -397,6 +427,19 @@ static int EagerTensor_setkeras_mask(EagerTensor* self, PyObject* value,
   self->keras_mask = value;
   return 0;
 }
+
+static PyObject* EagerTensor_tensor_shape(EagerTensor* self, void* unused) {
+  Py_INCREF(self->tensor_shape);
+  return self->tensor_shape;
+}
+
+static int EagerTensor_settensor_shape(EagerTensor* self, PyObject* value,
+                                       void* unused) {
+  Py_DECREF(self->tensor_shape);
+  Py_INCREF(value);
+  self->tensor_shape = value;
+  return 0;
+}
 // Function `_copy_to_device`.
 static PyObject* EagerTensor_copy_to_device(EagerTensor* self, PyObject* args,
                                             PyObject* kwds) {
@@ -437,10 +480,16 @@ static PyObject* EagerTensor_numpy(EagerTensor* self) {
 
 // Getter `device`.
 static PyObject* EagerTensor_device(EagerTensor* self) {
+  const char* device = TFE_TensorHandleDeviceName(self->handle, self->status);
+  if (MaybeRaiseExceptionFromTFStatus(self->status, PyExc_ValueError)) {
+    // Cleanup self->status before returning.
+    TF_SetStatus(self->status, TF_OK, "");
+    return nullptr;
+  }
 #if PY_MAJOR_VERSION >= 3
-  return PyUnicode_FromString(TFE_TensorHandleDeviceName(self->handle));
+  return PyUnicode_FromString(device);
 #else
-  return PyBytes_FromString(TFE_TensorHandleDeviceName(self->handle));
+  return PyBytes_FromString(device);
 #endif
 }
 
@@ -455,6 +504,9 @@ static PyGetSetDef EagerTensor_getseters[] = {
     {const_cast<char*>("_keras_mask"), (getter)EagerTensor_keras_mask,
      (setter)EagerTensor_setkeras_mask, const_cast<char*>("_keras_mask"),
      nullptr},
+    {const_cast<char*>("_tensor_shape"), (getter)EagerTensor_tensor_shape,
+     (setter)EagerTensor_settensor_shape, const_cast<char*>("_tensor_shape"),
+     nullptr},
     {nullptr} /* Sentinel */
 };
 
@@ -491,16 +543,11 @@ PyTypeObject* EagerTensorType = nullptr;
 
 #if PY_MAJOR_VERSION >= 3
 static PyType_Slot EagerTensor_Type_slots[] = {
-    Py_tp_dealloc,
-    reinterpret_cast<void*>(EagerTensor_dealloc),
-    Py_tp_methods,
-    reinterpret_cast<void*>(EagerTensor_methods),
-    Py_tp_getset,
-    reinterpret_cast<void*>(EagerTensor_getseters),
-    Py_tp_init,
-    reinterpret_cast<void*>(EagerTensor_init),
-    0,
-    nullptr,
+    {Py_tp_dealloc, reinterpret_cast<void*>(EagerTensor_dealloc)},
+    {Py_tp_methods, reinterpret_cast<void*>(EagerTensor_methods)},
+    {Py_tp_getset, reinterpret_cast<void*>(EagerTensor_getseters)},
+    {Py_tp_init, reinterpret_cast<void*>(EagerTensor_init)},
+    {0, nullptr},
 };
 
 PyType_Spec EagerTensor_Type_spec = {"EagerTensor", sizeof(EagerTensor), 0,
@@ -575,7 +622,10 @@ PyObject* EagerTensorFromHandle(TFE_TensorHandle* handle) {
     t->handle_data = Py_None;
     Py_INCREF(Py_None);
     t->keras_mask = Py_None;
+    Py_INCREF(Py_None);
+    t->tensor_shape = Py_None;
     t->handle = handle;
+    t->status = TF_NewStatus();
   }
   return reinterpret_cast<PyObject*>(t);
 }
@@ -673,6 +723,7 @@ PyObject* TFE_Py_TensorShapeSlice(PyObject* tensor_list, int slice_dim) {
   auto tensor = tensorflow::make_safe(TF_AllocateTensor(
       TF_INT32, &num_tensors_int, /*num_dims=*/1, /*len=*/4 * num_tensors_int));
   int32_t* data = reinterpret_cast<int32_t*>(TF_TensorData(tensor.get()));
+  auto status = tensorflow::make_safe(TF_NewStatus());
   for (Py_ssize_t i = 0; i < num_tensors; ++i) {
     PyObject* tensor_obj = PyList_GET_ITEM(tensor_list, i);
     if (!EagerTensor_CheckExact(tensor_obj)) {
@@ -687,21 +738,27 @@ PyObject* TFE_Py_TensorShapeSlice(PyObject* tensor_list, int slice_dim) {
 
     EagerTensor* t = reinterpret_cast<EagerTensor*>(tensor_obj);
     TFE_TensorHandle* handle = t->handle;
-    if (slice_dim >= TFE_TensorHandleNumDims(handle)) {
-      PyErr_SetString(PyExc_IndexError,
-                      tensorflow::strings::StrCat(
-                          "Slice dimension (", slice_dim,
-                          ") must be smaller than rank of all "
-                          "tensors, but tensor at index ",
-                          i, " has rank ", TFE_TensorHandleNumDims(handle))
-                          .c_str());
+    int num_dims = TFE_TensorHandleNumDims(handle, status.get());
+    if (MaybeRaiseExceptionFromTFStatus(status.get(), PyExc_ValueError)) {
+      return nullptr;
+    }
+    if (slice_dim >= num_dims) {
+      PyErr_SetString(
+          PyExc_IndexError,
+          tensorflow::strings::StrCat("Slice dimension (", slice_dim,
+                                      ") must be smaller than rank of all "
+                                      "tensors, but tensor at index ",
+                                      i, " has rank ", num_dims)
+              .c_str());
+      return nullptr;
+    }
+    int64_t dim = TFE_TensorHandleDim(handle, slice_dim, status.get());
+    if (MaybeRaiseExceptionFromTFStatus(status.get(), PyExc_ValueError)) {
       return nullptr;
     }
-    int64_t dim = TFE_TensorHandleDim(handle, slice_dim);
     data[i] = dim;
   }
 
-  auto status = tensorflow::make_safe(TF_NewStatus());
   TFE_TensorHandle* handle = TFE_NewTensorHandle(tensor.get(), status.get());
   if (TF_GetCode(status.get()) != TF_OK) {
     PyErr_SetString(
diff --git a/tensorflow/python/eager/pywrap_tfe.h b/tensorflow/python/eager/pywrap_tfe.h
index 16b7d1a119a409d1d0a77b220d5d0945b280b638..32d731d0f68910b8e41a57cb32ae60c3ea6742f7 100644
--- a/tensorflow/python/eager/pywrap_tfe.h
+++ b/tensorflow/python/eager/pywrap_tfe.h
@@ -51,6 +51,13 @@ void TFE_Py_Execute(TFE_Context* ctx, const char* device_name,
 // This function is not thread-safe.
 PyObject* TFE_Py_RegisterExceptionClass(PyObject* e);
 
+// Registers e as the type of the ResourceVariable class.
+// Returns Py_None if registration succeeds, else throws a TypeError and returns
+// NULL.
+//
+// This function is not thread-safe.
+PyObject* TFE_Py_RegisterResourceVariableType(PyObject* e);
+
 // Registers e as the Exception to be raised when the conditions of
 // TFE_Py_FastPathExecute_C have not been met. When this exception is set, it
 // is a signal to the calling code that it should fall back to the safer (and
@@ -59,6 +66,15 @@ PyObject* TFE_Py_RegisterExceptionClass(PyObject* e);
 // This function is not thread-safe.
 PyObject* TFE_Py_RegisterFallbackExceptionClass(PyObject* e);
 
+// Registers e as the backward_function_getter.
+// The registered function creates a backward function (a function that can
+// return the gradient of the inputs an op given the gradient of it's outputs).
+// The registered function will be passed the following arguments:
+//    op_name, attrs, num_inputs, op_inputs, op_outputs
+//
+// This function is not thread-safe.
+PyObject* TFE_Py_RegisterBackwardFunctionGetter(PyObject* e);
+
 // Returns 0 if 'status' is TF_OK. Otherwise, raises an exception (using
 // `exception` if not nullptr, else using the class registered via
 // TFE_Py_RegisterExceptionClass), and returns -1.
@@ -151,13 +167,10 @@ PyObject* TFE_Py_TapeGradient(PyObject* tape, PyObject* vspace,
 //  Item 2: device_name: Name of the device on which to execute the operation,
 //          or NULL for automatic selection.
 //  Item 3: op_name: Name of the TensorFlow op to execute.
-//  Item 4: record_gradient_callback: Callback that records the gradient of the
-//          result. The callback takes (op_name, inputs, attrs, result, name)
-//          - all sequences and records the gradient.
-//  Item 5: name: An optional name for the operation.
-//  Item 6: List representing all callbacks to execute after successful
+//  Item 4: name: An optional name for the operation.
+//  Item 5: List representing all callbacks to execute after successful
 //  op execute.
-//  Item 7 onwards: inputs - This is a list of inputs followed by a list of
+//  Item 6 onwards: inputs - This is a list of inputs followed by a list of
 //        attrs. It is not necessary for type attrs to be present.
 //
 // This is named _C since there doesn't seem to be any way to make it visible
@@ -165,6 +178,11 @@ PyObject* TFE_Py_TapeGradient(PyObject* tape, PyObject* vspace,
 // directive.
 PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args);
 
+// Record the gradient for a given op.
+PyObject* TFE_Py_RecordGradient(PyObject* op_name, PyObject* inputs,
+                                PyObject* attrs, PyObject* results,
+                                PyObject* name);
+
 // Returns the set of variables watched by the given tape.
 PyObject* TFE_Py_TapeWatchedVariables(PyObject* tape);
 
diff --git a/tensorflow/python/eager/pywrap_tfe_src.cc b/tensorflow/python/eager/pywrap_tfe_src.cc
index cabbcc48fd56563a50591cc6adabc3af75918401..27c9d05081754216850b4d9eca5e33b38203f734 100644
--- a/tensorflow/python/eager/pywrap_tfe_src.cc
+++ b/tensorflow/python/eager/pywrap_tfe_src.cc
@@ -24,18 +24,37 @@ limitations under the License.
 #include "tensorflow/core/lib/gtl/cleanup.h"
 #include "tensorflow/core/lib/gtl/compactptrset.h"
 #include "tensorflow/core/lib/gtl/flatmap.h"
+#include "tensorflow/core/lib/gtl/flatset.h"
 #include "tensorflow/core/lib/strings/strcat.h"
 #include "tensorflow/core/lib/strings/stringprintf.h"
 #include "tensorflow/core/platform/mutex.h"
 #include "tensorflow/core/platform/protobuf.h"
 #include "tensorflow/core/platform/types.h"
 #include "tensorflow/python/eager/pywrap_tensor.h"
+#include "tensorflow/python/lib/core/safe_ptr.h"
 
 using tensorflow::string;
 using tensorflow::strings::Printf;
 
 namespace {
 
+struct FastPathOpExecInfo {
+  TFE_Context* ctx;
+  const char* device_name;
+  // The op def of the main op being executed.
+  const tensorflow::OpDef* op_def;
+
+  bool run_callbacks;
+  bool run_post_exec_callbacks;
+  bool run_gradient_callback;
+
+  // The op name of the main op being executed.
+  PyObject* name;
+  // The op type name of the main op being executed.
+  PyObject* op_name;
+  PyObject* callbacks;
+};
+
 #define PARSE_VALUE(fn_name, type, check_fn, parse_fn)                       \
   bool fn_name(const string& key, PyObject* py_value, TF_Status* status,     \
                type* value) {                                                \
@@ -118,6 +137,11 @@ bool ParseTypeValue(const string& key, PyObject* py_value, TF_Status* status,
 
   PyObject* py_type_enum = PyObject_GetAttrString(py_value, "_type_enum");
   if (py_type_enum == nullptr) {
+    TF_SetStatus(
+        status, TF_INVALID_ARGUMENT,
+        tensorflow::strings::StrCat("Expecting a DType.dtype for attr ", key,
+                                    ", got ", py_value->ob_type->tp_name)
+            .c_str());
     return false;
   }
 
@@ -575,6 +599,11 @@ PyObject* exception_class GUARDED_BY(exception_class_mutex) = nullptr;
 // Python subclass of Exception that is created to signal fallback.
 PyObject* fallback_exception_class = nullptr;
 
+// Python function that returns a backward_function.
+PyObject* backward_function_getter = nullptr;
+
+PyTypeObject* resource_variable_type = nullptr;
+
 tensorflow::mutex _uid_mutex(tensorflow::LINKER_INITIALIZED);
 tensorflow::int64 _uid GUARDED_BY(_uid_mutex) = 0;
 
@@ -623,11 +652,28 @@ PyObject* TFE_Py_RegisterExceptionClass(PyObject* e) {
                     "TFE_Py_RegisterExceptionClass: "
                     "Registered class should be subclass of Exception.");
     return nullptr;
-  } else {
-    Py_INCREF(e);
-    exception_class = e;
-    Py_RETURN_NONE;
   }
+
+  Py_INCREF(e);
+  exception_class = e;
+  Py_RETURN_NONE;
+}
+
+PyObject* TFE_Py_RegisterResourceVariableType(PyObject* e) {
+  if (!PyType_Check(e)) {
+    PyErr_SetString(
+        PyExc_TypeError,
+        "TFE_Py_RegisterResourceVariableType: Need to register a type.");
+    return nullptr;
+  }
+
+  if (resource_variable_type != nullptr) {
+    Py_DECREF(resource_variable_type);
+  }
+
+  Py_INCREF(e);
+  resource_variable_type = reinterpret_cast<PyTypeObject*>(e);
+  Py_RETURN_NONE;
 }
 
 PyObject* TFE_Py_RegisterFallbackExceptionClass(PyObject* e) {
@@ -647,6 +693,23 @@ PyObject* TFE_Py_RegisterFallbackExceptionClass(PyObject* e) {
   }
 }
 
+PyObject* TFE_Py_RegisterBackwardFunctionGetter(PyObject* e) {
+  if (backward_function_getter != nullptr) {
+    Py_DECREF(backward_function_getter);
+  }
+  if (!PyCallable_Check(e)) {
+    backward_function_getter = nullptr;
+    PyErr_SetString(PyExc_TypeError,
+                    "TFE_Py_RegisterBackwardFunctionGetter: "
+                    "Registered object should be function.");
+    return nullptr;
+  } else {
+    Py_INCREF(e);
+    backward_function_getter = e;
+    Py_RETURN_NONE;
+  }
+}
+
 void RaiseFallbackException(const char* message) {
   if (fallback_exception_class != nullptr) {
     PyErr_SetObject(fallback_exception_class, Py_BuildValue("s", message));
@@ -1062,16 +1125,10 @@ PyObject* TFE_Py_TapeWatchedVariables(PyObject* tape) {
   return result;
 }
 
-void TFE_Py_TapeSetRecordOperation(PyObject* op_type, PyObject* output_tensors,
-                                   PyObject* input_tensors,
-                                   PyObject* backward_function) {
-  if (GetTapeSet()->empty() || *ThreadTapeIsStopped()) {
-    return;
-  }
-  std::vector<tensorflow::int64> input_ids = MakeTensorIDList(input_tensors);
-  if (PyErr_Occurred()) {
-    return;
-  }
+namespace {
+void TapeSetRecordOperation(PyObject* op_type, PyObject* output_tensors,
+                            const std::vector<tensorflow::int64>& input_ids,
+                            PyObject* backward_function) {
   std::vector<tensorflow::eager::TapeTensor> output_info;
   PyObject* seq = PySequence_Fast(output_tensors,
                                   "expected a sequence of integer tensor ids");
@@ -1110,6 +1167,19 @@ void TFE_Py_TapeSetRecordOperation(PyObject* op_type, PyObject* output_tensors,
         [backward_function]() { Py_DECREF(backward_function); });
   }
 }
+}  // namespace
+
+void TFE_Py_TapeSetRecordOperation(PyObject* op_type, PyObject* output_tensors,
+                                   PyObject* input_tensors,
+                                   PyObject* backward_function) {
+  if (GetTapeSet()->empty() || *ThreadTapeIsStopped()) {
+    return;
+  }
+  std::vector<tensorflow::int64> input_ids = MakeTensorIDList(input_tensors);
+  if (PyErr_Occurred()) return;
+
+  TapeSetRecordOperation(op_type, output_tensors, input_ids, backward_function);
+}
 
 void TFE_Py_TapeSetDeleteTrace(tensorflow::int64 tensor_id) {
   for (TFE_Py_Tape* tape : SafeTapeSet()) {
@@ -1336,7 +1406,7 @@ PyObject* TFE_Py_TapeGradient(PyObject* tape, PyObject* vspace,
 }
 
 namespace {
-static const int kFastPathExecuteInputStartIndex = 6;
+static const int kFastPathExecuteInputStartIndex = 5;
 
 PyObject* GetPythonObjectFromString(const char* s) {
 #if PY_MAJOR_VERSION >= 3
@@ -1346,8 +1416,12 @@ PyObject* GetPythonObjectFromString(const char* s) {
 #endif
 }
 
-bool CheckEagerTensors(PyObject* seq, int start_index,
-                       const tensorflow::OpDef& op_def) {
+bool CheckResourceVariable(PyObject* item) {
+  return PyObject_TypeCheck(item, resource_variable_type);
+}
+
+bool CheckInputsOk(PyObject* seq, int start_index,
+                   const tensorflow::OpDef& op_def) {
   for (int i = 0; i < op_def.input_arg_size(); i++) {
     PyObject* item = PyTuple_GET_ITEM(seq, i + start_index);
     if (!op_def.input_arg(i).number_attr().empty() ||
@@ -1355,9 +1429,234 @@ bool CheckEagerTensors(PyObject* seq, int start_index,
       // This item should be a list input.
       if (!PyList_Check(item)) return false;
       for (Py_ssize_t j = 0; j < PyList_Size(item); j++) {
-        if (!EagerTensor_CheckExact(PyList_GET_ITEM(item, j))) return false;
+        PyObject* inner_item = PyList_GET_ITEM(item, j);
+        if (!EagerTensor_CheckExact(inner_item) &&
+            !CheckResourceVariable(inner_item)) {
+          return false;
+        }
       }
-    } else if (!EagerTensor_CheckExact(item)) {
+    } else if (!EagerTensor_CheckExact(item) && !CheckResourceVariable(item)) {
+      return false;
+    }
+  }
+
+  return true;
+}
+
+bool OpDoesntRequireOutput(const string& op_name) {
+  static tensorflow::gtl::FlatSet<string>* ops_that_dont_require_outputs =
+      new tensorflow::gtl::FlatSet<string>({
+          "Identity",
+          "MatMul",
+          "Conv2DBackpropInput",
+          "Conv2DBackpropFilter",
+          "Conv3D",
+          "Conv3DBackpropInputV2",
+          "AvgPool3D",
+          "AvgPool3DGrad",
+          "MaxPool3D",
+          "MaxPool3DGrad",
+          "MaxPool3DGradGrad",
+          "BiasAdd",
+          "BiasAddV1",
+          "BiasAddGrad",
+          "Relu6",
+          "Softplus",
+          "SoftplusGrad",
+          "Softsign",
+          "ReluGrad",
+          "Conv2D",
+          "DepthwiseConv2dNative",
+          "Dilation2D",
+          "AvgPool",
+          "AvgPoolGrad",
+          "BatchNormWithGlobalNormalization",
+          "L2Loss",
+          "Sum",
+          "Prod",
+          "SegmentSum",
+          "SegmentMean",
+          "SparseSegmentSum",
+          "SparseSegmentMean",
+          "SparseSegmentSqrtN",
+          "SegmentMin",
+          "SegmentMax",
+          "UnsortedSegmentSum",
+          "UnsortedSegmentMax",
+          "Abs",
+          "Neg",
+          "ReciprocalGrad",
+          "Square",
+          "Expm1",
+          "Log",
+          "Log1p",
+          "TanhGrad",
+          "SigmoidGrad",
+          "Sign",
+          "Sin",
+          "Cos",
+          "Tan",
+          "Add",
+          "Sub",
+          "Mul",
+          "Div",
+          "RealDiv",
+          "Maximum",
+          "Minimum",
+          "SquaredDifference",
+          "Select",
+          "SparseMatMul",
+          "BatchMatMul",
+          "Complex",
+          "Real",
+          "Imag",
+          "Angle",
+          "Conj",
+          "Cast",
+          "Cross",
+          "Cumsum",
+          "Cumprod",
+          "ReadVariableOp",
+          "VarHandleOp",
+          "Shape",
+      });
+
+  return ops_that_dont_require_outputs->find(op_name) !=
+         ops_that_dont_require_outputs->end();
+}
+
+bool OpDoesntRequireInput(const string& op_name) {
+  static tensorflow::gtl::FlatSet<string>* ops_that_dont_require_inputs =
+      new tensorflow::gtl::FlatSet<string>({
+          "Identity",
+          "Softmax",
+          "LogSoftmax",
+          "BiasAdd",
+          "Relu",
+          "Elu",
+          "Selu",
+          "SparseSoftmaxCrossEntropyWithLogits",
+          "Neg",
+          "Inv",
+          "Reciprocal",
+          "Sqrt",
+          "Exp",
+          "Tanh",
+          "Sigmoid",
+          "Real",
+          "Imag",
+          "Conj",
+          "ReadVariableOp",
+          "VarHandleOp",
+          "Shape",
+      });
+
+  return ops_that_dont_require_inputs->find(op_name) !=
+         ops_that_dont_require_inputs->end();
+}
+
+PyObject* RecordGradient(PyObject* op_name, PyObject* inputs, PyObject* attrs,
+                         PyObject* results, PyObject* name) {
+  std::vector<tensorflow::int64> input_ids = MakeTensorIDList(inputs);
+  if (PyErr_Occurred()) return nullptr;
+
+  bool should_record = false;
+  for (TFE_Py_Tape* tape : SafeTapeSet()) {
+    if (tape->tape->ShouldRecord(input_ids)) {
+      should_record = true;
+      break;
+    }
+  }
+  if (!should_record) Py_RETURN_NONE;
+
+  string c_op_name = TFE_GetPythonString(op_name);
+  PyObject* op_outputs;
+  if (OpDoesntRequireOutput(c_op_name)) {
+    op_outputs = Py_None;
+  } else {
+    op_outputs = results;
+  }
+
+  PyObject* op_inputs;
+  if (OpDoesntRequireInput(c_op_name)) {
+    op_inputs = Py_None;
+  } else {
+    op_inputs = inputs;
+  }
+
+  PyObject* num_inputs = PyLong_FromLong(PySequence_Size(inputs));
+  PyObject* callback_args =
+      Py_BuildValue("OOOOO", op_name, attrs, num_inputs, op_inputs, op_outputs);
+
+  PyObject* backward_function =
+      PyObject_CallObject(backward_function_getter, callback_args);
+  Py_DECREF(callback_args);
+  if (backward_function == nullptr) return nullptr;
+
+  TapeSetRecordOperation(op_name, results, input_ids, backward_function);
+
+  Py_DECREF(backward_function);
+
+  Py_RETURN_NONE;
+}
+
+void MaybeWatchVariable(PyObject* input) {
+  DCHECK(CheckResourceVariable(input));
+  DCHECK(PyObject_HasAttrString(input, "_trainable"));
+
+  tensorflow::Safe_PyObjectPtr trainable(
+      PyObject_GetAttrString(input, "_trainable"));
+  if (trainable.get() == Py_False) return;
+  TFE_Py_TapeSetWatchVariable(input);
+}
+
+bool ReadVariableOp(const FastPathOpExecInfo& parent_op_exec_info,
+                    PyObject* input, tensorflow::Safe_PyObjectPtr* output,
+                    TF_Status* status) {
+  MaybeWatchVariable(input);
+
+  TFE_Op* op = TFE_NewOp(parent_op_exec_info.ctx, "ReadVariableOp", status);
+  auto cleaner = tensorflow::gtl::MakeCleanup([op] { TFE_DeleteOp(op); });
+  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) return false;
+
+  // Set dtype
+  DCHECK(PyObject_HasAttrString(input, "_dtype"));
+  tensorflow::Safe_PyObjectPtr dtype(PyObject_GetAttrString(input, "_dtype"));
+  int value;
+  if (!ParseTypeValue("_dtype", dtype.get(), status, &value)) {
+    return false;
+  }
+  TFE_OpSetAttrType(op, "dtype", static_cast<TF_DataType>(value));
+
+  TFE_OpSetDevice(op, parent_op_exec_info.device_name, status);
+  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) return false;
+
+  // Get handle
+  tensorflow::Safe_PyObjectPtr handle(PyObject_GetAttrString(input, "_handle"));
+  if (!EagerTensor_CheckExact(handle.get())) return false;
+  TFE_OpAddInput(op, EagerTensor_Handle(handle.get()), status);
+  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) return false;
+
+  int num_retvals = 1;
+  TFE_TensorHandle* output_handle;
+  TFE_Execute(op, &output_handle, &num_retvals, status);
+  if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) return false;
+
+  // Always create the py object (and correctly DECREF it) from the returned
+  // value, else the data will leak.
+  output->reset(EagerTensorFromHandle(output_handle));
+
+  // TODO(nareshmodi): Should we run post exec callbacks here?
+  if (parent_op_exec_info.run_gradient_callback) {
+    tensorflow::Safe_PyObjectPtr inputs(PyTuple_New(1));
+    PyTuple_SET_ITEM(inputs.get(), 0, handle.release());
+
+    tensorflow::Safe_PyObjectPtr outputs(PyTuple_New(1));
+    Py_INCREF(output->get());  // stay alive after since tuple steals.
+    PyTuple_SET_ITEM(outputs.get(), 0, output->get());
+
+    if (!RecordGradient(GetPythonObjectFromString("ReadVariableOp"),
+                        inputs.get(), Py_None, outputs.get(), Py_None)) {
       return false;
     }
   }
@@ -1365,30 +1664,60 @@ bool CheckEagerTensors(PyObject* seq, int start_index,
   return true;
 }
 
+// Supports only 2 cases at the moment:
+//  i) input is an EagerTensor
+//  ii) input is a ResourceVariable - in this case, the is_variable param is set
+//  to true.
+bool ConvertToTensor(const FastPathOpExecInfo& op_exec_info, PyObject* input,
+                     tensorflow::Safe_PyObjectPtr* output_handle,
+                     TF_Status* status) {
+  if (CheckResourceVariable(input)) {
+    return ReadVariableOp(op_exec_info, input, output_handle, status);
+  }
+
+  Py_INCREF(input);
+  output_handle->reset(input);
+
+  return true;
+}
+
 // Adds input and type attr to the op, and to the list of flattened
 // inputs/attrs.
-bool AddInputToOp(PyObject* input, const tensorflow::OpDef::ArgDef* input_arg,
-                  std::vector<PyObject*>* flattened_attrs,
-                  std::vector<PyObject*>* flattened_inputs, TFE_Op* op,
-                  TF_Status* status) {
-  TFE_TensorHandle* input_handle = EagerTensor_Handle(input);
+bool AddInputToOp(const FastPathOpExecInfo& op_exec_info, PyObject* input,
+                  const tensorflow::OpDef::ArgDef* input_arg,
+                  std::vector<tensorflow::Safe_PyObjectPtr>* flattened_attrs,
+                  std::vector<tensorflow::Safe_PyObjectPtr>* flattened_inputs,
+                  TFE_Op* op, TF_Status* status) {
+  // py_eager_tensor's ownership is transferred to flattened_inputs if it is
+  // required, else the object is destroyed and DECREF'd when the object goes
+  // out of scope in this function.
+  tensorflow::Safe_PyObjectPtr py_eager_tensor = nullptr;
+
+  if (!ConvertToTensor(op_exec_info, input, &py_eager_tensor, status)) {
+    return false;
+  }
+
+  TFE_TensorHandle* input_handle = EagerTensor_Handle(py_eager_tensor.get());
+
   if (input_arg != nullptr && !input_arg->type_attr().empty()) {
     auto dtype = TFE_TensorHandleDataType(input_handle);
     TFE_OpSetAttrType(op, input_arg->type_attr().data(), dtype);
     if (flattened_attrs != nullptr) {
-      flattened_attrs->push_back(
+      flattened_attrs->emplace_back(
           GetPythonObjectFromString(input_arg->type_attr().data()));
-      flattened_attrs->push_back(PyLong_FromLong(dtype));
+      flattened_attrs->emplace_back(PyLong_FromLong(dtype));
     }
   }
 
   if (flattened_inputs != nullptr) {
-    flattened_inputs->push_back(input);
+    flattened_inputs->emplace_back(std::move(py_eager_tensor));
   }
+
   TFE_OpAddInput(op, input_handle, status);
   if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) {
     return false;
   }
+
   return true;
 }
 
@@ -1430,67 +1759,53 @@ bool RaiseIfNotPyList(PyObject* list, const string& attr_name) {
   return true;
 }
 
-bool RunCallbacks(bool run_gradient_callback, bool run_post_exec_callbacks,
-                  const tensorflow::OpDef* op_def, PyObject* args,
-                  const std::vector<PyObject*>& flattened_inputs,
-                  const std::vector<PyObject*>& flattened_attrs,
-                  PyObject* flattened_result, PyObject* op_name, PyObject* name,
-                  PyObject* record_gradient_callback, PyObject* callbacks) {
-  PyObject* inputs = PyTuple_New(flattened_inputs.size());
+bool RunCallbacks(
+    const FastPathOpExecInfo& op_exec_info, PyObject* args,
+    const std::vector<tensorflow::Safe_PyObjectPtr>& flattened_inputs,
+    const std::vector<tensorflow::Safe_PyObjectPtr>& flattened_attrs,
+    PyObject* flattened_result) {
+  if (!op_exec_info.run_callbacks) return true;
+
+  tensorflow::Safe_PyObjectPtr inputs(PyTuple_New(flattened_inputs.size()));
   for (int i = 0; i < flattened_inputs.size(); i++) {
-    PyObject* input = flattened_inputs[i];
+    PyObject* input = flattened_inputs[i].get();
     Py_INCREF(input);
-    PyTuple_SET_ITEM(inputs, i, input);
+    PyTuple_SET_ITEM(inputs.get(), i, input);
   }
 
   int num_non_inferred_attrs = PyTuple_GET_SIZE(args) -
-                               op_def->input_arg_size() -
+                               op_exec_info.op_def->input_arg_size() -
                                kFastPathExecuteInputStartIndex;
   int num_attrs = flattened_attrs.size() + num_non_inferred_attrs;
-  PyObject* attrs = PyTuple_New(num_attrs);
+  tensorflow::Safe_PyObjectPtr attrs(PyTuple_New(num_attrs));
 
   for (int i = 0; i < num_non_inferred_attrs; i++) {
-    auto* attr = PyTuple_GET_ITEM(
-        args, kFastPathExecuteInputStartIndex + op_def->input_arg_size() + i);
+    auto* attr =
+        PyTuple_GET_ITEM(args, kFastPathExecuteInputStartIndex +
+                                   op_exec_info.op_def->input_arg_size() + i);
     Py_INCREF(attr);
-    PyTuple_SET_ITEM(attrs, i, attr);
+    PyTuple_SET_ITEM(attrs.get(), i, attr);
   }
   for (int i = num_non_inferred_attrs; i < num_attrs; i++) {
-    // Not INCREFing anything in flattened_attrs as each of those is a new
-    // reference, so allow the attrs tuple to steal the reference.
-    PyTuple_SET_ITEM(attrs, i, flattened_attrs.at(i - num_non_inferred_attrs));
+    PyObject* attr_or_name =
+        flattened_attrs.at(i - num_non_inferred_attrs).get();
+    Py_INCREF(attr_or_name);
+    PyTuple_SET_ITEM(attrs.get(), i, attr_or_name);
   }
 
-  PyObject* callback_args =
-      Py_BuildValue("OOOOO", op_name, inputs, attrs, flattened_result, name);
-
-  auto cleaner = tensorflow::gtl::MakeCleanup([inputs, attrs, callback_args] {
-    Py_DECREF(inputs);
-    Py_DECREF(attrs);
-    Py_DECREF(callback_args);
-  });
-
-  if (run_gradient_callback) {
-    if (!PyCallable_Check(record_gradient_callback)) {
-      PyErr_SetString(PyExc_TypeError,
-                      Printf("expected a function for "
-                             "record_gradient_callback, got %s instead",
-                             record_gradient_callback->ob_type->tp_name)
-                          .c_str());
+  if (op_exec_info.run_gradient_callback) {
+    if (!RecordGradient(op_exec_info.op_name, inputs.get(), attrs.get(),
+                        flattened_result, op_exec_info.name)) {
       return false;
     }
-
-    PyObject* callback_result =
-        PyObject_CallObject(record_gradient_callback, callback_args);
-    if (!callback_result) {
-      return false;
-    }
-    Py_DECREF(callback_result);
   }
 
-  if (run_post_exec_callbacks) {
-    for (Py_ssize_t i = 0; i < PyList_Size(callbacks); i++) {
-      PyObject* callback_fn = PyList_GET_ITEM(callbacks, i);
+  if (op_exec_info.run_post_exec_callbacks) {
+    tensorflow::Safe_PyObjectPtr callback_args(
+        Py_BuildValue("OOOOO", op_exec_info.op_name, inputs.get(), attrs.get(),
+                      flattened_result, op_exec_info.name));
+    for (Py_ssize_t i = 0; i < PyList_Size(op_exec_info.callbacks); i++) {
+      PyObject* callback_fn = PyList_GET_ITEM(op_exec_info.callbacks, i);
       if (!PyCallable_Check(callback_fn)) {
         PyErr_SetString(
             PyExc_TypeError,
@@ -1501,7 +1816,7 @@ bool RunCallbacks(bool run_gradient_callback, bool run_post_exec_callbacks,
         return false;
       }
       PyObject* callback_result =
-          PyObject_CallObject(callback_fn, callback_args);
+          PyObject_CallObject(callback_fn, callback_args.get());
       if (!callback_result) {
         return false;
       }
@@ -1525,15 +1840,30 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
     return nullptr;
   }
 
-  TFE_Context* ctx = reinterpret_cast<TFE_Context*>(
+  FastPathOpExecInfo op_exec_info;
+
+  op_exec_info.ctx = reinterpret_cast<TFE_Context*>(
       PyCapsule_GetPointer(PyTuple_GET_ITEM(args, 0), nullptr));
-  const char* device_name = GetDeviceName(PyTuple_GET_ITEM(args, 1));
-  PyObject* op_name = PyTuple_GET_ITEM(args, 2);
-  const tensorflow::OpDef* op_def = GetOpDef(op_name);
-  if (op_def == nullptr) return nullptr;
-  PyObject* record_gradient_callback = PyTuple_GET_ITEM(args, 3);
-  PyObject* name = PyTuple_GET_ITEM(args, 4);
-  PyObject* callbacks = PyTuple_GET_ITEM(args, 5);
+  op_exec_info.device_name = GetDeviceName(PyTuple_GET_ITEM(args, 1));
+  op_exec_info.op_name = PyTuple_GET_ITEM(args, 2);
+  op_exec_info.op_def = GetOpDef(op_exec_info.op_name);
+  if (op_exec_info.op_def == nullptr) return nullptr;
+  op_exec_info.name = PyTuple_GET_ITEM(args, 3);
+  op_exec_info.callbacks = PyTuple_GET_ITEM(args, 4);
+
+  const tensorflow::OpDef* op_def = op_exec_info.op_def;
+
+  // TODO(nareshmodi): Add a benchmark for the fast-path with gradient callbacks
+  // (similar to benchmark_tf_gradient_function_*). Also consider using an
+  // InlinedVector for flattened_attrs and flattened_inputs if the benchmarks
+  // point out problems with heap allocs.
+  op_exec_info.run_gradient_callback =
+      !*ThreadTapeIsStopped() && !GetTapeSet()->empty();
+  op_exec_info.run_post_exec_callbacks =
+      op_exec_info.callbacks != Py_None &&
+      PyList_Size(op_exec_info.callbacks) > 0;
+  op_exec_info.run_callbacks = op_exec_info.run_gradient_callback ||
+                               op_exec_info.run_post_exec_callbacks;
 
   if (args_size < kFastPathExecuteInputStartIndex + op_def->input_arg_size()) {
     PyErr_SetString(
@@ -1546,7 +1876,7 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
     return nullptr;
   }
 
-  if (!CheckEagerTensors(args, kFastPathExecuteInputStartIndex, *op_def)) {
+  if (!CheckInputsOk(args, kFastPathExecuteInputStartIndex, *op_def)) {
     RaiseFallbackException(
         "This function does not handle the case of the path where "
         "all inputs are not already EagerTensors.");
@@ -1554,7 +1884,7 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
   }
 
   TF_Status* status = TF_NewStatus();
-  TFE_Op* op = TFE_NewOp(ctx, op_def->name().c_str(), status);
+  TFE_Op* op = TFE_NewOp(op_exec_info.ctx, op_def->name().c_str(), status);
   auto cleaner = tensorflow::gtl::MakeCleanup([status, op] {
     TF_DeleteStatus(status);
     TFE_DeleteOp(op);
@@ -1581,8 +1911,8 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
     // OpRegistrationData.
     for (const auto& attr : op_def->attr()) {
       if (attr_name == attr.name()) {
-        SetOpAttrWithDefaults(ctx, op, attr, attr_name.data(), py_attr_value,
-                              &attr_list_sizes, status);
+        SetOpAttrWithDefaults(op_exec_info.ctx, op, attr, attr_name.data(),
+                              py_attr_value, &attr_list_sizes, status);
 
         if (TF_GetCode(status) != TF_OK) {
           RaiseFallbackException(TF_Message(status));
@@ -1594,34 +1924,28 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
     }
   }
 
-  TFE_OpSetDevice(op, device_name, status);
+  TFE_OpSetDevice(op, op_exec_info.device_name, status);
   if (MaybeRaiseExceptionFromTFStatus(status, nullptr)) {
     return nullptr;
   }
 
-  // TODO(nareshmodi): Add a benchmark for the fast-path with gradient callbacks
-  // (similar to benchmark_tf_gradient_function_*). Also consider using an
-  // InlinedVector for flattened_attrs and flattened_inputs if the benchmarks
-  // point out problems with heap allocs.
-  bool run_gradient_callback = !*ThreadTapeIsStopped() &&
-                               !GetTapeSet()->empty() &&
-                               record_gradient_callback != Py_None;
-  bool run_post_exec_callbacks =
-      callbacks != Py_None && PyList_Size(callbacks) > 0;
-  bool run_callbacks = run_gradient_callback || run_post_exec_callbacks;
   // Flat attrs and inputs as required by the record_gradient call. The attrs
   // here only contain inferred attrs (non-inferred attrs are added directly
   // from the input args).
-  // All items in flattened_attrs contain new references.
-  // All items in flattened_inputs contain borrowed references.
+  // All items in flattened_attrs and flattened_inputs contain
+  // Safe_PyObjectPtr - any time something steals a reference to this, it must
+  // INCREF.
   // TODO(nareshmodi): figure out why PyList_New/PyList_Append don't work
   // directly.
-  std::unique_ptr<std::vector<PyObject*>> flattened_attrs = nullptr;
-  std::unique_ptr<std::vector<PyObject*>> flattened_inputs = nullptr;
+  std::unique_ptr<std::vector<tensorflow::Safe_PyObjectPtr>> flattened_attrs =
+      nullptr;
+  std::unique_ptr<std::vector<tensorflow::Safe_PyObjectPtr>> flattened_inputs =
+      nullptr;
 
-  if (run_callbacks) {
-    flattened_attrs.reset(new std::vector<PyObject*>);
-    flattened_inputs.reset(new std::vector<PyObject*>);
+  // TODO(nareshmodi): Encapsulate callbacks information into a struct.
+  if (op_exec_info.run_callbacks) {
+    flattened_attrs.reset(new std::vector<tensorflow::Safe_PyObjectPtr>);
+    flattened_inputs.reset(new std::vector<tensorflow::Safe_PyObjectPtr>);
   }
 
   // Add inferred attrs and inputs.
@@ -1641,16 +1965,16 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
       Py_ssize_t len = PyList_Size(input);
 
       TFE_OpSetAttrInt(op, input_arg.number_attr().data(), len);
-      if (run_callbacks) {
-        flattened_attrs->push_back(
+      if (op_exec_info.run_callbacks) {
+        flattened_attrs->emplace_back(
             GetPythonObjectFromString(input_arg.number_attr().data()));
-        flattened_attrs->push_back(PyLong_FromLong(len));
+        flattened_attrs->emplace_back(PyLong_FromLong(len));
       }
       attr_list_sizes[input_arg.number_attr()] = len;
 
       if (len > 0) {
         // First item adds the type attr.
-        if (!AddInputToOp(PyList_GET_ITEM(input, 0), &input_arg,
+        if (!AddInputToOp(op_exec_info, PyList_GET_ITEM(input, 0), &input_arg,
                           flattened_attrs.get(), flattened_inputs.get(), op,
                           status)) {
           return nullptr;
@@ -1658,7 +1982,8 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
 
         for (Py_ssize_t j = 1; j < len; j++) {
           // Since the list is homogeneous, we don't need to re-add the attr.
-          if (!AddInputToOp(PyList_GET_ITEM(input, j), nullptr /* input_arg */,
+          if (!AddInputToOp(op_exec_info, PyList_GET_ITEM(input, j),
+                            nullptr /* input_arg */,
                             nullptr /* flattened_attrs */,
                             flattened_inputs.get(), op, status)) {
             return nullptr;
@@ -1672,12 +1997,20 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
       Py_ssize_t len = PyList_Size(input);
       tensorflow::gtl::InlinedVector<TF_DataType, 4> attr_value(len);
       PyObject* py_attr_value = nullptr;
-      if (run_callbacks) {
+      if (op_exec_info.run_callbacks) {
         py_attr_value = PyTuple_New(len);
       }
       for (Py_ssize_t j = 0; j < len; j++) {
         PyObject* py_input = PyList_GET_ITEM(input, j);
-        TFE_TensorHandle* input_handle = EagerTensor_Handle(py_input);
+        tensorflow::Safe_PyObjectPtr py_eager_tensor;
+        if (!ConvertToTensor(op_exec_info, py_input, &py_eager_tensor,
+                             status)) {
+          return nullptr;
+        }
+
+        TFE_TensorHandle* input_handle =
+            EagerTensor_Handle(py_eager_tensor.get());
+
         attr_value[j] = TFE_TensorHandleDataType(input_handle);
 
         TFE_OpAddInput(op, input_handle, status);
@@ -1685,22 +2018,23 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
           return nullptr;
         }
 
-        if (run_callbacks) {
-          flattened_inputs->push_back(py_input);
+        if (op_exec_info.run_callbacks) {
+          flattened_inputs->emplace_back(std::move(py_eager_tensor));
 
           PyTuple_SET_ITEM(py_attr_value, j, PyLong_FromLong(attr_value[j]));
         }
       }
-      if (run_callbacks) {
-        flattened_attrs->push_back(GetPythonObjectFromString(attr_name.data()));
-        flattened_attrs->push_back(py_attr_value);
+      if (op_exec_info.run_callbacks) {
+        flattened_attrs->emplace_back(
+            GetPythonObjectFromString(attr_name.data()));
+        flattened_attrs->emplace_back(py_attr_value);
       }
       TFE_OpSetAttrTypeList(op, attr_name.data(), attr_value.data(),
                             attr_value.size());
       attr_list_sizes[attr_name] = len;
     } else {
       // The item is a single item.
-      if (!AddInputToOp(input, &input_arg, flattened_attrs.get(),
+      if (!AddInputToOp(op_exec_info, input, &input_arg, flattened_attrs.get(),
                         flattened_inputs.get(), op, status)) {
         return nullptr;
       }
@@ -1724,12 +2058,14 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
   Py_BEGIN_ALLOW_THREADS;
   TFE_Execute(op, retvals.data(), &num_retvals, status);
   Py_END_ALLOW_THREADS;
+
   if (TF_GetCode(status) != TF_OK) {
     // Augment the status with the op_name for easier debugging similar to
     // TFE_Py_Execute.
     TF_SetStatus(status, TF_GetCode(status),
-                 tensorflow::strings::StrCat(TF_Message(status), " [Op:",
-                                             TFE_GetPythonString(op_name), "]")
+                 tensorflow::strings::StrCat(
+                     TF_Message(status),
+                     " [Op:", TFE_GetPythonString(op_exec_info.op_name), "]")
                      .c_str());
 
     MaybeRaiseExceptionFromTFStatus(status, nullptr);
@@ -1741,10 +2077,8 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
     PyList_SET_ITEM(flat_result, i, EagerTensorFromHandle(retvals[i]));
   }
 
-  if (run_callbacks &&
-      !RunCallbacks(run_gradient_callback, run_post_exec_callbacks, op_def,
-                    args, *flattened_inputs, *flattened_attrs, flat_result,
-                    op_name, name, record_gradient_callback, callbacks)) {
+  if (!RunCallbacks(op_exec_info, args, *flattened_inputs, *flattened_attrs,
+                    flat_result)) {
     return nullptr;
   }
 
@@ -1796,3 +2130,13 @@ PyObject* TFE_Py_FastPathExecute_C(PyObject*, PyObject* args) {
   Py_DECREF(flat_result);
   return result;
 }
+
+PyObject* TFE_Py_RecordGradient(PyObject* op_name, PyObject* inputs,
+                                PyObject* attrs, PyObject* results,
+                                PyObject* name) {
+  if (*ThreadTapeIsStopped() || GetTapeSet()->empty()) {
+    Py_RETURN_NONE;
+  }
+
+  return RecordGradient(op_name, inputs, attrs, results, name);
+}
diff --git a/tensorflow/python/eager/pywrap_tfe_test.py b/tensorflow/python/eager/pywrap_tfe_test.py
index 49323e6640e664ef5f98b227964f9dd4e248ca39..46c5601f47ad746dc3869e390a2db18df4b89134 100644
--- a/tensorflow/python/eager/pywrap_tfe_test.py
+++ b/tensorflow/python/eager/pywrap_tfe_test.py
@@ -21,13 +21,13 @@ from __future__ import print_function
 from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import backprop
 from tensorflow.python.eager import context
-from tensorflow.python.eager import execute
 from tensorflow.python.eager import test
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import random_ops
+from tensorflow.python.ops import resource_variable_ops
 
 
 class Tests(test.TestCase):
@@ -46,15 +46,28 @@ class Tests(test.TestCase):
     self.assertAllClose(
         math_ops.matmul(a_2_by_2, b_2_by_2),
         pywrap_tensorflow.TFE_Py_FastPathExecute(
-            ctx._handle, ctx.device_name, "MatMul", execute.record_gradient,
-            None, None, a_2_by_2, b_2_by_2, "transpose_a", False, "transpose_b",
-            False))
+            ctx._handle, ctx.device_name, "MatMul", None, None, a_2_by_2,
+            b_2_by_2, "transpose_a", False, "transpose_b", False))
     self.assertAllClose(
         math_ops.matmul(a_100_by_784, b_100_by_784, transpose_b=True),
         pywrap_tensorflow.TFE_Py_FastPathExecute(
-            ctx._handle, ctx.device_name, "MatMul", execute.record_gradient,
-            None, None, a_100_by_784, b_100_by_784, "transpose_a", False,
-            "transpose_b", True))
+            ctx._handle, ctx.device_name, "MatMul", None, None, a_100_by_784,
+            b_100_by_784, "transpose_a", False, "transpose_b", True))
+
+  @test_util.assert_no_new_tensors
+  @test_util.assert_no_garbage_created
+  def testFastpathExecute_ResourceVariableMatMulCorrectResponse(self):
+    ctx = context.context()
+    a_2_by_2 = constant_op.constant(1.0, shape=[2, 2])
+    m = resource_variable_ops.ResourceVariable(a_2_by_2)
+    x = pywrap_tensorflow.TFE_Py_FastPathExecute(
+        ctx._handle, ctx.device_name, "MatMul", None, None, m, m, "transpose_a",
+        False, "transpose_b", False)
+    y = pywrap_tensorflow.TFE_Py_FastPathExecute(
+        ctx._handle, ctx.device_name, "MatMul", None, None, a_2_by_2, a_2_by_2,
+        "transpose_a", False, "transpose_b", False)
+
+    self.assertAllEqual(x, y)
 
   @test_util.assert_no_new_tensors
   @test_util.assert_no_garbage_created
@@ -64,12 +77,27 @@ class Tests(test.TestCase):
       a_2_by_2 = constant_op.constant(1.0, shape=[2, 2])
       tape.watch(a_2_by_2)
       z = pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx._handle, ctx.device_name, "MatMul", execute.record_gradient, None,
-          None, a_2_by_2, a_2_by_2, "transpose_a", False, "transpose_b", False)
+          ctx._handle, ctx.device_name, "MatMul", None, None, a_2_by_2,
+          a_2_by_2, "transpose_a", False, "transpose_b", False)
     dz_dy = tape.gradient(z, [a_2_by_2])[0]
     self.assertAllEqual(dz_dy.numpy(),
                         constant_op.constant(4.0, shape=[2, 2]).numpy())
 
+  @test_util.assert_no_new_tensors
+  @test_util.assert_no_garbage_created
+  def testFastpathExecute_ResourceVariableTapeWrite(self):
+    ctx = context.context()
+    with backprop.GradientTape(persistent=True) as tape:
+      a_2_by_2 = constant_op.constant(1.0, shape=[2, 2])
+      m = resource_variable_ops.ResourceVariable(a_2_by_2)
+      tape.watch(m)
+      z = pywrap_tensorflow.TFE_Py_FastPathExecute(
+          ctx._handle, ctx.device_name, "MatMul", None, None, m, m,
+          "transpose_a", False, "transpose_b", False)
+    dz_dy = tape.gradient(z, [m])[0]
+    self.assertAllEqual(dz_dy.numpy(),
+                        constant_op.constant(4.0, shape=[2, 2]).numpy())
+
   # Tests homogeneous list op
   @test_util.assert_no_new_tensors
   @test_util.assert_no_garbage_created
@@ -80,9 +108,9 @@ class Tests(test.TestCase):
 
     self.assertAllClose(
         math_ops.add_n([a_2_by_2, b_2_by_2]),
-        pywrap_tensorflow.TFE_Py_FastPathExecute(
-            ctx._handle, ctx.device_name, "AddN", execute.record_gradient, None,
-            None, [a_2_by_2, b_2_by_2]))
+        pywrap_tensorflow.TFE_Py_FastPathExecute(ctx._handle, ctx.device_name,
+                                                 "AddN", None, None,
+                                                 [a_2_by_2, b_2_by_2]))
 
   # Tests homogeneous list op
   @test_util.assert_no_new_tensors
@@ -96,8 +124,8 @@ class Tests(test.TestCase):
       tape.watch(a_2_by_2)
       tape.watch(b_2_by_2)
       z1 = pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx._handle, ctx.device_name, "AddN", execute.record_gradient, None,
-          None, [a_2_by_2, b_2_by_2])
+          ctx._handle, ctx.device_name, "AddN", None, None,
+          [a_2_by_2, b_2_by_2])
       z2 = math_ops.add_n([a_2_by_2, b_2_by_2])
     dz1_dy = tape.gradient(z1, [a_2_by_2])[0]
     dz2_dy = tape.gradient(z2, [a_2_by_2])[0]
@@ -113,9 +141,9 @@ class Tests(test.TestCase):
 
     self.assertAllClose(
         array_ops.identity_n([a_2_by_2, b_2_by_2]),
-        pywrap_tensorflow.TFE_Py_FastPathExecute(
-            ctx._handle, ctx.device_name, "IdentityN", execute.record_gradient,
-            None, None, [a_2_by_2, b_2_by_2]))
+        pywrap_tensorflow.TFE_Py_FastPathExecute(ctx._handle, ctx.device_name,
+                                                 "IdentityN", None, None,
+                                                 [a_2_by_2, b_2_by_2]))
 
   # Tests heterogeneous list op
   @test_util.assert_no_new_tensors
@@ -129,8 +157,8 @@ class Tests(test.TestCase):
       tape.watch(a_2_by_2)
       tape.watch(b_2_by_2)
       z1 = pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx._handle, ctx.device_name, "IdentityN", execute.record_gradient,
-          None, None, [a_2_by_2, b_2_by_2])
+          ctx._handle, ctx.device_name, "IdentityN", None, None,
+          [a_2_by_2, b_2_by_2])
       z2 = array_ops.identity_n([a_2_by_2, b_2_by_2])
     dz1_dy = tape.gradient(z1[0], [a_2_by_2])[0]
     dz2_dy = tape.gradient(z2[0], [a_2_by_2])[0]
@@ -147,22 +175,20 @@ class Tests(test.TestCase):
 
     # Not enough base params
     with self.assertRaisesRegexp(ValueError,
-                                 "at least 6 items in the input tuple"):
+                                 "at least 5 items in the input tuple"):
       pywrap_tensorflow.TFE_Py_FastPathExecute(ctx_handle, ctx.device_name,
                                                "Identity")
 
     # Not enough inputs
     with self.assertRaisesRegexp(ValueError,
-                                 "Expected to be at least 7, was 6"):
-      pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx_handle, ctx_handle, "Identity", backprop._record_gradient, None,
-          [])
+                                 "Expected to be at least 6, was 5"):
+      pywrap_tensorflow.TFE_Py_FastPathExecute(ctx_handle, ctx_handle,
+                                               "Identity", None, [])
 
     # Bad type
     with self.assertRaisesRegexp(TypeError, "expected a string for op_name"):
-      pywrap_tensorflow.TFE_Py_FastPathExecute(
-          ctx_handle, ctx.device_name, ctx_handle, backprop._record_gradient,
-          None, [], a_2_by_2)
+      pywrap_tensorflow.TFE_Py_FastPathExecute(ctx_handle, ctx.device_name,
+                                               ctx_handle, None, [], a_2_by_2)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/estimator/canned/head.py b/tensorflow/python/estimator/canned/head.py
index cb9e3fc6ca116ac0f48a37cea92fa4119754f324..8d742a2c6147e86619d4c0aad59b69459384bd4d 100644
--- a/tensorflow/python/estimator/canned/head.py
+++ b/tensorflow/python/estimator/canned/head.py
@@ -1156,6 +1156,7 @@ def _regression_head_with_mean_squared_error_loss(
     label_dimension=1,
     loss_reduction=losses.Reduction.SUM,
     loss_fn=None,
+    inverse_link_fn=None,
     name=None):
   """Creates a `_Head` for regression using the `mean_squared_error` loss.
 
@@ -1174,10 +1175,16 @@ def _regression_head_with_mean_squared_error_loss(
   `[D0, D1, ... DN]`, `[D0, D1, ... DN, 1]` or
   `[D0, D1, ... DN, label_dimension]`.
 
-  Also supports custom `loss_fn`. `loss_fn` takes `(labels, logits)` or
+  Supports custom `loss_fn`. `loss_fn` takes `(labels, logits)` or
   `(labels, logits, features)` as arguments and returns unreduced loss with
   shape `[D0, D1, ... DN, label_dimension]`.
 
+  Also supports custom `inverse_link_fn`, also known as 'mean function'.
+  `inverse_link_fn` takes `logits` as argument and returns predicted values.
+  This function is the inverse of the link function defined in
+  https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function
+  Namely, for poisson regression, set `inverse_link_fn=tf.exp`.
+
   Args:
     weight_column: A string or a `_NumericColumn` created by
       `tf.feature_column.numeric_column` defining feature column representing
@@ -1188,7 +1195,9 @@ def _regression_head_with_mean_squared_error_loss(
       `[batch_size, label_dimension]`).
     loss_reduction: One of `tf.losses.Reduction` except `NONE`. Describes how to
       reduce training loss over batch. Defaults to `SUM`.
-    loss_fn: Optional loss function.
+    loss_fn: Optional loss function. Defaults to `mean_squared_error`.
+    inverse_link_fn: Optional inverse link function, also known as 'mean
+      function'. Defaults to identity.
     name: name of the head. If provided, summary and metrics keys will be
       suffixed by `"/" + name`. Also used as `name_scope` when creating ops.
 
@@ -1208,6 +1217,7 @@ def _regression_head_with_mean_squared_error_loss(
       label_dimension=label_dimension,
       loss_reduction=loss_reduction,
       loss_fn=loss_fn,
+      inverse_link_fn=inverse_link_fn,
       name=name)
 
 
@@ -1220,6 +1230,7 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
       weight_column=None,
       loss_reduction=losses.Reduction.SUM,
       loss_fn=None,
+      inverse_link_fn=None,
       name=None):
     """`Head` for regression."""
     if label_dimension < 1:
@@ -1228,6 +1239,7 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
     self._weight_column = weight_column
     self._loss_reduction = loss_reduction
     self._loss_fn = loss_fn
+    self._inverse_link_fn = inverse_link_fn
     self._name = name
 
   @property
@@ -1294,9 +1306,19 @@ class _RegressionHeadWithMeanSquaredErrorLoss(_Head):
     # Predict.
     with ops.name_scope(self._name, 'head'):
       logits = _check_logits_final_dim(logits, self._logits_dimension)
-      predictions = {prediction_keys.PredictionKeys.PREDICTIONS: logits}
+      if self._inverse_link_fn:
+        predicted_value = self._inverse_link_fn(logits)
+        predictions = {
+            prediction_keys.PredictionKeys.PREDICTIONS: predicted_value,
+            prediction_keys.PredictionKeys.LOGITS: logits,
+        }
+      else:
+        predicted_value = logits
+        predictions = {
+            prediction_keys.PredictionKeys.PREDICTIONS: predicted_value}
       if mode == model_fn.ModeKeys.PREDICT:
-        regression_output = export_output.RegressionOutput(value=logits)
+        regression_output = export_output.RegressionOutput(
+            value=predicted_value)
         return model_fn.EstimatorSpec(
             mode=model_fn.ModeKeys.PREDICT,
             predictions=predictions,
diff --git a/tensorflow/python/estimator/canned/head_test.py b/tensorflow/python/estimator/canned/head_test.py
index c09f88262af3cdbb952a2ebadf2b2bdaf2a651cb..a300f315c18f60e77f262a3b961c5ef6306bc235 100644
--- a/tensorflow/python/estimator/canned/head_test.py
+++ b/tensorflow/python/estimator/canned/head_test.py
@@ -2703,10 +2703,9 @@ class RegressionHeadWithMeanSquaredErrorLossTest(test.TestCase):
     self.assertIsNone(spec.loss)
     self.assertEqual({}, spec.eval_metric_ops)
     self.assertIsNone(spec.train_op)
+    default_serving_key = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
     self.assertItemsEqual(
-        (signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY,
-         'predict',
-         'regression'),
+        (default_serving_key, 'predict', 'regression'),
         spec.export_outputs.keys())
     _assert_no_hooks(self, spec)
 
@@ -2714,6 +2713,54 @@ class RegressionHeadWithMeanSquaredErrorLossTest(test.TestCase):
     with self.test_session():
       _initialize_variables(self, spec.scaffold)
       self.assertAllClose(logits, spec.predictions[prediction_key].eval())
+      self.assertAllClose(
+          logits, spec.export_outputs[default_serving_key].value.eval())
+      self.assertAllClose(
+          logits, spec.export_outputs['regression'].value.eval())
+      self.assertAllClose(
+          logits, spec.export_outputs['predict'].outputs['predictions'].eval())
+
+  def test_predict_with_inverse_link_fn(self):
+    def _inverse_link_fn(logits):
+      return logits - 10.
+    head = head_lib._regression_head_with_mean_squared_error_loss(
+        inverse_link_fn=_inverse_link_fn)
+
+    # Create estimator spec.
+    logits = np.array(((45,), (41,),), dtype=np.int32)
+    expected_predictions = np.array(((35,), (31,),), dtype=np.int32)
+    spec = head.create_estimator_spec(
+        features={'x': np.array(((42.,),), dtype=np.int32)},
+        mode=model_fn.ModeKeys.PREDICT,
+        logits=logits)
+
+    # Assert spec contains expected tensors.
+    keys = prediction_keys.PredictionKeys
+    self.assertItemsEqual(
+        (keys.PREDICTIONS, keys.LOGITS), spec.predictions.keys())
+    self.assertEqual(dtypes.float32, spec.predictions[keys.PREDICTIONS].dtype)
+    self.assertEqual(dtypes.float32, spec.predictions[keys.LOGITS].dtype)
+    default_serving_key = signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
+    self.assertItemsEqual(
+        (default_serving_key, 'predict', 'regression'),
+        spec.export_outputs.keys())
+
+    # Assert predictions.
+    with self.test_session():
+      _initialize_variables(self, spec.scaffold)
+      self.assertAllClose(
+          expected_predictions, spec.predictions[keys.PREDICTIONS].eval())
+      self.assertAllClose(logits, spec.predictions[keys.LOGITS].eval())
+      self.assertAllClose(
+          expected_predictions,
+          spec.export_outputs[default_serving_key].value.eval())
+      self.assertAllClose(
+          expected_predictions, spec.export_outputs['regression'].value.eval())
+      self.assertAllClose(
+          expected_predictions,
+          spec.export_outputs['predict'].outputs['predictions'].eval())
+      self.assertAllClose(
+          logits, spec.export_outputs['predict'].outputs['logits'].eval())
 
   def test_eval_create_loss(self):
     head = head_lib._regression_head_with_mean_squared_error_loss()
diff --git a/tensorflow/python/estimator/estimator.py b/tensorflow/python/estimator/estimator.py
index 1167b3834eb6a79abf670f629ec2cbc37957d191..60351471f19bad6c0abb42693345f7f77e5736a4 100644
--- a/tensorflow/python/estimator/estimator.py
+++ b/tensorflow/python/estimator/estimator.py
@@ -49,6 +49,7 @@ from tensorflow.python.saved_model import builder as saved_model_builder
 from tensorflow.python.saved_model import tag_constants
 from tensorflow.python.summary import summary
 from tensorflow.python.summary.writer import writer_cache
+from tensorflow.python.training import device_setter
 from tensorflow.python.training import evaluation
 from tensorflow.python.training import monitored_session
 from tensorflow.python.training import saver
@@ -570,7 +571,7 @@ class Estimator(object):
       export_dir_base: A string containing a directory in which to create
         timestamped subdirectories containing exported SavedModels.
       serving_input_receiver_fn: A function that takes no argument and
-        returns a `ServingInputReceiver`.
+        returns a `ServingInputReceiver` or `TensorServingInputReceiver`.
       assets_extra: A dict specifying how to populate the assets.extra directory
         within the exported SavedModel, or `None` if no extra assets are needed.
       as_text: whether to write the SavedModel proto in text format.
@@ -1007,13 +1008,6 @@ def _get_replica_device_setter(config):
   Returns:
     A replica device setter, or None.
   """
-  ps_ops = [
-      'Variable', 'VariableV2', 'AutoReloadVariable', 'MutableHashTable',
-      'MutableHashTableV2', 'MutableHashTableOfTensors',
-      'MutableHashTableOfTensorsV2', 'MutableDenseHashTable',
-      'MutableDenseHashTableV2', 'VarHandleOp'
-  ]
-
   if config.task_type:
     worker_device = '/job:%s/task:%d' % (config.task_type, config.task_id)
   else:
@@ -1024,7 +1018,7 @@ def _get_replica_device_setter(config):
         ps_tasks=config.num_ps_replicas,
         worker_device=worker_device,
         merge_devices=True,
-        ps_ops=ps_ops,
+        ps_ops=list(device_setter.STANDARD_PS_OPS),
         cluster=config.cluster_spec)
   else:
     return None
diff --git a/tensorflow/python/estimator/estimator_test.py b/tensorflow/python/estimator/estimator_test.py
index b0a7752ec74913c959fc176e3eb9001f7418b4a2..ac0ff41dd260ce76e214af740f9334b58e725563 100644
--- a/tensorflow/python/estimator/estimator_test.py
+++ b/tensorflow/python/estimator/estimator_test.py
@@ -48,6 +48,7 @@ from tensorflow.python.ops import check_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import lookup_ops
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import metrics as metrics_lib
 from tensorflow.python.ops import parsing_ops
 from tensorflow.python.ops import state_ops
@@ -1291,8 +1292,8 @@ class EstimatorEvaluateTest(test.TestCase):
     writer_cache.FileWriterCache.clear()
 
     # Get last evaluation Event written.
-    if check_eventfile_for_keyword('image', 
-                                   os.path.join(est.model_dir, 'eval')):
+    if check_eventfile_for_keyword('image', os.path.join(est.model_dir,
+                                                         'eval')):
       return
     self.fail('{} should be part of reported summaries.'.format('image'))
 
@@ -1936,6 +1937,60 @@ class EstimatorExportTest(test.TestCase):
     # cleanup
     gfile.DeleteRecursively(tmpdir)
 
+  def test_export_savedmodel_tensor_features(self):
+    """Test that models accepting a single raw Tensor can be exported.
+
+    See https://github.com/tensorflow/tensorflow/issues/11674
+
+    If the model_fn and receiver_fn accept raw tensors rather than dictionaries
+    as input, export_savedmodel should be okay with that, too.
+
+    """
+
+    tmpdir = tempfile.mkdtemp()
+
+    def _input_fn_tensor_features():
+      t = array_ops.constant([1, 2, 3], dtype=dtypes.float32, shape=[1, 3])
+      return (t, None)
+
+    def _model_fn_tensor_features(features, labels, mode):
+      _ = labels
+      prediction = math_ops.matmul(features, features, transpose_b=True)
+
+      return model_fn_lib.EstimatorSpec(
+          mode,
+          predictions=prediction,
+          loss=constant_op.constant(1.),
+          train_op=state_ops.assign_add(training.get_global_step(), 1),
+          export_outputs={
+              'test': export_output.PredictOutput({'prediction': prediction})
+          })
+
+    def _serving_input_receiver_fn():
+      feat = array_ops.placeholder(dtype=dtypes.float32)
+      return export.TensorServingInputReceiver(
+          features=feat, receiver_tensors=feat)
+
+    est = estimator.Estimator(model_fn=_model_fn_tensor_features)
+    est.train(input_fn=_input_fn_tensor_features, steps=1)
+
+    # Perform the export.
+    export_dir_base = os.path.join(
+        compat.as_bytes(tmpdir), compat.as_bytes('export'))
+    export_dir = est.export_savedmodel(
+        export_dir_base, _serving_input_receiver_fn)
+
+    # Restore, to validate that the export was well-formed.
+    with ops.Graph().as_default() as graph:
+      with session.Session(graph=graph) as sess:
+        loader.load(sess, [tag_constants.SERVING], export_dir)
+        graph_ops = [x.name.lower() for x in graph.get_operations()]
+        self.assertTrue('const' in graph_ops)
+        self.assertTrue('matmul' in graph_ops)
+
+    # Clean up.
+    gfile.DeleteRecursively(tmpdir)
+
   def test_scaffold_is_used_for_saver(self):
     tmpdir = tempfile.mkdtemp()
 
diff --git a/tensorflow/python/estimator/export/export.py b/tensorflow/python/estimator/export/export.py
index 83251c79fc561e16ebddb638668b92b3c69b8af4..f240e11478bac4071fde87a07e6168b5a4a7b286 100644
--- a/tensorflow/python/estimator/export/export.py
+++ b/tensorflow/python/estimator/export/export.py
@@ -120,6 +120,62 @@ class ServingInputReceiver(collections.namedtuple(
         receiver_tensors_alternatives=receiver_tensors_alternatives)
 
 
+@tf_export('estimator.export.TensorServingInputReceiver')
+class TensorServingInputReceiver(collections.namedtuple(
+    'TensorServingInputReceiver',
+    ['features', 'receiver_tensors', 'receiver_tensors_alternatives'])):
+  """A return type for a serving_input_receiver_fn.
+
+  This is for use with models that expect a single `Tensor` or `SparseTensor`
+  as an input feature, as opposed to a dict of features.
+
+  The normal `ServingInputReceiver` always returns a feature dict, even if it
+  contains only one entry, and so can be used only with models that accept such
+  a dict.  For models that accept only a single raw feature, the
+  `serving_input_receiver_fn` provided to `Estimator.export_savedmodel()` should
+  return this `TensorServingInputReceiver` instead.  See:
+  https://github.com/tensorflow/tensorflow/issues/11674
+
+  Note that the receiver_tensors and receiver_tensor_alternatives arguments
+  will be automatically converted to the dict representation in either case,
+  because the SavedModel format requires each input `Tensor` to have a name
+  (provided by the dict key).
+
+  The expected return values are:
+    features: A single `Tensor` or `SparseTensor`, representing the feature
+      to be passed to the model.
+    receiver_tensors: a `Tensor`, or a dict of string to `Tensor`, specifying
+      input nodes where this receiver expects to be fed by default.  Typically,
+      this is a single placeholder expecting serialized `tf.Example` protos.
+    receiver_tensors_alternatives: a dict of string to additional
+      groups of receiver tensors, each of which may be a `Tensor` or a dict of
+      string to `Tensor`.  These named receiver tensor alternatives generate
+      additional serving signatures, which may be used to feed inputs at
+      different points within the input receiver subgraph.  A typical usage is
+      to allow feeding raw feature `Tensor`s *downstream* of the
+      tf.parse_example() op.  Defaults to None.
+  """
+
+  def __new__(cls, features, receiver_tensors,
+              receiver_tensors_alternatives=None):
+    if features is None:
+      raise ValueError('features must be defined.')
+    if not (isinstance(features, ops.Tensor)
+            or isinstance(features, sparse_tensor.SparseTensor)):
+      raise ValueError('feature must be a Tensor or SparseTensor.')
+
+    receiver = ServingInputReceiver(
+        features=features,
+        receiver_tensors=receiver_tensors,
+        receiver_tensors_alternatives=receiver_tensors_alternatives)
+
+    return super(TensorServingInputReceiver, cls).__new__(
+        cls,
+        features=receiver.features[_SINGLE_FEATURE_DEFAULT_NAME],
+        receiver_tensors=receiver.receiver_tensors,
+        receiver_tensors_alternatives=receiver.receiver_tensors_alternatives)
+
+
 @tf_export('estimator.export.build_parsing_serving_input_receiver_fn')
 def build_parsing_serving_input_receiver_fn(feature_spec,
                                             default_batch_size=None):
diff --git a/tensorflow/python/estimator/export/export_lib.py b/tensorflow/python/estimator/export/export_lib.py
index 99cd81d678bc04e7ed52de721a1fdf799c116795..226fc97fd3a3aefe61c4b88088873ce7489168c7 100644
--- a/tensorflow/python/estimator/export/export_lib.py
+++ b/tensorflow/python/estimator/export/export_lib.py
@@ -22,6 +22,7 @@ from __future__ import print_function
 from tensorflow.python.estimator.export.export import build_parsing_serving_input_receiver_fn
 from tensorflow.python.estimator.export.export import build_raw_serving_input_receiver_fn
 from tensorflow.python.estimator.export.export import ServingInputReceiver
+from tensorflow.python.estimator.export.export import TensorServingInputReceiver
 from tensorflow.python.estimator.export.export_output import ClassificationOutput
 from tensorflow.python.estimator.export.export_output import ExportOutput
 from tensorflow.python.estimator.export.export_output import PredictOutput
@@ -34,6 +35,7 @@ _allowed_symbols = [
     'build_parsing_serving_input_receiver_fn',
     'build_raw_serving_input_receiver_fn',
     'ServingInputReceiver',
+    'TensorServingInputReceiver',
     'ClassificationOutput',
     'ExportOutput',
     'PredictOutput',
diff --git a/tensorflow/python/estimator/export/export_test.py b/tensorflow/python/estimator/export/export_test.py
index 8442bf04accbd0bc15f5958069bf3060debd42bc..eb9688bc973666554b6057f5f546b9a2d18461d6 100644
--- a/tensorflow/python/estimator/export/export_test.py
+++ b/tensorflow/python/estimator/export/export_test.py
@@ -385,5 +385,67 @@ class ExportTest(test_util.TensorFlowTestCase):
     self.assertTrue(int(time_2) < int(time_3))
 
 
+class TensorServingReceiverTest(test_util.TensorFlowTestCase):
+
+  def test_tensor_serving_input_receiver_constructor(self):
+    features = constant_op.constant([0])
+    receiver_tensors = {
+        "example0": array_ops.placeholder(dtypes.string, name="example0"),
+        u"example1": array_ops.placeholder(dtypes.string, name="example1"),
+    }
+    r = export.TensorServingInputReceiver(features, receiver_tensors)
+    self.assertTrue(isinstance(r.features, ops.Tensor))
+    self.assertTrue(isinstance(r.receiver_tensors, dict))
+
+  def test_tensor_serving_input_receiver_sparse(self):
+    features = sparse_tensor.SparseTensor(
+        indices=[[0, 0]], values=[1], dense_shape=[1, 1])
+    receiver_tensors = {
+        "example0": array_ops.placeholder(dtypes.string, name="example0"),
+        u"example1": array_ops.placeholder(dtypes.string, name="example1"),
+    }
+    r = export.TensorServingInputReceiver(features, receiver_tensors)
+    self.assertTrue(isinstance(r.features, sparse_tensor.SparseTensor))
+    self.assertTrue(isinstance(r.receiver_tensors, dict))
+
+  def test_serving_input_receiver_features_invalid(self):
+    receiver_tensors = {
+        "example0": array_ops.placeholder(dtypes.string, name="example0"),
+        u"example1": array_ops.placeholder(dtypes.string, name="example1"),
+    }
+
+    with self.assertRaisesRegexp(ValueError, "features must be defined"):
+      export.TensorServingInputReceiver(
+          features=None,
+          receiver_tensors=receiver_tensors)
+
+    with self.assertRaisesRegexp(ValueError, "feature must be a Tensor"):
+      export.TensorServingInputReceiver(
+          features={"1": constant_op.constant([1])},
+          receiver_tensors=receiver_tensors)
+
+  def test_serving_input_receiver_receiver_tensors_invalid(self):
+    features = constant_op.constant([0])
+
+    with self.assertRaisesRegexp(
+        ValueError, "receiver_tensors must be defined"):
+      export.TensorServingInputReceiver(
+          features=features,
+          receiver_tensors=None)
+
+    with self.assertRaisesRegexp(
+        ValueError, "receiver_tensors keys must be strings"):
+      export.TensorServingInputReceiver(
+          features=features,
+          receiver_tensors={
+              1: array_ops.placeholder(dtypes.string, name="example0")})
+
+    with self.assertRaisesRegexp(
+        ValueError, "receiver_tensor example1 must be a Tensor"):
+      export.TensorServingInputReceiver(
+          features=features,
+          receiver_tensors={"example1": [1]})
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/estimator/run_config.py b/tensorflow/python/estimator/run_config.py
index 3e021242c4cc914990c6b38736b8f725213b5b7e..62f035bce558f57c5fd39d60b44cf8eb0130ce38 100644
--- a/tensorflow/python/estimator/run_config.py
+++ b/tensorflow/python/estimator/run_config.py
@@ -345,7 +345,7 @@ class RunConfig(object):
       os.environ['TF_CONFIG'] = json.dumps(
           {'cluster': cluster,
            'task': {'type': 'worker', 'index': 1}})
-      config = ClusterConfig()
+      config = RunConfig()
       assert config.master == 'host4:2222'
       assert config.task_id == 1
       assert config.num_ps_replicas == 2
@@ -363,7 +363,7 @@ class RunConfig(object):
       os.environ['TF_CONFIG'] = json.dumps(
           {'cluster': cluster,
            'task': {'type': 'chief', 'index': 0}})
-      config = ClusterConfig()
+      config = RunConfig()
       assert config.master == 'host0:2222'
       assert config.task_id == 0
       assert config.num_ps_replicas == 2
@@ -381,7 +381,7 @@ class RunConfig(object):
       os.environ['TF_CONFIG'] = json.dumps(
           {'cluster': cluster,
            'task': {'type': 'evaluator', 'index': 0}})
-      config = ClusterConfig()
+      config = RunConfig()
       assert config.master == ''
       assert config.evaluator_master == ''
       assert config.task_id == 0
diff --git a/tensorflow/python/estimator/training.py b/tensorflow/python/estimator/training.py
index 63328dcfb55646ce2aaf8929d5517c8522c418f2..2cc3331a15867e9a984847391857bf84baee7424 100644
--- a/tensorflow/python/estimator/training.py
+++ b/tensorflow/python/estimator/training.py
@@ -455,15 +455,21 @@ class _TrainingExecutor(object):
                train_hooks=None,
                continuous_eval_listener=None):
     if not isinstance(estimator, estimator_lib.Estimator):
-      raise TypeError('`estimator` must have type `tf.estimator.Estimator`.')
+      raise TypeError(
+          '`estimator` must have type `tf.estimator.Estimator`. '
+          'Got: {}'.format(type(estimator)))
     self._estimator = estimator
 
     if not isinstance(train_spec, TrainSpec):
-      raise TypeError('`train_spec` must have type `tf.estimator.TrainSpec`.')
+      raise TypeError(
+          '`train_spec` must have type `tf.estimator.TrainSpec`. '
+          'Got: {}'.format(type(train_spec)))
     self._train_spec = train_spec
 
     if not isinstance(eval_spec, EvalSpec):
-      raise TypeError('`eval_spec` must have type `tf.estimator.EvalSpec`.')
+      raise TypeError(
+          '`eval_spec` must have type `tf.estimator.EvalSpec`. '
+          'Got: {}'.format(type(eval_spec)))
     self._eval_spec = eval_spec
 
     self._train_hooks = _validate_hooks(train_hooks)
diff --git a/tensorflow/python/feature_column/BUILD b/tensorflow/python/feature_column/BUILD
index a758f8a4fc4898713772c4e919acda48b0f6ad0b..238a90b67d9d0039c25a6f3800aad25a2db9e36f 100644
--- a/tensorflow/python/feature_column/BUILD
+++ b/tensorflow/python/feature_column/BUILD
@@ -74,7 +74,10 @@ py_test(
     srcs = ["feature_column_test.py"],
     data = [":vocabulary_testdata"],
     srcs_version = "PY2AND3",
-    tags = ["no_pip"],
+    tags = [
+        "no_cuda_on_cpu_tap",
+        "no_pip",
+    ],
     deps = [
         ":feature_column",
         ":feature_column_py",
diff --git a/tensorflow/python/framework/function.py b/tensorflow/python/framework/function.py
index cba225e749d88a45c43266e45172a7335a8e0b71..caa604999c2fad4ce111d910a77e4b99399c11ca 100644
--- a/tensorflow/python/framework/function.py
+++ b/tensorflow/python/framework/function.py
@@ -353,8 +353,10 @@ class _DefinedFunction(object):
           outputs = (outputs,)
         if any([_ is None for _ in outputs]):
           raise ValueError("Function can not return None.")
-      # Ensures each output is a Tensor.
-      outputs = [ops.convert_to_tensor(_) for _ in outputs]
+      # Ensures each output is a Tensor in the function graph.
+      outputs = [ops.convert_to_tensor(t) for t in outputs]
+      outputs = [temp_graph.capture(t) if t.graph is not temp_graph else t
+                 for t in outputs]
     self._extra_inputs = temp_graph.extra_inputs
     inputs.extend(temp_graph.extra_args)
     # pylint: disable=protected-access
@@ -683,28 +685,34 @@ class _FuncGraph(ops.Graph):
   def create_op(self, op_type, inputs, data_types, **kwargs):
     for i, x in enumerate(inputs):
       if isinstance(x, ops.EagerTensor) or x.graph is not self:
-        # Referring to a tensor from other graph.
-        if x in self._captured:
-          # Captured already.
-          inputs[i] = self._captured[x]
-        elif self._capture_by_value:
-          inputs[i] = self._add_tensor_and_parents(x)
-        else:
-          # Substitute with a placeholder.
-          self.extra_inputs.append(x)
-          # Hoist the new input placeholder out of any control flow context
-          # we're currently in.
-          with ops.control_dependencies(None):
-            ph = array_ops.placeholder(x.dtype, shape=x.get_shape())
-          # pylint: disable=protected-access
-          ph._handle_data = x._handle_data
-          # pylint: enable=protected-access
-          inputs[i] = ph
-          self._captured[x] = ph
-          self.extra_args.append(ph)
+        inputs[i] = self.capture(x)
     return super(_FuncGraph, self).create_op(op_type, inputs, data_types,
                                              **kwargs)
 
+  def capture(self, tensor):
+    """Adds the given tensor to this graph and returns the captured tensor."""
+    if tensor in self._captured:
+      # Captured already.
+      return self._captured[tensor]
+    elif self._capture_by_value:
+      return self._add_tensor_and_parents(tensor)
+    else:
+      return self._capture_tensor_as_extra_input(tensor)
+
+  def _capture_tensor_as_extra_input(self, tensor):
+    # Substitute with a placeholder.
+    self.extra_inputs.append(tensor)
+    # Hoist the new input placeholder out of any control flow context
+    # we're currently in.
+    with ops.control_dependencies(None):
+      ph = array_ops.placeholder(tensor.dtype, shape=tensor.get_shape())
+    # pylint: disable=protected-access
+    ph._handle_data = tensor._handle_data
+    # pylint: enable=protected-access
+    self._captured[tensor] = ph
+    self.extra_args.append(ph)
+    return ph
+
   def _add_tensor_and_parents(self, tensor):
     op = self._add_op_and_parents(tensor.op)
     return op.outputs[tensor.value_index]
diff --git a/tensorflow/python/framework/function_test.py b/tensorflow/python/framework/function_test.py
index 301a7f682dde8dbeccd1e81675b0059433990a09..65ca801cbe922b36e3bc72bc2fbcd88f66aa5290 100644
--- a/tensorflow/python/framework/function_test.py
+++ b/tensorflow/python/framework/function_test.py
@@ -193,7 +193,7 @@ class FunctionTest(test.TestCase):
 
     @function.Defun(dtypes.float32, dtypes.float32)
     def XSquarePlusOneGrad(x, dy):
-      dx = functional_ops._symbolic_gradient(
+      dx = functional_ops.symbolic_gradient(
           input=[x, dy], Tout=[dtypes.float32], f="XSquarePlusOneFn", name="dx")
       return dx
 
@@ -295,7 +295,7 @@ class FunctionTest(test.TestCase):
       # gradient function is (x, y, dz) -> (dx, dy).  dx's shape
       # should be the same as x's; and dy's shape should be the same
       # as y's.
-      dx, dy = functional_ops._symbolic_gradient(
+      dx, dy = functional_ops.symbolic_gradient(
           input=[x, y, dz], Tout=[dtypes.float32] * 2, f="Foo")
       self.assertEqual(x.get_shape(), dx.get_shape())
       self.assertEqual(y.get_shape(), dy.get_shape())
@@ -725,9 +725,16 @@ class FunctionTest(test.TestCase):
 
       y = Foo(constant_op.constant([[10.]]))
 
+      @function.Defun()
+      def Bar():
+        return w
+
+      z = Bar()
+
     with self.test_session(graph=g):
       variables.global_variables_initializer().run()
       self.assertAllEqual(y.eval(), [[12.0]])
+      self.assertAllEqual(z.eval(), [[1.0]])
 
   def testCaptureControls(self):
     g = ops.Graph()
diff --git a/tensorflow/python/framework/graph_util_impl.py b/tensorflow/python/framework/graph_util_impl.py
index 5a543317e665a940841714fd72d834a430f8406a..910364364c8be84b1a629dbdaae5e69443d07e75 100644
--- a/tensorflow/python/framework/graph_util_impl.py
+++ b/tensorflow/python/framework/graph_util_impl.py
@@ -235,7 +235,7 @@ def convert_variables_to_constants(sess,
   variable_names = []
   variable_dict_names = []
   for node in inference_graph.node:
-    if node.op in ["Variable", "VariableV2"]:
+    if node.op in ["Variable", "VariableV2", "VarHandleOp"]:
       variable_name = node.name
       if ((variable_names_whitelist is not None and
            variable_name not in variable_names_whitelist) or
@@ -243,7 +243,10 @@ def convert_variables_to_constants(sess,
            variable_name in variable_names_blacklist)):
         continue
       variable_dict_names.append(variable_name)
-      variable_names.append(variable_name + ":0")
+      if node.op == "VarHandleOp":
+        variable_names.append(variable_name + "/Read/ReadVariableOp:0")
+      else:
+        variable_names.append(variable_name + ":0")
   if variable_names:
     returned_variables = sess.run(variable_names)
   else:
@@ -266,6 +269,17 @@ def convert_variables_to_constants(sess,
               tensor=tensor_util.make_tensor_proto(
                   data, dtype=dtype.type, shape=data.shape)))
       how_many_converted += 1
+    elif input_node.op == "ReadVariableOp" and (
+        input_node.input[0] in found_variables):
+      # The preceding branch converts all VarHandleOps of ResourceVariables to
+      # constants, so we need to convert the associated ReadVariableOps to
+      # Identity ops.
+      output_node.op = "Identity"
+      output_node.name = input_node.name
+      output_node.input.extend([input_node.input[0]])
+      output_node.attr["T"].CopyFrom(input_node.attr["dtype"])
+      if "_class" in input_node.attr:
+        output_node.attr["_class"].CopyFrom(input_node.attr["_class"])
     else:
       output_node.CopyFrom(input_node)
     output_graph_def.node.extend([output_node])
diff --git a/tensorflow/python/framework/graph_util_test.py b/tensorflow/python/framework/graph_util_test.py
index 0421837d49de753d642aed59d1524619a243dcb8..b618152b0256fd043dc7259960d867278ba55b0a 100644
--- a/tensorflow/python/framework/graph_util_test.py
+++ b/tensorflow/python/framework/graph_util_test.py
@@ -32,6 +32,7 @@ from tensorflow.python.framework import tensor_util
 from tensorflow.python.ops import gen_state_ops
 from tensorflow.python.ops import math_ops  # pylint: disable=unused-import
 from tensorflow.python.ops import math_ops as math_ops_lib
+from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 
@@ -47,46 +48,46 @@ class DeviceFunctionsTest(test.TestCase):
 
   def testTwoDeviceFunctions(self):
     with ops.Graph().as_default() as g:
-      var_0 = gen_state_ops._variable(
+      var_0 = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="var_0",
           container="",
           shared_name="")
       with g.device(test_device_func_pin_variable_to_cpu):
-        var_1 = gen_state_ops._variable(
+        var_1 = gen_state_ops.variable(
             shape=[1],
             dtype=dtypes.float32,
             name="var_1",
             container="",
             shared_name="")
-      var_2 = gen_state_ops._variable(
+      var_2 = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="var_2",
           container="",
           shared_name="")
-      var_3 = gen_state_ops._variable(
+      var_3 = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="var_3",
           container="",
           shared_name="")
       with g.device(test_device_func_pin_variable_to_cpu):
-        var_4 = gen_state_ops._variable(
+        var_4 = gen_state_ops.variable(
             shape=[1],
             dtype=dtypes.float32,
             name="var_4",
             container="",
             shared_name="")
         with g.device("/device:GPU:0"):
-          var_5 = gen_state_ops._variable(
+          var_5 = gen_state_ops.variable(
               shape=[1],
               dtype=dtypes.float32,
               name="var_5",
               container="",
               shared_name="")
-        var_6 = gen_state_ops._variable(
+        var_6 = gen_state_ops.variable(
             shape=[1],
             dtype=dtypes.float32,
             name="var_6",
@@ -226,52 +227,62 @@ class DeviceFunctionsTest(test.TestCase):
                          constant_graph_def.library)
 
   def testConvertVariablesToConsts(self):
-    with ops.Graph().as_default():
-      variable_node = variables.Variable(1.0, name="variable_node")
-      _ = variables.Variable(1.0, name="unused_variable_node")
-      output_node = math_ops_lib.multiply(
-          variable_node, 2.0, name="output_node")
-      with session.Session() as sess:
-        init = variables.initialize_variables([variable_node])
-        sess.run(init)
-        output = sess.run(output_node)
-        self.assertNear(2.0, output, 0.00001)
-        variable_graph_def = sess.graph.as_graph_def()
-        # First get the constant_graph_def when variable_names_whitelist is set,
-        # note that if variable_names_whitelist is not set an error will be
-        # thrown because unused_variable_node is not initialized.
-        constant_graph_def = graph_util.convert_variables_to_constants(
-            sess,
-            variable_graph_def, ["output_node"],
-            variable_names_whitelist=set(["variable_node"]))
+    self._test_variable_to_const_conversion(use_resource=False)
 
-        # Then initialize the unused variable, and get another
-        # constant_graph_def when variable_names_whitelist is not set.
-        sess.run(variables.global_variables_initializer())
-        constant_graph_def_without_variable_whitelist = (
-            graph_util.convert_variables_to_constants(sess, variable_graph_def,
-                                                      ["output_node"]))
-
-        # The unused variable should be cleared so the two graphs should be
-        # equivalent.
-        self.assertEqual(
-            str(constant_graph_def),
-            str(constant_graph_def_without_variable_whitelist))
-
-        # Test variable name black list. This should result in the variable not
-        # being a const.
-        sess.run(variables.global_variables_initializer())
-        constant_graph_def_with_blacklist = (
-            graph_util.convert_variables_to_constants(
-                sess,
-                variable_graph_def, ["output_node"],
-                variable_names_blacklist=set(["variable_node"])))
-        variable_node = None
-        for node in constant_graph_def_with_blacklist.node:
-          if node.name == "variable_node":
-            variable_node = node
-        self.assertIsNotNone(variable_node)
-        self.assertEqual(variable_node.op, "VariableV2")
+  def testConvertResourceVariablesToConsts(self):
+    self._test_variable_to_const_conversion(use_resource=True)
+
+  def _test_variable_to_const_conversion(self, use_resource):
+    with ops.Graph().as_default():
+      with variable_scope.variable_scope("", use_resource=use_resource):
+        variable_node = variable_scope.get_variable(
+            "variable_node", initializer=1.0)
+        another_variable = variable_scope.get_variable(
+            "unused_variable_node", initializer=1.0)
+        output_node = math_ops_lib.multiply(
+            variable_node, 2.0, name="output_node")
+        with session.Session() as sess:
+          sess.run(variable_node.initializer)
+          output = sess.run(output_node)
+          self.assertNear(2.0, output, 0.00001)
+          variable_graph_def = sess.graph.as_graph_def()
+          # First get the constant_graph_def when variable_names_whitelist is
+          # set, note that if variable_names_whitelist is not set an error will
+          # be thrown because unused_variable_node is not initialized.
+          constant_graph_def = graph_util.convert_variables_to_constants(
+              sess,
+              variable_graph_def, ["output_node"],
+              variable_names_whitelist=set(["variable_node"]))
+
+          # Then initialize the unused variable, and get another
+          # constant_graph_def when variable_names_whitelist is not set.
+          sess.run(another_variable.initializer)
+          constant_graph_def_without_variable_whitelist = (
+              graph_util.convert_variables_to_constants(
+                  sess, variable_graph_def, ["output_node"]))
+
+          # The unused variable should be cleared so the two graphs should be
+          # equivalent.
+          self.assertEqual(
+              str(constant_graph_def),
+              str(constant_graph_def_without_variable_whitelist))
+
+          # Test variable name black list. This should result in the variable
+          # not being a const.
+          constant_graph_def_with_blacklist = (
+              graph_util.convert_variables_to_constants(
+                  sess,
+                  variable_graph_def, ["output_node"],
+                  variable_names_blacklist=set(["variable_node"])))
+          variable_node = None
+          for node in constant_graph_def_with_blacklist.node:
+            if node.name == "variable_node":
+              variable_node = node
+          self.assertIsNotNone(variable_node)
+          if use_resource:
+            self.assertEqual(variable_node.op, "VarHandleOp")
+          else:
+            self.assertEqual(variable_node.op, "VariableV2")
 
     # Now we make sure the variable is now a constant, and that the graph still
     # produces the expected result.
@@ -279,8 +290,9 @@ class DeviceFunctionsTest(test.TestCase):
       _ = importer.import_graph_def(constant_graph_def, name="")
       self.assertEqual(4, len(constant_graph_def.node))
       for node in constant_graph_def.node:
-        self.assertNotEqual("Variable", node.op)
-        self.assertNotEqual("VariableV2", node.op)
+        self.assertNotIn(
+            node.op,
+            ["Variable", "VariableV2", "VarHandleOp", "ReadVariableOp"])
       with session.Session() as sess:
         output_node = sess.graph.get_tensor_by_name("output_node:0")
         output = sess.run(output_node)
diff --git a/tensorflow/python/framework/meta_graph.py b/tensorflow/python/framework/meta_graph.py
index 8c03a5f19dee31a6609590e46d608af9a686c5fe..4c1bd736d727e974375ad9008a579361137fb9d6 100644
--- a/tensorflow/python/framework/meta_graph.py
+++ b/tensorflow/python/framework/meta_graph.py
@@ -741,6 +741,7 @@ def import_scoped_meta_graph(meta_graph_or_file,
         producer_op_list=producer_op_list)
 
     # Restores all the other collections.
+    variable_objects = {}
     for key, col_def in sorted(meta_graph_def.collection_def.items()):
       # Don't add unbound_inputs to the new graph.
       if key == unbound_inputs_col_name:
@@ -756,11 +757,23 @@ def import_scoped_meta_graph(meta_graph_or_file,
       from_proto = ops.get_from_proto_function(key)
       if from_proto and kind == "bytes_list":
         proto_type = ops.get_collection_proto_type(key)
-        for value in col_def.bytes_list.value:
-          proto = proto_type()
-          proto.ParseFromString(value)
-          graph.add_to_collection(
-              key, from_proto(proto, import_scope=scope_to_prepend_to_names))
+        if key in ops.GraphKeys._VARIABLE_COLLECTIONS:  # pylint: disable=protected-access
+          for value in col_def.bytes_list.value:
+            variable = variable_objects.get(value, None)
+            if variable is None:
+              proto = proto_type()
+              proto.ParseFromString(value)
+              variable = from_proto(
+                  proto, import_scope=scope_to_prepend_to_names)
+              variable_objects[value] = variable
+            graph.add_to_collection(key, variable)
+        else:
+          for value in col_def.bytes_list.value:
+            proto = proto_type()
+            proto.ParseFromString(value)
+            graph.add_to_collection(
+                key, from_proto(
+                    proto, import_scope=scope_to_prepend_to_names))
       else:
         field = getattr(col_def, kind)
         if key in _COMPAT_COLLECTION_LIST:
diff --git a/tensorflow/python/framework/meta_graph_test.py b/tensorflow/python/framework/meta_graph_test.py
index f2f1e83da15eacdbb4f194967b51559d279ae1a4..21963d0beee398da8e90c2c829b2d4607ec6cc42 100644
--- a/tensorflow/python/framework/meta_graph_test.py
+++ b/tensorflow/python/framework/meta_graph_test.py
@@ -261,6 +261,29 @@ class SimpleMetaGraphTest(test.TestCase):
       self.assertEqual(node_def.attr["attr_1"].i, 1)
       self.assertTrue(meta_graph_def.meta_info_def.stripped_default_attrs)
 
+  def testVariableObjectsAreSharedAmongCollections(self):
+    with ops.Graph().as_default() as graph1:
+      v = variables.Variable(3.0)
+      # A single instance of Variable is shared among the collections:
+      global_vars = graph1.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+      trainable_vars = graph1.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      self.assertEqual(len(global_vars), 1)
+      self.assertEqual(len(trainable_vars), 1)
+      self.assertIs(global_vars[0], trainable_vars[0])
+      self.assertIs(v, global_vars[0])
+
+    orig_meta_graph, _ = meta_graph.export_scoped_meta_graph(graph=graph1)
+    del graph1  # To avoid accidental references in code involving graph2.
+
+    with ops.Graph().as_default() as graph2:
+      meta_graph.import_scoped_meta_graph(orig_meta_graph)
+      global_vars = graph2.get_collection(ops.GraphKeys.GLOBAL_VARIABLES)
+      trainable_vars = graph2.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES)
+      self.assertEqual(len(global_vars), 1)
+      self.assertEqual(len(trainable_vars), 1)
+      # A single instance of Variable is shared among the collections:
+      self.assertIs(global_vars[0], trainable_vars[0])
+
 
 @test_util.with_c_api
 class ScopedMetaGraphTest(test.TestCase):
@@ -882,22 +905,12 @@ class ExportImportAcrossScopesTest(test.TestCase):
       with variable_scope.variable_scope("importA/keepA"):
         graph_fn(use_resource=use_resource)
 
-      if use_resource:
-        # Bringing in a collection that contains ResourceVariables adds ops
-        # to the graph, so mimic the same behavior.
-        for collection_key in sorted([
-            ops.GraphKeys.GLOBAL_VARIABLES,
-            ops.GraphKeys.TRAINABLE_VARIABLES,
-        ]):
-          for var in expected_graph.get_collection(collection_key):
-            var._read_variable_op()
-
     result = meta_graph.export_scoped_meta_graph(graph=imported_graph)[0]
     expected = meta_graph.export_scoped_meta_graph(graph=expected_graph)[0]
 
     if use_resource:
       # Clear all shared_name attributes before comparing, since they are
-      # supposed to be orthogonal to scopes.
+      # orthogonal to scopes and are not updated on export/import.
       for meta_graph_def in [result, expected]:
         for node in meta_graph_def.graph_def.node:
           shared_name_attr = "shared_name"
diff --git a/tensorflow/python/framework/ops.py b/tensorflow/python/framework/ops.py
index 398b3f67e20660dc23f8fb339774ad0e3b2eff9d..0a85b153de820baaf131b0d5d1cea6a539b44500 100644
--- a/tensorflow/python/framework/ops.py
+++ b/tensorflow/python/framework/ops.py
@@ -368,8 +368,8 @@ class Tensor(_TensorLike):
       A `TensorShape` representing the shape of this tensor.
 
     """
-    if _USE_C_API:
-      graph = self._op._graph._c_graph  # pylint: disable=protected-access
+    graph = self._op._graph._c_graph # pylint: disable=protected-access
+    if graph:
       with errors.raise_exception_on_not_ok_status() as status:
         num_dims = c_api.TF_GraphGetTensorNumDims(graph, self._as_tf_output(),
                                                   status)
@@ -466,7 +466,7 @@ class Tensor(_TensorLike):
       ValueError: If `shape` is not compatible with the current shape of
         this tensor.
     """
-    if not _USE_C_API:
+    if not self._op._graph._c_graph:  # pylint: disable=protected-access # ASIM
       self._shape_val = self._shape_val.merge_with(shape)
       return
     if not isinstance(shape, tensor_shape.TensorShape):
@@ -782,7 +782,11 @@ class _EagerTensorBase(Tensor):
 
   @property
   def shape(self):
-    return tensor_shape.TensorShape(self._shape_tuple())
+    if self._tensor_shape is None:  # pylint: disable=access-member-before-definition
+      # `_tensor_shape` is declared and defined in the definition of
+      # `EagerTensor`, in C.
+      self._tensor_shape = tensor_shape.TensorShape(self._shape_tuple())
+    return self._tensor_shape
 
   def get_shape(self):
     """Alias of Tensor.shape."""
@@ -2707,15 +2711,21 @@ class Graph(object):
     self._name_stack = ""
     # Maps a name used in the graph to the next id to use for that name.
     self._names_in_use = {}
+    self._stack_state_is_thread_local = False
+    self._thread_local = threading.local()
     # Functions that will be applied to choose a device if none is specified.
-    self._device_function_stack = []
+    # After switch_to_thread_local(), self._thread_local._device_function_stack
+    # is used instead.
+    self._graph_device_function_stack = []
     # Default original_op applied to new ops.
     self._default_original_op = None
     # Current control flow context. It could be either CondContext or
     # WhileContext defined in ops/control_flow_ops.py
     self._control_flow_context = None
     # A new node will depend of the union of all of the nodes in the stack.
-    self._control_dependencies_stack = []
+    # After switch_to_thread_local(),
+    # self._thread_local._control_dependencies_stack is used instead.
+    self._graph_control_dependencies_stack = []
     # Arbitrary collections of objects.
     self._collections = {}
     # The graph-level random seed
@@ -2737,8 +2747,9 @@ class Graph(object):
         producer=versions.GRAPH_DEF_VERSION,
         min_consumer=versions.GRAPH_DEF_VERSION_MIN_CONSUMER)
     self._building_function = False
-    # Stack of colocate_with ops
-    self._colocation_stack = []
+    # Stack of colocate_with ops. After switch_to_thread_local(),
+    # self._thread_local._colocation_stack is used instead.
+    self._graph_colocation_stack = []
     # Set of tensors that are dangerous to feed!
     self._unfeedable_tensors = set()
     # Set of operations that are dangerous to fetch!
@@ -2761,8 +2772,12 @@ class Graph(object):
 
     # TODO(skyewm): fold as much of the above as possible into the C
     # implementation
-    if _USE_C_API or self._use_c_api_hack():
+    if self._use_c_api_hack():
       self._scoped_c_graph = c_api_util.ScopedTFGraph()
+      # The C API requires all ops to have shape functions. Disable this
+      # requirement (many custom ops do not have shape functions, and we don't
+      # want to break these existing cases).
+      c_api.SetRequireShapeInferenceFns(self._c_graph, False)
     else:
       self._scoped_c_graph = None
     self._variable_creator_stack = []
@@ -2770,7 +2785,7 @@ class Graph(object):
   # TODO(apassos) remove once the C API is used by default.
   def _use_c_api_hack(self):
     """Temporary hack; can be overridden to force C API usage."""
-    return False
+    return _USE_C_API
 
   def _convert_stack(self, stack, include_func_start_lineno=False):
     """Converts a stack extracted using _extract_stack() to a traceback stack.
@@ -3030,7 +3045,7 @@ class Graph(object):
 
     """
     # pylint: enable=line-too-long
-    if _USE_C_API:
+    if self._c_graph:
       with self._lock:
         with c_api_util.tf_buffer() as buf:
           with errors.raise_exception_on_not_ok_status() as status:
@@ -3350,9 +3365,9 @@ class Graph(object):
           if (op.device and pydev.canonical_name(op.device) !=
               pydev.canonical_name(colocation_op.device)):
             logging.warning("Tried to colocate %s with an op %s that had "
-                            "a different device: %s vs %s. "
-                            "Ignoring colocation property.", op.name,
-                            colocation_op.name, op.device,
+                            "a different device: %s vs %s. Postponing "
+                            "error-checking until all devices are assigned.",
+                            op.name, colocation_op.name, op.device,
                             colocation_op.device)
           else:
             op._set_device(colocation_op.device)  # pylint: disable=protected-access
@@ -4669,6 +4684,79 @@ class Graph(object):
     else:
       return tensor_or_op not in self._unfetchable_ops
 
+  def switch_to_thread_local(self):
+    """Make device, colocation and dependencies stacks thread-local.
+
+    Device, colocation and dependencies stacks are not thread-local be default.
+    If multiple threads access them, then the state is shared.  This means that
+    one thread may affect the behavior of another thread.
+
+    After this method is called, the stacks become thread-local.  If multiple
+    threads access them, then the state is not shared.  Each thread uses its own
+    value; a thread doesn't affect other threads by mutating such a stack.
+
+    The initial value for every thread's stack is set to the current value
+    of the stack when `switch_to_thread_local()` was first called.
+    """
+    if not self._stack_state_is_thread_local:
+      self._stack_state_is_thread_local = True
+
+  @property
+  def _device_function_stack(self):
+    if self._stack_state_is_thread_local:
+      # This may be called from a thread where device_function_stack doesn't yet
+      # exist.
+      if not hasattr(self._thread_local, "_device_function_stack"):
+        self._thread_local._device_function_stack = (
+            self._graph_device_function_stack[:])
+      return self._thread_local._device_function_stack
+    else:
+      return self._graph_device_function_stack
+
+  @_device_function_stack.setter
+  def _device_function_stack(self, device_function_stack):
+    if self._stack_state_is_thread_local:
+      self._thread_local._device_function_stack = device_function_stack
+    else:
+      self._graph_device_function_stack = device_function_stack
+
+  @property
+  def _colocation_stack(self):
+    if self._stack_state_is_thread_local:
+      # This may be called from a thread where colocation_stack doesn't yet
+      # exist.
+      if not hasattr(self._thread_local, "_colocation_stack"):
+        self._thread_local._colocation_stack = self._graph_colocation_stack[:]
+      return self._thread_local._colocation_stack
+    else:
+      return self._graph_colocation_stack
+
+  @_colocation_stack.setter
+  def _colocation_stack(self, colocation_stack):
+    if self._stack_state_is_thread_local:
+      self._thread_local._colocation_stack = colocation_stack
+    else:
+      self._graph_colocation_stack = colocation_stack
+
+  @property
+  def _control_dependencies_stack(self):
+    if self._stack_state_is_thread_local:
+      # This may be called from a thread where control_dependencies_stack
+      # doesn't yet exist.
+      if not hasattr(self._thread_local, "_control_dependencies_stack"):
+        self._thread_local._control_dependencies_stack = (
+            self._graph_control_dependencies_stack[:])
+      return self._thread_local._control_dependencies_stack
+    else:
+      return self._graph_control_dependencies_stack
+
+  @_control_dependencies_stack.setter
+  def _control_dependencies_stack(self, control_dependencies):
+    if self._stack_state_is_thread_local:
+      self._thread_local._control_dependencies_stack = control_dependencies
+    else:
+      self._graph_control_dependencies_stack = control_dependencies
+
 
 # TODO(agarwal): currently device directives in an outer eager scope will not
 # apply to inner graph mode code. Fix that.
@@ -4721,7 +4809,14 @@ def container(container_name):
 @tf_export("colocate_with")
 def colocate_with(op, ignore_existing=False):
   if context.in_graph_mode():
-    return get_default_graph().colocate_with(op, ignore_existing)
+    default_graph = get_default_graph()
+    if isinstance(op, EagerTensor):
+      if default_graph.building_function:
+        op = internal_convert_to_tensor(op)
+      else:
+        raise ValueError("Encountered an Eager-defined Tensor during graph "
+                         "construction, but a function was not being built.")
+    return default_graph.colocate_with(op, ignore_existing)
   else:
     if op is not None:
       return device(op.device)
@@ -5012,38 +5107,50 @@ def init_scope():
   """
   # pylint: enable=g-doc-return-or-yield,line-too-long
 
-  in_graph_mode = context.in_graph_mode()
-  # Retrieve the active name scope: entering an `init_scope` preserves
-  # the name scope of the current context.
-  if in_graph_mode:
+  if context.in_eager_mode():
+    # Fastpath.
+    with tape.stop_recording():
+      yield
+  else:
+    # Retrieve the active name scope: entering an `init_scope` preserves
+    # the name scope of the current context.
     default_graph = get_default_graph()
     scope = default_graph.get_name_scope()
-  else:
-    scope = context.context().scope_name
-  if scope and scope[-1] != '/':
-    # Names that end with trailing slashes are treated by `name_scope` as
-    # absolute.
-    scope = scope + '/'
-
-  outer_context = None
-  if in_graph_mode and not _default_graph_stack.stack:
-    outer_context = default_graph.as_default
-  else:
-    for stack_entry in reversed(context.context_stack.stack):
-      if not stack_entry.is_building_function:
-        outer_context = stack_entry.enter_context_fn
-        break
+    if scope and scope[-1] != '/':
+      # Names that end with trailing slashes are treated by `name_scope` as
+      # absolute.
+      scope = scope + '/'
+
+    outer_context = None
+    if not _default_graph_stack.stack:
+      # If the default graph stack is empty, then we cannot be building a
+      # function. Install the global graph (which, in this case, is also the
+      # default graph) as the outer context.
+      if default_graph.building_function:
+        raise RuntimeError("The global graph is building a function.")
+      outer_context = default_graph.as_default
+    else:
+      # Find a context that is not building a function.
+      for stack_entry in reversed(context.context_stack.stack):
+        if not stack_entry.is_building_function:
+          outer_context = stack_entry.enter_context_fn
+          break
+
+      if outer_context is None:
+        # As a last resort, obtain the global default graph; this graph doesn't
+        # necessarily live on the graph stack (and hence it doesn't necessarily
+        # live on the context stack), but it is stored in the graph stack's
+        # encapsulating object.
+        outer_context = _default_graph_stack._GetGlobalDefaultGraph().as_default  # pylint: disable=protected-access
 
-  if outer_context is None:
-    raise AssertionError("All graphs are building functions, and no "
+    if outer_context is None:
+      # Sanity check; this shouldn't be triggered.
+      raise RuntimeError("All graphs are building functions, and no "
                          "eager context was previously active.")
 
-  try:
     with outer_context(), name_scope(scope), control_dependencies(
         None), tape.stop_recording():
       yield
-  finally:
-    pass
 
 
 def enable_eager_execution(config=None, device_policy=None):
diff --git a/tensorflow/python/framework/ops_test.py b/tensorflow/python/framework/ops_test.py
index c6deafd89eb1bdc4892a65ba3ab8c7900915390f..55576f0e885ac87d8dc3665db2205e6754ee9960 100644
--- a/tensorflow/python/framework/ops_test.py
+++ b/tensorflow/python/framework/ops_test.py
@@ -19,6 +19,7 @@ from __future__ import division
 from __future__ import print_function
 
 import gc
+import threading
 import weakref
 
 from tensorflow.core.framework import attr_value_pb2
@@ -1381,6 +1382,180 @@ class DeviceTest(test_util.TensorFlowTestCase):
     """, gd)
 
 
+@test_util.with_c_api
+class MultithreadedGraphStateTest(test_util.TensorFlowTestCase):
+
+  class TestThread(threading.Thread):
+
+    def __init__(self, graph, replica_id):
+      super(MultithreadedGraphStateTest.TestThread, self).__init__()
+      self._graph = graph
+      self._replica_id = replica_id
+      # This thread sets this event when it mutated the graph.  The caller can
+      # wait for that.
+      self.has_mutated_graph = threading.Event()
+      # This thread waits for when it should continue.  The caller can set this
+      # event.
+      self.should_continue = threading.Event()
+
+    def run(self):
+      # Mutate a graph's stack, then set `has_mutated_graph`, then wait for
+      # `should_continue`, then add an op to the graph affected by the graph's
+      # stack.
+      raise NotImplementedError("must be implemented in descendants")
+
+  def testDeviceFunctionStack(self):
+
+    class DeviceSettingThread(self.TestThread):
+
+      def run(self):
+        with g.device("/job:worker/replica:{}".format(self._replica_id)):
+          self.has_mutated_graph.set()
+          self.should_continue.wait()
+          self.should_continue.clear()
+          g.create_op(
+              "FloatOutput", [], [dtypes.float32],
+              name="FloatOutput_{}".format(self._replica_id))
+
+    g = ops.Graph()
+    # If `switch_to_thread` isn't called, then device placement of the ops
+    # below is not deterministic.
+    g.switch_to_thread_local()
+    threads = [DeviceSettingThread(g, i) for i in range(3)]
+    for t in threads:
+      t.start()
+      t.has_mutated_graph.wait()
+      t.has_mutated_graph.clear()
+    for t in threads:
+      t.should_continue.set()
+      t.join()
+
+    gd = g.as_graph_def()
+    self.assertProtoEqualsVersion("""
+      node { name: "FloatOutput_0" op: "FloatOutput"
+             device: "/job:worker/replica:0" }
+      node { name: "FloatOutput_1" op: "FloatOutput"
+             device: "/job:worker/replica:1" }
+      node { name: "FloatOutput_2" op: "FloatOutput"
+             device: "/job:worker/replica:2" }
+    """, gd)
+
+  def testColocateWith(self):
+
+    class ColocatingThread(self.TestThread):
+
+      def __init__(self, graph, replica_id, op_to_colocate_with):
+        super(ColocatingThread, self).__init__(graph, replica_id)
+        self._op_to_colocate_with = op_to_colocate_with
+
+      def run(self):
+        with g.colocate_with(self._op_to_colocate_with):
+          self.has_mutated_graph.set()
+          self.should_continue.wait()
+          self.should_continue.clear()
+          g.create_op(
+              "FloatOutput", [], [dtypes.float32],
+              name="FloatOutput_{}".format(self._replica_id))
+
+    g = ops.Graph()
+    ops_to_colocate_with = []
+    for i in range(3):
+      with g.device("/job:worker/replica:{}".format(i)):
+        ops_to_colocate_with.append(
+            g.create_op(
+                "FloatOutput", [], [dtypes.float32],
+                name="ColocateWithMe_{}".format(i)))
+
+    # If `switch_to_thread` isn't called, then `device` and `attr` values for
+    # the ops below are not deterministic.
+    g.switch_to_thread_local()
+    threads = [
+        ColocatingThread(g, i, ops_to_colocate_with[i]) for i in range(3)
+    ]
+    for t in threads:
+      t.start()
+      t.has_mutated_graph.wait()
+      t.has_mutated_graph.clear()
+    for t in threads:
+      t.should_continue.set()
+      t.join()
+
+    gd = g.as_graph_def()
+    self.assertProtoEqualsVersion("""
+      node { name: "ColocateWithMe_0" op: "FloatOutput"
+             device: "/job:worker/replica:0" }
+      node { name: "ColocateWithMe_1" op: "FloatOutput"
+             device: "/job:worker/replica:1" }
+      node { name: "ColocateWithMe_2" op: "FloatOutput"
+             device: "/job:worker/replica:2" }
+      node { name: "FloatOutput_0" op: "FloatOutput"
+             device: "/job:worker/replica:0"
+             attr { key: "_class"
+               value { list {
+                 s: "loc:@ColocateWithMe_0"}}}}
+      node { name: "FloatOutput_1" op: "FloatOutput"
+             device: "/job:worker/replica:1"
+             attr { key: "_class"
+               value { list {
+                 s: "loc:@ColocateWithMe_1"}}}}
+      node { name: "FloatOutput_2" op: "FloatOutput"
+             device: "/job:worker/replica:2"
+             attr { key: "_class"
+               value { list {
+                 s: "loc:@ColocateWithMe_2"}}}}
+    """, gd)
+
+  def testControlDependencies(self):
+
+    class DependingThread(self.TestThread):
+
+      def __init__(self, graph, replica_id, dependency_op):
+        super(DependingThread, self).__init__(graph, replica_id)
+        self._dependency_op = dependency_op
+
+      def run(self):
+        with g.control_dependencies([self._dependency_op]):
+          self.has_mutated_graph.set()
+          self.should_continue.wait()
+          self.should_continue.clear()
+          g.create_op(
+              "FloatOutput", [], [dtypes.float32],
+              name="FloatOutput_{}".format(self._replica_id))
+
+    g = ops.Graph()
+    dependency_ops = []
+    for i in range(3):
+      dependency_ops.append(
+          g.create_op(
+              "FloatOutput", [], [dtypes.float32],
+              name="ColocateWithMe_{}".format(i)))
+
+    # If `switch_to_thread` isn't called, then `input` values for the ops below
+    # are not deterministic.
+    g.switch_to_thread_local()
+    threads = [DependingThread(g, i, dependency_ops[i]) for i in range(3)]
+    for t in threads:
+      t.start()
+      t.has_mutated_graph.wait()
+      t.has_mutated_graph.clear()
+    for t in threads:
+      t.should_continue.set()
+      t.join()
+
+    gd = g.as_graph_def()
+    self.assertProtoEqualsVersion("""
+      node { name: "ColocateWithMe_0" op: "FloatOutput" }
+      node { name: "ColocateWithMe_1" op: "FloatOutput" }
+      node { name: "ColocateWithMe_2" op: "FloatOutput" }
+      node { name: "FloatOutput_0" op: "FloatOutput"
+             input: "^ColocateWithMe_0" }
+      node { name: "FloatOutput_1" op: "FloatOutput"
+             input: "^ColocateWithMe_1" }
+      node { name: "FloatOutput_2" op: "FloatOutput"
+             input: "^ColocateWithMe_2" }
+    """, gd)
+
+
 @test_util.with_c_api
 class ObjectWithName(object):
 
@@ -1981,14 +2156,6 @@ class InitScopeTest(test_util.TensorFlowTestCase):
           self.assertIs(g, ops.get_default_graph())
           self.assertTrue(context.in_graph_mode())
 
-  def testAllGraphsBuildingFunctionsRaisesError(self):
-    g = ops.Graph()
-    g._building_function = True  # pylint: disable=protected-access
-    with g.as_default():
-      with self.assertRaises(AssertionError):
-        with ops.init_scope():
-          pass
-
   def testStaysInEagerWhenOnlyEagerContextActive(self):
     with context.eager_mode():
       with ops.init_scope():
@@ -2066,6 +2233,29 @@ class InitScopeTest(test_util.TensorFlowTestCase):
       self.assertEqual(4, int(compiled_outer(inner=compiled_inner)))
       self.assertEqual(7, int(compiled_outer(inner=compiled_inner)))
 
+  def testFallsBackToGlobalGraphWhenAllGraphsAreBuildingFunctions(self):
+    with context.graph_mode():
+      ops.reset_default_graph()
+      # This doesn't push anything onto the graph stack, but it does
+      # set the stack's global graph.
+      global_graph = ops.get_default_graph()
+      fn_graph = ops.Graph()
+
+      # pylint: disable=protected-access
+      fn_graph._building_function = True
+      self.assertEqual(len(ops._default_graph_stack.stack), 0)
+      with fn_graph.as_default():
+        self.assertEqual(len(ops._default_graph_stack.stack), 1)
+        with ops.init_scope():
+          self.assertGreater(len(ops._default_graph_stack.stack), 1)
+          dummy = constant_op.constant(1.0)
+        self.assertEqual(len(ops._default_graph_stack.stack), 1)
+      # Note that the global graph is _not_ on the graph stack.
+      self.assertEqual(len(ops._default_graph_stack.stack), 0)
+      # Ensure that `dummy` was added to the global graph.
+      self.assertEqual(global_graph, dummy.graph)
+      # pylint: enable=protected-access
+
   def testInstallsDefaultGraphWhenGraphStackIsEmptyInGraphMode(self):
     with context.graph_mode():
       # pylint: disable=protected-access
@@ -2702,7 +2892,7 @@ class OutputTypesTest(test_util.TensorFlowTestCase):
     with g.as_default():
       x = constant_op.constant([1, 1, 2, 4, 4, 4, 7, 8, 8],
                                dtype=dtypes.double)
-      y, _ = gen_array_ops._unique(x)
+      y, _ = gen_array_ops.unique(x)
       self.assertEqual([types_pb2.DT_DOUBLE, types_pb2.DT_INT32],
                        y.op._output_types)  # pylint: disable=protected-access
 
diff --git a/tensorflow/python/framework/python_op_gen.cc b/tensorflow/python/framework/python_op_gen.cc
index c95149d177990e364c3d6b9daeae5dc535cf0070..64d214a07fdf8997d57363c47b2636619756e3bf 100644
--- a/tensorflow/python/framework/python_op_gen.cc
+++ b/tensorflow/python/framework/python_op_gen.cc
@@ -75,6 +75,36 @@ bool IsPythonReserved(const string& s) {
   return kPythonReserved->count(s) > 0;
 }
 
+bool IsOpWithUnderscorePrefix(const string& s) {
+  static const std::set<string>* const kUnderscoreOps = new std::set<string>(
+      {// Lowercase built-in functions and types in Python, from:
+       // [x for x in dir(__builtins__) if x[0].islower()]
+       // These need to be excluded so they don't conflict with actual built-in
+       // functions since we use '*' imports.
+       "abs", "all", "any", "apply", "bin", "bool", "buffer", "bytearray",
+       "bytes", "callable", "chr", "classmethod", "cmp", "coerce", "compile",
+       "complex", "copyright", "credits", "delattr", "dict", "dir", "divmod",
+       "enumerate", "eval", "execfile", "exit", "file", "filter", "float",
+       "format", "frozenset", "getattr", "globals", "hasattr", "hash", "help",
+       "hex", "id", "input", "int", "intern", "isinstance", "issubclass",
+       "iter", "len", "license", "list", "locals", "long", "map", "max",
+       "memoryview", "min", "next", "object", "oct", "open", "ord", "pow",
+       "print", "property", "quit", "range", "raw_input", "reduce", "reload",
+       "repr", "reversed", "round", "set", "setattr", "slice", "sorted",
+       "staticmethod", "str", "sum", "super", "tuple", "type", "unichr",
+       "unicode", "vars", "xrange", "zip",
+       // These have the same name as ops defined in Python and might be used
+       // incorrectly depending on order of '*' imports.
+       // TODO(annarev): reduce usage of '*' imports and remove these from the
+       // list.
+       "fused_batch_norm", "histogram_fixed_width", "stack",
+       "batch_norm_with_global_normalization",
+       // TODO(annarev): replace these ops in the next change.
+       "broadcast_gradient_args", "concat", "enter", "histogram_summary",
+       "ref_enter", "ref_identity", "scalar_summary"});
+  return kUnderscoreOps->count(s) > 0;
+}
+
 string AvoidPythonReserved(const string& s) {
   if (IsPythonReserved(s)) return strings::StrCat(s, "_");
   return s;
@@ -816,6 +846,7 @@ from tensorflow.python.util.tf_export import tf_export
     // An op is hidden if either its ApiDef visibility is HIDDEN
     // or it is in the hidden_ops list.
     bool is_hidden = api_def->visibility() == ApiDef::HIDDEN;
+    bool hidden_by_api_def = is_hidden;
     if (!is_hidden) {
       for (const string& hidden : hidden_ops) {
         if (op_def.name() == hidden) {
@@ -828,13 +859,22 @@ from tensorflow.python.util.tf_export import tf_export
     string function_name;
     python_op_gen_internal::GenerateLowerCaseOpName(op_def.name(),
                                                     &function_name);
-    if (is_hidden) function_name = strings::StrCat("_", function_name);
-
-    // When users create custom python wrappers, they may link in the
-    // default op registry by accident, and because they can't
-    // enumerate all 'hidden' symbols, this guard is to prevent
-    // instantiating a python reserved word in their wrapper.
-    if (python_op_gen_internal::IsPythonReserved(function_name)) {
+    bool is_reserved = python_op_gen_internal::IsPythonReserved(function_name);
+
+    // Prefix an op with underscore if the op is listed in hidden_ops or
+    // name is reserved or it is of the exceptions in IsOpWithUnderscorePrefix.
+    // Do not add underscores to ops set to HIDDEN in ApiDef otherwise.
+    // TODO(annarev): don't prefix with underscores even if op is in hidden_ops.
+    if (is_hidden) {
+      if (!hidden_by_api_def || is_reserved ||
+          python_op_gen_internal::IsOpWithUnderscorePrefix(function_name)) {
+        function_name = strings::StrCat("_", function_name);
+      }
+    } else if (is_reserved) {
+      // When users create custom python wrappers, they may link in the
+      // default op registry by accident, and because they can't
+      // enumerate all 'hidden' symbols, this guard is to prevent
+      // instantiating a python reserved word in their wrapper.
       continue;
     }
 
diff --git a/tensorflow/python/framework/python_op_gen_internal.h b/tensorflow/python/framework/python_op_gen_internal.h
index 4319e5a7820b33283df8153fdc76e0e567813a17..e0cfb05f4bdf8afd09957c62a9ba3af1fd0882a6 100644
--- a/tensorflow/python/framework/python_op_gen_internal.h
+++ b/tensorflow/python/framework/python_op_gen_internal.h
@@ -29,6 +29,9 @@ namespace python_op_gen_internal {
 // Returns true if s is a Python keyword or built-in.
 bool IsPythonReserved(const string& s);
 
+// Whether the op should be prefixed with underscore.
+bool IsOpWithUnderscorePrefix(const string& s);
+
 // Add a _ to the end of s if necessary to avoid a Python keyword or built-in.
 string AvoidPythonReserved(const string& s);
 
diff --git a/tensorflow/python/framework/smart_cond.py b/tensorflow/python/framework/smart_cond.py
new file mode 100644
index 0000000000000000000000000000000000000000..f97bb01f54bbe2a75072e2bc959ae85b86f79dd0
--- /dev/null
+++ b/tensorflow/python/framework/smart_cond.py
@@ -0,0 +1,79 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""smart_cond and related utilties."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.ops import control_flow_ops
+
+
+def smart_cond(pred, true_fn=None, false_fn=None, name=None):
+  """Return either `true_fn()` if predicate `pred` is true else `false_fn()`.
+
+  If `pred` is a bool or has a constant value, we return either `true_fn()`
+  or `false_fn()`, otherwise we use `tf.cond` to dynamically route to both.
+
+  Arguments:
+    pred: A scalar determining whether to return the result of `true_fn` or
+      `false_fn`.
+    true_fn: The callable to be performed if pred is true.
+    false_fn: The callable to be performed if pred is false.
+    name: Optional name prefix when using `tf.cond`.
+
+  Returns:
+    Tensors returned by the call to either `true_fn` or `false_fn`.
+
+  Raises:
+    TypeError: If `true_fn` or `false_fn` is not callable.
+  """
+  if not callable(true_fn):
+    raise TypeError("`true_fn` must be callable.")
+  if not callable(false_fn):
+    raise TypeError("`false_fn` must be callable.")
+
+  pred_value = smart_constant_value(pred)
+  if pred_value is not None:
+    if pred_value:
+      return true_fn()
+    else:
+      return false_fn()
+  else:
+    return control_flow_ops.cond(pred, true_fn=true_fn, false_fn=false_fn,
+                                 name=name)
+
+
+def smart_constant_value(pred):
+  """Return the bool value for `pred`, or None if `pred` had a dynamic value.
+
+  Arguments:
+    pred: A scalar, either a Python bool or tensor.
+
+  Returns:
+    True or False if `pred` has a constant boolean value, None otherwise.
+
+  Raises:
+    TypeError: If `pred` is not a Tensor or bool.
+  """
+  if isinstance(pred, bool):
+    pred_value = pred
+  elif isinstance(pred, ops.Tensor):
+    pred_value = tensor_util.constant_value(pred)
+  else:
+    raise TypeError("`pred` must be a Tensor or a Python bool.")
+  return pred_value
diff --git a/tensorflow/python/framework/smart_cond_test.py b/tensorflow/python/framework/smart_cond_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..b682506da057af9a645f7f71301564268ed3b20d
--- /dev/null
+++ b/tensorflow/python/framework/smart_cond_test.py
@@ -0,0 +1,66 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.client import session
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import smart_cond
+from tensorflow.python.framework import test_util
+from tensorflow.python.ops import math_ops
+from tensorflow.python.platform import googletest
+
+
+@test_util.with_c_api
+class SmartCondTest(test_util.TensorFlowTestCase):
+
+  def testSmartCondTrue(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = constant_op.constant(2)
+        y = constant_op.constant(5)
+        z = smart_cond.smart_cond(True, lambda: math_ops.multiply(x, 16),
+                                  lambda: math_ops.multiply(y, 5))
+        self.assertEqual(z.eval(), 32)
+
+  def testSmartCondFalse(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = constant_op.constant(4)
+        y = constant_op.constant(3)
+        z = smart_cond.smart_cond(False, lambda: math_ops.multiply(x, 16),
+                                  lambda: math_ops.multiply(y, 3))
+        self.assertEqual(z.eval(), 9)
+
+  def testSmartCondMissingArg1(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = constant_op.constant(1)
+        with self.assertRaises(TypeError):
+          smart_cond.smart_cond(True, false_fn=lambda: x)
+
+  def testSmartCondMissingArg2(self):
+    with ops.Graph().as_default():
+      with session.Session():
+        x = constant_op.constant(1)
+        with self.assertRaises(TypeError):
+          smart_cond.smart_cond(True, lambda: x)
+
+
+if __name__ == "__main__":
+  googletest.main()
diff --git a/tensorflow/python/framework/tensor_util.py b/tensorflow/python/framework/tensor_util.py
index cbba112841c0fbe8220ec6ef610049b31409d329..135562e831081c6687c892caf937e5d477853073 100644
--- a/tensorflow/python/framework/tensor_util.py
+++ b/tensorflow/python/framework/tensor_util.py
@@ -557,17 +557,18 @@ def MakeNdarray(tensor):
   dtype = tensor_dtype.as_numpy_dtype
 
   if tensor.tensor_content:
-    return np.frombuffer(tensor.tensor_content, dtype=dtype).reshape(shape)
-  elif tensor_dtype == dtypes.float16:
+    return (np.frombuffer(tensor.tensor_content, dtype=dtype).copy()
+            .reshape(shape))
+  elif tensor_dtype == dtypes.float16 or tensor_dtype == dtypes.bfloat16:
     # the half_val field of the TensorProto stores the binary representation
     # of the fp16: we need to reinterpret this as a proper float16
     if len(tensor.half_val) == 1:
       tmp = np.array(tensor.half_val[0], dtype=np.uint16)
-      tmp.dtype = np.float16
+      tmp.dtype = tensor_dtype.as_numpy_dtype
       return np.repeat(tmp, num_elements).reshape(shape)
     else:
       tmp = np.fromiter(tensor.half_val, dtype=np.uint16)
-      tmp.dtype = np.float16
+      tmp.dtype = tensor_dtype.as_numpy_dtype
       return tmp.reshape(shape)
   elif tensor_dtype == dtypes.float32:
     if len(tensor.float_val) == 1:
@@ -585,8 +586,7 @@ def MakeNdarray(tensor):
       return np.fromiter(tensor.double_val, dtype=dtype).reshape(shape)
   elif tensor_dtype in [
       dtypes.int32, dtypes.uint8, dtypes.uint16, dtypes.int16, dtypes.int8,
-      dtypes.qint32, dtypes.quint8, dtypes.qint8, dtypes.qint16, dtypes.quint16,
-      dtypes.bfloat16
+      dtypes.qint32, dtypes.quint8, dtypes.qint8, dtypes.qint16, dtypes.quint16
   ]:
     if len(tensor.int_val) == 1:
       return np.repeat(np.array(tensor.int_val[0], dtype=dtype),
diff --git a/tensorflow/python/framework/tensor_util_test.py b/tensorflow/python/framework/tensor_util_test.py
index f2de69e159646b4a085645fa1bfef7782e78cd59..35fff80c61b98e7603d3b7b5df3cabdb59059a72 100644
--- a/tensorflow/python/framework/tensor_util_test.py
+++ b/tensorflow/python/framework/tensor_util_test.py
@@ -199,6 +199,25 @@ class TensorUtilTest(test.TestCase):
               dtype=nptype),
           a)
 
+  def testFloatMutateArray(self):
+    t = tensor_util.make_tensor_proto([10.0, 20.0, 30.0], dtype=dtypes.float32)
+    a = tensor_util.MakeNdarray(t)
+    a[0] = 5.0
+    self.assertEquals(np.float32, a.dtype)
+    self.assertAllClose(np.array([5.0, 20.0, 30.0], dtype=np.float32), a)
+    if sys.byteorder == "big":
+      self.assertProtoEquals("""
+        dtype: DT_FLOAT
+        tensor_shape { dim { size: 3 } }
+        tensor_content: "A \000\000A\240\000\000A\360\000\000"
+        """, t)
+    else:
+      self.assertProtoEquals("""
+        dtype: DT_FLOAT
+        tensor_shape { dim { size: 3 } }
+        tensor_content: "\000\000 A\000\000\240A\000\000\360A"
+        """, t)
+
   def testHalf(self):
     t = tensor_util.make_tensor_proto(np.array([10.0, 20.0], dtype=np.float16))
     self.assertProtoEquals("""
@@ -216,6 +235,26 @@ class TensorUtilTest(test.TestCase):
     self.assertEquals(np.float16, a.dtype)
     self.assertAllClose(np.array([10.0, 20.0], dtype=np.float16), a)
 
+  def testBfloat16(self):
+    test_type = dtypes.bfloat16.as_numpy_dtype
+    t = tensor_util.make_tensor_proto(np.array([10.0, 20.0], dtype=test_type))
+    # 10.0: 16672 = 010000010(130) 0100000: (1+0/2+1/4) * 2^(130-127)
+    # 20.0: 16800 = 010000011(131) 0100000: (1+0/2+1/4) * 2^(131-127)
+    self.assertProtoEquals("""
+      dtype: DT_BFLOAT16
+      tensor_shape {
+        dim {
+          size: 2
+        }
+      }
+      half_val: 16672
+      half_val: 16800
+      """, t)
+
+    a = tensor_util.MakeNdarray(t)
+    self.assertEquals(test_type, a.dtype)
+    self.assertAllClose(np.array([10.0, 20.0], dtype=test_type), a)
+
   def testInt(self):
     t = tensor_util.make_tensor_proto(10)
     self.assertProtoEquals("""
@@ -749,7 +788,7 @@ class ConstantValueTest(test.TestCase):
     self.assertAllClose(np_val, tensor_util.constant_value(tf_val))
 
   def testUnknown(self):
-    tf_val = gen_state_ops._variable(
+    tf_val = gen_state_ops.variable(
         shape=[3, 4, 7],
         dtype=dtypes.float32,
         name="tf_val",
diff --git a/tensorflow/python/framework/test_util.py b/tensorflow/python/framework/test_util.py
index 310bd75d4ee6624617fa9e45d7f2c97f03f982e6..7389730d91cf9fd35c861ad85040c79108e5eb77 100644
--- a/tensorflow/python/framework/test_util.py
+++ b/tensorflow/python/framework/test_util.py
@@ -419,6 +419,11 @@ def with_c_api(cls):
   Returns:
     cls with new test methods added
   """
+  # If the C API is already enabled, don't do anything. Some tests break if the
+  # same test is run twice, so this allows us to turn on the C API by default
+  # without breaking these tests.
+  if ops._USE_C_API: return cls
+
   for name, value in cls.__dict__.copy().items():
     if callable(value) and name.startswith("test"):
       setattr(cls, name + "WithCApi", enable_c_api(value))
@@ -501,6 +506,30 @@ def assert_no_garbage_created(f):
     previous_garbage = len(gc.garbage)
     f(self, **kwargs)
     gc.collect()
+    if len(gc.garbage) > previous_garbage:
+      logging.error(
+          "The decorated test created work for Python's garbage collector, "
+          "likely due to a reference cycle. New objects in cycle(s):")
+      for i, obj in enumerate(gc.garbage[previous_garbage:]):
+        try:
+          logging.error(
+              "Object %d of %d" % (i, len(gc.garbage) - previous_garbage))
+          def _safe_object_str(obj):
+            return "<%s %d>" % (obj.__class__.__name__, id(obj))
+          logging.error("  Object type: %s" % (_safe_object_str(obj),))
+          logging.error("  Referrer types: %s" % (
+              ', '.join([_safe_object_str(ref)
+                         for ref in gc.get_referrers(obj)]),))
+          logging.error("  Referent types: %s" % (
+              ', '.join([_safe_object_str(ref)
+                         for ref in gc.get_referents(obj)]),))
+          logging.error("  Object attribute names: %s" % (dir(obj),))
+          logging.error("  Object __str__:")
+          logging.error(obj)
+          logging.error("  Object __repr__:")
+          logging.error(repr(obj))
+        except Exception:
+          logging.error("(Exception while printing object)")
     # This will fail if any garbage has been created, typically because of a
     # reference cycle.
     self.assertEqual(previous_garbage, len(gc.garbage))
@@ -559,6 +588,7 @@ def run_in_graph_and_eager_modes(__unused__=None,
         # This decorator runs the wrapped test twice.
         # Reset the test environment between runs.
         self.tearDown()
+        self._tempdir = None
         self.setUp()
 
       def run_eager_mode(self, **kwargs):
@@ -1101,7 +1131,12 @@ class TensorFlowTestCase(googletest.TestCase):
       np.testing.assert_allclose(
           a, b, rtol=rtol, atol=atol, err_msg=msg, equal_nan=True)
 
-  def _assertAllCloseRecursive(self, a, b, rtol=1e-6, atol=1e-6, path=None,
+  def _assertAllCloseRecursive(self,
+                               a,
+                               b,
+                               rtol=1e-6,
+                               atol=1e-6,
+                               path=None,
                                msg=None):
     path = path or []
     path_str = (("[" + "][".join([str(p) for p in path]) + "]") if path else "")
@@ -1248,7 +1283,7 @@ class TensorFlowTestCase(googletest.TestCase):
     a = self._GetNdArray(a)
     b = self._GetNdArray(b)
     self.assertEqual(a.shape, b.shape, "Shape mismatch: expected %s, got %s."
-                                       " %s" % (a.shape, b.shape, msg))
+                     " %s" % (a.shape, b.shape, msg))
     same = (a == b)
 
     if a.dtype == np.float32 or a.dtype == np.float64:
@@ -1330,8 +1365,8 @@ class TensorFlowTestCase(googletest.TestCase):
       raise TypeError("np_array must be a Numpy ndarray or Numpy scalar")
     if not isinstance(tf_tensor, ops.Tensor):
       raise TypeError("tf_tensor must be a Tensor")
-    self.assertAllEqual(np_array.shape, tf_tensor.get_shape().as_list(),
-                        msg=msg)
+    self.assertAllEqual(
+        np_array.shape, tf_tensor.get_shape().as_list(), msg=msg)
 
   def assertDeviceEqual(self, device1, device2, msg=None):
     """Asserts that the two given devices are the same.
diff --git a/tensorflow/python/grappler/cluster.i b/tensorflow/python/grappler/cluster.i
index 8079cb307bb1f5904b71bae891d5ef5f1e749e66..067c8213d4741936e4c28aaedf4f30639b8cdc41 100644
--- a/tensorflow/python/grappler/cluster.i
+++ b/tensorflow/python/grappler/cluster.i
@@ -206,7 +206,7 @@ static PyObject* TF_ListDevices(GCluster cluster) {
   return result;
 }
 
-static std::vector<string> TF_ListAvailableOps() {
+static PyObject* TF_ListAvailableOps() {
   tensorflow::OpRegistry* registry = tensorflow::OpRegistry::Global();
   std::vector<tensorflow::OpDef> ops;
   registry->GetRegisteredOps(&ops);
@@ -215,7 +215,14 @@ static std::vector<string> TF_ListAvailableOps() {
     op_names.push_back(op.name());
   }
   std::sort(op_names.begin(), op_names.end());
-  return op_names;
+
+  PyGILState_STATE gstate = PyGILState_Ensure();
+  PyObject* result = PyList_New(op_names.size());
+  for (int i = 0; i < op_names.size(); ++i) {
+    PyList_SetItem(result, i, PyString_FromString(op_names[i].c_str()));
+  }
+  PyGILState_Release(gstate);
+  return result;
 }
 
 static PyObject* TF_GetSupportedDevices(GCluster cluster, GItem item) {
@@ -432,7 +439,7 @@ static GCluster TF_NewVirtualCluster(
     TF_Status* out_status);
 static void TF_ShutdownCluster(GCluster cluster);
 static PyObject* TF_ListDevices(GCluster cluster);
-static std::vector<string> TF_ListAvailableOps();
+static PyObject* TF_ListAvailableOps();
 static PyObject* TF_GetSupportedDevices(GCluster cluster, GItem item);
 static float TF_EstimatePerformance(const tensorflow::NamedDevice& device);
 static PyObject* TF_MeasureCosts(
diff --git a/tensorflow/python/grappler/cluster_test.py b/tensorflow/python/grappler/cluster_test.py
index caae5b114e1f896fc758613380a6f702853d98a5..a3c4c2bbeba7c4ee5d00268c0e475e11a31fa7eb 100644
--- a/tensorflow/python/grappler/cluster_test.py
+++ b/tensorflow/python/grappler/cluster_test.py
@@ -131,8 +131,8 @@ class ClusterTest(test.TestCase):
   def testAvailableOps(self):
     with cluster.Provision() as gcluster:
       op_names = gcluster.ListAvailableOps()
-      self.assertTrue(b'Add' in op_names)
-      self.assertTrue(b'MatMul' in op_names)
+      self.assertTrue('Add' in op_names)
+      self.assertTrue('MatMul' in op_names)
       self.assertEqual(op_names, sorted(op_names))
 
   def testSupportDevices(self):
diff --git a/tensorflow/python/grappler/controller.py b/tensorflow/python/grappler/controller.py
new file mode 100644
index 0000000000000000000000000000000000000000..5677f4f52310dd68dc80c87275b50be95ba86b60
--- /dev/null
+++ b/tensorflow/python/grappler/controller.py
@@ -0,0 +1,142 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Controller Class."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from collections import defaultdict
+
+
+class Controller(object):
+  """Controller class."""
+
+  def __init__(self, item, cluster):
+    """Controller class initializer.
+
+    Args:
+      item: The metagraph to place wrapped in a cluster.
+      cluster: A cluster of devices on which to place the item.
+    """
+    self.item = item
+
+    self._node = {}
+    for node in item.metagraph.graph_def.node:
+      self._node[node.name] = node
+
+    self._fanout = defaultdict(lambda: [])
+    for node in item.metagraph.graph_def.node:
+      for fanin in self._get_node_fanin(node):
+        self._fanout[fanin.name].append(node)
+
+    important_op_names = item.IdentifyImportantOps(sort_topologically=True)
+
+    # List of important ops (these are the ops to place) sorted in topological
+    # order. The order of this collection is deterministic.
+    self.important_ops = []
+    for name in important_op_names:
+      self.important_ops.append(self._node[name])
+
+    self.node_properties = item.GetOpProperties()
+
+    self.cluster = cluster
+    self.devices = cluster.ListDevices()
+
+    self.colocation_constraints = item.GetColocationGroups()
+
+    self.placement_constraints = cluster.GetSupportedDevices(item)
+    for node_name, dev in self.placement_constraints.items():
+      if len(dev) == 1:
+        # Place the node on the supported device
+        node = self._node[node_name]
+        node.device = dev[0]
+        fanout = self.get_node_fanout(node)
+        # Update the fanout of the fanin to bypass the node
+        for fanin in self._get_node_fanin(node):
+          fanout_of_fanin = self.get_node_fanout(fanin)
+          fanout_of_fanin += fanout
+          fanout_of_fanin.remove(node)
+        # Remove node from the list of important ops since we don't need to
+        # place the node.
+        if node in self.important_ops:
+          self.important_ops.remove(node)
+          important_op_names.remove(node.name)
+
+    # List of important op names, in non deterministic order.
+    self.important_op_names = frozenset(important_op_names)
+
+  @property
+  def input_graph_def(self):
+    return self.item.metagraph.graph_def
+
+  @property
+  def num_devices(self):
+    return len(self.devices)
+
+  def get_node_by_name(self, node_name):
+    return self._node[node_name]
+
+  def get_node_fanout(self, node):
+    return self._fanout[node.name]
+
+  def get_placements(self, *args, **kwargs):
+    """Returns: Two TF ops.
+
+    Args:
+      *args: "".
+      **kwargs: "".
+
+    Returns:
+      y_preds: tensor of size [batch_size, num_ops]
+      log_probs: python dict of at least two fields: "sample", "target" each
+      containing a tensor of size [batch_size], corresponding to the log_probs.
+    """
+    raise NotImplementedError
+
+  def eval_placement(self, sess, *args, **kwargs):
+    """At this time, this method evaluates ONLY ONE placement.
+
+    Args:
+      sess: a tf.Session() object used to retrieve cached assignment info.
+      *args: "".
+      **kwargs: "".
+
+    Returns:
+      run_time: scalar
+    """
+    raise NotImplementedError
+
+  def export_placement(self, metagraph):
+    """Annotate the placement onto the specified metagraph.
+
+    Args:
+      metagraph: the metagraph to annotate with the placement.
+    """
+    for node in metagraph.graph_def.node:
+      if node.name in self.important_op_names:
+        node.device = self.get_node_by_name(node.name).device
+
+  # Get the nodes in the immediate fanin of node.
+  # Beware: this doesn't take into account the nodes that may be skipped
+  # since placement constraints force their placement.
+  def _get_node_fanin(self, node):
+    input_ops = []
+    for fanin_name in node.input:
+      if fanin_name[0] == "^":
+        fanin_name = fanin_name[1:]
+      fanin_name = fanin_name.split(":")[0]
+      input_ops.append(self.get_node_by_name(fanin_name))
+    return input_ops
diff --git a/tensorflow/python/grappler/cost_analyzer.cc b/tensorflow/python/grappler/cost_analyzer.cc
index 88bf900dca6d97773959eb309a4a3c5931fdcb88..b474e19894957d01c7c8978282c547df81a9b2b3 100644
--- a/tensorflow/python/grappler/cost_analyzer.cc
+++ b/tensorflow/python/grappler/cost_analyzer.cc
@@ -30,11 +30,12 @@ CostAnalyzer::CostAnalyzer(const GrapplerItem& item, Cluster* cluster,
       analytical_estimator_(cluster, false),
       suffix_(suffix) {}
 
-Status CostAnalyzer::GenerateReport(std::ostream& os, bool per_node_report) {
+Status CostAnalyzer::GenerateReport(std::ostream& os, bool per_node_report,
+                                    bool verbose) {
   GatherCosts();
   PreprocessCosts();
   AnalyzeCosts();
-  PrintAnalysis(os, per_node_report);
+  PrintAnalysis(os, per_node_report, verbose);
   return Status::OK();
 }
 
@@ -158,7 +159,8 @@ void CostAnalyzer::AnalyzeCosts() {
   }
 }
 
-void CostAnalyzer::PrintAnalysis(std::ostream& os, bool per_node_report) const {
+void CostAnalyzer::PrintAnalysis(std::ostream& os, bool per_node_report,
+                                 bool verbose) const {
   os << std::endl;
   os << std::left << std::setw(50)
      << "Total time measured in ns (serialized): " << std::right
@@ -227,10 +229,55 @@ void CostAnalyzer::PrintAnalysis(std::ostream& os, bool per_node_report) const {
   os << std::endl;
 
   if (per_node_report) {
-    os << "Below is the per-node report:" << std::endl;
-    os << op_perf_.DebugString();
+    if (verbose) {
+      os << "Below is the full per-node report:" << std::endl;
+      os << op_perf_.DebugString();
+    } else {
+      os << "Below is the per-node report summary:" << std::endl;
+      int width = 35;
+      int width_narrow = 15;
+      int width_wide = 20;
+      os << std::setw(width + 1) << "Op,";
+      os << std::setw(width_wide + 1) << "Measured time (ns),";
+      os << std::setw(width_wide + 1) << "Compute time (ns),";
+      os << std::setw(width_wide + 1) << "Memory time (ns),";
+      os << std::setw(width_narrow + 2) << "Compute eff,";
+      os << std::setw(width_narrow + 2) << "Memory eff,";
+      os << "    Inputs" << std::endl;
+      for (int i = 0; i < op_perf_.op_performance_size(); i++) {
+        const auto& perf = op_perf_.op_performance(i);
+        string op_name = perf.op().op();
+        os << std::setw(width) << op_name << ",";
+        os << std::setw(width_wide) << perf.compute_cost() << ",";
+        os << std::setw(width_wide) << perf.compute_time() << ",";
+        os << std::setw(width_wide) << perf.memory_time() << ",";
+        os << std::setw(width_narrow) << std::setprecision(2)
+           << perf.compute_efficiency() * 100 << "%,";
+        os << std::setw(width_narrow) << std::setprecision(2)
+           << perf.memory_efficiency() * 100 << "%,";
+        os << "    [";
+        for (int j = 0; j < perf.op().inputs_size(); j++) {
+          const auto& shape = perf.op().inputs(j).shape();
+          if (shape.dim_size() > 0) {
+            os << "(";
+            std::vector<int> dims;
+            for (int k = 0; k < shape.dim_size(); k++) {
+              os << shape.dim(k).size();
+              if (k < shape.dim_size() - 1) {
+                os << ", ";
+              }
+            }
+            os << ")";
+            if (j < perf.op().inputs_size() - 1) {
+              os << ", ";
+            }
+          }
+        }
+        os << "]" << std::endl;
+      }
+      os << std::endl;
+    }
   }
 }
-
 }  // end namespace grappler
 }  // end namespace tensorflow
diff --git a/tensorflow/python/grappler/cost_analyzer.h b/tensorflow/python/grappler/cost_analyzer.h
index 0e860e0fee9923510292d3cf1a8069435787476f..b5364aa37ab2fbbeb0a33e6764539cca795f2fa6 100644
--- a/tensorflow/python/grappler/cost_analyzer.h
+++ b/tensorflow/python/grappler/cost_analyzer.h
@@ -19,6 +19,7 @@ limitations under the License.
 #include <iostream>
 #include "tensorflow/core/framework/cost_graph.pb.h"
 #include "tensorflow/core/framework/graph.pb.h"
+#include "tensorflow/core/framework/tensor_shape.pb.h"
 #include "tensorflow/core/grappler/clusters/cluster.h"
 #include "tensorflow/core/grappler/costs/analytical_cost_estimator.h"
 #include "tensorflow/core/grappler/costs/cost_estimator.h"
@@ -50,7 +51,7 @@ class CostAnalyzer {
  public:
   explicit CostAnalyzer(const GrapplerItem& item, Cluster* cluster,
                         const string& suffix);
-  Status GenerateReport(std::ostream& os, bool per_node_report);
+  Status GenerateReport(std::ostream& os, bool per_node_report, bool verbose);
 
  private:
   void PredictCosts(CostEstimator* cost_estimator, CostGraphDef* cost_graph,
@@ -59,7 +60,8 @@ class CostAnalyzer {
   void PreprocessCosts();
   void AnalyzeCosts();
   void SortOpsByTime(std::map<string, OpPerfSummary> ops);
-  void PrintAnalysis(std::ostream& os, bool per_node_report) const;
+  void PrintAnalysis(std::ostream& os, bool per_node_report,
+                     bool verbose) const;
 
   const GrapplerItem* item_;
   MeasuringCostEstimator measure_estimator_;
diff --git a/tensorflow/python/grappler/cost_analyzer.i b/tensorflow/python/grappler/cost_analyzer.i
index 4c0953435ba3fa6423bbc869fcca909d0c2ccb25..8f7fdb47f267bea582e371eb9ea6982b6e9341ad 100644
--- a/tensorflow/python/grappler/cost_analyzer.i
+++ b/tensorflow/python/grappler/cost_analyzer.i
@@ -44,7 +44,7 @@ limitations under the License.
 
 %{
 string GenerateCostReport(const tensorflow::MetaGraphDef& metagraph, bool per_node_report,
-                          GCluster cluster) {
+                          bool verbose, GCluster cluster) {
   tensorflow::grappler::ItemConfig cfg;
   cfg.apply_optimizations = false;
   std::unique_ptr<tensorflow::grappler::GrapplerItem> item =
@@ -57,11 +57,11 @@ string GenerateCostReport(const tensorflow::MetaGraphDef& metagraph, bool per_no
   tensorflow::grappler::CostAnalyzer analyzer(*item, cluster.get(), suffix);
 
   std::stringstream os;
-  analyzer.GenerateReport(os, per_node_report);
+  analyzer.GenerateReport(os, per_node_report, verbose);
   return os.str();
 }
 
 %}
 
 string GenerateCostReport(const tensorflow::MetaGraphDef& metagraph, bool per_node_report,
-                          GCluster cluster);
+                          bool verbose, GCluster cluster);
diff --git a/tensorflow/python/grappler/cost_analyzer.py b/tensorflow/python/grappler/cost_analyzer.py
index a1ff915c61ba14d9a899d7f6c9a2c49855969b00..6a4690e91ba981706eed0d9fdfae2e64359d0416 100644
--- a/tensorflow/python/grappler/cost_analyzer.py
+++ b/tensorflow/python/grappler/cost_analyzer.py
@@ -24,7 +24,10 @@ from tensorflow.python.grappler import cluster as gcluster
 from tensorflow.python.grappler import item as gitem
 
 
-def GenerateCostReport(metagraph, per_node_report=False, cluster=None):
+def GenerateCostReport(metagraph,
+                       per_node_report=False,
+                       verbose=False,
+                       cluster=None):
   """Analyze the cost of each TensorFlow op and node in the provided metagraph.
 
   Args:
@@ -32,6 +35,7 @@ def GenerateCostReport(metagraph, per_node_report=False, cluster=None):
     per_node_report: by default the report contains stats aggregated on a per op
       type basis, setting per_node_report to True adds results for each
       individual node to the report.
+    verbose: Prints out the entire operation proto instead of a summary table.
     cluster: Analyze the costs using the specified cluster, or the local machine
       if no cluster was specified.
 
@@ -42,8 +46,9 @@ def GenerateCostReport(metagraph, per_node_report=False, cluster=None):
     cluster = gcluster.Cluster(disable_detailed_stats=False)
 
   with errors.raise_exception_on_not_ok_status():
-    ret_from_swig = tf_wrap.GenerateCostReport(
-        metagraph.SerializeToString(), per_node_report, cluster.tf_cluster)
+    ret_from_swig = tf_wrap.GenerateCostReport(metagraph.SerializeToString(),
+                                               per_node_report, verbose,
+                                               cluster.tf_cluster)
   return ret_from_swig
 
 
diff --git a/tensorflow/python/grappler/cost_analyzer_test.py b/tensorflow/python/grappler/cost_analyzer_test.py
index 511908c79ce47d6849bf97d11bc42f2f1bb13f18..b8225b81a52f1a2ee10663544d54f1c9bd7ee785 100644
--- a/tensorflow/python/grappler/cost_analyzer_test.py
+++ b/tensorflow/python/grappler/cost_analyzer_test.py
@@ -48,7 +48,7 @@ class CostAnalysisTest(test.TestCase):
     train_op.append(d)
     mg = meta_graph.create_meta_graph_def(graph=ops.get_default_graph())
 
-    report = cost_analyzer.GenerateCostReport(mg)
+    report = cost_analyzer.GenerateCostReport(mg, per_node_report=True)
 
     # Check the report headers
     self.assertTrue(b"Total time measured in ns (serialized):" in report)
@@ -57,6 +57,26 @@ class CostAnalysisTest(test.TestCase):
     self.assertTrue(b"Total time analytical in ns (lower bound):" in report)
     self.assertTrue(b"Overall efficiency (analytical upper/actual):" in report)
     self.assertTrue(b"Overall efficiency (analytical lower/actual):" in report)
+    self.assertTrue(b"Below is the per-node report summary:" in report)
+
+    # Also print the report to make it easier to debug
+    print("{}".format(report))
+
+  def testVerbose(self):
+    """Make sure the full report is generated with verbose=True."""
+    a = constant_op.constant(10, name="a")
+    b = constant_op.constant(20, name="b")
+    c = math_ops.add_n([a, b], name="c")
+    d = math_ops.add_n([b, c], name="d")
+    train_op = ops.get_collection_ref(ops.GraphKeys.TRAIN_OP)
+    train_op.append(d)
+    mg = meta_graph.create_meta_graph_def(graph=ops.get_default_graph())
+
+    report = cost_analyzer.GenerateCostReport(
+        mg, per_node_report=True, verbose=True)
+
+    # Check the report headers
+    self.assertTrue(b"Below is the full per-node report:" in report)
 
     # Also print the report to make it easier to debug
     print("{}".format(report))
diff --git a/tensorflow/python/grappler/cost_analyzer_tool.py b/tensorflow/python/grappler/cost_analyzer_tool.py
index 86db87d51530621d71b9b90f145e0f10d0b72443..0853db252406966cec36b63efafec9ec755c7e87 100644
--- a/tensorflow/python/grappler/cost_analyzer_tool.py
+++ b/tensorflow/python/grappler/cost_analyzer_tool.py
@@ -35,7 +35,8 @@ from tensorflow.python.platform import gfile
 from tensorflow.python.training import saver
 
 
-def main(_):
+def get_metagraph():
+  """Constructs and returns a MetaGraphDef from the input file."""
   if FLAGS.metagraphdef:
     with gfile.GFile(FLAGS.metagraphdef) as meta_file:
       metagraph = meta_graph_pb2.MetaGraphDef()
@@ -45,7 +46,8 @@ def main(_):
         metagraph.ParseFromString(meta_file.read())
     if FLAGS.fetch is not None:
       fetch_collection = meta_graph_pb2.CollectionDef()
-      fetch_collection.node_list.value.append(FLAGS.fetch)
+      for fetch in FLAGS.fetch.split(","):
+        fetch_collection.node_list.value.append(fetch)
       metagraph.collection_def["train_op"].CopyFrom(fetch_collection)
   else:
     with gfile.GFile(FLAGS.graphdef) as graph_file:
@@ -56,21 +58,28 @@ def main(_):
         graph_def.ParseFromString(graph_file.read())
       importer.import_graph_def(graph_def, name="")
       graph = ops.get_default_graph()
-      fetch = graph.get_operation_by_name(FLAGS.fetch)
-      graph.add_to_collection("train_op", fetch)
+      for fetch in FLAGS.fetch.split(","):
+        fetch_op = graph.get_operation_by_name(fetch)
+        graph.add_to_collection("train_op", fetch_op)
       metagraph = saver.export_meta_graph(
           graph_def=graph.as_graph_def(), graph=graph)
+  return metagraph
+
 
+def main(_):
+  metagraph = get_metagraph()
   rewriter_config = rewriter_config_pb2.RewriterConfig()
   if FLAGS.rewriter_config is not None:
     text_format.Merge(FLAGS.rewriter_config, rewriter_config)
   optimized_graph = tf_optimizer.OptimizeGraph(rewriter_config, metagraph)
   metagraph.graph_def.CopyFrom(optimized_graph)
 
-  report = cost_analyzer.GenerateCostReport(metagraph, FLAGS.per_node_report)
-  print(report)
-  report = cost_analyzer.GenerateMemoryReport(metagraph)
+  report = cost_analyzer.GenerateCostReport(metagraph, FLAGS.per_node_report,
+                                            FLAGS.verbose)
   print(report)
+  if FLAGS.memory_report:
+    report = cost_analyzer.GenerateMemoryReport(metagraph)
+    print(report)
 
 
 if __name__ == "__main__":
@@ -89,9 +98,7 @@ if __name__ == "__main__":
       "--fetch",
       type=str,
       default=None,
-      help=
-      "The name of the fetch node."
-  )
+      help="The names of the fetch node delimited by comma.")
   parser.add_argument(
       "--rewriter_config",
       type=str,
@@ -107,5 +114,13 @@ if __name__ == "__main__":
       help="Generate per-node report. By default the report contains stats "
       "aggregated on a per op type basis, per_node_report adds results "
       "for each individual node to the report.")
+  parser.add_argument(
+      "--memory_report",
+      action="store_true",
+      help="Generate memory usage report.")
+  parser.add_argument(
+      "--verbose",
+      action="store_true",
+      help="Generate verbose reports. By default, succinct reports are used.")
   FLAGS, unparsed = parser.parse_known_args()
   app.run(main=main, argv=[sys.argv[0]] + unparsed)
diff --git a/tensorflow/python/grappler/graph_placer.py b/tensorflow/python/grappler/graph_placer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cd51df4d962583555e08ae973ab43d15ba01997
--- /dev/null
+++ b/tensorflow/python/grappler/graph_placer.py
@@ -0,0 +1,120 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Graph Placer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import time
+from tensorflow.core.protobuf import meta_graph_pb2
+from tensorflow.core.protobuf import rewriter_config_pb2
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops as tf_ops
+from tensorflow.python.grappler import cluster as gcluster
+from tensorflow.python.grappler import hierarchical_controller
+from tensorflow.python.grappler import item as gitem
+from tensorflow.python.grappler import tf_optimizer
+from tensorflow.python.training import training
+
+
+def PlaceGraph(metagraph,
+               cluster=None,
+               allotted_time=3600,
+               hparams=None,
+               verbose=False):
+  """Place the provided metagraph.
+
+  Args:
+    metagraph: the metagraph to place.
+    cluster: an optional set of hardware resource to optimize the placement for.
+      If none is specified, we'll optimize the placement for the hardware
+      available on the local machine.
+    allotted_time: the maximum amount to time in seconds to spend optimizing
+      the placement.
+    hparams: hyperparameters used to fine tune the placer.
+    verbose: prints debug information if True.
+
+  Returns:
+    The placed metagraph.
+  """
+  if cluster is None:
+    cluster = gcluster.Cluster()
+
+  # Optimize the metagraph to speedup the placement
+  rewriter_config = rewriter_config_pb2.RewriterConfig()
+  rewriter_config.optimizers.append("pruning")
+  rewriter_config.optimizers.append("constfold")
+  rewriter_config.optimizers.append("arithmetic")
+  rewriter_config.optimizers.append("dependency")
+  rewriter_config.optimizers.append("pruning")
+  optimized_graph = tf_optimizer.OptimizeGraph(
+      rewriter_config, metagraph, verbose=verbose, cluster=cluster)
+  optimized_metagraph = meta_graph_pb2.MetaGraphDef()
+  optimized_metagraph.CopyFrom(metagraph)
+  optimized_metagraph.graph_def.CopyFrom(optimized_graph)
+
+  item = gitem.Item(optimized_metagraph)
+
+  # Measure the runtime achievable with the original placement.
+  try:
+    _, original_run_time, _ = cluster.MeasureCosts(item)
+    if verbose:
+      print("Runtime for original placement: " + str(original_run_time))
+  except errors.OpError as e:
+    if verbose:
+      print("Original placement isn't feasible: " + str(e))
+    original_run_time = hparams.failing_signal
+
+  if hparams is None:
+    hparams = hierarchical_controller.hierarchical_controller_hparams()
+  # We run with a single child
+  hparams.num_children = 1
+
+  with tf_ops.Graph().as_default():
+    # Place all the nodes of the controller on the CPU. We don't want them to
+    # fight for accelerator memory with the model to optimize.
+    with tf_ops.device("/device:CPU:0"):
+      model = hierarchical_controller.HierarchicalController(
+          hparams, item, cluster)
+      ops = model.build_controller()
+      session_creator = training.ChiefSessionCreator()
+      with training.MonitoredSession(session_creator=session_creator) as sess:
+        start_time = time.time()
+        current_time = start_time
+        while current_time - start_time < allotted_time:
+          grouping_actions = model.generate_grouping(sess)
+          input_to_seq2seq = model.create_group_embeddings(
+              grouping_actions, verbose=verbose)
+          model.generate_placement(input_to_seq2seq, sess)
+          try:
+            run_time = model.eval_placement(
+                sess,
+                verbose=verbose)
+          except errors.OpError as e:
+            if verbose:
+              print("Failed to run graph:" + str(e))
+            run_time = hparams.failing_signal
+          updated = model.update_reward(sess, run_time, verbose=verbose)
+          if updated and run_time < original_run_time:
+            if verbose:
+              print("Found better placement, with runtime " + str(run_time))
+            model.export_placement(metagraph)
+
+          model.process_reward(sess)
+
+          current_time = time.time()
+
+  return metagraph
diff --git a/tensorflow/python/grappler/graph_placer_test.py b/tensorflow/python/grappler/graph_placer_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..9eabe3cd5437022eb3b98010d0f384cc9f6bac2a
--- /dev/null
+++ b/tensorflow/python/grappler/graph_placer_test.py
@@ -0,0 +1,140 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests the graph placer."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from tensorflow.core.protobuf import device_properties_pb2
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import meta_graph
+from tensorflow.python.framework import ops as tf_ops
+from tensorflow.python.grappler import cluster
+from tensorflow.python.grappler import graph_placer
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import random_ops
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.platform import test
+
+
+class GraphPlacerTest(test.TestCase):
+
+  @staticmethod
+  def _buildMnist(batch_size=128,
+                  input_size=256,
+                  num_classes=1024,
+                  num_layers=10,
+                  hidden_size=256,
+                  name='mnist'):
+    g = tf_ops.get_default_graph()
+    with g.as_default():
+      ops = {}
+      x = random_ops.random_uniform(
+          [batch_size, input_size], -0.1, 0.1, dtype=dtypes.float32)
+      for layer_id in range(num_layers):
+        with variable_scope.variable_scope('layer_{}'.format(layer_id)):
+          a = input_size if layer_id == 0 else hidden_size
+          b = hidden_size if layer_id < num_layers - 1 else num_classes
+          w = variable_scope.get_variable('w', [a, b])
+          x = math_ops.matmul(x, w)
+          x = nn_ops.relu(x)
+      ops['y_preds'] = math_ops.argmax(x, axis=1)
+
+    train_op = g.get_collection_ref(tf_ops.GraphKeys.TRAIN_OP)
+    train_op.append(ops['y_preds'])
+    return g
+
+  @staticmethod
+  def _buildCluster(num_cpus=1, num_gpus=1):
+    devices = []
+    if num_gpus > 0:
+      device_properties = device_properties_pb2.DeviceProperties(
+          type='GPU',
+          vendor='NVidia',
+          model='GeForce GTX TITAN X',
+          frequency=1076,
+          num_cores=24,
+          environment={'architecture': '5.2',
+                       'cuda': '8000',
+                       'cudnn': '6021'},
+          num_registers=65536,
+          l1_cache_size=24576,
+          l2_cache_size=3145728,
+          shared_memory_size_per_multiprocessor=98304,
+          memory_size=12783648768,
+          bandwidth=336480000)
+      for i in range(num_gpus):
+        devices.append(
+            device_properties_pb2.NamedDevice(
+                properties=device_properties, name='/GPU:' + str(i)))
+
+    assert num_cpus > 0
+    device_properties = device_properties_pb2.DeviceProperties(
+        type='CPU',
+        frequency=2000,
+        num_cores=4,
+        l1_cache_size=32768,
+        l2_cache_size=262144,
+        l3_cache_size=12582912)
+    for i in range(num_cpus):
+      devices.append(
+          device_properties_pb2.NamedDevice(
+              properties=device_properties, name='/CPU:' + str(i)))
+
+    return cluster.Cluster(devices=devices)
+
+  def testBasic(self):
+    """Place a trivial graph."""
+    a = constant_op.constant(10, name='a')
+    b = constant_op.constant(20, name='b')
+    c = math_ops.add_n([a, b], name='c')
+    d = math_ops.add_n([b, c], name='d')
+    train_op = tf_ops.get_collection_ref(tf_ops.GraphKeys.TRAIN_OP)
+    train_op.append(d)
+    mg = meta_graph.create_meta_graph_def(graph=tf_ops.get_default_graph())
+
+    gcluster = cluster.Cluster()
+    placed_mg = graph_placer.PlaceGraph(mg, allotted_time=15, cluster=gcluster)
+
+    self.assertEqual(4, len(placed_mg.graph_def.node))
+    self.assertItemsEqual([node.name for node in placed_mg.graph_def.node],
+                          [node.name for node in mg.graph_def.node])
+
+    available_devices = [device.name for device in gcluster.ListDevices()]
+    for node in placed_mg.graph_def.node:
+      # The constant nodes are optimized away before the placer is run, and
+      # therefore won't be placed.
+      self.assertTrue(not node.device or node.device in available_devices)
+
+  def testMNIST(self):
+    graph = GraphPlacerTest._buildMnist()
+    mg = meta_graph.create_meta_graph_def(graph=graph)
+    gcluster = GraphPlacerTest._buildCluster(num_gpus=1)
+    # Spend 15 seconds trying to optimize the placement of the model. This
+    # should give us enough time to exercise the code, but not enough to find
+    # a good placement, so we'll just check for legality.
+    placed_mg = graph_placer.PlaceGraph(mg, allotted_time=15, cluster=gcluster)
+    self.assertEqual(len(placed_mg.graph_def.node), len(mg.graph_def.node))
+    self.assertItemsEqual([node.name for node in placed_mg.graph_def.node],
+                          [node.name for node in mg.graph_def.node])
+    available_devices = [device.name for device in gcluster.ListDevices()]
+    for node in placed_mg.graph_def.node:
+      self.assertTrue(not node.device or node.device in available_devices)
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/grappler/hierarchical_controller.py b/tensorflow/python/grappler/hierarchical_controller.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0866c1069ac7f7e25cbd12cb5a490e2ed5e4bec
--- /dev/null
+++ b/tensorflow/python/grappler/hierarchical_controller.py
@@ -0,0 +1,1117 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""HierarchicalController Class.
+
+The HierarchicalController encompasses the entire lifecycle of training the
+device placement policy, including generating op embeddings, getting groups for
+each op, placing those groups and running the predicted placements.
+
+Different assignment models can inherit from this class.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+import numpy as np
+import six
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import errors
+from tensorflow.python.framework import ops as tf_ops
+from tensorflow.python.grappler.controller import Controller
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import clip_ops
+from tensorflow.python.ops import control_flow_ops
+from tensorflow.python.ops import embedding_ops
+from tensorflow.python.ops import init_ops
+from tensorflow.python.ops import linalg_ops
+from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import random_ops
+from tensorflow.python.ops import state_ops
+from tensorflow.python.ops import tensor_array_ops
+from tensorflow.python.ops import variable_scope
+from tensorflow.python.summary import summary
+from tensorflow.python.training import adam
+from tensorflow.python.training import gradient_descent
+from tensorflow.python.training import learning_rate_decay
+from tensorflow.python.training import training_util
+
+
+class PlacerParams(object):
+  """Class to hold a set of placement parameters as name-value pairs.
+
+  A typical usage is as follows:
+
+  ```python
+  # Create a PlacerParams object specifying names and values of the model
+  # parameters:
+  params = PlacerParams(hidden_size=128, decay_steps=50)
+
+  # The parameters are available as attributes of the PlacerParams object:
+  hparams.hidden_size ==> 128
+  hparams.decay_steps ==> 50
+  ```
+
+  """
+
+  def __init__(self, **kwargs):
+    """Create an instance of `PlacerParams` from keyword arguments.
+
+    The keyword arguments specify name-values pairs for the parameters.
+    The parameter types are inferred from the type of the values passed.
+
+    The parameter names are added as attributes of `PlacerParams` object,
+    and they can be accessed directly with the dot notation `params._name_`.
+
+    Example:
+
+    ```python
+    # Define 1 parameter: 'hidden_size'
+    params = PlacerParams(hidden_size=128)
+    params.hidden_size ==> 128
+    ```
+
+    Args:
+      **kwargs: Key-value pairs where the key is the parameter name and
+        the value is the value for the parameter.
+    """
+    for name, value in six.iteritems(kwargs):
+      self.add_param(name, value)
+
+  def add_param(self, name, value):
+    """Adds {name, value} pair to hyperparameters.
+
+    Args:
+      name: Name of the hyperparameter.
+      value: Value of the hyperparameter. Can be one of the following types:
+        int, float, string, int list, float list, or string list.
+
+    Raises:
+      ValueError: if one of the arguments is invalid.
+    """
+    # Keys in kwargs are unique, but 'name' could be the name of a pre-existing
+    # attribute of this object.  In that case we refuse to use it as a
+    # parameter name.
+    if getattr(self, name, None) is not None:
+      raise ValueError("Parameter name is reserved: %s" % name)
+    setattr(self, name, value)
+
+
+def hierarchical_controller_hparams():
+  """Hyperparameters for hierarchical planner."""
+  return PlacerParams(
+      hidden_size=512,
+      forget_bias_init=1.0,
+      temperature=1.0,
+      logits_std_noise=0.5,
+      stop_noise_step=750,
+      decay_steps=50,
+      max_num_outputs=5,
+      max_output_size=5,
+      tanh_constant=1.0,
+      adj_embed_dim=20,
+      grouping_hidden_size=64,
+      num_groups=None,
+      bi_lstm=True,
+      failing_signal=100,
+      stop_sampling=500,
+      start_with_failing_signal=True,
+      always_update_baseline=False,
+      bl_dec=0.9,
+      grad_bound=1.0,
+      lr=0.1,
+      lr_dec=0.95,
+      start_decay_step=400,
+      optimizer_type="adam",
+      stop_updating_after_steps=1000,
+      name="hierarchical_controller",
+      keep_prob=1.0,
+      reward_function="sqrt",
+      seed=1234,
+      # distributed training params
+      num_children=1)
+
+
+class HierarchicalController(Controller):
+  """HierarchicalController class."""
+
+  def __init__(self, hparams, item, cluster, controller_id=0):
+    """HierarchicalController class initializer.
+
+    Args:
+      hparams: All hyper-parameters.
+      item: The metagraph to place.
+      cluster: The cluster of hardware devices to optimize for.
+      controller_id: the id of the controller in a multi-controller setup.
+    """
+    super(HierarchicalController, self).__init__(item, cluster)
+    self.ctrl_id = controller_id
+    self.hparams = hparams
+
+    if self.hparams.num_groups is None:
+      self.num_groups = min(256, 20 * self.num_devices)
+    else:
+      self.num_groups = self.hparams.num_groups
+
+    # creates self.op_embeddings and self.type_dict
+    self.create_op_embeddings(verbose=False)
+    # TODO(azalia) clean up embedding/group_embedding_size names
+    self.group_emb_size = (
+        2 * self.num_groups + len(self.type_dict) +
+        self.hparams.max_num_outputs * self.hparams.max_output_size)
+    self.embedding_size = self.group_emb_size
+    self.initializer = init_ops.glorot_uniform_initializer(
+        seed=self.hparams.seed)
+
+    with variable_scope.variable_scope(
+        self.hparams.name,
+        initializer=self.initializer,
+        reuse=variable_scope.AUTO_REUSE):
+      # define parameters of feedforward
+      variable_scope.get_variable("w_grouping_ff", [
+          1 + self.hparams.max_num_outputs * self.hparams.max_output_size +
+          self.hparams.adj_embed_dim, self.hparams.grouping_hidden_size
+      ])
+      variable_scope.get_variable(
+          "w_grouping_softmax",
+          [self.hparams.grouping_hidden_size, self.num_groups])
+      if self.hparams.bi_lstm:
+        variable_scope.get_variable("encoder_lstm_forward", [
+            self.embedding_size + self.hparams.hidden_size / 2,
+            2 * self.hparams.hidden_size
+        ])
+        variable_scope.get_variable("encoder_lstm_backward", [
+            self.embedding_size + self.hparams.hidden_size / 2,
+            2 * self.hparams.hidden_size
+        ])
+        variable_scope.get_variable(
+            "device_embeddings", [self.num_devices, self.hparams.hidden_size])
+        variable_scope.get_variable(
+            "decoder_lstm",
+            [2 * self.hparams.hidden_size, 4 * self.hparams.hidden_size])
+        variable_scope.get_variable(
+            "device_softmax", [2 * self.hparams.hidden_size, self.num_devices])
+        variable_scope.get_variable("device_go_embedding",
+                                    [1, self.hparams.hidden_size])
+        variable_scope.get_variable(
+            "encoder_forget_bias",
+            shape=1,
+            dtype=dtypes.float32,
+            initializer=init_ops.constant_initializer(
+                self.hparams.forget_bias_init))
+        variable_scope.get_variable(
+            "decoder_forget_bias",
+            shape=1,
+            dtype=dtypes.float32,
+            initializer=init_ops.constant_initializer(
+                self.hparams.forget_bias_init))
+        variable_scope.get_variable(
+            "attn_w_1", [self.hparams.hidden_size, self.hparams.hidden_size])
+        variable_scope.get_variable(
+            "attn_w_2", [self.hparams.hidden_size, self.hparams.hidden_size])
+        variable_scope.get_variable("attn_v", [self.hparams.hidden_size, 1])
+
+      else:
+        variable_scope.get_variable("encoder_lstm", [
+            self.embedding_size + self.hparams.hidden_size,
+            4 * self.hparams.hidden_size
+        ])
+        variable_scope.get_variable(
+            "device_embeddings", [self.num_devices, self.hparams.hidden_size])
+        variable_scope.get_variable(
+            "decoder_lstm",
+            [2 * self.hparams.hidden_size, 4 * self.hparams.hidden_size])
+        variable_scope.get_variable(
+            "device_softmax", [2 * self.hparams.hidden_size, self.num_devices])
+        variable_scope.get_variable("device_go_embedding",
+                                    [1, self.hparams.hidden_size])
+        variable_scope.get_variable(
+            "encoder_forget_bias",
+            shape=1,
+            dtype=dtypes.float32,
+            initializer=init_ops.constant_initializer(
+                self.hparams.forget_bias_init))
+        variable_scope.get_variable(
+            "decoder_forget_bias",
+            shape=1,
+            dtype=dtypes.float32,
+            initializer=init_ops.constant_initializer(
+                self.hparams.forget_bias_init))
+        variable_scope.get_variable(
+            "attn_w_1", [self.hparams.hidden_size, self.hparams.hidden_size])
+        variable_scope.get_variable(
+            "attn_w_2", [self.hparams.hidden_size, self.hparams.hidden_size])
+        variable_scope.get_variable("attn_v", [self.hparams.hidden_size, 1])
+    seq2seq_input_layer = array_ops.placeholder_with_default(
+        array_ops.zeros([self.hparams.num_children,
+                         self.num_groups,
+                         self.group_emb_size],
+                        dtypes.float32),
+        shape=(self.hparams.num_children, self.num_groups, self.group_emb_size))
+    self.seq2seq_input_layer = seq2seq_input_layer
+
+  def compute_reward(self, run_time):
+    if self.hparams.reward_function == "id":
+      reward = run_time
+    elif self.hparams.reward_function == "sqrt":
+      reward = math.sqrt(run_time)
+    elif self.hparams.reward_function == "log":
+      reward = math.log1p(run_time)
+    else:
+      raise NotImplementedError(
+          "Unrecognized reward function '%s', consider your "
+          "--reward_function flag value." % self.hparams.reward_function)
+    return reward
+
+  def build_controller(self):
+    """RL optimization interface.
+
+    Returns:
+      ops: A dictionary holding handles of the model used for training.
+    """
+
+    self._global_step = training_util.get_or_create_global_step()
+    ops = {}
+    ops["loss"] = 0
+
+    failing_signal = self.compute_reward(self.hparams.failing_signal)
+
+    ctr = {}
+
+    with tf_ops.name_scope("controller_{}".format(self.ctrl_id)):
+      with variable_scope.variable_scope("controller_{}".format(self.ctrl_id)):
+        ctr["reward"] = {"value": [], "ph": [], "update": []}
+        ctr["ready"] = {"value": [], "ph": [], "update": []}
+        ctr["best_reward"] = {"value": [], "update": []}
+        for i in range(self.hparams.num_children):
+          reward_value = variable_scope.get_local_variable(
+              "reward_{}".format(i),
+              initializer=0.0,
+              dtype=dtypes.float32,
+              trainable=False)
+          reward_ph = array_ops.placeholder(
+              dtypes.float32, shape=(), name="reward_ph_{}".format(i))
+          reward_update = state_ops.assign(
+              reward_value, reward_ph, use_locking=True)
+          ctr["reward"]["value"].append(reward_value)
+          ctr["reward"]["ph"].append(reward_ph)
+          ctr["reward"]["update"].append(reward_update)
+          best_reward = variable_scope.get_local_variable(
+              "best_reward_{}".format(i),
+              initializer=failing_signal,
+              dtype=dtypes.float32,
+              trainable=False)
+          ctr["best_reward"]["value"].append(best_reward)
+          ctr["best_reward"]["update"].append(
+              state_ops.assign(best_reward,
+                               math_ops.minimum(best_reward, reward_update)))
+
+          ready_value = variable_scope.get_local_variable(
+              "ready_{}".format(i),
+              initializer=True,
+              dtype=dtypes.bool,
+              trainable=False)
+          ready_ph = array_ops.placeholder(
+              dtypes.bool, shape=(), name="ready_ph_{}".format(i))
+          ready_update = state_ops.assign(
+              ready_value, ready_ph, use_locking=True)
+          ctr["ready"]["value"].append(ready_value)
+          ctr["ready"]["ph"].append(ready_ph)
+          ctr["ready"]["update"].append(ready_update)
+
+      ctr["grouping_y_preds"], ctr["grouping_log_probs"] = self.get_groupings()
+      summary.histogram(
+          "grouping_actions",
+          array_ops.slice(ctr["grouping_y_preds"]["sample"], [0, 0],
+                          [1, array_ops.shape(self.op_embeddings)[0]]))
+
+      with variable_scope.variable_scope("controller_{}".format(self.ctrl_id)):
+        ctr["baseline"] = variable_scope.get_local_variable(
+            "baseline",
+            initializer=failing_signal
+            if self.hparams.start_with_failing_signal else 0.0,
+            dtype=dtypes.float32,
+            trainable=False)
+
+      new_baseline = self.hparams.bl_dec * ctr["baseline"] + (
+          1 - self.hparams.bl_dec) * math_ops.reduce_mean(
+              ctr["reward"]["value"])
+      if not self.hparams.always_update_baseline:
+        baseline_mask = math_ops.less(ctr["reward"]["value"], failing_signal)
+        selected_reward = array_ops.boolean_mask(ctr["reward"]["value"],
+                                                 baseline_mask)
+        selected_baseline = control_flow_ops.cond(
+            math_ops.reduce_any(baseline_mask),
+            lambda: math_ops.reduce_mean(selected_reward),
+            lambda: constant_op.constant(0, dtype=dtypes.float32))
+        ctr["pos_reward"] = selected_baseline
+        pos_ = math_ops.less(
+            constant_op.constant(0, dtype=dtypes.float32), selected_baseline)
+        selected_baseline = self.hparams.bl_dec * ctr["baseline"] + (
+            1 - self.hparams.bl_dec) * selected_baseline
+        selected_baseline = control_flow_ops.cond(
+            pos_, lambda: selected_baseline, lambda: ctr["baseline"])
+        new_baseline = control_flow_ops.cond(
+            math_ops.less(self.global_step,
+                          self.hparams.stop_updating_after_steps),
+            lambda: new_baseline, lambda: selected_baseline)
+      ctr["baseline_update"] = state_ops.assign(
+          ctr["baseline"], new_baseline, use_locking=True)
+
+      ctr["y_preds"], ctr["log_probs"] = self.get_placements()
+      summary.histogram("actions", ctr["y_preds"]["sample"])
+      mask = math_ops.less(ctr["reward"]["value"], failing_signal)
+      ctr["loss"] = ctr["reward"]["value"] - ctr["baseline"]
+      ctr["loss"] *= (
+          ctr["log_probs"]["sample"] + ctr["grouping_log_probs"]["sample"])
+
+      selected_loss = array_ops.boolean_mask(ctr["loss"], mask)
+      selected_loss = control_flow_ops.cond(
+          math_ops.reduce_any(mask),
+          lambda: math_ops.reduce_mean(-selected_loss),
+          lambda: constant_op.constant(0, dtype=dtypes.float32))
+
+      ctr["loss"] = control_flow_ops.cond(
+          math_ops.less(self.global_step,
+                        self.hparams.stop_updating_after_steps),
+          lambda: math_ops.reduce_mean(-ctr["loss"]), lambda: selected_loss)
+
+      ctr["reward_s"] = math_ops.reduce_mean(ctr["reward"]["value"])
+      summary.scalar("loss", ctr["loss"])
+      summary.scalar("avg_reward", ctr["reward_s"])
+      summary.scalar("best_reward_so_far", best_reward)
+      summary.scalar(
+          "advantage",
+          math_ops.reduce_mean(ctr["reward"]["value"] - ctr["baseline"]))
+
+    with variable_scope.variable_scope(
+        "optimizer", reuse=variable_scope.AUTO_REUSE):
+      (ctr["train_op"], ctr["lr"], ctr["grad_norm"],
+       ctr["grad_norms"]) = self._get_train_ops(
+           ctr["loss"],
+           tf_ops.get_collection(tf_ops.GraphKeys.TRAINABLE_VARIABLES),
+           self.global_step,
+           grad_bound=self.hparams.grad_bound,
+           lr_init=self.hparams.lr,
+           lr_dec=self.hparams.lr_dec,
+           start_decay_step=self.hparams.start_decay_step,
+           decay_steps=self.hparams.decay_steps,
+           optimizer_type=self.hparams.optimizer_type)
+
+    summary.scalar("gradnorm", ctr["grad_norm"])
+    summary.scalar("lr", ctr["lr"])
+    ctr["summary"] = summary.merge_all()
+    ops["controller"] = ctr
+
+    self.ops = ops
+    return ops
+
+  @property
+  def global_step(self):
+    return self._global_step
+
+  def create_op_embeddings(self, verbose=False):
+    if verbose:
+      print("process input graph for op embeddings")
+    self.num_ops = len(self.important_ops)
+    # topological sort of important nodes
+    topo_order = [op.name for op in self.important_ops]
+
+    # create index to name for topologicaly sorted important nodes
+    name_to_topo_order_index = {}
+    for idx, x in enumerate(topo_order):
+      name_to_topo_order_index[x] = idx
+    self.name_to_topo_order_index = name_to_topo_order_index
+
+    # create adj matrix
+    adj_dict = {}
+    for idx, op in enumerate(self.important_ops):
+      for output_op in self.get_node_fanout(op):
+        output_op_name = output_op.name
+        if output_op_name in self.important_op_names:
+          if name_to_topo_order_index[op.name] not in adj_dict:
+            adj_dict[name_to_topo_order_index[op.name]] = []
+          adj_dict[name_to_topo_order_index[op.name]].extend(
+              [name_to_topo_order_index[output_op_name], 1])
+          if output_op_name not in adj_dict:
+            adj_dict[name_to_topo_order_index[output_op_name]] = []
+          adj_dict[name_to_topo_order_index[output_op_name]].extend(
+              [name_to_topo_order_index[op.name], -1])
+
+    # get op_type op_output_shape, and adj info
+    output_embed_dim = (self.hparams.max_num_outputs *
+                        self.hparams.max_output_size)
+
+    # TODO(bsteiner): don't filter based on used ops so that we can generalize
+    # to models that use other types of ops.
+    used_ops = set()
+    for node in self.important_ops:
+      op_type = str(node.op)
+      used_ops.add(op_type)
+
+    self.type_dict = {}
+    for op_type in self.cluster.ListAvailableOps():
+      if op_type in used_ops:
+        self.type_dict[op_type] = len(self.type_dict)
+
+    op_types = np.zeros([self.num_ops], dtype=np.int32)
+    op_output_shapes = np.full(
+        [self.num_ops, output_embed_dim], -1.0, dtype=np.float32)
+    for idx, node in enumerate(self.important_ops):
+      op_types[idx] = self.type_dict[node.op]
+      # output shape
+      op_name = node.name
+      for i, output_prop in enumerate(self.node_properties[op_name]):
+        if output_prop.shape.__str__() == "<unknown>":
+          continue
+        shape = output_prop.shape
+        for j, dim in enumerate(shape.dim):
+          if dim.size >= 0:
+            if i * self.hparams.max_output_size + j >= output_embed_dim:
+              break
+            op_output_shapes[idx,
+                             i * self.hparams.max_output_size + j] = dim.size
+    # adj for padding
+    op_adj = np.full(
+        [self.num_ops, self.hparams.adj_embed_dim], 0, dtype=np.float32)
+    for idx in adj_dict:
+      neighbors = adj_dict[int(idx)]
+      min_dim = min(self.hparams.adj_embed_dim, len(neighbors))
+      padding_size = self.hparams.adj_embed_dim - min_dim
+      neighbors = neighbors[:min_dim] + [0] * padding_size
+      op_adj[int(idx)] = neighbors
+
+    # op_embedding   starts here
+    op_embeddings = np.zeros(
+        [
+            self.num_ops,
+            1 + self.hparams.max_num_outputs * self.hparams.max_output_size +
+            self.hparams.adj_embed_dim
+        ],
+        dtype=np.float32)
+    for idx, op_name in enumerate(topo_order):
+      op_embeddings[idx] = np.concatenate(
+          (np.array([op_types[idx]]), op_output_shapes[idx], op_adj[int(idx)]))
+    self.op_embeddings = constant_op.constant(
+        op_embeddings, dtype=dtypes.float32)
+    if verbose:
+      print("num_ops = {}".format(self.num_ops))
+      print("num_types = {}".format(len(self.type_dict)))
+
+  def get_groupings(self, *args, **kwargs):
+    num_children = self.hparams.num_children
+    with variable_scope.variable_scope("controller_{}".format(self.ctrl_id)):
+      grouping_actions_cache = variable_scope.get_local_variable(
+          "grouping_actions_cache",
+          initializer=init_ops.zeros_initializer,
+          dtype=dtypes.int32,
+          shape=[num_children, self.num_ops],
+          trainable=False)
+    input_layer = self.op_embeddings
+    input_layer = array_ops.expand_dims(input_layer, 0)
+    feed_ff_input_layer = array_ops.tile(input_layer, [num_children, 1, 1])
+    grouping_actions, grouping_log_probs = {}, {}
+    grouping_actions["sample"], grouping_log_probs[
+        "sample"] = self.make_grouping_predictions(feed_ff_input_layer)
+
+    grouping_actions["sample"] = state_ops.assign(grouping_actions_cache,
+                                                  grouping_actions["sample"])
+    self.grouping_actions_cache = grouping_actions_cache
+
+    return grouping_actions, grouping_log_probs
+
+  def make_grouping_predictions(self, input_layer, reuse=None):
+    """model that predicts grouping (grouping_actions).
+
+    Args:
+      input_layer: group_input_layer
+      reuse: reuse
+
+    Returns:
+       grouping_actions: actions
+       grouping_log_probs: log probabilities corresponding to actions
+    """
+    with variable_scope.variable_scope(self.hparams.name, reuse=True):
+      # input_layer: tensor of size [1, num_ops, hidden_size]
+      w_grouping_ff = variable_scope.get_variable("w_grouping_ff")
+      w_grouping_softmax = variable_scope.get_variable("w_grouping_softmax")
+
+    batch_size = array_ops.shape(input_layer)[0]
+    embedding_dim = array_ops.shape(input_layer)[2]
+
+    reshaped = array_ops.reshape(input_layer,
+                                 [batch_size * self.num_ops, embedding_dim])
+    ff_output = math_ops.matmul(reshaped, w_grouping_ff)
+    logits = math_ops.matmul(ff_output, w_grouping_softmax)
+    if self.hparams.logits_std_noise > 0:
+      num_in_logits = math_ops.cast(
+          array_ops.size(logits), dtype=dtypes.float32)
+      avg_norm = math_ops.divide(
+          linalg_ops.norm(logits), math_ops.sqrt(num_in_logits))
+      logits_noise = random_ops.random_normal(
+          array_ops.shape(logits),
+          stddev=self.hparams.logits_std_noise * avg_norm)
+      logits = control_flow_ops.cond(
+          self.global_step > self.hparams.stop_noise_step, lambda: logits,
+          lambda: logits + logits_noise)
+    logits = array_ops.reshape(logits,
+                               [batch_size * self.num_ops, self.num_groups])
+    actions = random_ops.multinomial(logits, 1, seed=self.hparams.seed)
+    actions = math_ops.to_int32(actions)
+    actions = array_ops.reshape(actions, [batch_size, self.num_ops])
+    action_label = array_ops.reshape(actions, [-1])
+    log_probs = nn_ops.sparse_softmax_cross_entropy_with_logits(
+        logits=logits, labels=action_label)
+    log_probs = array_ops.reshape(log_probs, [batch_size, -1])
+    log_probs = math_ops.reduce_sum(log_probs, 1)
+    grouping_actions = actions
+    grouping_log_probs = log_probs
+    return grouping_actions, grouping_log_probs
+
+  def create_group_embeddings(self, grouping_actions, verbose=False):
+    """Approximating the blocks of a TF graph from a graph_def.
+
+    Args:
+      grouping_actions: grouping predictions.
+      verbose: print stuffs.
+
+    Returns:
+      groups: list of groups.
+    """
+    groups = [
+        self._create_group_embeddings(grouping_actions, i, verbose) for
+        i in range(self.hparams.num_children)
+    ]
+    return np.stack(groups, axis=0)
+
+  def _create_group_embeddings(self, grouping_actions, child_id, verbose=False):
+    """Approximating the blocks of a TF graph from a graph_def for each child.
+
+    Args:
+      grouping_actions: grouping predictions.
+      child_id: child_id for the group.
+      verbose: print stuffs.
+
+    Returns:
+      groups: group embedding for the child_id.
+    """
+    if verbose:
+      print("Processing input_graph")
+
+    # TODO(azalia): Build inter-adjacencies dag matrix.
+    # record dag_matrix
+    dag_matrix = np.zeros([self.num_groups, self.num_groups], dtype=np.float32)
+    for op in self.important_ops:
+      topo_op_index = self.name_to_topo_order_index[op.name]
+      group_index = grouping_actions[child_id][topo_op_index]
+      for output_op in self.get_node_fanout(op):
+        if output_op.name not in self.important_op_names:
+          continue
+        output_group_index = (
+            grouping_actions[child_id][self.name_to_topo_order_index[
+                output_op.name]])
+        dag_matrix[group_index, output_group_index] += 1.0
+    num_connections = np.sum(dag_matrix)
+    num_intra_group_connections = dag_matrix.trace()
+    num_inter_group_connections = num_connections - num_intra_group_connections
+    if verbose:
+      print("grouping evaluation metric")
+      print(("num_connections={} num_intra_group_connections={} "
+             "num_inter_group_connections={}").format(
+                 num_connections, num_intra_group_connections,
+                 num_inter_group_connections))
+    self.dag_matrix = dag_matrix
+
+    # output_shape
+    op_output_shapes = np.zeros(
+        [
+            len(self.important_ops),
+            self.hparams.max_num_outputs * self.hparams.max_output_size
+        ],
+        dtype=np.float32)
+
+    for idx, op in enumerate(self.important_ops):
+      for i, output_properties in enumerate(self.node_properties[op.name]):
+        if output_properties.shape.__str__() == "<unknown>":
+          continue
+        if i > self.hparams.max_num_outputs:
+          break
+        shape = output_properties.shape
+        for j, dim in enumerate(shape.dim):
+          if dim.size > 0:
+            k = i * self.hparams.max_output_size + j
+            if k >= self.hparams.max_num_outputs * self.hparams.max_output_size:
+              break
+            op_output_shapes[idx, k] = dim.size
+
+    # group_embedding
+    group_embedding = np.zeros(
+        [
+            self.num_groups, len(self.type_dict) +
+            self.hparams.max_num_outputs * self.hparams.max_output_size
+        ],
+        dtype=np.float32)
+    for op_index, op in enumerate(self.important_ops):
+      group_index = grouping_actions[child_id][
+          self.name_to_topo_order_index[op.name]]
+      type_name = str(op.op)
+      type_index = self.type_dict[type_name]
+      group_embedding[group_index, type_index] += 1
+      group_embedding[group_index, :self.hparams.max_num_outputs * self.hparams.
+                      max_output_size] += (
+                          op_output_shapes[op_index])
+    grouping_adjacencies = np.concatenate(
+        [dag_matrix, np.transpose(dag_matrix)], axis=1)
+    group_embedding = np.concatenate(
+        [grouping_adjacencies, group_embedding], axis=1)
+    group_normalizer = np.amax(group_embedding, axis=1, keepdims=True)
+    group_embedding /= (group_normalizer + 1.0)
+    if verbose:
+      print("Finished Processing Input Graph")
+    return group_embedding
+
+  def get_placements(self, *args, **kwargs):
+    num_children = self.hparams.num_children
+    with variable_scope.variable_scope("controller_{}".format(self.ctrl_id)):
+      actions_cache = variable_scope.get_local_variable(
+          "actions_cache",
+          initializer=init_ops.zeros_initializer,
+          dtype=dtypes.int32,
+          shape=[num_children, self.num_groups],
+          trainable=False)
+
+    x = self.seq2seq_input_layer
+    last_c, last_h, attn_mem = self.encode(x)
+    actions, log_probs = {}, {}
+    actions["sample"], log_probs["sample"] = (
+        self.decode(
+            x, last_c, last_h, attn_mem, mode="sample"))
+    actions["target"], log_probs["target"] = (
+        self.decode(
+            x,
+            last_c,
+            last_h,
+            attn_mem,
+            mode="target",
+            y=actions_cache))
+    actions["greedy"], log_probs["greedy"] = (
+        self.decode(
+            x, last_c, last_h, attn_mem, mode="greedy"))
+    actions["sample"] = control_flow_ops.cond(
+        self.global_step < self.hparams.stop_sampling,
+        lambda: state_ops.assign(actions_cache, actions["sample"]),
+        lambda: state_ops.assign(actions_cache, actions["target"]))
+    self.actions_cache = actions_cache
+
+    return actions, log_probs
+
+  def encode(self, x):
+    """Encoder using LSTM.
+
+    Args:
+      x: tensor of size [num_children, num_groups, embedding_size]
+
+    Returns:
+      last_c, last_h: tensors of size [num_children, hidden_size], the final
+        LSTM states
+      attn_mem: tensor of size [num_children, num_groups, hidden_size], the
+      attention
+        memory, i.e. concatenation of all hidden states, linearly transformed by
+        an attention matrix attn_w_1
+    """
+    if self.hparams.bi_lstm:
+      with variable_scope.variable_scope(self.hparams.name, reuse=True):
+        w_lstm_forward = variable_scope.get_variable("encoder_lstm_forward")
+        w_lstm_backward = variable_scope.get_variable("encoder_lstm_backward")
+        forget_bias = variable_scope.get_variable("encoder_forget_bias")
+        attn_w_1 = variable_scope.get_variable("attn_w_1")
+    else:
+      with variable_scope.variable_scope(self.hparams.name, reuse=True):
+        w_lstm = variable_scope.get_variable("encoder_lstm")
+        forget_bias = variable_scope.get_variable("encoder_forget_bias")
+        attn_w_1 = variable_scope.get_variable("attn_w_1")
+
+    embedding_size = array_ops.shape(x)[2]
+
+    signals = array_ops.split(x, self.num_groups, axis=1)
+    for i in range(len(signals)):
+      signals[i] = array_ops.reshape(
+          signals[i], [self.hparams.num_children, embedding_size])
+
+    if self.hparams.bi_lstm:
+
+      def body(i, prev_c_forward, prev_h_forward, prev_c_backward,
+               prev_h_backward):
+        """while loop for LSTM."""
+        signal_forward = signals[i]
+        next_c_forward, next_h_forward = lstm(signal_forward, prev_c_forward,
+                                              prev_h_forward, w_lstm_forward,
+                                              forget_bias)
+
+        signal_backward = signals[self.num_groups - 1 - i]
+        next_c_backward, next_h_backward = lstm(
+            signal_backward, prev_c_backward, prev_h_backward, w_lstm_backward,
+            forget_bias)
+
+        next_h = array_ops.concat([next_h_forward, next_h_backward], axis=1)
+        all_h.append(next_h)
+
+        return (next_c_forward, next_h_forward, next_c_backward,
+                next_h_backward)
+
+      c_forward = array_ops.zeros(
+          [self.hparams.num_children, self.hparams.hidden_size / 2],
+          dtype=dtypes.float32)
+      h_forward = array_ops.zeros(
+          [self.hparams.num_children, self.hparams.hidden_size / 2],
+          dtype=dtypes.float32)
+
+      c_backward = array_ops.zeros(
+          [self.hparams.num_children, self.hparams.hidden_size / 2],
+          dtype=dtypes.float32)
+      h_backward = array_ops.zeros(
+          [self.hparams.num_children, self.hparams.hidden_size / 2],
+          dtype=dtypes.float32)
+      all_h = []
+
+      for i in range(0, self.num_groups):
+        c_forward, h_forward, c_backward, h_backward = body(
+            i, c_forward, h_forward, c_backward, h_backward)
+
+      last_c = array_ops.concat([c_forward, c_backward], axis=1)
+      last_h = array_ops.concat([h_forward, h_backward], axis=1)
+      attn_mem = array_ops.stack(all_h)
+
+    else:
+
+      def body(i, prev_c, prev_h):
+        signal = signals[i]
+        next_c, next_h = lstm(signal, prev_c, prev_h, w_lstm, forget_bias)
+        all_h.append(next_h)
+        return next_c, next_h
+
+      c = array_ops.zeros(
+          [self.hparams.num_children, self.hparams.hidden_size],
+          dtype=dtypes.float32)
+      h = array_ops.zeros(
+          [self.hparams.num_children, self.hparams.hidden_size],
+          dtype=dtypes.float32)
+      all_h = []
+
+      for i in range(0, self.num_groups):
+        c, h = body(i, c, h)
+
+      last_c = c
+      last_h = h
+      attn_mem = array_ops.stack(all_h)
+
+    attn_mem = array_ops.transpose(attn_mem, [1, 0, 2])
+    attn_mem = array_ops.reshape(
+        attn_mem,
+        [self.hparams.num_children * self.num_groups, self.hparams.hidden_size])
+    attn_mem = math_ops.matmul(attn_mem, attn_w_1)
+    attn_mem = array_ops.reshape(
+        attn_mem,
+        [self.hparams.num_children, self.num_groups, self.hparams.hidden_size])
+
+    return last_c, last_h, attn_mem
+
+  def decode(self,
+             x,
+             last_c,
+             last_h,
+             attn_mem,
+             mode="target",
+             y=None):
+    """Decoder using LSTM.
+
+    Args:
+      x: tensor of size [num_children, num_groups, embedding_size].
+      last_c: tensor of size [num_children, hidden_size], the final LSTM states
+          computed by self.encoder.
+      last_h: same as last_c.
+      attn_mem: tensor of size [num_children, num_groups, hidden_size].
+      mode: "target" or "sample".
+      y: tensor of size [num_children, num_groups], the device placements.
+
+    Returns:
+      actions: tensor of size [num_children, num_groups], the placements of
+          devices
+    """
+    with variable_scope.variable_scope(self.hparams.name, reuse=True):
+      w_lstm = variable_scope.get_variable("decoder_lstm")
+      forget_bias = variable_scope.get_variable("decoder_forget_bias")
+      device_embeddings = variable_scope.get_variable("device_embeddings")
+      device_softmax = variable_scope.get_variable("device_softmax")
+      device_go_embedding = variable_scope.get_variable("device_go_embedding")
+      attn_w_2 = variable_scope.get_variable("attn_w_2")
+      attn_v = variable_scope.get_variable("attn_v")
+
+    actions = tensor_array_ops.TensorArray(
+        dtypes.int32,
+        size=self.num_groups,
+        infer_shape=False,
+        clear_after_read=False)
+
+    # pylint: disable=unused-argument
+    def condition(i, *args):
+      return math_ops.less(i, self.num_groups)
+
+    # pylint: disable=missing-docstring
+    def body(i, prev_c, prev_h, actions, log_probs):
+      # pylint: disable=g-long-lambda
+      signal = control_flow_ops.cond(
+          math_ops.equal(i, 0),
+          lambda: array_ops.tile(device_go_embedding,
+                                 [self.hparams.num_children, 1]),
+          lambda: embedding_ops.embedding_lookup(device_embeddings,
+                                                 actions.read(i - 1))
+      )
+      if self.hparams.keep_prob is not None:
+        signal = nn_ops.dropout(signal, self.hparams.keep_prob)
+      next_c, next_h = lstm(signal, prev_c, prev_h, w_lstm, forget_bias)
+      query = math_ops.matmul(next_h, attn_w_2)
+      query = array_ops.reshape(
+          query, [self.hparams.num_children, 1, self.hparams.hidden_size])
+      query = math_ops.tanh(query + attn_mem)
+      query = array_ops.reshape(query, [
+          self.hparams.num_children * self.num_groups, self.hparams.hidden_size
+      ])
+      query = math_ops.matmul(query, attn_v)
+      query = array_ops.reshape(query,
+                                [self.hparams.num_children, self.num_groups])
+      query = nn_ops.softmax(query)
+      query = array_ops.reshape(query,
+                                [self.hparams.num_children, self.num_groups, 1])
+      query = math_ops.reduce_sum(attn_mem * query, axis=1)
+      query = array_ops.concat([next_h, query], axis=1)
+      logits = math_ops.matmul(query, device_softmax)
+      logits /= self.hparams.temperature
+      if self.hparams.tanh_constant > 0:
+        logits = math_ops.tanh(logits) * self.hparams.tanh_constant
+      if self.hparams.logits_std_noise > 0:
+        num_in_logits = math_ops.cast(
+            array_ops.size(logits), dtype=dtypes.float32)
+        avg_norm = math_ops.divide(
+            linalg_ops.norm(logits), math_ops.sqrt(num_in_logits))
+        logits_noise = random_ops.random_normal(
+            array_ops.shape(logits),
+            stddev=self.hparams.logits_std_noise * avg_norm)
+        logits = control_flow_ops.cond(
+            self.global_step > self.hparams.stop_noise_step, lambda: logits,
+            lambda: logits + logits_noise)
+
+      if mode == "sample":
+        next_y = random_ops.multinomial(logits, 1, seed=self.hparams.seed)
+      elif mode == "greedy":
+        next_y = math_ops.argmax(logits, 1)
+      elif mode == "target":
+        next_y = array_ops.slice(y, [0, i], [-1, 1])
+      else:
+        raise NotImplementedError
+      next_y = math_ops.to_int32(next_y)
+      next_y = array_ops.reshape(next_y, [self.hparams.num_children])
+      actions = actions.write(i, next_y)
+      log_probs += nn_ops.sparse_softmax_cross_entropy_with_logits(
+          logits=logits, labels=next_y)
+      return i + 1, next_c, next_h, actions, log_probs
+
+    loop_vars = [
+        constant_op.constant(0, dtype=dtypes.int32), last_c, last_h, actions,
+        array_ops.zeros([self.hparams.num_children], dtype=dtypes.float32)
+    ]
+    loop_outputs = control_flow_ops.while_loop(condition, body, loop_vars)
+
+    last_c = loop_outputs[-4]
+    last_h = loop_outputs[-3]
+    actions = loop_outputs[-2].stack()
+    actions = array_ops.transpose(actions, [1, 0])
+    log_probs = loop_outputs[-1]
+    return actions, log_probs
+
+  def eval_placement(self,
+                     sess,
+                     child_id=0,
+                     verbose=False):
+    grouping_actions, actions = sess.run([
+        self.grouping_actions_cache,
+        self.actions_cache
+    ])
+    grouping_actions = grouping_actions[child_id]
+    actions = actions[child_id]
+    if verbose:
+      global_step = sess.run(self.global_step)
+      if global_step % 100 == 0:
+        log_string = "op group assignments: "
+        for a in grouping_actions:
+          log_string += "{} ".format(a)
+        print(log_string[:-1])
+        log_string = "group device assignments: "
+        for a in actions:
+          log_string += "{} ".format(a)
+        print(log_string[:-1])
+
+    for op in self.important_ops:
+      topo_order_index = self.name_to_topo_order_index[op.name]
+      group_index = grouping_actions[topo_order_index]
+      op.device = self.devices[actions[group_index]].name
+    try:
+      _, run_time, _ = self.cluster.MeasureCosts(self.item)
+    except errors.ResourceExhaustedError:
+      run_time = self.hparams.failing_signal
+    return run_time
+
+  def update_reward(self,
+                    sess,
+                    run_time,
+                    child_id=0,
+                    verbose=False):
+    reward = self.compute_reward(run_time)
+    controller_ops = self.ops["controller"]
+    _, best_reward = sess.run(
+        [
+            controller_ops["reward"]["update"][child_id],
+            controller_ops["best_reward"]["update"][child_id]
+        ],
+        feed_dict={
+            controller_ops["reward"]["ph"][child_id]: reward,
+        })
+    if verbose:
+      print(("run_time={:<.5f} reward={:<.5f} "
+             "best_reward={:<.5f}").format(run_time, reward, best_reward))
+
+    # Reward is a double, best_reward a float: allow for some slack in the
+    # comparison.
+    updated = abs(best_reward - reward) < 1e-6
+    return updated
+
+  def generate_grouping(self, sess):
+    controller_ops = self.ops["controller"]
+    grouping_actions = sess.run(controller_ops["grouping_y_preds"]["sample"])
+    return grouping_actions
+
+  def generate_placement(self, grouping, sess):
+    controller_ops = self.ops["controller"]
+    feed_seq2seq_input_dict = {}
+    feed_seq2seq_input_dict[self.seq2seq_input_layer] = grouping
+    sess.run(
+        controller_ops["y_preds"]["sample"], feed_dict=feed_seq2seq_input_dict)
+
+  def process_reward(self, sess):
+    controller_ops = self.ops["controller"]
+    run_ops = [
+        controller_ops["loss"], controller_ops["lr"],
+        controller_ops["grad_norm"], controller_ops["grad_norms"],
+        controller_ops["train_op"]
+    ]
+    sess.run(run_ops)
+    sess.run(controller_ops["baseline_update"])
+
+  def _get_train_ops(self,
+                     loss,
+                     tf_variables,
+                     global_step,
+                     grad_bound=1.25,
+                     lr_init=1e-3,
+                     lr_dec=0.9,
+                     start_decay_step=10000,
+                     decay_steps=100,
+                     optimizer_type="adam"):
+    """Loss optimizer.
+
+    Args:
+      loss: scalar tf tensor
+      tf_variables: list of training variables, typically
+        tf.trainable_variables()
+      global_step: global_step
+      grad_bound: max gradient norm
+      lr_init: initial learning rate
+      lr_dec: leaning rate decay coefficient
+      start_decay_step: start decaying learning rate after this many steps
+      decay_steps: apply decay rate factor at this step intervals
+      optimizer_type: optimizer type should be either adam or sgd
+
+    Returns:
+      train_op: training op
+      learning_rate: scalar learning rate tensor
+      grad_norm: l2 norm of the gradient vector
+      all_grad_norms: l2 norm of each component
+    """
+    lr_gstep = global_step - start_decay_step
+
+    def f1():
+      return constant_op.constant(lr_init)
+
+    def f2():
+      return learning_rate_decay.exponential_decay(lr_init, lr_gstep,
+                                                   decay_steps, lr_dec, True)
+
+    learning_rate = control_flow_ops.cond(
+        math_ops.less(global_step, start_decay_step),
+        f1,
+        f2,
+        name="learning_rate")
+
+    if optimizer_type == "adam":
+      opt = adam.AdamOptimizer(learning_rate)
+    elif optimizer_type == "sgd":
+      opt = gradient_descent.GradientDescentOptimizer(learning_rate)
+    grads_and_vars = opt.compute_gradients(loss, tf_variables)
+    grad_norm = clip_ops.global_norm([g for g, v in grads_and_vars])
+    all_grad_norms = {}
+    clipped_grads = []
+    clipped_rate = math_ops.maximum(grad_norm / grad_bound, 1.0)
+    for g, v in grads_and_vars:
+      if g is not None:
+        if isinstance(g, tf_ops.IndexedSlices):
+          clipped = g.values / clipped_rate
+          norm_square = math_ops.reduce_sum(clipped * clipped)
+          clipped = tf_ops.IndexedSlices(clipped, g.indices)
+        else:
+          clipped = g / clipped_rate
+          norm_square = math_ops.reduce_sum(clipped * clipped)
+        all_grad_norms[v.name] = math_ops.sqrt(norm_square)
+        clipped_grads.append((clipped, v))
+
+    train_op = opt.apply_gradients(clipped_grads, global_step)
+    return train_op, learning_rate, grad_norm, all_grad_norms
+
+
+def lstm(x, prev_c, prev_h, w_lstm, forget_bias):
+  """LSTM cell.
+
+  Args:
+    x: tensors of size [num_children, hidden_size].
+    prev_c: tensors of size [num_children, hidden_size].
+    prev_h: same as prev_c.
+    w_lstm: .
+    forget_bias: .
+
+  Returns:
+    next_c:
+    next_h:
+  """
+  ifog = math_ops.matmul(array_ops.concat([x, prev_h], axis=1), w_lstm)
+  i, f, o, g = array_ops.split(ifog, 4, axis=1)
+  i = math_ops.sigmoid(i)
+  f = math_ops.sigmoid(f + forget_bias)
+  o = math_ops.sigmoid(o)
+  g = math_ops.tanh(g)
+  next_c = i * g + f * prev_c
+  next_h = o * math_ops.tanh(next_c)
+  return next_c, next_h
diff --git a/tensorflow/python/grappler/item.i b/tensorflow/python/grappler/item.i
index d0fc1a04f220e0a053257e0206bb07b25f3767c6..9a84c60b04029a64ed35a01f045a6eec5e492504 100644
--- a/tensorflow/python/grappler/item.i
+++ b/tensorflow/python/grappler/item.i
@@ -96,10 +96,10 @@ static GItem TF_NewItem(
   return GItem(item.release());
 }
 
-static std::vector<string> TF_IdentifyImportantOps(GItem item, bool sort_topologically,
+static PyObject* TF_IdentifyImportantOps(GItem item, bool sort_topologically,
                                                    TF_Status* status) {
   if (item.is_none()) {
-    return {};
+    Py_RETURN_NONE;
   }
 
   std::vector<const tensorflow::NodeDef*> main_ops = item->MainOpsFanin();
@@ -132,7 +132,13 @@ static std::vector<string> TF_IdentifyImportantOps(GItem item, bool sort_topolog
     }
   }
 
-  return ops;
+  PyGILState_STATE gstate = PyGILState_Ensure();
+  PyObject* result = PyList_New(ops.size());
+  for (int i = 0; i < ops.size(); ++i) {
+    PyList_SetItem(result, i, PyString_FromString(ops[i].c_str()));
+  }
+  PyGILState_Release(gstate);
+  return result;
 }
 
 static PyObject* TF_GetOpProperties(GItem item) {
@@ -305,7 +311,7 @@ static PyObject* TF_GetColocationGroups(GItem item) {
 static GItem TF_NewItem(
     const tensorflow::MetaGraphDef& meta_graph, bool ignore_colocation,
     bool ignore_user_placement, TF_Status* out_status);
-static std::vector<string> TF_IdentifyImportantOps(GItem item, bool sort_topologically,
-                                                   TF_Status* status);
+static PyObject* TF_IdentifyImportantOps(GItem item, bool sort_topologically,
+                                         TF_Status* status);
 static PyObject* TF_GetOpProperties(GItem item);
 static PyObject* TF_GetColocationGroups(GItem item);
diff --git a/tensorflow/python/grappler/item_test.py b/tensorflow/python/grappler/item_test.py
index cd70e2fdecc74f9d99240ac566f3c28e900a06c2..7c3efd6249cbdaa2675632f7fc8e25fb88658a24 100644
--- a/tensorflow/python/grappler/item_test.py
+++ b/tensorflow/python/grappler/item_test.py
@@ -56,7 +56,7 @@ class ItemTest(test.TestCase):
       mg = meta_graph.create_meta_graph_def(graph=g)
       grappler_item = item.Item(mg)
       op_list = grappler_item.IdentifyImportantOps()
-      self.assertItemsEqual([b'Const', b'Const_1', b'add'], op_list)
+      self.assertItemsEqual(['Const', 'Const_1', 'add'], op_list)
 
   def testOpProperties(self):
     with ops.Graph().as_default() as g:
diff --git a/tensorflow/python/grappler/layout_optimizer_test.py b/tensorflow/python/grappler/layout_optimizer_test.py
index 0f5150174049250e86bbac0a49eb998339058326..5a84b16a23f567fba6d08aaefd3b816a76907735 100644
--- a/tensorflow/python/grappler/layout_optimizer_test.py
+++ b/tensorflow/python/grappler/layout_optimizer_test.py
@@ -321,7 +321,7 @@ class LayoutOptimizerTest(test.TestCase):
       conv = _two_layer_model(x)
       dim = array_ops.placeholder(dtype='int32')
       sizes = constant_op.constant([50, 10, 4], shape=[3])
-      split = gen_array_ops._split_v(
+      split = gen_array_ops.split_v(
           value=conv, size_splits=sizes, axis=dim, num_split=3)
       output = math_ops.reduce_sum(split[0])
 
@@ -896,7 +896,7 @@ class LayoutOptimizerTest(test.TestCase):
       add = math_ops.add(conv, conv)
       mean = math_ops.reduce_mean(conv)
       condition = math_ops.less(conv, mean)
-      select = gen_math_ops._select(condition, conv, add)
+      select = gen_math_ops.select(condition, conv, add)
       output = array_ops.identity(select)
 
       with session.Session(config=_get_config(False)) as sess:
@@ -926,7 +926,7 @@ class LayoutOptimizerTest(test.TestCase):
       conv = _two_layer_model(x)
       add = math_ops.add(conv, conv)
       condition = array_ops.placeholder(dtype='bool')
-      select = gen_math_ops._select(condition, conv, add)
+      select = gen_math_ops.select(condition, conv, add)
       output = array_ops.identity(select)
 
       condition_val = np.zeros((1, 7, 7, 64))
@@ -957,7 +957,7 @@ class LayoutOptimizerTest(test.TestCase):
       conv = _two_layer_model(x)
       add = math_ops.add(conv, conv)
       condition = constant_op.constant(True)
-      select = gen_math_ops._select(condition, conv, add)
+      select = gen_math_ops.select(condition, conv, add)
       output = array_ops.identity(select)
 
       with session.Session(config=_get_config(False)) as sess:
@@ -1023,7 +1023,7 @@ class LayoutOptimizerTest(test.TestCase):
       conv = _two_layer_model(x)
       ksize = constant_op.constant([1, 2, 3, 1], shape=[4])
       strides = array_ops.placeholder(dtype='int32', shape=[4])
-      max_pool = gen_nn_ops._max_pool_v2(conv, ksize, strides, 'VALID')
+      max_pool = gen_nn_ops.max_pool_v2(conv, ksize, strides, 'VALID')
       output = array_ops.identity(max_pool)
 
       strides_val = [1, 3, 2, 1]
diff --git a/tensorflow/python/grappler/memory_optimizer_test.py b/tensorflow/python/grappler/memory_optimizer_test.py
index 948911f099674af4c6dd19bfdac75e5fc1f75c78..4df959ce04169395589aeebaef9e3e7839e2300c 100644
--- a/tensorflow/python/grappler/memory_optimizer_test.py
+++ b/tensorflow/python/grappler/memory_optimizer_test.py
@@ -162,7 +162,8 @@ class MemoryOptimizerRecomputeTest(test.TestCase):
             arithmetic_optimization=rewriter_config_pb2.RewriterConfig.OFF,
             memory_optimization=rewriter_config_pb2.RewriterConfig.
             RECOMPUTATION_HEURISTICS,
-            memory_optimizer_target_node_name_prefix='optimizer/gradients/'),
+            # Checks that name scope "gradients/" also match sub-scope.
+            memory_optimizer_target_node_name_scope='gradients/'),
         original_metagraph)
     self.assertGreater(
         len(rewritten_graph_def.node),
@@ -176,6 +177,35 @@ class MemoryOptimizerRecomputeTest(test.TestCase):
         len([node for node in rewritten_graph_def.node
              if 'Recomputed/' in node.name]))
 
+  def testRewritingNameScopedGradientNamesScope(self):
+    """Tests that rewriting occurs with non-standard gradient names."""
+    (original_metagraph, _, _,
+     _) = self._GetMetaGraph(optimizer_scope_name='foo/bar')
+    rewritten_graph_def = tf_optimizer.OptimizeGraph(
+        rewriter_config_pb2.RewriterConfig(
+            disable_model_pruning=True,
+            constant_folding=rewriter_config_pb2.RewriterConfig.OFF,
+            dependency_optimization=rewriter_config_pb2.RewriterConfig.OFF,
+            layout_optimizer=rewriter_config_pb2.RewriterConfig.OFF,
+            arithmetic_optimization=rewriter_config_pb2.RewriterConfig.OFF,
+            memory_optimization=rewriter_config_pb2.RewriterConfig.
+            RECOMPUTATION_HEURISTICS,
+            # This should not match anything.
+            memory_optimizer_target_node_name_scope='r/gradients/'),
+        original_metagraph)
+    self.assertEqual(
+        len(rewritten_graph_def.node), len(original_metagraph.graph_def.node))
+    self.assertEqual(0,
+                     len([
+                         node for node in original_metagraph.graph_def.node
+                         if 'Recomputed/' in node.name
+                     ]))
+    self.assertEqual(0,
+                     len([
+                         node for node in rewritten_graph_def.node
+                         if 'Recomputed/' in node.name
+                     ]))
+
   def _GetMemoryOptimizerSessionConfig(self):
     rewrite_options = rewriter_config_pb2.RewriterConfig(
         disable_model_pruning=True,
diff --git a/tensorflow/python/grappler/tf_optimizer.i b/tensorflow/python/grappler/tf_optimizer.i
index 1b657983a4690dd0ddb7f569ce514b08cb10400a..de9326ccfc1653c2afd0833dcdca2cc4bfdabed5 100644
--- a/tensorflow/python/grappler/tf_optimizer.i
+++ b/tensorflow/python/grappler/tf_optimizer.i
@@ -100,6 +100,7 @@ PyObject* TF_OptimizeGraph(
     tensorflow::grappler::ItemConfig item_config;
     item_config.inline_functions = false;
     item_config.apply_optimizations = false;
+    item_config.ignore_user_placement = false;
     std::unique_ptr<tensorflow::grappler::GrapplerItem> grappler_item =
         tensorflow::grappler::GrapplerItemFromMetaGraphDef(graph_id, metagraph, item_config);
 
diff --git a/tensorflow/python/grappler/tf_optimizer_test.py b/tensorflow/python/grappler/tf_optimizer_test.py
index 55dcbe2071f74204e0bbdd141560f33cefdf174d..3ee4d7807ea5677a742514eb56267b94c6b92bba 100644
--- a/tensorflow/python/grappler/tf_optimizer_test.py
+++ b/tensorflow/python/grappler/tf_optimizer_test.py
@@ -24,6 +24,7 @@ from tensorflow.python.framework import meta_graph
 from tensorflow.python.framework import ops
 from tensorflow.python.grappler import tf_optimizer
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 
 
@@ -48,6 +49,31 @@ class PyWrapOptimizeGraphTest(test.TestCase):
     self.assertEqual(len(graph.node), 1)
     self.assertItemsEqual([node.name for node in graph.node], ['d'])
 
+  def testKeepNodes(self):
+    g = ops.Graph()
+    with g.as_default():
+      a1 = variables.Variable(
+          1.0)  # Must be preserved since it's in the collection 'variables'.
+      a2 = constant_op.constant(0, shape=[50, 50], name='keep')
+      ops.add_to_collection('a2', a2)  # Explicitly add to collection.
+      b = constant_op.constant(1, shape=[100, 10])
+      c = constant_op.constant(0, shape=[10, 30])
+      d = math_ops.matmul(b, c)
+      ops.add_to_collection('train_op', d)  # d is the fetch node.
+
+    # Optimize the graph.
+    mg = meta_graph.create_meta_graph_def(graph=g)
+    rewriter_config = rewriter_config_pb2.RewriterConfig()
+    optimized_graph = tf_optimizer.OptimizeGraph(rewriter_config, mg)
+
+    # Check that the nodes referenced in various collections have been preserved
+    self.assertEqual(len(optimized_graph.node), 5)
+    self.assertEqual(d.op.name, optimized_graph.node[0].name)
+    self.assertEqual(a1.op.name, optimized_graph.node[1].name)
+    self.assertEqual('Variable/initial_value', optimized_graph.node[2].name)
+    self.assertEqual(a2.op.name, optimized_graph.node[3].name)
+    self.assertEqual('Variable/Assign', optimized_graph.node[4].name)
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/BUILD b/tensorflow/python/keras/BUILD
index 1956478f39a6b5d4dea720436d9b87f66ca20426..bd1aac5eae7271f08c6092491d65bf9fba96aef7 100755
--- a/tensorflow/python/keras/BUILD
+++ b/tensorflow/python/keras/BUILD
@@ -39,9 +39,16 @@ py_library(
         "_impl/keras/datasets/mnist.py",
         "_impl/keras/datasets/reuters.py",
         "_impl/keras/engine/__init__.py",
-        "_impl/keras/engine/topology.py",
+        "_impl/keras/engine/base_layer.py",
+        "_impl/keras/engine/input_layer.py",
+        "_impl/keras/engine/network.py",
+        "_impl/keras/engine/saving.py",
+        "_impl/keras/engine/sequential.py",
         "_impl/keras/engine/training.py",
+        "_impl/keras/engine/training_arrays.py",
         "_impl/keras/engine/training_eager.py",
+        "_impl/keras/engine/training_generator.py",
+        "_impl/keras/engine/training_utils.py",
         "_impl/keras/estimator.py",
         "_impl/keras/initializers.py",
         "_impl/keras/layers/__init__.py",
@@ -74,8 +81,8 @@ py_library(
         "_impl/keras/utils/generic_utils.py",
         "_impl/keras/utils/io_utils.py",
         "_impl/keras/utils/layer_utils.py",
+        "_impl/keras/utils/multi_gpu_utils.py",
         "_impl/keras/utils/np_utils.py",
-        "_impl/keras/utils/training_utils.py",
         "_impl/keras/utils/vis_utils.py",
         "_impl/keras/wrappers/__init__.py",
         "_impl/keras/wrappers/scikit_learn.py",
@@ -256,6 +263,11 @@ py_test(
     size = "small",
     srcs = ["_impl/keras/metrics_test.py"],
     srcs_version = "PY2AND3",
+    tags = [
+        "manual",
+        "no_oss",
+        "notap",
+    ],
     deps = [
         ":keras",
         "//tensorflow/python:client_testlib",
@@ -637,9 +649,9 @@ py_test(
 )
 
 py_test(
-    name = "training_utils_test",
+    name = "multi_gpu_utils_test",
     size = "medium",
-    srcs = ["_impl/keras/utils/training_utils_test.py"],
+    srcs = ["_impl/keras/utils/multi_gpu_utils_test.py"],
     srcs_version = "PY2AND3",
     tags = ["multi_gpu"],
     deps = [
@@ -756,9 +768,31 @@ py_test(
     srcs_version = "PY2AND3",
     deps = [
         ":keras",
-        "//tensorflow/python:array_ops",
         "//tensorflow/python:client_testlib",
-        "//tensorflow/python:dtypes",
+        "//third_party/py/numpy",
+    ],
+)
+
+py_test(
+    name = "saving_test",
+    size = "small",
+    srcs = ["_impl/keras/engine/saving_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":keras",
+        "//tensorflow/python:client_testlib",
+        "//third_party/py/numpy",
+    ],
+)
+
+py_test(
+    name = "sequential_test",
+    size = "small",
+    srcs = ["_impl/keras/engine/sequential_test.py"],
+    srcs_version = "PY2AND3",
+    deps = [
+        ":keras",
+        "//tensorflow/python:client_testlib",
         "//third_party/py/numpy",
     ],
 )
@@ -779,7 +813,7 @@ py_test(
 
 py_test(
     name = "estimator_test",
-    size = "medium",
+    size = "large",
     srcs = ["_impl/keras/estimator_test.py"],
     srcs_version = "PY2AND3",
     tags = ["notsan"],
diff --git a/tensorflow/python/keras/_impl/keras/applications/densenet.py b/tensorflow/python/keras/_impl/keras/applications/densenet.py
index 6521f8410435fd13393b9991d3ee9a6342a912d0..ca83e8691237216e799f2ca738dcb6822506e2cb 100644
--- a/tensorflow/python/keras/_impl/keras/applications/densenet.py
+++ b/tensorflow/python/keras/_impl/keras/applications/densenet.py
@@ -31,7 +31,7 @@ from tensorflow.python.keras._impl.keras import backend as K
 from tensorflow.python.keras._impl.keras.applications import imagenet_utils
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import decode_predictions
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
 from tensorflow.python.keras._impl.keras.layers import Activation
 from tensorflow.python.keras._impl.keras.layers import AveragePooling2D
 from tensorflow.python.keras._impl.keras.layers import BatchNormalization
diff --git a/tensorflow/python/keras/_impl/keras/applications/inception_resnet_v2.py b/tensorflow/python/keras/_impl/keras/applications/inception_resnet_v2.py
index bf3901fc54419c2b401bf9c4d6311b39a18f1aba..17e407dd58460e6d6802a3e137a96faf38a6f576 100644
--- a/tensorflow/python/keras/_impl/keras/applications/inception_resnet_v2.py
+++ b/tensorflow/python/keras/_impl/keras/applications/inception_resnet_v2.py
@@ -31,7 +31,7 @@ from tensorflow.python.keras._impl.keras import backend as K
 from tensorflow.python.keras._impl.keras.applications import imagenet_utils
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import decode_predictions
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
 from tensorflow.python.keras._impl.keras.layers import Activation
 from tensorflow.python.keras._impl.keras.layers import AveragePooling2D
 from tensorflow.python.keras._impl.keras.layers import BatchNormalization
diff --git a/tensorflow/python/keras/_impl/keras/applications/inception_v3.py b/tensorflow/python/keras/_impl/keras/applications/inception_v3.py
index e268e97bc663773a218f01b958b08f8e43c74ee2..2897c6058eb445ceacc34084b53dc89f556e3e9c 100644
--- a/tensorflow/python/keras/_impl/keras/applications/inception_v3.py
+++ b/tensorflow/python/keras/_impl/keras/applications/inception_v3.py
@@ -37,7 +37,7 @@ from tensorflow.python.keras._impl.keras import layers
 from tensorflow.python.keras._impl.keras.applications import imagenet_utils
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import decode_predictions
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
 from tensorflow.python.keras._impl.keras.layers import Activation
 from tensorflow.python.keras._impl.keras.layers import AveragePooling2D
 from tensorflow.python.keras._impl.keras.layers import BatchNormalization
diff --git a/tensorflow/python/keras/_impl/keras/applications/mobilenet.py b/tensorflow/python/keras/_impl/keras/applications/mobilenet.py
index 1bbbedb85e47902b9e6d3dd741e9d52ab9209080..ad96b53a4528d99a014a0214b52a78d6a60076f8 100644
--- a/tensorflow/python/keras/_impl/keras/applications/mobilenet.py
+++ b/tensorflow/python/keras/_impl/keras/applications/mobilenet.py
@@ -79,8 +79,8 @@ from tensorflow.python.keras._impl.keras.applications import imagenet_utils
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import decode_predictions
 from tensorflow.python.keras._impl.keras.engine import InputSpec
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
-from tensorflow.python.keras._impl.keras.engine.topology import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.base_layer import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
 from tensorflow.python.keras._impl.keras.layers import Activation
 from tensorflow.python.keras._impl.keras.layers import BatchNormalization
 from tensorflow.python.keras._impl.keras.layers import Conv2D
diff --git a/tensorflow/python/keras/_impl/keras/applications/nasnet.py b/tensorflow/python/keras/_impl/keras/applications/nasnet.py
index 08dae57f006c64021cbca26404770cd89b1ce176..dd33230a7eb9272f8fc60daee63e1f92574cf5e3 100644
--- a/tensorflow/python/keras/_impl/keras/applications/nasnet.py
+++ b/tensorflow/python/keras/_impl/keras/applications/nasnet.py
@@ -49,7 +49,7 @@ from tensorflow.python.keras._impl.keras import backend as K
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import decode_predictions
 from tensorflow.python.keras._impl.keras.applications.inception_v3 import preprocess_input
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
 from tensorflow.python.keras._impl.keras.layers import Activation
 from tensorflow.python.keras._impl.keras.layers import add
 from tensorflow.python.keras._impl.keras.layers import AveragePooling2D
diff --git a/tensorflow/python/keras/_impl/keras/applications/resnet50.py b/tensorflow/python/keras/_impl/keras/applications/resnet50.py
index a47dd657bb9ea0627d82831b7ee5d0b33788b5b7..46c0e635578c7f4707b027247943d75b16d703ad 100644
--- a/tensorflow/python/keras/_impl/keras/applications/resnet50.py
+++ b/tensorflow/python/keras/_impl/keras/applications/resnet50.py
@@ -34,7 +34,7 @@ from tensorflow.python.keras._impl.keras import layers
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import decode_predictions
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import preprocess_input
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
 from tensorflow.python.keras._impl.keras.layers import Activation
 from tensorflow.python.keras._impl.keras.layers import AveragePooling2D
 from tensorflow.python.keras._impl.keras.layers import BatchNormalization
diff --git a/tensorflow/python/keras/_impl/keras/applications/vgg16.py b/tensorflow/python/keras/_impl/keras/applications/vgg16.py
index 9da74253abc2124844ab89b7727ddda4f754d8e2..cefb25063e30505c9c34b49fd2df6eb7210d7ca8 100644
--- a/tensorflow/python/keras/_impl/keras/applications/vgg16.py
+++ b/tensorflow/python/keras/_impl/keras/applications/vgg16.py
@@ -32,7 +32,7 @@ from tensorflow.python.keras._impl.keras import backend as K
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import decode_predictions
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import preprocess_input
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
 from tensorflow.python.keras._impl.keras.layers import Conv2D
 from tensorflow.python.keras._impl.keras.layers import Dense
 from tensorflow.python.keras._impl.keras.layers import Flatten
diff --git a/tensorflow/python/keras/_impl/keras/applications/vgg19.py b/tensorflow/python/keras/_impl/keras/applications/vgg19.py
index 961c1f991893dbc0df858e9f72b61202c9fee500..dadaf4fdf0cc5922752c6867720c5d8cdbcab19a 100644
--- a/tensorflow/python/keras/_impl/keras/applications/vgg19.py
+++ b/tensorflow/python/keras/_impl/keras/applications/vgg19.py
@@ -32,7 +32,7 @@ from tensorflow.python.keras._impl.keras import backend as K
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import decode_predictions
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import preprocess_input
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
 from tensorflow.python.keras._impl.keras.layers import Conv2D
 from tensorflow.python.keras._impl.keras.layers import Dense
 from tensorflow.python.keras._impl.keras.layers import Flatten
diff --git a/tensorflow/python/keras/_impl/keras/applications/xception.py b/tensorflow/python/keras/_impl/keras/applications/xception.py
index 7e7ca5a18a31622ac79d61ab01ce65341a4a46c5..971063a16d1f5ba0e25189f1ef2f6c24eb5f8d61 100644
--- a/tensorflow/python/keras/_impl/keras/applications/xception.py
+++ b/tensorflow/python/keras/_impl/keras/applications/xception.py
@@ -44,7 +44,7 @@ from tensorflow.python.keras._impl.keras import layers
 from tensorflow.python.keras._impl.keras.applications import imagenet_utils
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import _obtain_input_shape
 from tensorflow.python.keras._impl.keras.applications.imagenet_utils import decode_predictions
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
 from tensorflow.python.keras._impl.keras.layers import Activation
 from tensorflow.python.keras._impl.keras.layers import BatchNormalization
 from tensorflow.python.keras._impl.keras.layers import Conv2D
diff --git a/tensorflow/python/keras/_impl/keras/backend.py b/tensorflow/python/keras/_impl/keras/backend.py
index a238a3f7483685ca0b08c746b55bbc9e868cb2d3..2b75666b9e61baea635a312c005fdbd955f6cab6 100644
--- a/tensorflow/python/keras/_impl/keras/backend.py
+++ b/tensorflow/python/keras/_impl/keras/backend.py
@@ -2749,7 +2749,7 @@ class Function(object):
       self.updates_op = control_flow_ops.group(*updates_ops)
     self.name = name
     # additional tensor substitutions
-    self.feed_dict = session_kwargs.pop('feed_dict', {})
+    self.feed_dict = session_kwargs.pop('feed_dict', None)
     # additional operations
     self.fetches = session_kwargs.pop('fetches', [])
     if not isinstance(self.fetches, list):
@@ -2759,8 +2759,15 @@ class Function(object):
   def __call__(self, inputs):
     if not isinstance(inputs, (list, tuple)):
       raise TypeError('`inputs` should be a list or tuple.')
-    feed_dict = self.feed_dict.copy()
+
+    if self.feed_dict:
+      feed_dict = self.feed_dict.copy()
+    else:
+      feed_dict = {}
+
     for tensor, value in zip(self.inputs, inputs):
+      if value is None:
+        continue
       if is_sparse(tensor):
         sparse_coo = value.tocoo()
         indices = np.concatenate((np.expand_dims(sparse_coo.row, 1),
@@ -3087,7 +3094,8 @@ def rnn(step_function,
   outputs_shape[1] = inputs_shape[1]
   outputs.set_shape(outputs_shape)
 
-  last_output._uses_learning_phase = uses_learning_phase
+  if not context.in_eager_mode():
+    last_output._uses_learning_phase = uses_learning_phase
   return last_output, outputs, new_states
 
 
diff --git a/tensorflow/python/keras/_impl/keras/callbacks.py b/tensorflow/python/keras/_impl/keras/callbacks.py
index f6c466142522927135d66f73f9f5c697671649ec..deb1e8867dba3d52816ebda02bd9a3bf2ec7bc09 100644
--- a/tensorflow/python/keras/_impl/keras/callbacks.py
+++ b/tensorflow/python/keras/_impl/keras/callbacks.py
@@ -778,16 +778,24 @@ class TensorBoard(Callback):
         while i < val_size:
           step = min(self.batch_size, val_size - i)
           batch_val = []
-          batch_val.append(val_data[0][i:i + step])
-          batch_val.append(val_data[1][i:i + step])
-          batch_val.append(val_data[2][i:i + step])
+          batch_val.append(val_data[0][i:i + step]
+                           if val_data[0] is not None else None)
+          batch_val.append(val_data[1][i:i + step]
+                           if val_data[1] is not None else None)
+          batch_val.append(val_data[2][i:i + step]
+                           if val_data[2] is not None else None)
           if self.model.uses_learning_phase:
             # do not slice the learning phase
-            batch_val = [x[i:i + step] for x in val_data[:-1]]
+            batch_val = [x[i:i + step] if x is not None else None
+                         for x in val_data[:-1]]
             batch_val.append(val_data[-1])
           else:
-            batch_val = [x[i:i + step] for x in val_data]
-          feed_dict = dict(zip(tensors, batch_val))
+            batch_val = [x[i:i + step] if x is not None else None
+                         for x in val_data]
+          feed_dict = {}
+          for key, val in zip(tensors, batch_val):
+            if val is not None:
+              feed_dict[key] = val
           result = self.sess.run([self.merged], feed_dict=feed_dict)
           summary_str = result[0]
           self.writer.add_summary(summary_str, epoch)
diff --git a/tensorflow/python/keras/_impl/keras/engine/__init__.py b/tensorflow/python/keras/_impl/keras/engine/__init__.py
index 31f624f9af65cac60b6466d4eb5753cbdee984c6..1bc533ab8f7ba37948d82bc69fe1c9bfe00d6834 100644
--- a/tensorflow/python/keras/_impl/keras/engine/__init__.py
+++ b/tensorflow/python/keras/_impl/keras/engine/__init__.py
@@ -18,13 +18,10 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-from tensorflow.python.keras._impl.keras.engine.topology import get_source_inputs
-from tensorflow.python.keras._impl.keras.engine.topology import Input
-from tensorflow.python.keras._impl.keras.engine.topology import InputLayer
-from tensorflow.python.keras._impl.keras.engine.topology import InputSpec
-from tensorflow.python.keras._impl.keras.engine.topology import Layer
+from tensorflow.python.keras._impl.keras.engine.base_layer import InputSpec
+from tensorflow.python.keras._impl.keras.engine.base_layer import Layer
+from tensorflow.python.keras._impl.keras.engine.input_layer import Input
+from tensorflow.python.keras._impl.keras.engine.input_layer import InputLayer
+from tensorflow.python.keras._impl.keras.engine.network import get_source_inputs
+from tensorflow.python.keras._impl.keras.engine.network import Network
 from tensorflow.python.keras._impl.keras.engine.training import Model
-
-
-# Note: topology.Node is an internal class,
-# it isn't meant to be used by Keras users.
diff --git a/tensorflow/python/keras/_impl/keras/engine/base_layer.py b/tensorflow/python/keras/_impl/keras/engine/base_layer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f215f5645c250af75008e447aa1c779e3ace2c0
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/base_layer.py
@@ -0,0 +1,505 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=protected-access
+"""Base layer code (`Layer`).
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from six.moves import zip  # pylint: disable=redefined-builtin
+
+from tensorflow.python.eager import context
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import constraints
+from tensorflow.python.keras._impl.keras import initializers
+from tensorflow.python.keras._impl.keras import regularizers
+from tensorflow.python.keras._impl.keras.utils import generic_utils
+from tensorflow.python.layers import base as tf_base_layers
+from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util.tf_export import tf_export
+
+
+# pylint: disable=invalid-name
+InputSpec = tf_base_layers.InputSpec
+Node = tf_base_layers.Node
+TFBaseLayer = tf_base_layers.Layer
+# pylint: enable=invalid-name
+
+
+@tf_export('keras.layers.Layer')
+class Layer(tf_base_layers.Layer):
+  """Abstract base layer class.
+
+  # Properties
+      name: String, must be unique within a model.
+      input_spec: List of InputSpec class instances
+          each entry describes one required input:
+              - ndim
+              - dtype
+          A layer with `n` input tensors must have
+          an `input_spec` of length `n`.
+      trainable: Boolean, whether the layer weights
+          will be updated during training.
+      uses_learning_phase: Whether any operation
+          of the layer uses `K.in_training_phase()`
+          or `K.in_test_phase()`.
+      input_shape: Shape tuple. Provided for convenience,
+          but note that there may be cases in which this
+          attribute is ill-defined (e.g. a shared layer
+          with multiple input shapes), in which case
+          requesting `input_shape` will raise an Exception.
+          Prefer using `layer.get_input_shape_for(input_shape)`,
+          or `layer.get_input_shape_at(node_index)`.
+      output_shape: Shape tuple. See above.
+      inbound_nodes: List of nodes.
+      outbound_nodes: List of nodes.
+      input, output: Input/output tensor(s). Note that if the layer is used
+          more than once (shared layer), this is ill-defined
+          and will raise an exception. In such cases, use
+          `layer.get_input_at(node_index)`.
+      input_mask, output_mask: Same as above, for masks.
+      trainable_weights: List of variables.
+      non_trainable_weights: List of variables.
+      weights: The concatenation of the lists trainable_weights and
+          non_trainable_weights (in this order).
+
+  # Methods
+      call(x, mask=None): Where the layer's logic lives.
+      __call__(x, mask=None): Wrapper around the layer logic (`call`).
+          If x is a Keras tensor:
+              - Connect current layer with last layer from tensor:
+                  `self._add_inbound_node(last_layer)`
+              - Add layer to tensor history
+          If layer is not built:
+              - Build from inputs shape
+      get_weights()
+      set_weights(weights)
+      get_config()
+      count_params()
+      compute_output_shape(input_shape)
+      compute_mask(x, mask)
+      get_input_at(node_index)
+      get_output_at(node_index)
+      get_input_shape_at(node_index)
+      get_output_shape_at(node_index)
+      get_input_mask_at(node_index)
+      get_output_mask_at(node_index)
+
+  # Class Methods
+      from_config(config)
+
+  # Internal methods:
+      build(input_shape)
+      _add_inbound_node(layer, index=0)
+  """
+
+  def __init__(self, **kwargs):
+    # These properties should be set by the user via keyword arguments.
+    # note that 'dtype', 'input_shape' and 'batch_input_shape'
+    # are only applicable to input layers: do not pass these keywords
+    # to non-input layers.
+    allowed_kwargs = {
+        'activity_regularizer',
+        'input_shape',
+        'batch_input_shape',
+        'batch_size',
+        'dtype',
+        'name',
+        'trainable',
+        'weights',
+    }
+    # Validate optional keyword arguments.
+    for kwarg in kwargs:
+      if kwarg not in allowed_kwargs:
+        raise TypeError('Keyword argument not understood:', kwarg)
+
+    # Get layer name.
+    name = kwargs.get('name')
+
+    # Get `trainable` status.
+    trainable = kwargs.get('trainable', True)
+
+    # Get `dtype`.
+    dtype = kwargs.get('dtype')
+    if dtype is None:
+      dtype = K.floatx()
+
+    # Call super, which will set all properties common to Keras layers
+    # and core TF layers.
+    super(Layer, self).__init__(
+        name=name, dtype=dtype, trainable=trainable,
+        activity_regularizer=kwargs.get('activity_regularizer'))
+
+    # Add properties that are Keras-only for now.
+    self.supports_masking = False
+
+    # Manage input shape information if passed.
+    if 'input_shape' in kwargs or 'batch_input_shape' in kwargs:
+      # In this case we will later create an input layer
+      # to insert before the current layer
+      if 'batch_input_shape' in kwargs:
+        batch_input_shape = tuple(kwargs['batch_input_shape'])
+      elif 'input_shape' in kwargs:
+        if 'batch_size' in kwargs:
+          batch_size = kwargs['batch_size']
+        else:
+          batch_size = None
+        batch_input_shape = (batch_size,) + tuple(kwargs['input_shape'])
+      self._batch_input_shape = batch_input_shape
+
+    # Manage initial weight values if passed.
+    if 'weights' in kwargs:
+      self._initial_weights = kwargs['weights']
+    else:
+      self._initial_weights = None
+
+  def add_weight(self,
+                 name,
+                 shape,
+                 dtype=None,
+                 initializer=None,
+                 regularizer=None,
+                 trainable=True,
+                 constraint=None):
+    """Adds a weight variable to the layer.
+
+    Arguments:
+        name: String, the name for the weight variable.
+        shape: The shape tuple of the weight.
+        dtype: The dtype of the weight.
+        initializer: An Initializer instance (callable).
+        regularizer: An optional Regularizer instance.
+        trainable: A boolean, whether the weight should
+            be trained via backprop or not (assuming
+            that the layer itself is also trainable).
+        constraint: An optional Constraint instance.
+
+    Returns:
+        The created weight variable.
+    """
+    if dtype is None:
+      dtype = K.floatx()
+    weight = self.add_variable(name, shape,
+                               dtype=dtype,
+                               initializer=initializers.get(initializer),
+                               regularizer=regularizers.get(regularizer),
+                               constraint=constraints.get(constraint),
+                               trainable=trainable)
+    return weight
+
+  def call(self, inputs, **kwargs):  # pylint: disable=unused-argument
+    """This is where the layer's logic lives.
+
+    Arguments:
+        inputs: Input tensor, or list/tuple of input tensors.
+        **kwargs: Additional keyword arguments.
+
+    Returns:
+        A tensor or list/tuple of tensors.
+    """
+    return inputs
+
+  def __call__(self, inputs, **kwargs):
+    """Wrapper around self.call(), for handling internal references.
+
+    If a Keras tensor is passed:
+        - We call self._add_inbound_node().
+        - If necessary, we `build` the layer to match
+            the shape of the input(s).
+        - We update the _keras_history of the output tensor(s)
+            with the current layer.
+            This is done as part of _add_inbound_node().
+
+    Arguments:
+        inputs: Can be a tensor or list/tuple of tensors.
+        **kwargs: Additional keyword arguments to be passed to `call()`.
+
+    Returns:
+        Output of the layer's `call` method.
+
+    Raises:
+        ValueError: in case the layer is missing shape information
+            for its `build` call.
+    """
+    # Actually call the layer (optionally building it).
+    output = super(Layer, self).__call__(inputs, **kwargs)
+    if context.in_eager_mode():
+      return output
+
+    if hasattr(self, '_symbolic_set_inputs') and not self.inputs:
+      # Subclassed network: explicitly set metadata normally set by a call to
+      # self._set_inputs().
+      self._symbolic_set_inputs(inputs, output)
+
+    # Update learning phase info.
+    output_tensors = generic_utils.to_list(output)
+    uses_lp = any(
+        [getattr(x, '_uses_learning_phase', False)
+         for x in generic_utils.to_list(inputs)])
+    uses_lp = getattr(self, 'uses_learning_phase', False) or uses_lp
+    for i in range(len(output_tensors)):
+      output_tensors[i]._uses_learning_phase = getattr(
+          output_tensors[i], '_uses_learning_phase', False) or uses_lp
+
+    # Optionally load weight values that were specified at layer instantiation.
+    if hasattr(self, '_initial_weights') and self._initial_weights is not None:
+      self.set_weights(self._initial_weights)
+      del self._initial_weights
+    return output
+
+  def compute_output_shape(self, input_shape):
+    """Computes the output shape of the layer.
+
+    Assumes that the layer will be built
+    to match that input shape provided.
+
+    Arguments:
+        input_shape: Shape tuple (tuple of integers)
+            or list of shape tuples (one per output tensor of the layer).
+            Shape tuples can include None for free dimensions,
+            instead of an integer.
+
+    Returns:
+        An input shape tuple.
+    """
+    logging.warning(
+        'All custom layers should implement the '
+        '`compute_output_shape` method. This layer (' + self.name + ') '
+        'is relying on the base `Layer.compute_output_shape` implementation, '
+        'which will start raising a `NotImplementedError` '
+        'as of July 1st, 2018.')
+    return input_shape
+
+  def compute_mask(self, inputs, mask=None):  # pylint: disable=unused-argument
+    """Computes an output mask tensor.
+
+    Arguments:
+        inputs: Tensor or list of tensors.
+        mask: Tensor or list of tensors.
+
+    Returns:
+        None or a tensor (or list of tensors,
+            one per output tensor of the layer).
+    """
+    if not self.supports_masking:
+      if mask is not None:
+        if isinstance(mask, list):
+          if any(m is not None for m in mask):
+            raise TypeError('Layer ' + self.name + ' does not support masking, '
+                            'but was passed an input_mask: ' + str(mask))
+        else:
+          raise TypeError('Layer ' + self.name + ' does not support masking, '
+                          'but was passed an input_mask: ' + str(mask))
+      # masking not explicitly supported: return None as mask
+      return None
+    # if masking is explicitly supported, by default
+    # carry over the input mask
+    return mask
+
+  def get_input_mask_at(self, node_index):
+    """Retrieves the input mask tensor(s) of a layer at a given node.
+
+    Arguments:
+        node_index: Integer, index of the node
+            from which to retrieve the attribute.
+            E.g. `node_index=0` will correspond to the
+            first time the layer was called.
+
+    Returns:
+        A mask tensor
+        (or list of tensors if the layer has multiple inputs).
+    """
+    inputs = self.get_input_at(node_index)
+    if isinstance(inputs, list):
+      return [getattr(x, '_keras_mask', None) for x in inputs]
+    else:
+      return getattr(inputs, '_keras_mask', None)
+
+  def get_output_mask_at(self, node_index):
+    """Retrieves the output mask tensor(s) of a layer at a given node.
+
+    Arguments:
+        node_index: Integer, index of the node
+            from which to retrieve the attribute.
+            E.g. `node_index=0` will correspond to the
+            first time the layer was called.
+
+    Returns:
+        A mask tensor
+        (or list of tensors if the layer has multiple outputs).
+    """
+    output = self.get_output_at(node_index)
+    if isinstance(output, list):
+      return [getattr(x, '_keras_mask', None) for x in output]
+    else:
+      return getattr(output, '_keras_mask', None)
+
+  @property
+  def input_mask(self):
+    """Retrieves the input mask tensor(s) of a layer.
+
+    Only applicable if the layer has exactly one inbound node,
+    i.e. if it is connected to one incoming layer.
+
+    Returns:
+        Input mask tensor (potentially None) or list of input
+        mask tensors.
+
+    Raises:
+        AttributeError: if the layer is connected to
+        more than one incoming layers.
+    """
+    inputs = self.input
+    if isinstance(inputs, list):
+      return [getattr(x, '_keras_mask', None) for x in inputs]
+    else:
+      return getattr(inputs, '_keras_mask', None)
+
+  @property
+  def output_mask(self):
+    """Retrieves the output mask tensor(s) of a layer.
+
+    Only applicable if the layer has exactly one inbound node,
+    i.e. if it is connected to one incoming layer.
+
+    Returns:
+        Output mask tensor (potentially None) or list of output
+        mask tensors.
+
+    Raises:
+        AttributeError: if the layer is connected to
+        more than one incoming layers.
+    """
+    output = self.output
+    if isinstance(output, list):
+      return [getattr(x, '_keras_mask', None) for x in output]
+    else:
+      return getattr(output, '_keras_mask', None)
+
+  def set_weights(self, weights):
+    """Sets the weights of the layer, from Numpy arrays.
+
+    Arguments:
+        weights: a list of Numpy arrays. The number
+            of arrays and their shape must match
+            number of the dimensions of the weights
+            of the layer (i.e. it should match the
+            output of `get_weights`).
+
+    Raises:
+        ValueError: If the provided weights list does not match the
+            layer's specifications.
+    """
+    params = self.weights
+    if len(params) != len(weights):
+      raise ValueError('You called `set_weights(weights)` on layer "' +
+                       self.name + '" with a  weight list of length ' +
+                       str(len(weights)) + ', but the layer was expecting ' +
+                       str(len(params)) + ' weights. Provided weights: ' +
+                       str(weights)[:50] + '...')
+    if not params:
+      return
+    weight_value_tuples = []
+    param_values = K.batch_get_value(params)
+    for pv, p, w in zip(param_values, params, weights):
+      if pv.shape != w.shape:
+        raise ValueError('Layer weight shape ' + str(pv.shape) +
+                         ' not compatible with '
+                         'provided weight shape ' + str(w.shape))
+      weight_value_tuples.append((p, w))
+    K.batch_set_value(weight_value_tuples)
+
+  def get_weights(self):
+    """Returns the current weights of the layer.
+
+    Returns:
+        Weights values as a list of numpy arrays.
+    """
+    params = self.weights
+    return K.batch_get_value(params)
+
+  def get_config(self):
+    """Returns the config of the layer.
+
+    A layer config is a Python dictionary (serializable)
+    containing the configuration of a layer.
+    The same layer can be reinstantiated later
+    (without its trained weights) from this configuration.
+
+    The config of a layer does not include connectivity
+    information, nor the layer class name. These are handled
+    by `Network` (one layer of abstraction above).
+
+    Returns:
+        Python dictionary.
+    """
+    config = {'name': self.name, 'trainable': self.trainable}
+    if hasattr(self, '_batch_input_shape'):
+      config['batch_input_shape'] = self._batch_input_shape
+    if hasattr(self, 'dtype'):
+      config['dtype'] = self.dtype
+    return config
+
+  @classmethod
+  def from_config(cls, config):
+    """Creates a layer from its config.
+
+    This method is the reverse of `get_config`,
+    capable of instantiating the same layer from the config
+    dictionary. It does not handle layer connectivity
+    (handled by Network), nor weights (handled by `set_weights`).
+
+    Arguments:
+        config: A Python dictionary, typically the
+            output of get_config.
+
+    Returns:
+        A layer instance.
+    """
+    return cls(**config)
+
+  @tf_base_layers.Layer.activity_regularizer.setter
+  def activity_regularizer(self, activity_regularizer):
+    self._activity_regularizer = activity_regularizer
+
+
+def shape_type_conversion(fn):
+  """Decorator that handles tuple/TensorShape conversion.
+
+  Used in `compute_output_shape` and `build`.
+
+  Arguments:
+    fn: function to wrap.
+
+  Returns:
+    Wrapped function.
+  """
+
+  def wrapper(instance, input_shape):
+    if input_shape is not None:
+      if isinstance(input_shape, list):
+        input_shape = [
+            tuple(tensor_shape.TensorShape(x).as_list()) for x in input_shape]
+      else:
+        input_shape = tuple(tensor_shape.TensorShape(input_shape).as_list())
+    output_shape = fn(instance, input_shape)
+    if output_shape is not None:
+      if isinstance(output_shape, list):
+        return [tensor_shape.TensorShape(x) for x in output_shape]
+      return tensor_shape.TensorShape(output_shape)
+
+  return wrapper
diff --git a/tensorflow/python/keras/_impl/keras/engine/input_layer.py b/tensorflow/python/keras/_impl/keras/engine/input_layer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f9ea6f7a40e49ec45dfaeb14f807cd9c7db65c9
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/input_layer.py
@@ -0,0 +1,230 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=protected-access
+"""Input layer code (`Input` and `InputLayer`).
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.eager import context
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras.engine import base_layer
+from tensorflow.python.layers import base as tf_base_layers
+from tensorflow.python.ops import array_ops
+from tensorflow.python.util.tf_export import tf_export
+
+
+class InputLayer(base_layer.Layer):
+  """Layer to be used as an entry point into a Network (a graph of layers).
+
+  It can either wrap an existing tensor (pass an `input_tensor` argument)
+  or create its a placeholder tensor (pass arguments `input_shape`, and
+  optionally, `dtype`).
+
+  It is generally recommend to use the functional layer API via `Input`,
+  (which creates an `InputLayer`) without directly using `InputLayer`.
+
+  Arguments:
+      input_shape: Shape tuple (not including the batch axis), or `TensorShape`
+        instance (not including the batch axis).
+      batch_size: Optional input batch size (integer or None).
+      dtype: Datatype of the input.
+      input_tensor: Optional tensor to use as layer input
+          instead of creating a placeholder.
+      sparse: Boolean, whether the placeholder created
+          is meant to be sparse.
+      name: Name of the layer (string).
+  """
+
+  def __init__(self,
+               input_shape=None,
+               batch_size=None,
+               dtype=None,
+               input_tensor=None,
+               sparse=False,
+               name=None,
+               **kwargs):
+    if 'batch_input_shape' in kwargs:
+      batch_input_shape = kwargs.pop('batch_input_shape')
+      if input_shape and batch_input_shape:
+        raise ValueError('Only provide the input_shape OR '
+                         'batch_input_shape argument to '
+                         'InputLayer, not both at the same time.')
+      batch_size = batch_input_shape[0]
+      input_shape = batch_input_shape[1:]
+    if kwargs:
+      raise ValueError('Unrecognized keyword arguments:', kwargs.keys())
+
+    if not name:
+      prefix = 'input'
+      name = prefix + '_' + str(K.get_uid(prefix))
+
+    if not dtype:
+      if input_tensor is None:
+        dtype = K.floatx()
+      else:
+        dtype = K.dtype(input_tensor)
+    super(InputLayer, self).__init__(dtype=dtype, name=name)
+    self.built = True
+    self.sparse = sparse
+    self.batch_size = batch_size
+
+    if isinstance(input_shape, tensor_shape.TensorShape):
+      input_shape = tuple(input_shape.as_list())
+
+    if input_tensor is None:
+      if input_shape is not None:
+        batch_input_shape = (batch_size,) + tuple(input_shape)
+      else:
+        batch_input_shape = None
+
+      if context.in_eager_mode():
+        # In eager mode, create a temporary placeholder to call the layer on.
+        input_tensor = tf_base_layers._DeferredTensor(  # pylint: disable=protected-access
+            shape=batch_input_shape,
+            dtype=dtype,
+            name=self.name)
+      else:
+        # In graph mode, create a graph placeholder to call the layer on.
+        if sparse:
+          input_tensor = array_ops.sparse_placeholder(
+              shape=batch_input_shape,
+              dtype=dtype,
+              name=self.name)
+        else:
+          input_tensor = array_ops.placeholder(
+              shape=batch_input_shape,
+              dtype=dtype,
+              name=self.name)
+
+      # For compatibility with Keras API.
+      self.is_placeholder = True
+      self._batch_input_shape = batch_input_shape
+    else:
+      # For compatibility with Keras API.
+      self.is_placeholder = False
+      self._batch_input_shape = tuple(input_tensor.get_shape().as_list())
+
+    # Create an input node to add to self.outbound_node
+    # and set output_tensors' _keras_history.
+    input_tensor._keras_history = (self, 0, 0)  # pylint: disable=protected-access
+    tf_base_layers.Node(
+        self,
+        inbound_layers=[],
+        node_indices=[],
+        tensor_indices=[],
+        input_tensors=[input_tensor],
+        output_tensors=[input_tensor])
+
+  def get_config(self):
+    config = {
+        'batch_input_shape': self._batch_input_shape,
+        'dtype': self.dtype,
+        'sparse': self.sparse,
+        'name': self.name
+    }
+    return config
+
+
+@tf_export('keras.layers.Input', 'keras.Input')
+def Input(  # pylint: disable=invalid-name
+    shape=None,
+    batch_size=None,
+    name=None,
+    dtype=None,
+    sparse=False,
+    tensor=None,
+    **kwargs):
+  """`Input()` is used to instantiate a Keras tensor.
+
+  A Keras tensor is a tensor object from the underlying backend
+  (Theano or TensorFlow), which we augment with certain
+  attributes that allow us to build a Keras model
+  just by knowing the inputs and outputs of the model.
+
+  For instance, if a, b and c are Keras tensors,
+  it becomes possible to do:
+  `model = Model(input=[a, b], output=c)`
+
+  The added Keras attribute is:
+      `_keras_history`: Last layer applied to the tensor.
+          the entire layer graph is retrievable from that layer,
+          recursively.
+
+  Arguments:
+      shape: A shape tuple (integers), not including the batch size.
+          For instance, `shape=(32,)` indicates that the expected input
+          will be batches of 32-dimensional vectors.
+      batch_size: optional static batch size (integer).
+      name: An optional name string for the layer.
+          Should be unique in a model (do not reuse the same name twice).
+          It will be autogenerated if it isn't provided.
+      dtype: The data type expected by the input, as a string
+          (`float32`, `float64`, `int32`...)
+      sparse: A boolean specifying whether the placeholder
+          to be created is sparse.
+      tensor: Optional existing tensor to wrap into the `Input` layer.
+          If set, the layer will not create a placeholder tensor.
+      **kwargs: deprecated arguments support.
+
+  Returns:
+      A tensor.
+
+  Example:
+
+      ```python
+      # this is a logistic regression in Keras
+      x = Input(shape=(32,))
+      y = Dense(16, activation='softmax')(x)
+      model = Model(x, y)
+      ```
+
+  Raises:
+    ValueError: in case of invalid arguments.
+  """
+  if 'batch_shape' in kwargs:
+    batch_shape = kwargs.pop('batch_shape')
+    if shape and batch_shape:
+      raise ValueError('Only provide the shape OR '
+                       'batch_shape argument to '
+                       'Input, not both at the same time.')
+    batch_size = batch_shape[0]
+    shape = batch_shape[1:]
+  if kwargs:
+    raise ValueError('Unrecognized keyword arguments:', kwargs.keys())
+
+  if dtype is None:
+    dtype = K.floatx()
+  if not shape and tensor is None:
+    raise ValueError('Please provide to Input either a `shape`'
+                     ' or a `tensor` argument. Note that '
+                     '`shape` does not include the batch '
+                     'dimension.')
+  input_layer = InputLayer(
+      input_shape=shape,
+      batch_size=batch_size,
+      name=name,
+      dtype=dtype,
+      sparse=sparse,
+      input_tensor=tensor)
+  # Return tensor including `_keras_history`.
+  # Note that in this case train_output and test_output are the same pointer.
+  outputs = input_layer._inbound_nodes[0].output_tensors
+  if len(outputs) == 1:
+    return outputs[0]
+  else:
+    return outputs
diff --git a/tensorflow/python/keras/_impl/keras/engine/network.py b/tensorflow/python/keras/_impl/keras/engine/network.py
new file mode 100644
index 0000000000000000000000000000000000000000..e47bba9267dd2b6a7394778cd9c55dc7b74e047f
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/network.py
@@ -0,0 +1,1509 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=protected-access
+"""A `Network` is way to compose layers: the topological form of a `Model`.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+import json
+import os
+
+import numpy as np
+from six.moves import zip  # pylint: disable=redefined-builtin
+
+from tensorflow.python.eager import context
+from tensorflow.python.framework import ops
+from tensorflow.python.framework import tensor_shape
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras.engine import base_layer
+from tensorflow.python.keras._impl.keras.engine import saving
+from tensorflow.python.keras._impl.keras.utils import generic_utils
+from tensorflow.python.keras._impl.keras.utils.io_utils import ask_to_proceed_with_overwrite
+from tensorflow.python.keras._impl.keras.utils.layer_utils import print_summary as print_layer_summary
+from tensorflow.python.layers import base as tf_base_layers
+from tensorflow.python.layers import utils as tf_layers_util
+from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpointable
+from tensorflow.python.util import nest
+from tensorflow.python.util import tf_inspect
+
+
+# pylint: disable=g-import-not-at-top
+try:
+  import h5py
+except ImportError:
+  h5py = None
+
+try:
+  import yaml
+except ImportError:
+  yaml = None
+# pylint: enable=g-import-not-at-top
+
+
+class Network(base_layer.Layer):
+  """A `Network` is a composition of layers.
+
+  It is the topological form of a "model". A `Model`
+  is simply a `Network` with added training routines.
+  """
+
+  def __init__(self, *args, **kwargs):  # pylint: disable=super-init-not-called
+    # Signature detection
+    if (len(args) == 2 or
+        len(args) == 1 and 'outputs' in kwargs or
+        'inputs' in kwargs and 'outputs' in kwargs):
+      # Graph network
+      self._init_graph_network(*args, **kwargs)
+    else:
+      # Subclassed network
+      self._init_subclassed_network(**kwargs)
+
+  def _base_init(self, name=None):
+    # The following are implemented as property functions:
+    # self.trainable_weights
+    # self.non_trainable_weights
+    # self.input_spec
+    # self.losses
+    # self.updates
+
+    self._init_set_name(name)
+    self._activity_regularizer = None
+    # This acts just like the `trainable` attribute of any layer instance.
+    # It does not affect users of the underlying layers, only users of the
+    # Network instance.
+    self.trainable = True
+    self._is_compiled = False
+    self._expects_training_arg = False
+
+    self.supports_masking = False
+    self.optimizer = None
+
+    # Private attributes to implement compatibility with Layer.
+    self._updates = []  # Used in symbolic mode only.
+    self._losses = []   # Used in symbolic mode only.
+    self._scope = None  # Never used.
+    self._reuse = None  # Never used.
+    if context.in_eager_mode:
+      self._graph = None
+    else:
+      self._graph = ops.get_default_graph()  # Used in symbolic mode only.
+        # A Network does not create weights of its own, thus has no dtype.
+    self._dtype = None
+
+    # All layers in order of horizontal graph traversal.
+    # Entries are unique. Includes input and output layers.
+    self._layers = []
+
+    # Used in symbolic mode only, only in conjonction with graph-networks
+    self._outbound_nodes = []
+    self._inbound_nodes = []
+
+  def _init_graph_network(self, inputs, outputs, name=None):
+    # Normalize and set self.inputs, self.outputs.
+    if isinstance(inputs, (list, tuple)):
+      self.inputs = list(inputs)  # Tensor or list of tensors.
+    else:
+      self.inputs = [inputs]
+    if isinstance(outputs, (list, tuple)):
+      self.outputs = list(outputs)
+    else:
+      self.outputs = [outputs]
+
+    # User-prodived argument validation.
+    if context.in_eager_mode():
+      # Check that all inputs/outputs are DeferredTensors.
+      for tensor in self.inputs:
+        if not isinstance(tensor, tf_base_layers._DeferredTensor):  # pylint: disable=protected-access
+          raise TypeError('When eager execution is enabled, '
+                          'inputs must come from a call to '
+                          '`tf.keras.Input` (called after '
+                          'tfe.enable_eager_execution()). '
+                          'Received invalid input: ' + str(tensor))
+      for tensor in self.outputs:
+        if not isinstance(tensor, tf_base_layers._DeferredTensor):  # pylint: disable=protected-access
+          raise TypeError('When eager execution is enabled, '
+                          'outputs must come from a call to '
+                          'a layer (called after '
+                          'tfe.enable_eager_execution()). '
+                          'Received invalid output: ' + str(tensor))
+    # Check for redundancy in inputs.
+    if len(set(self.inputs)) != len(self.inputs):
+      raise ValueError('The list of inputs passed to the model '
+                       'is redundant. '
+                       'All inputs should only appear once.'
+                       ' Found: ' + str(self.inputs))
+    for x in self.inputs:
+      # Check that x has appropriate `_keras_history` metadata.
+      if not hasattr(x, '_keras_history'):
+        cls_name = self.__class__.__name__
+        raise ValueError('Input tensors to a ' + cls_name + ' ' +
+                         'must come from `tf.layers.Input`. '
+                         'Received: ' + str(x) +
+                         ' (missing previous layer metadata).')
+      # Check that x is an input tensor.
+      # pylint: disable=protected-access
+      layer, node_index, tensor_index = x._keras_history
+      if len(layer._inbound_nodes) > 1 or (
+          layer._inbound_nodes and layer._inbound_nodes[0].inbound_layers):
+        cls_name = self.__class__.__name__
+        logging.warning(cls_name + ' inputs must come from '
+                        '`tf.layers.Input` (thus holding past layer metadata), '
+                        'they cannot be the output of '
+                        'a previous non-Input layer. '
+                        'Here, a tensor specified as '
+                        'input to "' + self.name + '" was not an Input tensor, '
+                        'it was generated by layer ' + layer.name + '.\n'
+                        'Note that input tensors are '
+                        'instantiated via `tensor = tf.layers.Input(shape)`.\n'
+                        'The tensor that caused the issue was: ' + str(x.name))
+    for x in self.outputs:
+      if not hasattr(x, '_keras_history'):
+        cls_name = self.__class__.__name__
+        raise ValueError('Output tensors to a ' + cls_name + ' must be '
+                         'the output of a TensorFlow `Layer` '
+                         '(thus holding past layer metadata). Found: ' + str(x))
+
+    self._base_init(name=name)
+    self._compute_previous_mask = (
+        'mask' in tf_inspect.getargspec(self.call).args or
+        hasattr(self, 'compute_mask'))
+    # A Network does not create weights of its own, thus it is already
+    # built.
+    self.built = True
+    self._is_graph_network = True
+
+    # # List of initial layers (1 to 1 mapping with self.inputs,
+    # # hence the same layer might appear twice)
+    # self._input_layers = []
+    # self._input_layers_node_indices = []
+    # self._input_layers_tensor_indices = []
+    # # list of layers (1 to 1 mapping with self.inputs,
+    # # hence the same layer might appear twice)
+    # self._output_layers = []
+    # self._output_layers_node_indices = []
+    # self._output_layers_tensor_indices = []
+
+    self._input_layers = []
+    self._output_layers = []
+    self._input_coordinates = []
+    self._output_coordinates = []
+
+    # This is for performance optimization when calling the Network on new
+    # inputs. Every time the Network is called on a set on input tensors,
+    # we compute the output tensors, output masks and output shapes in one pass,
+    # then cache them here. When any of these outputs is queried later, we
+    # retrieve it from there instead of recomputing it.
+    self._output_mask_cache = {}
+    self._output_tensor_cache = {}
+    self._output_shape_cache = {}
+
+    # Build self._output_layers:
+    for x in self.outputs:
+      layer, node_index, tensor_index = x._keras_history  # pylint: disable=protected-access
+      self._output_layers.append(layer)
+      self._output_coordinates.append((layer, node_index, tensor_index))
+
+    # Build self._input_layers:
+    for x in self.inputs:
+      layer, node_index, tensor_index = x._keras_history  # pylint: disable=protected-access
+      # It's supposed to be an input layer, so only one node
+      # and one tensor output.
+      assert node_index == 0
+      assert tensor_index == 0
+      self._input_layers.append(layer)
+      self._input_coordinates.append((layer, node_index, tensor_index))
+
+    # Keep track of the network's nodes and layers.
+    nodes, nodes_by_depth, layers, layers_by_depth = _map_graph_network(
+        self.inputs, self.outputs)
+    self._network_nodes = nodes
+    self._nodes_by_depth = nodes_by_depth
+    self._layers = layers
+    self._layers_by_depth = layers_by_depth
+
+    # Create the node linking internal inputs to internal outputs.
+    tf_base_layers.Node(
+        outbound_layer=self,
+        inbound_layers=[],
+        node_indices=[],
+        tensor_indices=[],
+        input_tensors=self.inputs,
+        output_tensors=self.outputs)
+
+    # Fill in the output mask cache.
+    masks = []
+    for x in self.inputs:
+      mask = x._keras_mask if hasattr(x, '_keras_mask') else None  # pylint: disable=protected-access
+      masks.append(mask)
+    mask_cache_key = (tf_layers_util.object_list_uid(self.inputs) + '_' +
+                      tf_layers_util.object_list_uid(masks))
+    masks = []
+    for x in self.outputs:
+      mask = x._keras_mask if hasattr(x, '_keras_mask') else None  # pylint: disable=protected-access
+      masks.append(mask)
+    if len(masks) == 1:
+      mask = masks[0]
+    else:
+      mask = masks
+    self._output_mask_cache[mask_cache_key] = mask
+
+    # Build self.input_names and self.output_names.
+    self.input_names = []
+    self.output_names = []
+    self._feed_input_names = []
+    self._feed_inputs = []
+    self._feed_input_shapes = []
+    for i, layer in enumerate(self._input_layers):
+      self.input_names.append(layer.name)
+      if layer.is_placeholder:
+        self._feed_input_names.append(layer.name)
+        self._feed_input_shapes.append(K.int_shape(self.inputs[i]))
+        # layer.input gives an error in eager mode
+        if context.in_graph_mode():
+          self._feed_inputs.append(layer.input)
+    for layer in self._output_layers:
+      self.output_names.append(layer.name)
+
+  def _init_subclassed_network(self, name=None):
+    self._base_init(name=name)
+    self._is_graph_network = False
+    if 'training' in tf_inspect.getargspec(self.call).args:
+      self._expects_training_arg = True
+    else:
+      self._expects_training_arg = False
+
+    self.outputs = None
+    self.inputs = None
+    self.built = False
+
+  def __setattr__(self, name, value):
+    if isinstance(value, (tf_base_layers.Layer, Network)):
+      try:
+        is_graph_network = self._is_graph_network
+      except AttributeError:
+        raise RuntimeError('It looks like you are subclassing `Model` and you '
+                           'forgot to call `super(YourClass, self).__init__()`.'
+                           ' Always start with this line.')
+      if not is_graph_network:
+        if value not in self._layers:
+          self._layers.append(value)
+    if isinstance(value, checkpointable.CheckpointableBase):
+      # Layer (and therefore Network/Model) inherit from CheckpointableBase
+      # rather than Checkpointable, which means there is no Checkpointable
+      # __setattr__ override (it would be a performance issue for functional
+      # layers). Therefore Model tracks Checkpointable objects itself.
+      self._track_checkpointable(
+          checkpointable=value, name=name, overwrite=True)
+    super(Network, self).__setattr__(name, value)
+
+  def add_variable(self, name, shape, dtype=None, initializer=None,
+                   regularizer=None, trainable=True, constraint=None):
+    raise NotImplementedError('`add_variable` is not supported on Networks.')
+
+  def add_loss(self, *args, **kwargs):
+    if context.in_eager_mode():
+      raise NotImplementedError('`add_loss` is not supported on Networks '
+                                'when eager execution is enabled.')
+    super(Network, self).add_loss(*args, **kwargs)
+
+  @property
+  def uses_learning_phase(self):
+    return any(
+        [getattr(x, '_uses_learning_phase', False) for x in self.outputs])
+
+  @property
+  def stateful(self):
+    return any([(hasattr(layer, 'stateful') and layer.stateful)
+                for layer in self.layers])
+
+  def reset_states(self):
+    for layer in self.layers:
+      if hasattr(layer, 'reset_states') and getattr(layer, 'stateful', False):
+        layer.reset_states()
+
+  @property
+  def state_updates(self):
+    """Returns the `updates` from all layers that are stateful.
+
+    This is useful for separating training updates and
+    state updates, e.g. when we need to update a layer's internal state
+    during prediction.
+
+    Returns:
+        A list of update ops.
+    """
+    state_updates = []
+    for layer in self.layers:
+      if getattr(layer, 'stateful', False):
+        if hasattr(layer, 'updates'):
+          state_updates += layer.updates
+    return state_updates
+
+  def get_weights(self):
+    """Retrieves the weights of the model.
+
+    Returns:
+        A flat list of Numpy arrays.
+    """
+    weights = []
+    for layer in self.layers:
+      weights += layer.weights
+    return K.batch_get_value(weights)
+
+  def set_weights(self, weights):
+    """Sets the weights of the model.
+
+    Arguments:
+        weights: A list of Numpy arrays with shapes and types matching
+            the output of `model.get_weights()`.
+    """
+    tuples = []
+    for layer in self.layers:
+      num_param = len(layer.weights)
+      layer_weights = weights[:num_param]
+      for sw, w in zip(layer.weights, layer_weights):
+        tuples.append((sw, w))
+      weights = weights[num_param:]
+    K.batch_set_value(tuples)
+
+  def compute_mask(self, inputs, mask):
+    if not self._is_graph_network:
+      return None
+
+    inputs = generic_utils.to_list(inputs)
+    if mask is None:
+      masks = [None for _ in range(len(inputs))]
+    else:
+      masks = generic_utils.to_list(mask)
+    cache_key = (tf_layers_util.object_list_uid(inputs)
+                 + '_' + tf_layers_util.object_list_uid(masks))
+    if cache_key in self._output_mask_cache:
+      return self._output_mask_cache[cache_key]
+    else:
+      _, output_masks = self._run_internal_graph(inputs, masks)
+      return output_masks
+
+  @property
+  def layers(self):
+    return self._layers
+
+  def get_layer(self, name=None, index=None):
+    """Retrieves a layer based on either its name (unique) or index.
+
+    Indices are based on order of horizontal graph traversal (bottom-up).
+
+    Arguments:
+        name: String, name of layer.
+        index: Integer, index of layer.
+
+    Returns:
+        A layer instance.
+
+    Raises:
+        ValueError: In case of invalid layer name or index.
+    """
+    # TODO(fchollet): We could build a dictionary based on layer names
+    # since they are constant, but we have not done that yet.
+    if index is not None:
+      if len(self.layers) <= index:
+        raise ValueError('Was asked to retrieve layer at index ' + str(index) +
+                         ' but model only has ' + str(len(self.layers)) +
+                         ' layers.')
+      else:
+        return self.layers[index]
+    else:
+      if not name:
+        raise ValueError('Provide either a layer name or layer index.')
+    for layer in self.layers:
+      if layer.name == name:
+        return layer
+    raise ValueError('No such layer: ' + name)
+
+  @property
+  def updates(self):
+    """Retrieve the network's updates.
+
+    Will only include updates that are either
+    unconditional, or conditional on inputs to this model
+    (e.g. will not include updates that were created by layers of this model
+    outside of the model).
+
+    Effectively, `network.updates` behaves like `layer.updates`.
+
+    Concrete example:
+
+    ```python
+      bn = keras.layers.BatchNormalization()
+      x1 = keras.layers.Input(shape=(10,))
+      _ = bn(x1)  # This creates 2 updates.
+
+      x2 = keras.layers.Input(shape=(10,))
+      y2 = bn(x2)  # This creates 2 more updates.
+
+      # The BN layer has now 4 updates.
+      self.assertEqual(len(bn.updates), 4)
+
+      # Let's create a model from x2 to y2.
+      model = keras.models.Model(x2, y2)
+
+      # The model does not list all updates from its underlying layers,
+      # but only the updates that are relevant to it. Updates created by layers
+      # outside of the model are discarded.
+      self.assertEqual(len(model.updates), 2)
+
+      # If you keep calling the model, you append to its updates, just like
+      # what happens for a layer.
+      x3 = keras.layers.Input(shape=(10,))
+      y3 = model(x3)
+      self.assertEqual(len(model.updates), 4)
+
+      # But if you call the inner BN layer independently, you don't affect
+      # the model's updates.
+      x4 = keras.layers.Input(shape=(10,))
+      _ = bn(x4)
+      self.assertEqual(len(model.updates), 4)
+    ```
+
+    Returns:
+        A list of update ops.
+    """
+    if context.in_eager_mode():
+      return []
+
+    if not self.trainable and not self.stateful:
+      return []
+
+    updates = []
+    for layer in self.layers:
+      updates += layer.updates
+
+    # `updates` might contain irrelevant updates, so it needs to be filtered
+    # with respect to inputs the model has been called on.
+    relevant_inputs = self.inputs or []
+    for i in range(1, len(self._inbound_nodes)):
+      inputs = self.get_input_at(i)
+      if isinstance(inputs, list):
+        relevant_inputs += inputs
+      else:
+        relevant_inputs.append(inputs)
+    reachable = tf_layers_util.get_reachable_from_inputs(relevant_inputs,
+                                                         updates)
+    relevant_conditional_updates = [x for x in updates if x in reachable]
+    unconditional_updates = [
+        x for x in updates if x._unconditional_update]  # pylint: disable=protected-access
+    # A layer could be used multiple times in a nested structure,
+    # so the updates list must be de-duped.
+    return list(set(
+        relevant_conditional_updates + unconditional_updates + self._updates))
+
+  @property
+  def losses(self):
+    """Retrieve the network's losses.
+
+    Will only include losses that are either
+    unconditional, or conditional on inputs to this model
+    (e.g. will not include losses that depend on tensors
+    that aren't inputs to this model).
+
+    Returns:
+        A list of loss tensors.
+    """
+    losses = []
+    for layer in self.layers:
+      losses += layer.losses
+    if context.in_eager_mode():
+      return losses
+
+    relevant_inputs = self.inputs or []
+    for i in range(1, len(self._inbound_nodes)):
+      inputs = self.get_input_at(i)
+      if isinstance(inputs, list):
+        relevant_inputs += inputs
+      else:
+        relevant_inputs.append(inputs)
+    reachable = tf_layers_util.get_reachable_from_inputs(relevant_inputs,
+                                                         losses)
+    relevant_conditional_losses = [x for x in losses if x in reachable]
+    unconditional_losses = [
+        x for x in losses if x._unconditional_loss]  # pylint: disable=protected-access
+    return list(set(
+        relevant_conditional_losses + unconditional_losses + self._losses))
+
+  @property
+  def trainable_weights(self):
+    if not self.trainable:
+      return []
+    weights = []
+    for layer in self.layers:
+      weights += layer.trainable_weights
+    return weights
+
+  @property
+  def non_trainable_weights(self):
+    weights = []
+    for layer in self.layers:
+      weights += layer.non_trainable_weights
+    if not self.trainable:
+      trainable_weights = []
+      for layer in self.layers:
+        trainable_weights += layer.trainable_weights
+      return trainable_weights + weights
+    return weights
+
+  @property
+  def input_spec(self):
+    """Gets the network's input specs.
+
+    Returns:
+        A list of `InputSpec` instances (one per input to the model)
+            or a single instance if the model has only one input.
+    """
+    # If not a graph network, can't assume anything.
+    if not self._is_graph_network:
+      return None
+
+    specs = []
+    for layer in self._input_layers:
+      if layer.input_spec is None:
+        specs.append(None)
+      else:
+        if not isinstance(layer.input_spec, list):
+          raise TypeError('Layer ' + layer.name +
+                          ' has an input_spec attribute that '
+                          'is not a list. We expect a list. '
+                          'Found input_spec = ' + str(layer.input_spec))
+        specs += layer.input_spec
+    if len(specs) == 1:
+      return specs[0]
+    return specs
+
+  def call(self, inputs, training=None, mask=None):
+    """Call the model on new inputs.
+
+    In this case `call` just reapplies
+    all ops in the graph to the new inputs
+    (e.g. build a new computational graph from the provided inputs).
+
+    Arguments:
+        inputs: A tensor or list of tensors.
+        training: Boolean or boolean scalar tensor, indicating whether to run
+          the `Network` in training mode or inference mode.
+        mask: A mask or list of masks. A mask can be
+            either a tensor or None (no mask).
+
+    Returns:
+        A tensor if there is a single output, or
+        a list of tensors if there are more than one outputs.
+    """
+    inputs = nest.flatten(inputs)
+    if mask is None:
+      masks = [None for _ in range(len(inputs))]
+    else:
+      masks = nest.flatten(mask)
+
+    if context.in_graph_mode():
+      # Try to retrieve cached outputs if the layer has already been called
+      # on these exact inputs.
+      cache_key = (tf_layers_util.object_list_uid(inputs)
+                   + '_' + tf_layers_util.object_list_uid(masks))
+      if cache_key in self._output_tensor_cache:
+        # Cache hit.
+        return self._output_tensor_cache[cache_key]
+    # Actually apply the network graph to the new inputs.
+    outputs, _ = self._run_internal_graph(inputs,
+                                          training=training,
+                                          mask=masks)
+    return outputs
+
+  def compute_output_shape(self, input_shape):
+    if not self._is_graph_network:
+      raise NotImplementedError
+
+    if isinstance(input_shape, list):
+      input_shapes = []
+      for shape in input_shape:
+        if shape is not None:
+          input_shapes.append(tuple(tensor_shape.TensorShape(shape).as_list()))
+        else:
+          input_shapes.append(None)
+    else:
+      if input_shape is not None:
+        input_shapes = [tuple(tensor_shape.TensorShape(input_shape).as_list())]
+      else:
+        input_shapes = [None]
+
+    if len(input_shapes) != len(self._input_layers):
+      raise ValueError('Invalid input_shape argument ' + str(input_shape) +
+                       ': model has ' + str(len(self._input_layers)) +
+                       ' tensor inputs.')
+
+    cache_key = tf_layers_util.object_list_uid(input_shapes)
+    if cache_key not in self._output_shape_cache:
+      # Cache miss. We have to run the network graph manually (recursive calls
+      # to `compute_output_shape`).
+      layers_to_output_shapes = {}
+      for i in range(len(input_shapes)):
+        layer = self._input_layers[i]
+        input_shape = input_shapes[i]
+        # It's an input layer: then `compute_output_shape` is identity,
+        # and there is only one node and one tensor output.
+        shape_key = layer.name + '_0_0'
+        layers_to_output_shapes[shape_key] = input_shape
+
+      depth_keys = list(self._nodes_by_depth.keys())
+      depth_keys.sort(reverse=True)
+      # Iterate over nodes, by depth level.
+      if len(depth_keys) > 1:
+        for depth in depth_keys:
+          nodes = self._nodes_by_depth[depth]
+          for node in nodes:
+            # This is always a single layer, never a list.
+            layer = node.outbound_layer
+            if layer in self._input_layers:
+              # We've already covered the input layers
+              # a few lines above.
+              continue
+            # Potentially redundant list,
+            # same size as node.input_tensors.
+            input_shapes = []
+            for j in range(len(node.inbound_layers)):
+              inbound_layer = node.inbound_layers[j]
+              node_index = node.node_indices[j]
+              tensor_index = node.tensor_indices[j]
+              shape_key = inbound_layer.name + '_%s_%s' % (node_index,
+                                                           tensor_index)
+              input_shape = layers_to_output_shapes[shape_key]
+              input_shapes.append(input_shape)
+
+            if len(input_shapes) == 1:
+              output_shape = layer.compute_output_shape(input_shapes[0])
+            else:
+              output_shape = layer.compute_output_shape(input_shapes)
+            if isinstance(output_shape, list):
+              output_shapes = [
+                  tuple(tensor_shape.TensorShape(shape).as_list())
+                  for shape in output_shape
+              ]
+            else:
+              output_shapes = [
+                  tuple(tensor_shape.TensorShape(output_shape).as_list())
+              ]
+
+            node_index = layer._inbound_nodes.index(node)  # pylint: disable=protected-access
+            for j in range(len(output_shapes)):
+              shape_key = layer.name + '_%s_%s' % (node_index, j)
+              layers_to_output_shapes[shape_key] = output_shapes[j]
+
+        # Read final output shapes from layers_to_output_shapes.
+        output_shapes = []
+        for i in range(len(self._output_layers)):
+          layer, node_index, tensor_index = self._output_coordinates[i]
+          shape_key = layer.name + '_%s_%s' % (node_index, tensor_index)
+          output_shapes.append(layers_to_output_shapes[shape_key])
+        # Store in cache.
+        self._output_shape_cache[cache_key] = output_shapes
+    else:
+      # Cache hit.
+      output_shapes = self._output_shape_cache[cache_key]
+
+    if isinstance(output_shapes, list):
+      if len(output_shapes) == 1:
+        return tensor_shape.TensorShape(output_shapes[0])
+      else:
+        return [tensor_shape.TensorShape(shape) for shape in output_shapes]
+    else:
+      return tensor_shape.TensorShape(output_shapes)
+
+  def _run_internal_graph(self, inputs, training=None, mask=None):
+    """Computes output tensors for new inputs.
+
+    # Note:
+        - Expects `inputs` to be a list (potentially with 1 element).
+        - Can be run on non-Keras tensors.
+
+    Arguments:
+        inputs: List of tensors
+        training: Boolean learning phase.
+        mask: List of masks (tensors or None).
+
+    Returns:
+        Three lists: output_tensors, output_masks, output_shapes
+    """
+    # Note: masking support is relevant mainly for Keras.
+    # It cannot be factored out without having the fully reimplement the network
+    # calling logic on the Keras side. We choose to incorporate it in
+    # Network because 1) it may be useful to fully support in tf.layers in
+    # the future and 2) Keras is a major user of Network.  If you don't
+    # use masking, it does not interfere with regular behavior at all and you
+    # can ignore it.
+    if mask is None:
+      masks = [None for _ in range(len(inputs))]
+    else:
+      masks = mask
+
+    # Dictionary mapping reference tensors to tuples
+    # (computed tensor, compute mask)
+    # we assume a 1:1 mapping from tensor to mask
+    # TODO(fchollet): raise exception when a `.compute_mask()` call
+    # does not return a list the same size as `call`
+    tensor_map = {}
+    for x, y, mask in zip(self.inputs, inputs, masks):
+      tensor_map[str(id(x))] = (y, mask)
+
+    depth_keys = list(self._nodes_by_depth.keys())
+    depth_keys.sort(reverse=True)
+    for depth in depth_keys:
+      nodes = self._nodes_by_depth[depth]
+      for node in nodes:
+        # This is always a single layer, never a list.
+        layer = node.outbound_layer
+        reference_input_tensors = node.input_tensors
+        reference_output_tensors = node.output_tensors
+
+        # If all previous input tensors are available in tensor_map,
+        # then call node.inbound_layer on them.
+        computed_data = []  # List of tuples (input, mask).
+        for x in reference_input_tensors:
+          if str(id(x)) in tensor_map:
+            computed_data.append(tensor_map[str(id(x))])
+
+        if len(computed_data) == len(reference_input_tensors):
+          # Call layer (reapplying ops to new inputs).
+          with ops.name_scope(layer.name):
+            if node.arguments:
+              kwargs = node.arguments
+            else:
+              kwargs = {}
+            if len(computed_data) == 1:
+              computed_tensor, computed_mask = computed_data[0]
+              # Ensure mask propagation if applicable.
+              if 'mask' in tf_inspect.getargspec(layer.call).args:
+                kwargs.setdefault('mask', computed_mask)
+              if 'training' in tf_inspect.getargspec(layer.call).args:
+                kwargs.setdefault('training', training)
+
+              output_tensors = nest.flatten(
+                  layer.call(computed_tensor, **kwargs))
+              if hasattr(layer, 'compute_mask'):
+                output_masks = nest.flatten(
+                    layer.compute_mask(computed_tensor, computed_mask))
+              else:
+                output_masks = [None for _ in range(len(output_tensors))]
+              computed_tensors = [computed_tensor]
+              computed_masks = [computed_mask]
+            else:
+              computed_tensors = [x[0] for x in computed_data]
+              computed_masks = [x[1] for x in computed_data]
+              if 'mask' in tf_inspect.getargspec(layer.call).args:
+                kwargs.setdefault('mask', computed_masks)
+              if 'training' in tf_inspect.getargspec(layer.call).args:
+                kwargs.setdefault('training', training)
+
+              output_tensors = nest.flatten(
+                  layer.call(computed_tensors, **kwargs))
+              if hasattr(layer, 'compute_mask'):
+                output_masks = nest.flatten(
+                    layer.compute_mask(computed_tensors, computed_masks))
+              else:
+                output_masks = [None for _ in range(len(output_tensors))]
+
+            if context.in_graph_mode():
+              if layer.activity_regularizer is not None:
+                regularization_losses = [
+                    layer.activity_regularizer(x) for x in output_tensors
+                ]
+                # Apply activity regularizer if any:
+                layer.add_loss(regularization_losses, computed_tensors)
+
+          # Update tensor_map.
+          for x, y, mask in zip(reference_output_tensors, output_tensors,
+                                output_masks):
+            tensor_map[str(id(x))] = (y, mask)
+
+    output_tensors = []
+    output_masks = []
+    output_shapes = []
+    for x in self.outputs:
+      assert str(id(x)) in tensor_map, 'Could not compute output ' + str(x)
+      tensor, mask = tensor_map[str(id(x))]
+      output_shapes.append(tf_layers_util.static_shape(x))
+      output_tensors.append(tensor)
+      output_masks.append(mask)
+
+    if len(output_tensors) == 1:
+      output_tensors = output_tensors[0]
+      if output_shapes is not None:
+        output_shapes = output_shapes[0]
+      if output_masks is not None:
+        output_masks = output_masks[0]
+
+    if context.in_graph_mode():
+      # Update cache;
+      # keys are based on ids on input tensors and inputs masks.
+      cache_key = (tf_layers_util.object_list_uid(inputs)
+                   + '_' + tf_layers_util.object_list_uid(masks))
+      self._output_tensor_cache[cache_key] = output_tensors
+      self._output_mask_cache[cache_key] = output_masks
+
+      if output_shapes is not None:
+        input_shapes = [tf_layers_util.static_shape(x) for x in inputs]
+        cache_key = tf_layers_util.object_list_uid(input_shapes)
+        self._output_shape_cache[cache_key] = output_shapes
+
+    return output_tensors, output_masks
+
+  def get_config(self):
+    if not self._is_graph_network:
+      raise NotImplementedError
+
+    config = {
+        'name': self.name,
+    }
+    node_conversion_map = {}
+    for layer in self.layers:
+      if issubclass(layer.__class__, Network):
+        # Networks start with a pre-existing node
+        # linking their input to output.
+        kept_nodes = 1
+      else:
+        kept_nodes = 0
+      for original_node_index, node in enumerate(layer._inbound_nodes):
+        node_key = _make_node_key(layer.name, original_node_index)
+        if node_key in self._network_nodes:
+          node_conversion_map[node_key] = kept_nodes
+          kept_nodes += 1
+    layer_configs = []
+    for layer in self.layers:  # From the earliest layers on.
+      layer_class_name = layer.__class__.__name__
+      layer_config = layer.get_config()
+      filtered_inbound_nodes = []
+      for original_node_index, node in enumerate(layer._inbound_nodes):
+        node_key = _make_node_key(layer.name, original_node_index)
+        if node_key in self._network_nodes:
+          # The node is relevant to the model:
+          # add to filtered_inbound_nodes.
+          if node.arguments:
+            try:
+              json.dumps(node.arguments)
+              kwargs = node.arguments
+            except TypeError:
+              logging.warning(
+                  'Layer ' + layer.name +
+                  ' was passed non-serializable keyword arguments: ' +
+                  str(node.arguments) + '. They will not be included '
+                  'in the serialized model (and thus will be missing '
+                  'at deserialization time).')
+              kwargs = {}
+          else:
+            kwargs = {}
+          if node.inbound_layers:
+            node_data = []
+            for i in range(len(node.inbound_layers)):
+              inbound_layer = node.inbound_layers[i]
+              node_index = node.node_indices[i]
+              tensor_index = node.tensor_indices[i]
+              node_key = _make_node_key(inbound_layer.name, node_index)
+              new_node_index = node_conversion_map.get(node_key, 0)
+              node_data.append(
+                  [inbound_layer.name, new_node_index, tensor_index, kwargs])
+            filtered_inbound_nodes.append(node_data)
+      layer_configs.append({
+          'name': layer.name,
+          'class_name': layer_class_name,
+          'config': layer_config,
+          'inbound_nodes': filtered_inbound_nodes,
+      })
+    config['layers'] = layer_configs
+
+    # Gather info about inputs and outputs.
+    model_inputs = []
+    for i in range(len(self._input_layers)):
+      layer, node_index, tensor_index = self._input_coordinates[i]
+      node_key = _make_node_key(layer.name, node_index)
+      if node_key not in self._network_nodes:
+        continue
+      new_node_index = node_conversion_map[node_key]
+      model_inputs.append([layer.name, new_node_index, tensor_index])
+    config['input_layers'] = model_inputs
+    model_outputs = []
+    for i in range(len(self._output_layers)):
+      layer, node_index, tensor_index = self._output_coordinates[i]
+      node_key = _make_node_key(layer.name, node_index)
+      if node_key not in self._network_nodes:
+        continue
+      new_node_index = node_conversion_map[node_key]
+      model_outputs.append([layer.name, new_node_index, tensor_index])
+    config['output_layers'] = model_outputs
+    return copy.deepcopy(config)
+
+  @classmethod
+  def from_config(cls, config, custom_objects=None):
+    """Instantiates a Model from its config (output of `get_config()`).
+
+    Arguments:
+        config: Model config dictionary.
+        custom_objects: Optional dictionary mapping names
+            (strings) to custom classes or functions to be
+            considered during deserialization.
+
+    Returns:
+        A model instance.
+
+    Raises:
+        ValueError: In case of improperly formatted config dict.
+    """
+    # Layer instances created during
+    # the graph reconstruction process
+    created_layers = {}
+
+    # Dictionary mapping layer instances to
+    # node data that specifies a layer call.
+    # It acts as a queue that maintains any unprocessed
+    # layer call until it becomes possible to process it
+    # (i.e. until the input tensors to the call all exist).
+    unprocessed_nodes = {}
+
+    def add_unprocessed_node(layer, node_data):
+      if layer not in unprocessed_nodes:
+        unprocessed_nodes[layer] = [node_data]
+      else:
+        unprocessed_nodes[layer].append(node_data)
+
+    def process_node(layer, node_data):
+      """Deserialize a node.
+
+      Arguments:
+          layer: layer instance.
+          node_data: node config dict.
+
+      Raises:
+          ValueError: In case of improperly formatted `node_data` dict.
+      """
+      input_tensors = []
+      for input_data in node_data:
+        inbound_layer_name = input_data[0]
+        inbound_node_index = input_data[1]
+        inbound_tensor_index = input_data[2]
+        if len(input_data) == 3:
+          kwargs = {}
+        elif len(input_data) == 4:
+          kwargs = input_data[3]
+        else:
+          raise ValueError('Improperly formatted model config.')
+        if inbound_layer_name not in created_layers:
+          add_unprocessed_node(layer, node_data)
+          return
+        inbound_layer = created_layers[inbound_layer_name]
+        if len(inbound_layer._inbound_nodes) <= inbound_node_index:
+          add_unprocessed_node(layer, node_data)
+          return
+        inbound_node = inbound_layer._inbound_nodes[inbound_node_index]
+        input_tensors.append(inbound_node.output_tensors[inbound_tensor_index])
+      # Call layer on its inputs, thus creating the node
+      # and building the layer if needed.
+      if input_tensors:
+        if len(input_tensors) == 1:
+          layer(input_tensors[0], **kwargs)
+        else:
+          layer(input_tensors, **kwargs)
+
+    def process_layer(layer_data):
+      """Deserialize a layer, then call it on appropriate inputs.
+
+      Arguments:
+          layer_data: layer config dict.
+
+      Raises:
+          ValueError: In case of improperly formatted `layer_data` dict.
+      """
+      layer_name = layer_data['name']
+
+      # Instantiate layer.
+      from tensorflow.python.keras._impl.keras.layers import deserialize as deserialize_layer  # pylint: disable=g-import-not-at-top
+
+      layer = deserialize_layer(layer_data, custom_objects=custom_objects)
+      created_layers[layer_name] = layer
+
+      # Gather layer inputs.
+      inbound_nodes_data = layer_data['inbound_nodes']
+      for node_data in inbound_nodes_data:
+        # We don't process nodes (i.e. make layer calls)
+        # on the fly because the inbound node may not yet exist,
+        # in case of layer shared at different topological depths
+        # (e.g. a model such as A(B(A(B(x)))))
+        add_unprocessed_node(layer, node_data)
+
+    # First, we create all layers and enqueue nodes to be processed
+    for layer_data in config['layers']:
+      process_layer(layer_data)
+    # Then we process nodes in order of layer depth.
+    # Nodes that cannot yet be processed (if the inbound node
+    # does not yet exist) are re-enqueued, and the process
+    # is repeated until all nodes are processed.
+    while unprocessed_nodes:
+      for layer_data in config['layers']:
+        layer = created_layers[layer_data['name']]
+        if layer in unprocessed_nodes:
+          for node_data in unprocessed_nodes.pop(layer):
+            process_node(layer, node_data)
+
+    name = config.get('name')
+    input_tensors = []
+    output_tensors = []
+    for layer_data in config['input_layers']:
+      layer_name, node_index, tensor_index = layer_data
+      assert layer_name in created_layers
+      layer = created_layers[layer_name]
+      layer_output_tensors = layer._inbound_nodes[node_index].output_tensors
+      input_tensors.append(layer_output_tensors[tensor_index])
+    for layer_data in config['output_layers']:
+      layer_name, node_index, tensor_index = layer_data
+      assert layer_name in created_layers
+      layer = created_layers[layer_name]
+      layer_output_tensors = layer._inbound_nodes[node_index].output_tensors
+      output_tensors.append(layer_output_tensors[tensor_index])
+    return cls(inputs=input_tensors, outputs=output_tensors, name=name)
+
+  def save(self, filepath, overwrite=True, include_optimizer=True):
+    """Save the model to a single HDF5 file.
+
+    The savefile includes:
+        - The model architecture, allowing to re-instantiate the model.
+        - The model weights.
+        - The state of the optimizer, allowing to resume training
+            exactly where you left off.
+
+    This allows you to save the entirety of the state of a model
+    in a single file.
+
+    Saved models can be reinstantiated via `keras.models.load_model`.
+    The model returned by `load_model`
+    is a compiled model ready to be used (unless the saved model
+    was never compiled in the first place).
+
+    Arguments:
+        filepath: String, path to the file to save the weights to.
+        overwrite: Whether to silently overwrite any existing file at the
+            target location, or provide the user with a manual prompt.
+        include_optimizer: If True, save optimizer's state together.
+
+    Example:
+
+    ```python
+    from keras.models import load_model
+
+    model.save('my_model.h5')  # creates a HDF5 file 'my_model.h5'
+    del model  # deletes the existing model
+
+    # returns a compiled model
+    # identical to the previous one
+    model = load_model('my_model.h5')
+    ```
+    """
+    if not self._is_graph_network:
+      raise NotImplementedError
+
+    from tensorflow.python.keras._impl.keras.models import save_model  # pylint: disable=g-import-not-at-top
+    save_model(self, filepath, overwrite, include_optimizer)
+
+  def save_weights(self, filepath, overwrite=True):
+    """Dumps all layer weights to a HDF5 file.
+
+    The weight file has:
+        - `layer_names` (attribute), a list of strings
+            (ordered names of model layers).
+        - For every layer, a `group` named `layer.name`
+            - For every such layer group, a group attribute `weight_names`,
+                a list of strings
+                (ordered names of weights tensor of the layer).
+            - For every weight in the layer, a dataset
+                storing the weight value, named after the weight tensor.
+
+    Arguments:
+        filepath: String, path to the file to save the weights to.
+        overwrite: Whether to silently overwrite any existing file at the
+            target location, or provide the user with a manual prompt.
+
+    Raises:
+        ImportError: If h5py is not available.
+    """
+    if h5py is None:
+      raise ImportError('`save_weights` requires h5py.')
+    # If file exists and should not be overwritten:
+    if not overwrite and os.path.isfile(filepath):
+      proceed = ask_to_proceed_with_overwrite(filepath)
+      if not proceed:
+        return
+    with h5py.File(filepath, 'w') as f:
+      saving.save_weights_to_hdf5_group(f, self.layers)
+
+  def load_weights(self, filepath, by_name=False):
+    """Loads all layer weights from a HDF5 save file.
+
+    If `by_name` is False (default) weights are loaded
+    based on the network's topology, meaning the architecture
+    should be the same as when the weights were saved.
+    Note that layers that don't have weights are not taken
+    into account in the topological ordering, so adding or
+    removing layers is fine as long as they don't have weights.
+
+    If `by_name` is True, weights are loaded into layers
+    only if they share the same name. This is useful
+    for fine-tuning or transfer-learning models where
+    some of the layers have changed.
+
+    Arguments:
+        filepath: String, path to the weights file to load.
+        by_name: Boolean, whether to load weights by name
+            or by topological order.
+
+    Raises:
+        ImportError: If h5py is not available.
+    """
+    if h5py is None:
+      raise ImportError('`load_weights` requires h5py.')
+    with h5py.File(filepath, 'r') as f:
+      if 'layer_names' not in f.attrs and 'model_weights' in f:
+        f = f['model_weights']
+      if by_name:
+        saving.load_weights_from_hdf5_group_by_name(f, self.layers)
+      else:
+        saving.load_weights_from_hdf5_group(f, self.layers)
+
+  def _updated_config(self):
+    """Util hared between different serialization methods.
+
+    Returns:
+        Model config with Keras version information added.
+    """
+    from tensorflow.python.keras._impl.keras import __version__ as keras_version  # pylint: disable=g-import-not-at-top
+
+    config = self.get_config()
+    model_config = {
+        'class_name': self.__class__.__name__,
+        'config': config,
+        'keras_version': keras_version,
+        'backend': K.backend()
+    }
+    return model_config
+
+  def to_json(self, **kwargs):
+    """Returns a JSON string containing the network configuration.
+
+    To load a network from a JSON save file, use
+    `keras.models.model_from_json(json_string, custom_objects={})`.
+
+    Arguments:
+        **kwargs: Additional keyword arguments
+            to be passed to `json.dumps()`.
+
+    Returns:
+        A JSON string.
+    """
+    if not self._is_graph_network:
+      raise NotImplementedError
+
+    def get_json_type(obj):
+      # If obj is any numpy type
+      if type(obj).__module__ == np.__name__:
+        return obj.item()
+
+      # If obj is a python 'type'
+      if type(obj).__name__ == type.__name__:
+        return obj.__name__
+
+      raise TypeError('Not JSON Serializable:', obj)
+
+    model_config = self._updated_config()
+    return json.dumps(model_config, default=get_json_type, **kwargs)
+
+  def to_yaml(self, **kwargs):
+    """Returns a yaml string containing the network configuration.
+
+    To load a network from a yaml save file, use
+    `keras.models.model_from_yaml(yaml_string, custom_objects={})`.
+
+    `custom_objects` should be a dictionary mapping
+    the names of custom losses / layers / etc to the corresponding
+    functions / classes.
+
+    Arguments:
+        **kwargs: Additional keyword arguments
+            to be passed to `yaml.dump()`.
+
+    Returns:
+        A YAML string.
+
+    Raises:
+        ImportError: if yaml module is not found.
+    """
+    if not self._is_graph_network:
+      raise NotImplementedError
+
+    if yaml is None:
+      raise ImportError('Requires yaml module installed.')
+    return yaml.dump(self._updated_config(), **kwargs)
+
+  def summary(self, line_length=None, positions=None, print_fn=None):
+    """Prints a string summary of the network.
+
+    Arguments:
+        line_length: Total length of printed lines
+            (e.g. set this to adapt the display to different
+            terminal window sizes).
+        positions: Relative or absolute positions of log elements
+            in each line. If not provided,
+            defaults to `[.33, .55, .67, 1.]`.
+        print_fn: Print function to use. Defaults to `print`.
+            It will be called on each line of the summary.
+            You can set it to a custom function
+            in order to capture the string summary.
+    """
+    print_layer_summary(self,
+                        line_length=line_length,
+                        positions=positions,
+                        print_fn=print_fn)
+
+
+def get_source_inputs(tensor, layer=None, node_index=None):
+  """Returns the list of input tensors necessary to compute `tensor`.
+
+  Output will always be a list of tensors
+  (potentially with 1 element).
+
+  Arguments:
+      tensor: The tensor to start from.
+      layer: Origin layer of the tensor. Will be
+          determined via tensor._keras_history if not provided.
+      node_index: Origin node index of the tensor.
+
+  Returns:
+      List of input tensors.
+  """
+  if not hasattr(tensor, '_keras_history'):
+    return tensor
+
+  if layer is None or node_index:
+    layer, node_index, _ = tensor._keras_history
+  if not layer._inbound_nodes:
+    return [tensor]
+  else:
+    node = layer._inbound_nodes[node_index]
+    if not node.inbound_layers:
+      # Reached an Input layer, stop recursion.
+      return node.input_tensors
+    else:
+      source_tensors = []
+      for i in range(len(node.inbound_layers)):
+        x = node.input_tensors[i]
+        layer = node.inbound_layers[i]
+        node_index = node.node_indices[i]
+        previous_sources = get_source_inputs(x, layer, node_index)
+        # Avoid input redundancy.
+        for x in previous_sources:
+          if x not in source_tensors:
+            source_tensors.append(x)
+      return source_tensors
+
+
+def _make_node_key(layer_name, node_index):
+  return layer_name + '_ib-' + str(node_index)
+
+
+def _map_graph_network(inputs, outputs):
+  """Validate a network's topology and gather its layers and nodes.
+
+  Arguments:
+    inputs: List of input tensors.
+    outputs: List of outputs tensors.
+
+  Returns:
+    A tuple `(nodes, nodes_by_depth, layers, layers_by_depth)`.
+    - nodes: list of Node instances.
+    - nodes_by_depth: dict mapping ints (depth) to lists of node instances.
+    - layers: list of Layer instances.
+    - layers_by_depth: dict mapping ints (depth) to lists of layer instances.
+
+  Raises:
+    ValueError: In case the network is not valid (e.g. disconnected graph).
+  """
+  # Network_nodes: set of nodes included in the graph of layers
+  # (not all nodes included in the layers are relevant to the current graph).
+  network_nodes = set()  # ids of all nodes relevant to the Network
+  nodes_depths = {}  # dict {node: depth value}
+  layers_depths = {}  # dict {layer: depth value}
+  layer_indices = {}  # dict {layer: index in traversal}
+  nodes_in_decreasing_depth = []
+
+  def build_map(tensor,
+                finished_nodes,
+                nodes_in_progress,
+                layer,
+                node_index,
+                tensor_index):
+    """Builds a map of the graph of layers.
+
+    This recursively updates the map `layer_indices`,
+    the list `nodes_in_decreasing_depth` and the set `network_nodes`.
+
+    Arguments:
+        tensor: Some tensor in a graph.
+        finished_nodes: Set of nodes whose subgraphs have been traversed
+            completely. Useful to prevent duplicated work.
+        nodes_in_progress: Set of nodes that are currently active on the
+            recursion stack. Useful to detect cycles.
+        layer: Layer from which `tensor` comes from. If not provided,
+            will be obtained from `tensor._keras_history`.
+        node_index: Node index from which `tensor` comes from.
+        tensor_index: Tensor_index from which `tensor` comes from.
+
+    Raises:
+        ValueError: if a cycle is detected.
+    """
+    node = layer._inbound_nodes[node_index]  # pylint: disable=protected-access
+
+    # Prevent cycles.
+    if node in nodes_in_progress:
+      raise ValueError('The tensor ' + str(tensor) + ' at layer "' +
+                       layer.name + '" is part of a cycle.')
+
+    # Don't repeat work for shared subgraphs
+    if node in finished_nodes:
+      return
+
+    node_key = _make_node_key(layer.name, node_index)
+    # Update network_nodes.
+    network_nodes.add(node_key)
+
+    # Store the traversal order for layer sorting.
+    if layer not in layer_indices:
+      layer_indices[layer] = len(layer_indices)
+
+    nodes_in_progress.add(node)
+
+    # Propagate to all previous tensors connected to this node.
+    for i in range(len(node.inbound_layers)):
+      x = node.input_tensors[i]
+      layer = node.inbound_layers[i]
+      node_index = node.node_indices[i]
+      tensor_index = node.tensor_indices[i]
+      build_map(x, finished_nodes, nodes_in_progress, layer,
+                node_index, tensor_index)
+
+    finished_nodes.add(node)
+    nodes_in_progress.remove(node)
+    nodes_in_decreasing_depth.append(node)
+
+  finished_nodes = set()
+  nodes_in_progress = set()
+  for x in outputs:
+    layer, node_index, tensor_index = x._keras_history  # pylint: disable=protected-access
+    build_map(x, finished_nodes, nodes_in_progress,
+              layer=layer,
+              node_index=node_index,
+              tensor_index=tensor_index)
+
+  for node in reversed(nodes_in_decreasing_depth):
+    # If the depth is not set, the node has no outbound nodes (depth 0).
+    depth = nodes_depths.setdefault(node, 0)
+
+    # Update the depth of the corresponding layer
+    previous_depth = layers_depths.get(node.outbound_layer, 0)
+    # If we've seen this layer before at a higher depth,
+    # we should use that depth instead of the node depth.
+    # This is necessary for shared layers that have inputs at different
+    # depth levels in the graph.
+    depth = max(depth, previous_depth)
+    layers_depths[node.outbound_layer] = depth
+    nodes_depths[node] = depth
+
+    # Update the depth of inbound nodes.
+    # The "depth" of a node is the max of the depths
+    # of all layers it is connected to.
+    for i in range(len(node.inbound_layers)):
+      inbound_layer = node.inbound_layers[i]
+      node_index = node.node_indices[i]
+      inbound_node = inbound_layer._inbound_nodes[node_index]  # pylint: disable=protected-access
+      previous_depth = nodes_depths.get(inbound_node, 0)
+      nodes_depths[inbound_node] = max(depth + 1, previous_depth)
+
+  # Build a dict {depth: list of nodes with this depth}
+  nodes_by_depth = {}
+  for node, depth in nodes_depths.items():
+    if depth not in nodes_by_depth:
+      nodes_by_depth[depth] = []
+    nodes_by_depth[depth].append(node)
+
+  # Build a dict {depth: list of layers with this depth}
+  layers_by_depth = {}
+  for layer, depth in layers_depths.items():
+    if depth not in layers_by_depth:
+      layers_by_depth[depth] = []
+    layers_by_depth[depth].append(layer)
+
+  # Get sorted list of layer depths.
+  depth_keys = list(layers_by_depth.keys())
+  depth_keys.sort(reverse=True)
+
+  # Set self.layers and self._layers_by_depth.
+  layers = []
+  for depth in depth_keys:
+    layers_for_depth = layers_by_depth[depth]
+    # Network.layers needs to have a deterministic order:
+    # here we order them by traversal order.
+    layers_for_depth.sort(key=lambda x: layer_indices[x])
+    layers.extend(layers_for_depth)
+
+  # Get sorted list of node depths.
+  depth_keys = list(nodes_by_depth.keys())
+  depth_keys.sort(reverse=True)
+
+  # Check that all tensors required are computable.
+  # computable_tensors: all tensors in the graph
+  # that can be computed from the inputs provided.
+  computable_tensors = []
+  for x in inputs:
+    computable_tensors.append(x)
+
+  layers_with_complete_input = []  # To provide a better error msg.
+  for depth in depth_keys:
+    for node in nodes_by_depth[depth]:
+      layer = node.outbound_layer
+      if layer:
+        for x in node.input_tensors:
+          if x not in computable_tensors:
+            raise ValueError('Graph disconnected: '
+                             'cannot obtain value for tensor ' + str(x) +
+                             ' at layer "' + layer.name + '". '
+                             'The following previous layers '
+                             'were accessed without issue: ' +
+                             str(layers_with_complete_input))
+        for x in node.output_tensors:
+          computable_tensors.append(x)
+        layers_with_complete_input.append(layer.name)
+
+  # Ensure name unicity, which will be crucial for serialization
+  # (since serialized nodes refer to layers by their name).
+  all_names = [layer.name for layer in layers]
+  for name in all_names:
+    if all_names.count(name) != 1:
+      raise ValueError('The name "' + name + '" is used ' +
+                       str(all_names.count(name)) + ' times in the model. '
+                       'All layer names should be unique.')
+  return network_nodes, nodes_by_depth, layers, layers_by_depth
diff --git a/tensorflow/python/keras/_impl/keras/engine/saving.py b/tensorflow/python/keras/_impl/keras/engine/saving.py
new file mode 100644
index 0000000000000000000000000000000000000000..52522e693511b010d0501651e594d346984c41e3
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/saving.py
@@ -0,0 +1,671 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=protected-access
+"""Model saving utilities.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import json
+import os
+
+import numpy as np
+from six.moves import zip  # pylint: disable=redefined-builtin
+
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import optimizers
+from tensorflow.python.keras._impl.keras.utils import conv_utils
+from tensorflow.python.keras._impl.keras.utils.io_utils import ask_to_proceed_with_overwrite
+from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util.tf_export import tf_export
+
+# pylint: disable=g-import-not-at-top
+try:
+  import h5py
+except ImportError:
+  h5py = None
+
+try:
+  import yaml
+except ImportError:
+  yaml = None
+# pylint: enable=g-import-not-at-top
+
+
+@tf_export('keras.models.save_model')
+def save_model(model, filepath, overwrite=True, include_optimizer=True):
+  """Save a model to a HDF5 file.
+
+  The saved model contains:
+      - the model's configuration (topology)
+      - the model's weights
+      - the model's optimizer's state (if any)
+
+  Thus the saved model can be reinstantiated in
+  the exact same state, without any of the code
+  used for model definition or training.
+
+  Arguments:
+      model: Keras model instance to be saved.
+      filepath: String, path where to save the model.
+      overwrite: Whether we should overwrite any existing
+          model at the target location, or instead
+          ask the user with a manual prompt.
+      include_optimizer: If True, save optimizer's state together.
+
+  Raises:
+      ImportError: if h5py is not available.
+  """
+
+  if h5py is None:
+    raise ImportError('`save_model` requires h5py.')
+
+  def get_json_type(obj):
+    """Serialize any object to a JSON-serializable structure.
+
+    Arguments:
+        obj: the object to serialize
+
+    Returns:
+        JSON-serializable structure representing `obj`.
+
+    Raises:
+        TypeError: if `obj` cannot be serialized.
+    """
+    # if obj is a serializable Keras class instance
+    # e.g. optimizer, layer
+    if hasattr(obj, 'get_config'):
+      return {'class_name': obj.__class__.__name__, 'config': obj.get_config()}
+
+    # if obj is any numpy type
+    if type(obj).__module__ == np.__name__:
+      if isinstance(obj, np.ndarray):
+        return {'type': type(obj), 'value': obj.tolist()}
+      else:
+        return obj.item()
+
+    # misc functions (e.g. loss function)
+    if callable(obj):
+      return obj.__name__
+
+    # if obj is a python 'type'
+    if type(obj).__name__ == type.__name__:
+      return obj.__name__
+
+    raise TypeError('Not JSON Serializable:', obj)
+
+  from tensorflow.python.keras._impl.keras import __version__ as keras_version  # pylint: disable=g-import-not-at-top
+
+  # If file exists and should not be overwritten.
+  if not overwrite and os.path.isfile(filepath):
+    proceed = ask_to_proceed_with_overwrite(filepath)
+    if not proceed:
+      return
+
+  with h5py.File(filepath, mode='w') as f:
+    f.attrs['keras_version'] = str(keras_version).encode('utf8')
+    f.attrs['backend'] = K.backend().encode('utf8')
+    f.attrs['model_config'] = json.dumps(
+        {
+            'class_name': model.__class__.__name__,
+            'config': model.get_config()
+        },
+        default=get_json_type).encode('utf8')
+
+    model_weights_group = f.create_group('model_weights')
+    model_layers = model.layers
+    save_weights_to_hdf5_group(model_weights_group, model_layers)
+
+    if include_optimizer and hasattr(model, 'optimizer'):
+      if isinstance(model.optimizer, optimizers.TFOptimizer):
+        logging.warning(
+            'TensorFlow optimizers do not '
+            'make it possible to access '
+            'optimizer attributes or optimizer state '
+            'after instantiation. '
+            'As a result, we cannot save the optimizer '
+            'as part of the model save file.'
+            'You will have to compile your model again after loading it. '
+            'Prefer using a Keras optimizer instead '
+            '(see keras.io/optimizers).')
+      else:
+        f.attrs['training_config'] = json.dumps(
+            {
+                'optimizer_config': {
+                    'class_name': model.optimizer.__class__.__name__,
+                    'config': model.optimizer.get_config()
+                },
+                'loss': model.loss,
+                'metrics': model.metrics,
+                'sample_weight_mode': model.sample_weight_mode,
+                'loss_weights': model.loss_weights,
+            },
+            default=get_json_type).encode('utf8')
+
+        # Save optimizer weights.
+        symbolic_weights = getattr(model.optimizer, 'weights')
+        if symbolic_weights:
+          optimizer_weights_group = f.create_group('optimizer_weights')
+          weight_values = K.batch_get_value(symbolic_weights)
+          weight_names = []
+          for w, val in zip(symbolic_weights, weight_values):
+            name = str(w.name)
+            weight_names.append(name.encode('utf8'))
+          optimizer_weights_group.attrs['weight_names'] = weight_names
+          for name, val in zip(weight_names, weight_values):
+            param_dset = optimizer_weights_group.create_dataset(
+                name, val.shape, dtype=val.dtype)
+            if not val.shape:
+              # scalar
+              param_dset[()] = val
+            else:
+              param_dset[:] = val
+    f.flush()
+
+
+@tf_export('keras.models.load_model')
+def load_model(filepath, custom_objects=None, compile=True):  # pylint: disable=redefined-builtin
+  """Loads a model saved via `save_model`.
+
+  Arguments:
+      filepath: String, path to the saved model.
+      custom_objects: Optional dictionary mapping names
+          (strings) to custom classes or functions to be
+          considered during deserialization.
+      compile: Boolean, whether to compile the model
+          after loading.
+
+  Returns:
+      A Keras model instance. If an optimizer was found
+      as part of the saved model, the model is already
+      compiled. Otherwise, the model is uncompiled and
+      a warning will be displayed. When `compile` is set
+      to False, the compilation is omitted without any
+      warning.
+
+  Raises:
+      ImportError: if h5py is not available.
+      ValueError: In case of an invalid savefile.
+  """
+  if h5py is None:
+    raise ImportError('`load_model` requires h5py.')
+
+  if not custom_objects:
+    custom_objects = {}
+
+  def convert_custom_objects(obj):
+    """Handles custom object lookup.
+
+    Arguments:
+        obj: object, dict, or list.
+
+    Returns:
+        The same structure, where occurrences
+            of a custom object name have been replaced
+            with the custom object.
+    """
+    if isinstance(obj, list):
+      deserialized = []
+      for value in obj:
+        deserialized.append(convert_custom_objects(value))
+      return deserialized
+    if isinstance(obj, dict):
+      deserialized = {}
+      for key, value in obj.items():
+        deserialized[key] = convert_custom_objects(value)
+      return deserialized
+    if obj in custom_objects:
+      return custom_objects[obj]
+    return obj
+
+  with h5py.File(filepath, mode='r') as f:
+    # instantiate model
+    model_config = f.attrs.get('model_config')
+    if model_config is None:
+      raise ValueError('No model found in config file.')
+    model_config = json.loads(model_config.decode('utf-8'))
+    model = model_from_config(model_config, custom_objects=custom_objects)
+
+    # set weights
+    load_weights_from_hdf5_group(f['model_weights'], model.layers)
+
+    # Early return if compilation is not required.
+    if not compile:
+      return model
+
+    # instantiate optimizer
+    training_config = f.attrs.get('training_config')
+    if training_config is None:
+      logging.warning('No training configuration found in save file: '
+                      'the model was *not* compiled. Compile it manually.')
+      return model
+    training_config = json.loads(training_config.decode('utf-8'))
+    optimizer_config = training_config['optimizer_config']
+    optimizer = optimizers.deserialize(
+        optimizer_config, custom_objects=custom_objects)
+
+    # Recover loss functions and metrics.
+    loss = convert_custom_objects(training_config['loss'])
+    metrics = convert_custom_objects(training_config['metrics'])
+    sample_weight_mode = training_config['sample_weight_mode']
+    loss_weights = training_config['loss_weights']
+
+    # Compile model.
+    model.compile(
+        optimizer=optimizer,
+        loss=loss,
+        metrics=metrics,
+        loss_weights=loss_weights,
+        sample_weight_mode=sample_weight_mode)
+
+    # Set optimizer weights.
+    if 'optimizer_weights' in f:
+      # Build train function (to get weight updates).
+      model._make_train_function()
+      optimizer_weights_group = f['optimizer_weights']
+      optimizer_weight_names = [
+          n.decode('utf8')
+          for n in optimizer_weights_group.attrs['weight_names']
+      ]
+      optimizer_weight_values = [
+          optimizer_weights_group[n] for n in optimizer_weight_names
+      ]
+      try:
+        model.optimizer.set_weights(optimizer_weight_values)
+      except ValueError:
+        logging.warning('Error in loading the saved optimizer '
+                        'state. As a result, your model is '
+                        'starting with a freshly initialized '
+                        'optimizer.')
+  return model
+
+
+@tf_export('keras.models.model_from_config')
+def model_from_config(config, custom_objects=None):
+  """Instantiates a Keras model from its config.
+
+  Arguments:
+      config: Configuration dictionary.
+      custom_objects: Optional dictionary mapping names
+          (strings) to custom classes or functions to be
+          considered during deserialization.
+
+  Returns:
+      A Keras model instance (uncompiled).
+
+  Raises:
+      TypeError: if `config` is not a dictionary.
+  """
+  if isinstance(config, list):
+    raise TypeError('`model_from_config` expects a dictionary, not a list. '
+                    'Maybe you meant to use '
+                    '`Sequential.from_config(config)`?')
+  from tensorflow.python.keras._impl.keras.layers import deserialize  # pylint: disable=g-import-not-at-top
+  return deserialize(config, custom_objects=custom_objects)
+
+
+@tf_export('keras.models.model_from_yaml')
+def model_from_yaml(yaml_string, custom_objects=None):
+  """Parses a yaml model configuration file and returns a model instance.
+
+  Arguments:
+      yaml_string: YAML string encoding a model configuration.
+      custom_objects: Optional dictionary mapping names
+          (strings) to custom classes or functions to be
+          considered during deserialization.
+
+  Returns:
+      A Keras model instance (uncompiled).
+
+  Raises:
+      ImportError: if yaml module is not found.
+  """
+  if yaml is None:
+    raise ImportError('Requires yaml module installed.')
+  config = yaml.load(yaml_string)
+  from tensorflow.python.keras._impl.keras.layers import deserialize  # pylint: disable=g-import-not-at-top
+  return deserialize(config, custom_objects=custom_objects)
+
+
+@tf_export('keras.models.model_from_json')
+def model_from_json(json_string, custom_objects=None):
+  """Parses a JSON model configuration file and returns a model instance.
+
+  Arguments:
+      json_string: JSON string encoding a model configuration.
+      custom_objects: Optional dictionary mapping names
+          (strings) to custom classes or functions to be
+          considered during deserialization.
+
+  Returns:
+      A Keras model instance (uncompiled).
+  """
+  config = json.loads(json_string)
+  from tensorflow.python.keras._impl.keras.layers import deserialize  # pylint: disable=g-import-not-at-top
+  return deserialize(config, custom_objects=custom_objects)
+
+
+def save_weights_to_hdf5_group(f, layers):
+  from tensorflow.python.keras._impl.keras import __version__ as keras_version  # pylint: disable=g-import-not-at-top
+
+  f.attrs['layer_names'] = [layer.name.encode('utf8') for layer in layers]
+  f.attrs['backend'] = K.backend().encode('utf8')
+  f.attrs['keras_version'] = str(keras_version).encode('utf8')
+
+  for layer in layers:
+    g = f.create_group(layer.name)
+    symbolic_weights = layer.weights
+    weight_values = K.batch_get_value(symbolic_weights)
+    weight_names = []
+    for i, (w, val) in enumerate(zip(symbolic_weights, weight_values)):
+      if hasattr(w, 'name') and w.name:
+        name = str(w.name)
+      else:
+        name = 'param_' + str(i)
+      weight_names.append(name.encode('utf8'))
+    g.attrs['weight_names'] = weight_names
+    for name, val in zip(weight_names, weight_values):
+      param_dset = g.create_dataset(name, val.shape, dtype=val.dtype)
+      if not val.shape:
+        # scalar
+        param_dset[()] = val
+      else:
+        param_dset[:] = val
+
+
+def preprocess_weights_for_loading(layer,
+                                   weights,
+                                   original_keras_version=None,
+                                   original_backend=None):
+  """Converts layers weights from Keras 1 format to Keras 2.
+
+  Arguments:
+      layer: Layer instance.
+      weights: List of weights values (Numpy arrays).
+      original_keras_version: Keras version for the weights, as a string.
+      original_backend: Keras backend the weights were trained with,
+          as a string.
+
+  Returns:
+      A list of weights values (Numpy arrays).
+  """
+  if layer.__class__.__name__ == 'Bidirectional':
+    num_weights_per_layer = len(weights) // 2
+    forward_weights = preprocess_weights_for_loading(
+        layer.forward_layer, weights[:num_weights_per_layer],
+        original_keras_version, original_backend)
+    backward_weights = preprocess_weights_for_loading(
+        layer.backward_layer, weights[num_weights_per_layer:],
+        original_keras_version, original_backend)
+    weights = forward_weights + backward_weights
+
+  if original_keras_version == '1':
+    if layer.__class__.__name__ == 'TimeDistributed':
+      weights = preprocess_weights_for_loading(
+          layer.layer, weights, original_keras_version, original_backend)
+
+    if layer.__class__.__name__ == 'Conv1D':
+      shape = weights[0].shape
+      # Handle Keras 1.1 format
+      if shape[:2] != (layer.kernel_size[0], 1) or shape[3] != layer.filters:
+        # Legacy shape:
+        # (filters, input_dim, filter_length, 1)
+        assert shape[0] == layer.filters and shape[2:] == (layer.kernel_size[0],
+                                                           1)
+        weights[0] = np.transpose(weights[0], (2, 3, 1, 0))
+      weights[0] = weights[0][:, 0, :, :]
+
+    if layer.__class__.__name__ == 'Conv2D':
+      if layer.data_format == 'channels_first':
+        # old: (filters, stack_size, kernel_rows, kernel_cols)
+        # new: (kernel_rows, kernel_cols, stack_size, filters)
+        weights[0] = np.transpose(weights[0], (2, 3, 1, 0))
+
+    if layer.__class__.__name__ == 'Conv2DTranspose':
+      if layer.data_format == 'channels_last':
+        # old: (kernel_rows, kernel_cols, stack_size, filters)
+        # new: (kernel_rows, kernel_cols, filters, stack_size)
+        weights[0] = np.transpose(weights[0], (0, 1, 3, 2))
+      if layer.data_format == 'channels_first':
+        # old: (filters, stack_size, kernel_rows, kernel_cols)
+        # new: (kernel_rows, kernel_cols, filters, stack_size)
+        weights[0] = np.transpose(weights[0], (2, 3, 0, 1))
+
+    if layer.__class__.__name__ == 'Conv3D':
+      if layer.data_format == 'channels_first':
+        # old: (filters, stack_size, ...)
+        # new: (..., stack_size, filters)
+        weights[0] = np.transpose(weights[0], (2, 3, 4, 1, 0))
+
+    if layer.__class__.__name__ == 'GRU':
+      if len(weights) == 9:
+        kernel = np.concatenate([weights[0], weights[3], weights[6]], axis=-1)
+        recurrent_kernel = np.concatenate(
+            [weights[1], weights[4], weights[7]], axis=-1)
+        bias = np.concatenate([weights[2], weights[5], weights[8]], axis=-1)
+        weights = [kernel, recurrent_kernel, bias]
+
+    if layer.__class__.__name__ == 'LSTM':
+      if len(weights) == 12:
+        # old: i, c, f, o
+        # new: i, f, c, o
+        kernel = np.concatenate(
+            [weights[0], weights[6], weights[3], weights[9]], axis=-1)
+        recurrent_kernel = np.concatenate(
+            [weights[1], weights[7], weights[4], weights[10]], axis=-1)
+        bias = np.concatenate(
+            [weights[2], weights[8], weights[5], weights[11]], axis=-1)
+        weights = [kernel, recurrent_kernel, bias]
+
+    if layer.__class__.__name__ == 'ConvLSTM2D':
+      if len(weights) == 12:
+        kernel = np.concatenate(
+            [weights[0], weights[6], weights[3], weights[9]], axis=-1)
+        recurrent_kernel = np.concatenate(
+            [weights[1], weights[7], weights[4], weights[10]], axis=-1)
+        bias = np.concatenate(
+            [weights[2], weights[8], weights[5], weights[11]], axis=-1)
+        if layer.data_format == 'channels_first':
+          # old: (filters, stack_size, kernel_rows, kernel_cols)
+          # new: (kernel_rows, kernel_cols, stack_size, filters)
+          kernel = np.transpose(kernel, (2, 3, 1, 0))
+          recurrent_kernel = np.transpose(recurrent_kernel, (2, 3, 1, 0))
+        weights = [kernel, recurrent_kernel, bias]
+
+    if layer.__class__.__name__ in ['Model', 'Sequential']:
+      new_weights = []
+      # trainable weights
+      for sublayer in layer.layers:
+        num_weights = len(sublayer.trainable_weights)
+        if num_weights > 0:
+          new_weights.extend(
+              preprocess_weights_for_loading(
+                  layer=sublayer,
+                  weights=weights[:num_weights],
+                  original_keras_version=original_keras_version,
+                  original_backend=original_backend))
+          weights = weights[num_weights:]
+
+      # non-trainable weights
+      for sublayer in layer.layers:
+        num_weights = len([
+            l for l in sublayer.weights if l not in sublayer.trainable_weights
+        ])
+        if num_weights > 0:
+          new_weights.extend(
+              preprocess_weights_for_loading(
+                  layer=sublayer,
+                  weights=weights[:num_weights],
+                  original_keras_version=original_keras_version,
+                  original_backend=original_backend))
+          weights = weights[num_weights:]
+      weights = new_weights
+
+  conv_layers = ['Conv1D', 'Conv2D', 'Conv3D', 'Conv2DTranspose', 'ConvLSTM2D']
+  if layer.__class__.__name__ in conv_layers:
+    if original_backend == 'theano':
+      weights[0] = conv_utils.convert_kernel(weights[0])
+      if layer.__class__.__name__ == 'ConvLSTM2D':
+        weights[1] = conv_utils.convert_kernel(weights[1])
+    if K.int_shape(layer.weights[0]) != weights[0].shape:
+      weights[0] = np.transpose(weights[0], (3, 2, 0, 1))
+      if layer.__class__.__name__ == 'ConvLSTM2D':
+        weights[1] = np.transpose(weights[1], (3, 2, 0, 1))
+
+  # Convert the weights of CuDNNLSTM so that they could be loaded into LSTM
+  if layer.__class__.__name__ == 'LSTM' and len(weights) == 3:
+    # Determine if loading a CuDNNLSTM layer from the number of bias weights:
+    # CuDNNLSTM has (units * 8) weights; while LSTM has (units * 4)
+    # if there's no bias weight in the file, skip this conversion
+    units = weights[1].shape[0]
+    bias = weights[2]
+    if len(bias) == units * 8:
+      # reshape the kernels
+      kernels = np.split(weights[0], 4, axis=1)
+      kernels = [
+          kernel.reshape(-1).reshape(kernel.shape, order='F')
+          for kernel in kernels
+      ]
+      weights[0] = np.concatenate(kernels, axis=1)
+
+      # transpose the recurrent kernels
+      recurrent_kernels = np.split(weights[1], 4, axis=1)
+      recurrent_kernels = [kernel.T for kernel in recurrent_kernels]
+      weights[1] = np.concatenate(recurrent_kernels, axis=1)
+
+      # split the bias into half and merge
+      weights[2] = bias[:units * 4] + bias[units * 4:]
+
+  return weights
+
+
+def load_weights_from_hdf5_group(f, layers):
+  """Implements topological (order-based) weight loading.
+
+  Arguments:
+      f: A pointer to a HDF5 group.
+      layers: a list of target layers.
+
+  Raises:
+      ValueError: in case of mismatch between provided layers
+          and weights file.
+  """
+  if 'keras_version' in f.attrs:
+    original_keras_version = f.attrs['keras_version'].decode('utf8')
+  else:
+    original_keras_version = '1'
+  if 'backend' in f.attrs:
+    original_backend = f.attrs['backend'].decode('utf8')
+  else:
+    original_backend = None
+
+  filtered_layers = []
+  for layer in layers:
+    weights = layer.weights
+    if weights:
+      filtered_layers.append(layer)
+
+  layer_names = [n.decode('utf8') for n in f.attrs['layer_names']]
+  filtered_layer_names = []
+  for name in layer_names:
+    g = f[name]
+    weight_names = [n.decode('utf8') for n in g.attrs['weight_names']]
+    if weight_names:
+      filtered_layer_names.append(name)
+  layer_names = filtered_layer_names
+  if len(layer_names) != len(filtered_layers):
+    raise ValueError('You are trying to load a weight file '
+                     'containing ' + str(len(layer_names)) +
+                     ' layers into a model with ' + str(len(filtered_layers)) +
+                     ' layers.')
+
+  # We batch weight value assignments in a single backend call
+  # which provides a speedup in TensorFlow.
+  weight_value_tuples = []
+  for k, name in enumerate(layer_names):
+    g = f[name]
+    weight_names = [n.decode('utf8') for n in g.attrs['weight_names']]
+    weight_values = [g[weight_name] for weight_name in weight_names]
+    layer = filtered_layers[k]
+    symbolic_weights = layer.weights
+    weight_values = preprocess_weights_for_loading(
+        layer, weight_values, original_keras_version, original_backend)
+    if len(weight_values) != len(symbolic_weights):
+      raise ValueError('Layer #' + str(k) + ' (named "' + layer.name +
+                       '" in the current model) was found to '
+                       'correspond to layer ' + name + ' in the save file. '
+                       'However the new layer ' + layer.name + ' expects ' +
+                       str(len(symbolic_weights)) +
+                       ' weights, but the saved weights have ' +
+                       str(len(weight_values)) + ' elements.')
+    weight_value_tuples += zip(symbolic_weights, weight_values)
+  K.batch_set_value(weight_value_tuples)
+
+
+def load_weights_from_hdf5_group_by_name(f, layers):
+  """Implements name-based weight loading.
+
+  (instead of topological weight loading).
+
+  Layers that have no matching name are skipped.
+
+  Arguments:
+      f: A pointer to a HDF5 group.
+      layers: a list of target layers.
+
+  Raises:
+      ValueError: in case of mismatch between provided layers
+          and weights file.
+  """
+  if 'keras_version' in f.attrs:
+    original_keras_version = f.attrs['keras_version'].decode('utf8')
+  else:
+    original_keras_version = '1'
+  if 'backend' in f.attrs:
+    original_backend = f.attrs['backend'].decode('utf8')
+  else:
+    original_backend = None
+
+  # New file format.
+  layer_names = [n.decode('utf8') for n in f.attrs['layer_names']]
+
+  # Reverse index of layer name to list of layers with name.
+  index = {}
+  for layer in layers:
+    if layer.name:
+      index.setdefault(layer.name, []).append(layer)
+
+  # We batch weight value assignments in a single backend call
+  # which provides a speedup in TensorFlow.
+  weight_value_tuples = []
+  for k, name in enumerate(layer_names):
+    g = f[name]
+    weight_names = [n.decode('utf8') for n in g.attrs['weight_names']]
+    weight_values = [g[weight_name] for weight_name in weight_names]
+
+    for layer in index.get(name, []):
+      symbolic_weights = layer.weights
+      weight_values = preprocess_weights_for_loading(
+          layer, weight_values, original_keras_version, original_backend)
+      if len(weight_values) != len(symbolic_weights):
+        raise ValueError('Layer #' + str(k) + ' (named "' + layer.name +
+                         '") expects ' + str(len(symbolic_weights)) +
+                         ' weight(s), but the saved weights' + ' have ' +
+                         str(len(weight_values)) + ' element(s).')
+      # Set values.
+      for i in range(len(weight_values)):
+        weight_value_tuples.append((symbolic_weights[i], weight_values[i]))
+  K.batch_set_value(weight_value_tuples)
diff --git a/tensorflow/python/keras/_impl/keras/engine/saving_test.py b/tensorflow/python/keras/_impl/keras/engine/saving_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..bdb17641b0d26bc227b142d9302dc1da9637c506
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/saving_test.py
@@ -0,0 +1,375 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#,============================================================================
+"""Tests for model saving."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import shutil
+import tempfile
+
+import numpy as np
+
+from tensorflow.python.keras._impl import keras
+from tensorflow.python.platform import test
+from tensorflow.python.training import training as training_module
+
+try:
+  import h5py  # pylint:disable=g-import-not-at-top
+except ImportError:
+  h5py = None
+
+
+class TestWeightSavingAndLoading(test.TestCase):
+
+  def test_weight_loading(self):
+    with self.test_session():
+      a = keras.layers.Input(shape=(2,))
+      x = keras.layers.Dense(3)(a)
+      b = keras.layers.Dense(1)(x)
+      model = keras.models.Model(a, b)
+
+      x = np.random.random((3, 2))
+      ref_y = model.predict(x)
+      weights = model.get_weights()
+      model.set_weights(weights)
+      y = model.predict(x)
+      self.assertAllClose(ref_y, y)
+
+      with self.assertRaises(ValueError):
+        model.set_weights(weights[1:])
+      with self.assertRaises(ValueError):
+        model.set_weights(weights[::-1])
+
+      if h5py is None:
+        return  # Skip rest of test if H5py isn't available.
+
+      temp_dir = self.get_temp_dir()
+      self.addCleanup(shutil.rmtree, temp_dir)
+
+      h5_path = os.path.join(temp_dir, 'test.h5')
+      model.save_weights(h5_path)
+      model.load_weights(h5_path)
+      y = model.predict(x)
+      self.assertAllClose(ref_y, y)
+
+      model.load_weights(h5_path, by_name=True)
+      y = model.predict(x)
+      self.assertAllClose(ref_y, y)
+
+  def test_weight_preprocessing(self):
+    input_dim = 3
+    output_dim = 3
+    size = 2
+    cases = [
+        [
+            (keras.layers.Bidirectional(keras.layers.SimpleRNN(2))),
+            [np.random.random((2, 1)), np.random.random((2, 1))],
+            (None, 3, 2),
+        ],
+        [
+            (keras.layers.TimeDistributed(keras.layers.Dense(1))),
+            [np.random.random((2, 1)), np.random.random((1,))],
+            (None, 3, 2),
+        ],
+        [
+            (keras.layers.Conv1D(output_dim, size, use_bias=False)),
+            [np.random.random((output_dim, input_dim, size, 1))],
+            (None, 4, input_dim),
+        ],
+        [
+            (keras.layers.Conv2D(output_dim, size,
+                                 use_bias=False, data_format='channels_first')),
+            [np.random.random((output_dim, input_dim, size, size))],
+            (None, input_dim, 4, 4),
+        ],
+        [
+            (keras.layers.Conv2DTranspose(output_dim, size,
+                                          use_bias=False,
+                                          data_format='channels_first')),
+            [np.random.random((output_dim, input_dim, size, size))],
+            (None, input_dim, 4, 4),
+        ],
+        [
+            (keras.layers.Conv2DTranspose(output_dim, size,
+                                          use_bias=False,
+                                          data_format='channels_last')),
+            [np.random.random((size, size, input_dim, output_dim))],
+            (None, 4, 4, input_dim),
+        ],
+        [
+            (keras.layers.Conv3D(output_dim, size,
+                                 use_bias=False, data_format='channels_first')),
+            [np.random.random((output_dim, input_dim, size, size, size))],
+            (None, input_dim, 4, 4, 4),
+        ],
+        [
+            (keras.layers.GRU(output_dim)),
+            [np.random.random((input_dim, output_dim)),
+             np.random.random((output_dim, output_dim)),
+             np.random.random((output_dim,)),
+             np.random.random((input_dim, output_dim)),
+             np.random.random((output_dim, output_dim)),
+             np.random.random((output_dim,)),
+             np.random.random((input_dim, output_dim)),
+             np.random.random((output_dim, output_dim)),
+             np.random.random((output_dim,))],
+            (None, 4, input_dim),
+        ],
+        [
+            (keras.layers.LSTM(output_dim)),
+            [np.random.random((input_dim, output_dim)),
+             np.random.random((output_dim, output_dim)),
+             np.random.random((output_dim,)),
+             np.random.random((input_dim, output_dim)),
+             np.random.random((output_dim, output_dim)),
+             np.random.random((output_dim,)),
+             np.random.random((input_dim, output_dim)),
+             np.random.random((output_dim, output_dim)),
+             np.random.random((output_dim,)),
+             np.random.random((input_dim, output_dim)),
+             np.random.random((output_dim, output_dim)),
+             np.random.random((output_dim,))],
+            (None, 4, input_dim),
+        ],
+    ]
+    for layer, weights, input_shape in cases:
+      layer.build(input_shape)
+      _ = keras.engine.saving.preprocess_weights_for_loading(
+          layer, weights, original_keras_version='1')
+
+    model = keras.models.Sequential([keras.layers.Dense(2, input_dim=2)])
+    _ = keras.engine.saving.preprocess_weights_for_loading(
+        model, model.weights, original_keras_version='1')
+
+    x = keras.Input((2,))
+    y = keras.layers.Dense(2)(x)
+    model = keras.models.Model(x, y)
+    _ = keras.engine.saving.preprocess_weights_for_loading(
+        model, model.weights, original_keras_version='1')
+
+  def test_sequential_weight_loading(self):
+    if h5py is None:
+      return
+
+    temp_dir = self.get_temp_dir()
+    self.addCleanup(shutil.rmtree, temp_dir)
+    h5_path = os.path.join(temp_dir, 'test.h5')
+
+    num_hidden = 5
+    input_dim = 3
+    batch_size = 5
+    num_classes = 2
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(num_hidden, input_dim=input_dim))
+      model.add(keras.layers.Dense(num_classes))
+
+      x = np.random.random((batch_size, input_dim))
+      ref_y = model.predict(x)
+
+      model.save_weights(h5_path)
+
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(num_hidden, input_dim=input_dim))
+      model.add(keras.layers.Dense(num_classes))
+      model.load_weights(h5_path)
+      y = model.predict(x)
+
+      self.assertAllClose(y, ref_y)
+
+
+class TestWholeModelSaving(test.TestCase):
+
+  def test_sequential_model_saving(self):
+    if h5py is None:
+      return  # Skip test if models cannot be saved.
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.RepeatVector(3))
+      model.add(keras.layers.TimeDistributed(keras.layers.Dense(3)))
+      model.compile(loss=keras.losses.MSE,
+                    optimizer=keras.optimizers.RMSprop(lr=0.0001),
+                    metrics=[keras.metrics.categorical_accuracy],
+                    sample_weight_mode='temporal')
+      x = np.random.random((1, 3))
+      y = np.random.random((1, 3, 3))
+      model.train_on_batch(x, y)
+
+      out = model.predict(x)
+      fd, fname = tempfile.mkstemp('.h5')
+      keras.models.save_model(model, fname)
+
+      new_model = keras.models.load_model(fname)
+      os.close(fd)
+      os.remove(fname)
+
+      out2 = new_model.predict(x)
+      self.assertAllClose(out, out2, atol=1e-05)
+
+      # test that new updates are the same with both models
+      x = np.random.random((1, 3))
+      y = np.random.random((1, 3, 3))
+      model.train_on_batch(x, y)
+      new_model.train_on_batch(x, y)
+      out = model.predict(x)
+      out2 = new_model.predict(x)
+      self.assertAllClose(out, out2, atol=1e-05)
+
+  def test_sequential_model_saving_2(self):
+    if h5py is None:
+      return  # Skip test if models cannot be saved.
+
+    with self.test_session():
+      # test with custom optimizer, loss
+
+      class CustomOp(keras.optimizers.RMSprop):
+        pass
+
+      def custom_loss(y_true, y_pred):
+        return keras.losses.mse(y_true, y_pred)
+
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.Dense(3))
+      model.compile(loss=custom_loss, optimizer=CustomOp(), metrics=['acc'])
+
+      x = np.random.random((1, 3))
+      y = np.random.random((1, 3))
+      model.train_on_batch(x, y)
+
+      out = model.predict(x)
+      fd, fname = tempfile.mkstemp('.h5')
+      keras.models.save_model(model, fname)
+
+      model = keras.models.load_model(
+          fname,
+          custom_objects={'CustomOp': CustomOp,
+                          'custom_loss': custom_loss})
+      os.close(fd)
+      os.remove(fname)
+
+      out2 = model.predict(x)
+      self.assertAllClose(out, out2, atol=1e-05)
+
+  def test_functional_model_saving(self):
+    if h5py is None:
+      return  # Skip test if models cannot be saved.
+
+    with self.test_session():
+      inputs = keras.layers.Input(shape=(3,))
+      x = keras.layers.Dense(2)(inputs)
+      output = keras.layers.Dense(3)(x)
+
+      model = keras.models.Model(inputs, output)
+      model.compile(loss=keras.losses.MSE,
+                    optimizer=keras.optimizers.RMSprop(lr=0.0001),
+                    metrics=[keras.metrics.categorical_accuracy])
+      x = np.random.random((1, 3))
+      y = np.random.random((1, 3))
+      model.train_on_batch(x, y)
+
+      out = model.predict(x)
+      fd, fname = tempfile.mkstemp('.h5')
+      keras.models.save_model(model, fname)
+
+      model = keras.models.load_model(fname)
+      os.close(fd)
+      os.remove(fname)
+
+      out2 = model.predict(x)
+      self.assertAllClose(out, out2, atol=1e-05)
+
+  def test_saving_without_compilation(self):
+    if h5py is None:
+      return  # Skip test if models cannot be saved.
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.Dense(3))
+      model.compile(loss='mse', optimizer='sgd', metrics=['acc'])
+
+      fd, fname = tempfile.mkstemp('.h5')
+      keras.models.save_model(model, fname)
+      model = keras.models.load_model(fname)
+      os.close(fd)
+      os.remove(fname)
+
+  def test_saving_with_tf_optimizer(self):
+    if h5py is None:
+      return  # Skip test if models cannot be saved.
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.Dense(3))
+      model.compile(loss='mse',
+                    optimizer=training_module.AdadeltaOptimizer(0.1),
+                    metrics=['acc'])
+
+      fd, fname = tempfile.mkstemp('.h5')
+      keras.models.save_model(model, fname)
+      model = keras.models.load_model(fname)
+      os.close(fd)
+      os.remove(fname)
+
+  def test_saving_right_after_compilation(self):
+    if h5py is None:
+      return  # Skip test if models cannot be saved.
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(2, input_shape=(3,)))
+      model.add(keras.layers.Dense(3))
+      model.compile(loss='mse', optimizer='sgd', metrics=['acc'])
+      model.model._make_train_function()
+
+      fd, fname = tempfile.mkstemp('.h5')
+      keras.models.save_model(model, fname)
+      model = keras.models.load_model(fname)
+      os.close(fd)
+      os.remove(fname)
+
+  def test_saving_lambda_numpy_array_arguments(self):
+    if h5py is None:
+      return  # Skip test if models cannot be saved.
+
+    mean = np.random.random((4, 2, 3))
+    std = np.abs(np.random.random((4, 2, 3))) + 1e-5
+    inputs = keras.layers.Input(shape=(4, 2, 3))
+    output = keras.layers.Lambda(lambda image, mu, std: (image - mu) / std,
+                                 arguments={'mu': mean, 'std': std})(inputs)
+    model = keras.models.Model(inputs, output)
+    model.compile(loss='mse', optimizer='sgd', metrics=['acc'])
+
+    fd, fname = tempfile.mkstemp('.h5')
+    keras.models.save_model(model, fname)
+
+    model = keras.models.load_model(fname)
+    os.close(fd)
+    os.remove(fname)
+
+    self.assertAllClose(mean, model.layers[1].arguments['mu'])
+    self.assertAllClose(std, model.layers[1].arguments['std'])
+
+
+if __name__ == '__main__':
+  test.main()
diff --git a/tensorflow/python/keras/_impl/keras/engine/sequential.py b/tensorflow/python/keras/_impl/keras/engine/sequential.py
new file mode 100644
index 0000000000000000000000000000000000000000..db5e7754bc22ba360dbf635f1bd80334f58e8509
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/sequential.py
@@ -0,0 +1,997 @@
+# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# pylint: disable=protected-access
+"""Home of the `Sequential` model.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+import os
+
+from tensorflow.python.framework import ops
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import layers as layer_module
+from tensorflow.python.keras._impl.keras.engine import base_layer
+from tensorflow.python.keras._impl.keras.engine import network
+from tensorflow.python.keras._impl.keras.engine import saving
+from tensorflow.python.keras._impl.keras.engine.input_layer import Input
+from tensorflow.python.keras._impl.keras.engine.input_layer import InputLayer
+from tensorflow.python.keras._impl.keras.engine.training import Model
+from tensorflow.python.keras._impl.keras.utils.io_utils import ask_to_proceed_with_overwrite
+from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.util.tf_export import tf_export
+
+try:
+  import h5py  # pylint: disable=g-import-not-at-top
+except ImportError:
+  h5py = None
+
+
+@tf_export('keras.models.Sequential', 'keras.Sequential')
+class Sequential(Model):
+  """Linear stack of layers.
+
+  Arguments:
+      layers: list of layers to add to the model.
+
+  # Note
+      The first layer passed to a Sequential model
+      should have a defined input shape. What that
+      means is that it should have received an `input_shape`
+      or `batch_input_shape` argument,
+      or for some type of layers (recurrent, Dense...)
+      an `input_dim` argument.
+
+  Example:
+
+      ```python
+          model = Sequential()
+          # first layer must have a defined input shape
+          model.add(Dense(32, input_dim=500))
+          # afterwards, Keras does automatic shape inference
+          model.add(Dense(32))
+
+          # also possible (equivalent to the above):
+          model = Sequential()
+          model.add(Dense(32, input_shape=(500,)))
+          model.add(Dense(32))
+
+          # also possible (equivalent to the above):
+          model = Sequential()
+          # here the batch dimension is None,
+          # which means any batch size will be accepted by the model.
+          model.add(Dense(32, batch_input_shape=(None, 500)))
+          model.add(Dense(32))
+      ```
+  """
+
+  def __init__(self, layers=None, name=None):
+    self._is_graph_network = True
+    self._is_compiled = False
+    self._layers = []  # Stack of layers.
+    self.model = None  # Internal Model instance.
+    self.inputs = []  # List of input tensors
+    self.outputs = []  # List of length 1: the output tensor (unique).
+    self._trainable = True
+    self._initial_weights = None
+    self._input_layers = []
+
+    # Model attributes.
+    self._inbound_nodes = []
+    self._outbound_nodes = []
+    self.built = False
+
+    # Set model name.
+    if not name:
+      prefix = 'sequential_'
+      name = prefix + str(K.get_uid(prefix))
+    self._name = name
+
+    # Used by Layer base class.
+    self._dtype = None
+    self._activity_regularizer = None
+
+    # The following properties are not actually used by Keras;
+    # they exist for compatibility with TF's variable scoping mechanism.
+    self._updates = []
+    self._losses = []
+    self._scope = None
+    self._reuse = None
+    self._base_name = name
+    self._graph = ops.get_default_graph()
+
+    # Add to the model any layers passed to the constructor.
+    if layers:
+      for layer in layers:
+        self.add(layer)
+
+  def add(self, layer):
+    """Adds a layer instance on top of the layer stack.
+
+    Arguments:
+        layer: layer instance.
+
+    Raises:
+        TypeError: If `layer` is not a layer instance.
+        ValueError: In case the `layer` argument does not
+            know its input shape.
+        ValueError: In case the `layer` argument has
+            multiple output tensors, or is already connected
+            somewhere else (forbidden in `Sequential` models).
+    """
+    if not isinstance(layer, (base_layer.Layer, base_layer.TFBaseLayer)):
+      raise TypeError('The added layer must be '
+                      'an instance of class Layer. '
+                      'Found: ' + str(layer))
+    if not self.outputs:
+      # First layer in model: check that it is an input layer.
+      if not isinstance(layer, InputLayer):
+        # Create an input layer.
+        # First, we need to infer its expected input shape and dtype.
+        if isinstance(layer, (Model, Sequential)):
+          # We were passed a model as first layer.
+          # This requires a specific way to figure out the
+          # input shape and dtype.
+          if not layer.layers:
+            raise ValueError('Cannot add an empty model '
+                             'to a `Sequential` model.')
+          # In case of nested models: recover the first layer
+          # of the deepest model to infer input shape and dtype.
+          first_layer = layer.layers[0]
+          while isinstance(first_layer, (Model, Sequential)):
+            first_layer = first_layer.layers[0]
+          batch_shape = first_layer._batch_input_shape
+          dtype = first_layer.dtype
+        else:
+          # We were passed a regular layer, and it should
+          # know about its input shape. Otherwise, that's an error.
+          if not hasattr(layer, '_batch_input_shape'):
+            raise ValueError('The first layer in a '
+                             'Sequential model must '
+                             'get an `input_shape` argument.')
+          batch_shape = layer._batch_input_shape
+          dtype = layer.dtype
+        # Instantiate the input layer.
+        x = Input(
+            batch_shape=batch_shape, dtype=dtype, name=layer.name + '_input')
+        # This will build the current layer
+        # and create the node connecting the current layer
+        # to the input layer we just created.
+        layer(x)
+
+      if len(layer._inbound_nodes[-1].output_tensors) != 1:
+        raise ValueError('All layers in a Sequential model '
+                         'should have a single output tensor. '
+                         'For multi-output layers, '
+                         'use the functional API.')
+
+      self.outputs = [layer._inbound_nodes[-1].output_tensors[0]]
+      self.inputs = network.get_source_inputs(self.outputs[0])
+
+      # We create an input node, which we will keep updated
+      # as we add more layers
+      base_layer.Node(
+          outbound_layer=self,
+          inbound_layers=[],
+          node_indices=[],
+          tensor_indices=[],
+          input_tensors=self.inputs,
+          output_tensors=self.outputs)
+    else:
+      output_tensor = layer(self.outputs[0])
+      if isinstance(output_tensor, list):
+        raise TypeError('All layers in a Sequential model '
+                        'should have a single output tensor. '
+                        'For multi-output layers, '
+                        'use the functional API.')
+      self.outputs = [output_tensor]
+      # update self._inbound_nodes
+      self._inbound_nodes[0].output_tensors = self.outputs
+      self._inbound_nodes[0].output_shapes = [K.int_shape(self.outputs[0])]
+
+    self._layers.append(layer)
+    self.built = False
+
+  def pop(self):
+    """Removes the last layer in the model.
+
+    Raises:
+        TypeError: if there are no layers in the model.
+    """
+    if not self.layers:
+      raise TypeError('There are no layers in the model.')
+
+    self.layers.pop()
+    if not self.layers:
+      self.outputs = []
+      self._inbound_nodes = []
+      self._outbound_nodes = []
+    else:
+      self.layers[-1]._outbound_nodes = []
+      self.outputs = [self.layers[-1].output]
+      # update self._inbound_nodes
+      self._inbound_nodes[0].output_tensors = self.outputs
+      self._inbound_nodes[0].output_shapes = [K.int_shape(self.outputs[0])]
+    self.built = False
+
+  def get_layer(self, name=None, index=None):
+    """Retrieve a layer that is part of the model.
+
+    Returns a layer based on either its name (unique)
+    or its index in the graph. Indices are based on
+    order of horizontal graph traversal (bottom-up).
+
+    Arguments:
+        name: string, name of layer.
+        index: integer, index of layer.
+
+    Returns:
+        A layer instance.
+    """
+    if not self.built:
+      self.build()
+    return self.model.get_layer(name, index)
+
+  def call(self, inputs, **kwargs):
+    if not self.built:
+      self.build()
+    return self.model.call(inputs, **kwargs)
+
+  def build(self, input_shape=None):
+    if not self.inputs or not self.outputs:
+      raise TypeError('Sequential model cannot be built: model is empty.'
+                      ' Add some layers first.')
+    # actually create the model
+    self.model = Model(self.inputs, self.outputs[0], name=self.name + '_model')
+    self.model.trainable = self.trainable
+
+    # mirror model attributes
+    self.supports_masking = self.model.supports_masking
+    self._output_mask_cache = self.model._output_mask_cache
+    self._output_tensor_cache = self.model._output_tensor_cache
+    self._output_shape_cache = self.model._output_shape_cache
+    self._input_layers = self.model._input_layers
+    self._output_layers = self.model._output_layers
+    self._input_coordinates = self.model._input_coordinates
+    self._output_coordinates = self.model._output_coordinates
+    self._nodes_by_depth = self.model._nodes_by_depth
+    self._network_nodes = self.model._network_nodes
+    self.output_names = self.model.output_names
+    self.input_names = self.model.input_names
+    self._feed_input_names = self.model._feed_input_names
+    self._feed_inputs = self.model._feed_inputs
+
+    # Make sure child model callbacks
+    # will call the parent Sequential model.
+    self.model.callback_model = self
+
+    self.built = True
+
+  @property
+  def uses_learning_phase(self):
+    if not self.built:
+      self.build()
+    return self.model.uses_learning_phase
+
+  def _gather_list_attr(self, attr):
+    all_attrs = []
+    for layer in self.layers:
+      all_attrs += getattr(layer, attr, [])
+    return all_attrs
+
+  def _make_train_function(self):
+    self.model._make_train_function()
+
+  def _make_test_function(self):
+    self.model._make_test_function()
+
+  def _make_predict_function(self):
+    self.model._make_predict_function()
+
+  @property
+  def trainable(self):
+    return self._trainable
+
+  @trainable.setter
+  def trainable(self, value):
+    if self.model:
+      self.model.trainable = value
+    self._trainable = value
+
+  @property
+  def trainable_weights(self):
+    if not self.trainable:
+      return []
+    return self._gather_list_attr('trainable_weights')
+
+  @property
+  def non_trainable_weights(self):
+    weights = self._gather_list_attr('non_trainable_weights')
+    if not self.trainable:
+      trainable_weights = self._gather_list_attr('trainable_weights')
+      return trainable_weights + weights
+    return weights
+
+  @property
+  def regularizers(self):
+    if not self.built:
+      self.build()
+    return self.model.regularizers
+
+  def get_weights(self):
+    """Retrieves the weights of the model.
+
+    Returns:
+        A flat list of Numpy arrays
+        (one array per model weight).
+    """
+    if not self.built:
+      self.build()
+    return self.model.get_weights()
+
+  def set_weights(self, weights):
+    """Sets the weights of the model.
+
+    Arguments:
+        weights: Should be a list
+            of Numpy arrays with shapes and types matching
+            the output of `model.get_weights()`.
+    """
+    if not self.built:
+      self.build()
+    self.model.set_weights(weights)
+
+  def load_weights(self, filepath, by_name=False):
+    if h5py is None:
+      raise ImportError('`load_weights` requires h5py.')
+    f = h5py.File(filepath, mode='r')
+    if 'layer_names' not in f.attrs and 'model_weights' in f:
+      f = f['model_weights']
+    layers = self.layers
+    if by_name:
+      saving.load_weights_from_hdf5_group_by_name(f, layers)
+    else:
+      saving.load_weights_from_hdf5_group(f, layers)
+    if hasattr(f, 'close'):
+      f.close()
+
+  def save_weights(self, filepath, overwrite=True):
+    if h5py is None:
+      raise ImportError('`save_weights` requires h5py.')
+    # If file exists and should not be overwritten:
+    if not overwrite and os.path.isfile(filepath):
+      proceed = ask_to_proceed_with_overwrite(filepath)
+      if not proceed:
+        return
+    layers = self.layers
+    f = h5py.File(filepath, 'w')
+    saving.save_weights_to_hdf5_group(f, layers)
+    f.flush()
+    f.close()
+
+  def compile(self,
+              optimizer,
+              loss,
+              metrics=None,
+              sample_weight_mode=None,
+              weighted_metrics=None,
+              target_tensors=None,
+              **kwargs):
+    """Configures the model for training.
+
+    Arguments:
+        optimizer: String (name of optimizer) or optimizer object.
+            See [optimizers](/optimizers).
+        loss: String (name of objective function) or objective function.
+            See [losses](/losses).
+            If the model has multiple outputs, you can use a different loss
+            on each output by passing a dictionary or a list of losses.
+            The loss value that will be minimized by the model
+            will then be the sum of all individual losses.
+        metrics: List of metrics to be evaluated by the model
+            during training and testing.
+            Typically you will use `metrics=['accuracy']`.
+            To specify different metrics for different outputs of a
+            multi-output model, you could also pass a dictionary,
+            such as `metrics={'output_a': 'accuracy'}`.
+        sample_weight_mode: If you need to do timestep-wise
+            sample weighting (2D weights), set this to `"temporal"`.
+            `None` defaults to sample-wise weights (1D).
+            If the model has multiple outputs, you can use a different
+            `sample_weight_mode` on each output by passing a
+            dictionary or a list of modes.
+        weighted_metrics: list of metrics to be evaluated and weighted
+             by `sample_weight` or `class_weight` during training and testing.
+        target_tensors: By default, Keras will create a placeholder for the
+            model's target, which will be fed with the target data during
+            training. If instead you would like to use your own
+            target tensor (in turn, Keras will not expect external
+            Numpy data for these targets at training time), you
+            can specify them via the `target_tensors` argument.
+            It should be a single tensor
+            (for a single-output `Sequential` model).
+        **kwargs: These arguments are passed into `tf.Session.run`.
+
+    Example:
+        ```python
+            model = Sequential()
+            model.add(Dense(32, input_shape=(500,)))
+            model.add(Dense(10, activation='softmax'))
+            model.compile(optimizer='rmsprop',
+                          loss='categorical_crossentropy',
+                          metrics=['accuracy'])
+        ```
+    """
+    # create the underlying model
+    self.build()
+    # call compile method of Model class
+    self.model.compile(
+        optimizer,
+        loss,
+        metrics=metrics,
+        sample_weight_mode=sample_weight_mode,
+        weighted_metrics=weighted_metrics,
+        target_tensors=target_tensors,
+        **kwargs)
+    self.optimizer = self.model.optimizer
+    self.loss = self.model.loss
+    self.metrics = self.model.metrics
+    self.loss_weights = self.model.loss_weights
+    self.sample_weight_mode = self.model.sample_weight_mode
+    self.weighted_metrics = self.model.weighted_metrics
+    self.targets = self.model.targets
+    self.metrics_tensors = self.model.metrics_tensors
+    self.metrics_names = self.model.metrics_names
+    self.sample_weights = self.model.sample_weights
+    self.total_loss = self.model.total_loss
+
+  def fit(self,
+          x=None,
+          y=None,
+          batch_size=None,
+          epochs=1,
+          verbose=1,
+          callbacks=None,
+          validation_split=0.,
+          validation_data=None,
+          shuffle=True,
+          class_weight=None,
+          sample_weight=None,
+          initial_epoch=0,
+          steps_per_epoch=None,
+          validation_steps=None,
+          **kwargs):
+    """Trains the model for a fixed number of epochs.
+
+    Arguments:
+        x: Numpy array of training data.
+            If the input layer in the model is named, you can also pass a
+            dictionary mapping the input name to a Numpy array.
+            `x` can be `None` (default) if feeding from
+            TensorFlow data tensors.
+        y: Numpy array of target (label) data.
+            If the output layer in the model is named, you can also pass a
+            dictionary mapping the output name to a Numpy array.
+            `y` can be `None` (default) if feeding from
+            TensorFlow data tensors.
+        batch_size: Integer or `None`.
+            Number of samples per gradient update.
+            If unspecified, it will default to 32.
+        epochs: Integer. Number of epochs to train the model.
+            An epoch is an iteration over the entire `x` and `y`
+            data provided.
+            Note that in conjunction with `initial_epoch`,
+            `epochs` is to be understood as "final epoch".
+            The model is not trained for a number of iterations
+            given by `epochs`, but merely until the epoch
+            of index `epochs` is reached.
+        verbose: 0, 1, or 2. Verbosity mode.
+            0 = silent, 1 = progress bar, 2 = one line per epoch.
+        callbacks: List of `keras.callbacks.Callback` instances.
+            List of callbacks to apply during training.
+            See [callbacks](/callbacks).
+        validation_split: Float between 0 and 1:
+            Fraction of the training data to be used as validation data.
+            The model will set apart this fraction of the training data,
+            will not train on it, and will evaluate
+            the loss and any model metrics
+            on this data at the end of each epoch.
+            The validation data is selected from the last samples
+            in the `x` and `y` data provided, before shuffling.
+        validation_data: tuple `(x_val, y_val)` or tuple
+            `(x_val, y_val, val_sample_weights)` on which to evaluate
+            the loss and any model metrics at the end of each epoch.
+            The model will not be trained on this data.
+            This will override `validation_split`.
+        shuffle: Boolean (whether to shuffle the training data
+            before each epoch) or str (for 'batch').
+            'batch' is a special option for dealing with the
+            limitations of HDF5 data; it shuffles in batch-sized chunks.
+            Has no effect when `steps_per_epoch` is not `None`.
+        class_weight: Optional dictionary mapping class indices (integers)
+            to a weight (float) value, used for weighting the loss function
+            (during training only).
+            This can be useful to tell the model to
+            "pay more attention" to samples from
+            an under-represented class.
+        sample_weight: Optional Numpy array of weights for
+            the training samples, used for weighting the loss function
+            (during training only). You can either pass a flat (1D)
+            Numpy array with the same length as the input samples
+            (1:1 mapping between weights and samples),
+            or in the case of temporal data,
+            you can pass a 2D array with shape
+            `(samples, sequence_length)`,
+            to apply a different weight to every timestep of every sample.
+            In this case you should make sure to specify
+            `sample_weight_mode="temporal"` in `compile()`.
+        initial_epoch: Epoch at which to start training
+            (useful for resuming a previous training run).
+        steps_per_epoch: Total number of steps (batches of samples)
+            before declaring one epoch finished and starting the
+            next epoch. When training with input tensors such as
+            TensorFlow data tensors, the default `None` is equal to
+            the number of unique samples in your dataset divided by
+            the batch size, or 1 if that cannot be determined.
+        validation_steps: Only relevant if `steps_per_epoch`
+            is specified. Total number of steps (batches of samples)
+            to validate before stopping.
+        **kwargs: Used for backwards compatibility support.
+
+    Returns:
+        A `History` object. Its `History.history` attribute is
+        a record of training loss values and metrics values
+        at successive epochs, as well as validation loss values
+        and validation metrics values (if applicable).
+
+    Raises:
+        RuntimeError: If the model was never compiled.
+        ValueError: In case of mismatch between the provided input data
+            and what the model expects.
+    """
+    if not self.built:
+      raise RuntimeError('The model needs to be compiled before being used.')
+    return self.model.fit(
+        x,
+        y,
+        batch_size=batch_size,
+        epochs=epochs,
+        verbose=verbose,
+        callbacks=callbacks,
+        validation_split=validation_split,
+        validation_data=validation_data,
+        shuffle=shuffle,
+        class_weight=class_weight,
+        sample_weight=sample_weight,
+        initial_epoch=initial_epoch,
+        steps_per_epoch=steps_per_epoch,
+        validation_steps=validation_steps)
+
+  def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None):
+    """Computes the loss on some input data, batch by batch.
+
+    Arguments:
+        x: input data, as a Numpy array or list of Numpy arrays
+            (if the model has multiple inputs).
+        y: labels, as a Numpy array.
+        batch_size: integer. Number of samples per gradient update.
+        verbose: verbosity mode, 0 or 1.
+        sample_weight: sample weights, as a Numpy array.
+
+    Returns:
+        Scalar test loss (if the model has no metrics)
+        or list of scalars (if the model computes other metrics).
+        The attribute `model.metrics_names` will give you
+        the display labels for the scalar outputs.
+
+    Raises:
+        RuntimeError: if the model was never compiled.
+    """
+    if not self.built:
+      raise RuntimeError('The model needs to be compiled before being used.')
+    return self.model.evaluate(
+        x,
+        y,
+        batch_size=batch_size,
+        verbose=verbose,
+        sample_weight=sample_weight)
+
+  def predict(self, x, batch_size=32, verbose=0):
+    """Generates output predictions for the input samples.
+
+    The input samples are processed batch by batch.
+
+    Arguments:
+        x: the input data, as a Numpy array.
+        batch_size: integer.
+        verbose: verbosity mode, 0 or 1.
+
+    Returns:
+        A Numpy array of predictions.
+    """
+    if not self.built:
+      self.build()
+    return self.model.predict(x, batch_size=batch_size, verbose=verbose)
+
+  def predict_on_batch(self, x):
+    """Returns predictions for a single batch of samples.
+
+    Arguments:
+        x: input data, as a Numpy array or list of Numpy arrays
+            (if the model has multiple inputs).
+
+    Returns:
+        A Numpy array of predictions.
+    """
+    if not self.built:
+      self.build()
+    return self.model.predict_on_batch(x)
+
+  def train_on_batch(self, x, y, class_weight=None, sample_weight=None):
+    """Single gradient update over one batch of samples.
+
+    Arguments:
+        x: input data, as a Numpy array or list of Numpy arrays
+            (if the model has multiple inputs).
+        y: labels, as a Numpy array.
+        class_weight: dictionary mapping classes to a weight value,
+            used for scaling the loss function (during training only).
+        sample_weight: sample weights, as a Numpy array.
+
+    Returns:
+        Scalar training loss (if the model has no metrics)
+        or list of scalars (if the model computes other metrics).
+        The attribute `model.metrics_names` will give you
+        the display labels for the scalar outputs.
+
+    Raises:
+        RuntimeError: if the model was never compiled.
+    """
+    if not self.built:
+      raise RuntimeError('The model needs to be compiled before being used.')
+    return self.model.train_on_batch(
+        x, y, sample_weight=sample_weight, class_weight=class_weight)
+
+  def test_on_batch(self, x, y, sample_weight=None):
+    """Evaluates the model over a single batch of samples.
+
+    Arguments:
+        x: input data, as a Numpy array or list of Numpy arrays
+            (if the model has multiple inputs).
+        y: labels, as a Numpy array.
+        sample_weight: sample weights, as a Numpy array.
+
+    Returns:
+        Scalar test loss (if the model has no metrics)
+        or list of scalars (if the model computes other metrics).
+        The attribute `model.metrics_names` will give you
+        the display labels for the scalar outputs.
+
+    Raises:
+        RuntimeError: if the model was never compiled.
+    """
+    if not self.built:
+      raise RuntimeError('The model needs to be compiled before being used.')
+    return self.model.test_on_batch(x, y, sample_weight=sample_weight)
+
+  def predict_proba(self, x, batch_size=32, verbose=0):
+    """Generates class probability predictions for the input samples.
+
+    The input samples are processed batch by batch.
+
+    Arguments:
+        x: input data, as a Numpy array or list of Numpy arrays
+            (if the model has multiple inputs).
+        batch_size: integer.
+        verbose: verbosity mode, 0 or 1.
+
+    Returns:
+        A Numpy array of probability predictions.
+    """
+    preds = self.predict(x, batch_size, verbose)
+    if preds.min() < 0. or preds.max() > 1.:
+      logging.warning('Network returning invalid probability values. '
+                      'The last layer might not normalize predictions '
+                      'into probabilities '
+                      '(like softmax or sigmoid would).')
+    return preds
+
+  def predict_classes(self, x, batch_size=32, verbose=0):
+    """Generate class predictions for the input samples.
+
+    The input samples are processed batch by batch.
+
+    Arguments:
+        x: input data, as a Numpy array or list of Numpy arrays
+            (if the model has multiple inputs).
+        batch_size: integer.
+        verbose: verbosity mode, 0 or 1.
+
+    Returns:
+        A numpy array of class predictions.
+    """
+    proba = self.predict(x, batch_size=batch_size, verbose=verbose)
+    if proba.shape[-1] > 1:
+      return proba.argmax(axis=-1)
+    else:
+      return (proba > 0.5).astype('int32')
+
+  def fit_generator(self,
+                    generator,
+                    steps_per_epoch=None,
+                    epochs=1,
+                    verbose=1,
+                    callbacks=None,
+                    validation_data=None,
+                    validation_steps=None,
+                    class_weight=None,
+                    max_queue_size=10,
+                    workers=1,
+                    use_multiprocessing=False,
+                    shuffle=True,
+                    initial_epoch=0,
+                    **kwargs):
+    """Fits the model on data generated batch-by-batch by a Python generator.
+
+    The generator is run in parallel to the model, for efficiency.
+    For instance, this allows you to do real-time data augmentation
+    on images on CPU in parallel to training your model on GPU.
+
+    Arguments:
+        generator: A generator.
+            The output of the generator must be either
+            - a tuple (inputs, targets)
+            - a tuple (inputs, targets, sample_weights).
+            All arrays should contain the same number of samples.
+            The generator is expected to loop over its data
+            indefinitely. An epoch finishes when `steps_per_epoch`
+            batches have been seen by the model.
+        steps_per_epoch: Total number of steps (batches of samples)
+            to yield from `generator` before declaring one epoch
+            finished and starting the next epoch. It should typically
+            be equal to the number of samples of your dataset
+            divided by the batch size.
+            Optional for `Sequence`: if unspecified, will use
+            the `len(generator)` as a number of steps.
+        epochs: Integer, total number of iterations on the data.
+            Note that in conjunction with initial_epoch, the parameter
+            epochs is to be understood as "final epoch". The model is
+            not trained for n steps given by epochs, but until the
+            epoch epochs is reached.
+        verbose: Verbosity mode, 0, 1, or 2.
+        callbacks: List of callbacks to be called during training.
+        validation_data: This can be either
+            - A generator for the validation data
+            - A tuple (inputs, targets)
+            - A tuple (inputs, targets, sample_weights).
+        validation_steps: Only relevant if `validation_data`
+            is a generator.
+            Number of steps to yield from validation generator
+            at the end of every epoch. It should typically
+            be equal to the number of samples of your
+            validation dataset divided by the batch size.
+            Optional for `Sequence`: if unspecified, will use
+            the `len(validation_data)` as a number of steps.
+        class_weight: Dictionary mapping class indices to a weight
+            for the class.
+        max_queue_size: Maximum size for the generator queue
+        workers: Maximum number of processes to spin up
+        use_multiprocessing: If True, use process based threading.
+            Note that because
+            this implementation relies on multiprocessing,
+            you should not pass
+            non picklable arguments to the generator
+            as they can't be passed
+            easily to children processes.
+       shuffle: Whether to shuffle the order of the batches at
+              the beginning of each epoch. Only used with instances
+              of `Sequence` (keras.utils.Sequence).
+        initial_epoch: Epoch at which to start training
+            (useful for resuming a previous training run)
+        **kwargs: support for legacy arguments.
+
+    Returns:
+        A `History` object.
+
+    Raises:
+        RuntimeError: if the model was never compiled.
+        ValueError: In case the generator yields
+            data in an invalid format.
+
+    Example:
+
+    ```python
+        def generate_arrays_from_file(path):
+            while 1:
+                f = open(path)
+                for line in f:
+                    # create Numpy arrays of input data
+                    # and labels, from each line in the file
+                    x, y = process_line(line)
+                    yield (x, y)
+                    f.close()
+
+        model.fit_generator(generate_arrays_from_file('/my_file.txt'),
+                            steps_per_epoch=1000, epochs=10)
+    ```
+    """
+    # Legacy support
+    if 'max_q_size' in kwargs:
+      max_queue_size = kwargs.pop('max_q_size')
+      logging.warning('The argument `max_q_size` has been renamed '
+                      '`max_queue_size`. Update your method calls accordingly.')
+    if 'pickle_safe' in kwargs:
+      use_multiprocessing = kwargs.pop('pickle_safe')
+      logging.warning('The argument `pickle_safe` has been renamed '
+                      '`use_multiprocessing`. '
+                      'Update your method calls accordingly.')
+    if kwargs:
+      raise ValueError('Unrecognized keyword arguments: ' + str(kwargs))
+
+    if not self.built:
+      raise RuntimeError('The model needs to be compiled before being used.')
+    return self.model.fit_generator(
+        generator,
+        steps_per_epoch,
+        epochs,
+        verbose=verbose,
+        callbacks=callbacks,
+        validation_data=validation_data,
+        validation_steps=validation_steps,
+        class_weight=class_weight,
+        max_queue_size=max_queue_size,
+        workers=workers,
+        use_multiprocessing=use_multiprocessing,
+        shuffle=shuffle,
+        initial_epoch=initial_epoch)
+
+  def evaluate_generator(self,
+                         generator,
+                         steps=None,
+                         max_queue_size=10,
+                         workers=1,
+                         use_multiprocessing=False,
+                         **kwargs):
+    """Evaluates the model on a data generator.
+
+    The generator should return the same kind of data
+    as accepted by `test_on_batch`.
+
+    Arguments:
+        generator: Generator yielding tuples (inputs, targets)
+            or (inputs, targets, sample_weights)
+        steps: Total number of steps (batches of samples)
+            to yield from `generator` before stopping.
+            Optional for `Sequence`: if unspecified, will use
+            the `len(generator)` as a number of steps.
+        max_queue_size: maximum size for the generator queue
+        workers: maximum number of processes to spin up
+        use_multiprocessing: if True, use process based threading.
+            Note that because this implementation
+            relies on multiprocessing, you should not pass
+            non picklable arguments to the generator
+            as they can't be passed easily to children processes.
+        **kwargs: support for legacy arguments.
+
+    Returns:
+        Scalar test loss (if the model has no metrics)
+        or list of scalars (if the model computes other metrics).
+        The attribute `model.metrics_names` will give you
+        the display labels for the scalar outputs.
+
+    Raises:
+        RuntimeError: if the model was never compiled.
+        ValueError: In case the generator yields
+            data in an invalid format.
+    """
+    # Legacy support
+    if 'max_q_size' in kwargs:
+      max_queue_size = kwargs.pop('max_q_size')
+      logging.warning('The argument `max_q_size` has been renamed '
+                      '`max_queue_size`. Update your method calls accordingly.')
+    if 'pickle_safe' in kwargs:
+      use_multiprocessing = kwargs.pop('pickle_safe')
+      logging.warning('The argument `pickle_safe` has been renamed '
+                      '`use_multiprocessing`. '
+                      'Update your method calls accordingly.')
+    if kwargs:
+      raise ValueError('Unrecognized keyword arguments: ' + str(kwargs))
+
+    if not self.built:
+      raise RuntimeError('The model needs to be compiled before being used.')
+    return self.model.evaluate_generator(
+        generator,
+        steps,
+        max_queue_size=max_queue_size,
+        workers=workers,
+        use_multiprocessing=use_multiprocessing)
+
+  def predict_generator(self,
+                        generator,
+                        steps=None,
+                        max_queue_size=10,
+                        workers=1,
+                        use_multiprocessing=False,
+                        verbose=0,
+                        **kwargs):
+    """Generates predictions for the input samples from a data generator.
+
+    The generator should return the same kind of data as accepted by
+    `predict_on_batch`.
+
+    Arguments:
+        generator: generator yielding batches of input samples.
+        steps: Total number of steps (batches of samples)
+            to yield from `generator` before stopping.
+            Optional for `Sequence`: if unspecified, will use
+            the `len(generator)` as a number of steps.
+        max_queue_size: maximum size for the generator queue
+        workers: maximum number of processes to spin up
+        use_multiprocessing: if True, use process based threading.
+            Note that because this implementation
+            relies on multiprocessing, you should not pass
+            non picklable arguments to the generator
+            as they can't be passed easily to children processes.
+        verbose: verbosity mode, 0 or 1.
+        **kwargs: support for legacy arguments.
+
+    Returns:
+        A Numpy array of predictions.
+
+    Raises:
+        ValueError: In case the generator yields
+            data in an invalid format.
+    """
+    # Legacy support
+    if 'max_q_size' in kwargs:
+      max_queue_size = kwargs.pop('max_q_size')
+      logging.warning('The argument `max_q_size` has been renamed '
+                      '`max_queue_size`. Update your method calls accordingly.')
+    if 'pickle_safe' in kwargs:
+      use_multiprocessing = kwargs.pop('pickle_safe')
+      logging.warning('The argument `pickle_safe` has been renamed '
+                      '`use_multiprocessing`. '
+                      'Update your method calls accordingly.')
+    if kwargs:
+      raise ValueError('Unrecognized keyword arguments: ' + str(kwargs))
+
+    if not self.built:
+      self.build()
+    return self.model.predict_generator(
+        generator,
+        steps,
+        max_queue_size=max_queue_size,
+        workers=workers,
+        use_multiprocessing=use_multiprocessing,
+        verbose=verbose)
+
+  def get_config(self):
+    config = []
+    for layer in self.layers:
+      config.append({
+          'class_name': layer.__class__.__name__,
+          'config': layer.get_config()
+      })
+    return copy.deepcopy(config)
+
+  @classmethod
+  def from_config(cls, config, custom_objects=None):
+    model = cls()
+    for conf in config:
+      layer = layer_module.deserialize(conf, custom_objects=custom_objects)
+      model.add(layer)
+    return model
diff --git a/tensorflow/python/keras/_impl/keras/engine/sequential_test.py b/tensorflow/python/keras/_impl/keras/engine/sequential_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..166634bd8219b831ce212ba983a4ab695b00c3b7
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/sequential_test.py
@@ -0,0 +1,152 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests specific to `Sequential` model."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.keras._impl import keras
+from tensorflow.python.platform import test
+
+
+class TestSequential(test.TestCase):
+  """Most Sequential model API tests are covered in `training_test.py`.
+  """
+
+  def test_basic_methods(self):
+    model = keras.models.Sequential()
+    model.add(keras.layers.Dense(1, input_dim=2))
+    model.add(keras.layers.Dropout(0.3, name='dp'))
+    model.add(keras.layers.Dense(2, kernel_regularizer='l2',
+                                 kernel_constraint='max_norm'))
+    model.build()
+    self.assertEqual(model.state_updates, model.model.state_updates)
+    self.assertEqual(model.get_layer(name='dp').name, 'dp')
+
+  def test_sequential_pop(self):
+    num_hidden = 5
+    input_dim = 3
+    batch_size = 5
+    num_classes = 2
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(num_hidden, input_dim=input_dim))
+      model.add(keras.layers.Dense(num_classes))
+      model.compile(loss='mse', optimizer='sgd')
+      x = np.random.random((batch_size, input_dim))
+      y = np.random.random((batch_size, num_classes))
+      model.fit(x, y, epochs=1)
+      model.pop()
+      self.assertEqual(len(model.layers), 1)
+      self.assertEqual(model.output_shape, (None, num_hidden))
+      model.compile(loss='mse', optimizer='sgd')
+      y = np.random.random((batch_size, num_hidden))
+      model.fit(x, y, epochs=1)
+
+      # Test popping single-layer model
+      model = keras.models.Sequential()
+      model.add(keras.layers.Dense(num_hidden, input_dim=input_dim))
+      model.pop()
+      self.assertEqual(len(model.layers), 0)
+      self.assertEqual(len(model.outputs), 0)
+
+      # Invalid use case
+      model = keras.models.Sequential()
+      with self.assertRaises(TypeError):
+        model.pop()
+
+  def test_invalid_use_cases(self):
+    with self.test_session():
+      # Added objects must be layer instances
+      with self.assertRaises(TypeError):
+        model = keras.models.Sequential()
+        model.add(None)
+
+      # Added layers must have an inputs shape
+      with self.assertRaises(ValueError):
+        model = keras.models.Sequential()
+        model.add(keras.layers.Dense(1))
+
+      # Added layers cannot have multiple outputs
+      class MyLayer(keras.layers.Layer):
+
+        def call(self, inputs):
+          return [3 * inputs, 2 * inputs]
+
+        def compute_output_shape(self, input_shape):
+          return [input_shape, input_shape]
+
+      with self.assertRaises(ValueError):
+        model = keras.models.Sequential()
+        model.add(MyLayer(input_shape=(3,)))
+      with self.assertRaises(TypeError):
+        model = keras.models.Sequential()
+        model.add(keras.layers.Dense(1, input_dim=1))
+        model.add(MyLayer())
+
+      # Building empty model
+      model = keras.models.Sequential()
+      with self.assertRaises(TypeError):
+        model.build()
+
+  def test_nested_sequential_trainability(self):
+    input_dim = 20
+    num_units = 10
+    num_classes = 2
+
+    inner_model = keras.models.Sequential()
+    inner_model.add(keras.layers.Dense(num_units, input_shape=(input_dim,)))
+
+    model = keras.models.Sequential()
+    model.add(inner_model)
+    model.add(keras.layers.Dense(num_classes))
+
+    self.assertEqual(len(model.trainable_weights), 4)
+    inner_model.trainable = False
+    self.assertEqual(len(model.trainable_weights), 2)
+    inner_model.trainable = True
+    self.assertEqual(len(model.trainable_weights), 4)
+
+  def test_sequential_update_disabling(self):
+    val_a = np.random.random((10, 4))
+    val_out = np.random.random((10, 4))
+
+    with self.test_session():
+      model = keras.models.Sequential()
+      model.add(keras.layers.BatchNormalization(input_shape=(4,)))
+
+      model.trainable = False
+      assert not model.updates
+
+      model.compile('sgd', 'mse')
+      assert not model.updates
+      assert not model.model.updates
+
+      x1 = model.predict(val_a)
+      model.train_on_batch(val_a, val_out)
+      x2 = model.predict(val_a)
+      self.assertAllClose(x1, x2, atol=1e-7)
+
+      model.trainable = True
+      model.compile('sgd', 'mse')
+      assert model.updates
+      assert model.model.updates
+
+      model.train_on_batch(val_a, val_out)
+      x2 = model.predict(val_a)
+      assert np.abs(np.sum(x1 - x2)) > 1e-5
diff --git a/tensorflow/python/keras/_impl/keras/engine/topology.py b/tensorflow/python/keras/_impl/keras/engine/topology.py
deleted file mode 100644
index dd7436e3d00f5dfa736b8d058316918cb5ef51e4..0000000000000000000000000000000000000000
--- a/tensorflow/python/keras/_impl/keras/engine/topology.py
+++ /dev/null
@@ -1,1684 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-# pylint: disable=protected-access
-"""Base layer code and base model (Network) code.
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import copy
-import json
-import os
-
-import numpy as np
-from six.moves import zip  # pylint: disable=redefined-builtin
-
-from tensorflow.python.eager import context
-from tensorflow.python.framework import tensor_shape
-from tensorflow.python.keras._impl.keras import backend as K
-from tensorflow.python.keras._impl.keras import constraints
-from tensorflow.python.keras._impl.keras import initializers
-from tensorflow.python.keras._impl.keras import regularizers
-from tensorflow.python.keras._impl.keras.utils import conv_utils
-from tensorflow.python.keras._impl.keras.utils.io_utils import ask_to_proceed_with_overwrite
-from tensorflow.python.keras._impl.keras.utils.layer_utils import print_summary as print_layer_summary
-from tensorflow.python.layers import base as tf_base_layers
-from tensorflow.python.layers import network as tf_network
-from tensorflow.python.layers import utils as tf_layers_util
-from tensorflow.python.platform import tf_logging as logging
-from tensorflow.python.util.tf_export import tf_export
-
-
-# pylint: disable=g-import-not-at-top
-try:
-  import h5py
-except ImportError:
-  h5py = None
-
-try:
-  import yaml
-except ImportError:
-  yaml = None
-# pylint: enable=g-import-not-at-top
-
-# pylint: disable=invalid-name
-InputSpec = tf_base_layers.InputSpec
-Node = tf_base_layers.Node
-TFBaseLayer = tf_base_layers.Layer
-# pylint: enable=invalid-name
-
-
-@tf_export('keras.layers.Layer')
-class Layer(tf_base_layers.Layer):
-  """Abstract base layer class.
-
-  # Properties
-      name: String, must be unique within a model.
-      input_spec: List of InputSpec class instances
-          each entry describes one required input:
-              - ndim
-              - dtype
-          A layer with `n` input tensors must have
-          an `input_spec` of length `n`.
-      trainable: Boolean, whether the layer weights
-          will be updated during training.
-      uses_learning_phase: Whether any operation
-          of the layer uses `K.in_training_phase()`
-          or `K.in_test_phase()`.
-      input_shape: Shape tuple. Provided for convenience,
-          but note that there may be cases in which this
-          attribute is ill-defined (e.g. a shared layer
-          with multiple input shapes), in which case
-          requesting `input_shape` will raise an Exception.
-          Prefer using `layer.get_input_shape_for(input_shape)`,
-          or `layer.get_input_shape_at(node_index)`.
-      output_shape: Shape tuple. See above.
-      inbound_nodes: List of nodes.
-      outbound_nodes: List of nodes.
-      input, output: Input/output tensor(s). Note that if the layer is used
-          more than once (shared layer), this is ill-defined
-          and will raise an exception. In such cases, use
-          `layer.get_input_at(node_index)`.
-      input_mask, output_mask: Same as above, for masks.
-      trainable_weights: List of variables.
-      non_trainable_weights: List of variables.
-      weights: The concatenation of the lists trainable_weights and
-          non_trainable_weights (in this order).
-
-  # Methods
-      call(x, mask=None): Where the layer's logic lives.
-      __call__(x, mask=None): Wrapper around the layer logic (`call`).
-          If x is a Keras tensor:
-              - Connect current layer with last layer from tensor:
-                  `self._add_inbound_node(last_layer)`
-              - Add layer to tensor history
-          If layer is not built:
-              - Build from inputs shape
-      get_weights()
-      set_weights(weights)
-      get_config()
-      count_params()
-      compute_output_shape(input_shape)
-      compute_mask(x, mask)
-      get_input_at(node_index)
-      get_output_at(node_index)
-      get_input_shape_at(node_index)
-      get_output_shape_at(node_index)
-      get_input_mask_at(node_index)
-      get_output_mask_at(node_index)
-
-  # Class Methods
-      from_config(config)
-
-  # Internal methods:
-      build(input_shape)
-      _add_inbound_node(layer, index=0)
-  """
-
-  def __init__(self, **kwargs):
-    # These properties should be set by the user via keyword arguments.
-    # note that 'dtype', 'input_shape' and 'batch_input_shape'
-    # are only applicable to input layers: do not pass these keywords
-    # to non-input layers.
-    allowed_kwargs = {
-        'activity_regularizer',
-        'input_shape',
-        'batch_input_shape',
-        'batch_size',
-        'dtype',
-        'name',
-        'trainable',
-        'weights',
-    }
-    # Validate optional keyword arguments.
-    for kwarg in kwargs:
-      if kwarg not in allowed_kwargs:
-        raise TypeError('Keyword argument not understood:', kwarg)
-
-    # Get layer name.
-    name = kwargs.get('name')
-
-    # Get `trainable` status.
-    trainable = kwargs.get('trainable', True)
-
-    # Get `dtype`.
-    dtype = kwargs.get('dtype')
-    if dtype is None:
-      dtype = K.floatx()
-
-    # Call super, which will set all properties common to Keras layers
-    # and core TF layers.
-    super(Layer, self).__init__(
-        name=name, dtype=dtype, trainable=trainable,
-        activity_regularizer=kwargs.get('activity_regularizer'))
-
-    # Add properties that are Keras-only for now.
-    self.supports_masking = False
-
-    # Manage input shape information if passed.
-    if 'input_shape' in kwargs or 'batch_input_shape' in kwargs:
-      # In this case we will later create an input layer
-      # to insert before the current layer
-      if 'batch_input_shape' in kwargs:
-        batch_input_shape = tuple(kwargs['batch_input_shape'])
-      elif 'input_shape' in kwargs:
-        if 'batch_size' in kwargs:
-          batch_size = kwargs['batch_size']
-        else:
-          batch_size = None
-        batch_input_shape = (batch_size,) + tuple(kwargs['input_shape'])
-      self._batch_input_shape = batch_input_shape
-
-    # Manage initial weight values if passed.
-    if 'weights' in kwargs:
-      self._initial_weights = kwargs['weights']
-    else:
-      self._initial_weights = None
-
-  def add_weight(self,
-                 name,
-                 shape,
-                 dtype=None,
-                 initializer=None,
-                 regularizer=None,
-                 trainable=True,
-                 constraint=None):
-    """Adds a weight variable to the layer.
-
-    Arguments:
-        name: String, the name for the weight variable.
-        shape: The shape tuple of the weight.
-        dtype: The dtype of the weight.
-        initializer: An Initializer instance (callable).
-        regularizer: An optional Regularizer instance.
-        trainable: A boolean, whether the weight should
-            be trained via backprop or not (assuming
-            that the layer itself is also trainable).
-        constraint: An optional Constraint instance.
-
-    Returns:
-        The created weight variable.
-    """
-    if dtype is None:
-      dtype = K.floatx()
-    weight = self.add_variable(name, shape,
-                               dtype=dtype,
-                               initializer=initializers.get(initializer),
-                               regularizer=regularizers.get(regularizer),
-                               constraint=constraints.get(constraint),
-                               trainable=trainable)
-    return weight
-
-  def call(self, inputs, **kwargs):  # pylint: disable=unused-argument
-    """This is where the layer's logic lives.
-
-    Arguments:
-        inputs: Input tensor, or list/tuple of input tensors.
-        **kwargs: Additional keyword arguments.
-
-    Returns:
-        A tensor or list/tuple of tensors.
-    """
-    return inputs
-
-  def __call__(self, inputs, **kwargs):
-    """Wrapper around self.call(), for handling internal references.
-
-    If a Keras tensor is passed:
-        - We call self._add_inbound_node().
-        - If necessary, we `build` the layer to match
-            the shape of the input(s).
-        - We update the _keras_history of the output tensor(s)
-            with the current layer.
-            This is done as part of _add_inbound_node().
-
-    Arguments:
-        inputs: Can be a tensor or list/tuple of tensors.
-        **kwargs: Additional keyword arguments to be passed to `call()`.
-
-    Returns:
-        Output of the layer's `call` method.
-
-    Raises:
-        ValueError: in case the layer is missing shape information
-            for its `build` call.
-    """
-    # Actually call the layer (optionally building it).
-    output = super(Layer, self).__call__(inputs, **kwargs)
-    if context.in_eager_mode():
-      return output
-
-    # Un-built subclassed network: build it
-    if isinstance(self, Network) and not self.inputs:
-      self._set_inputs(inputs)
-
-    # Update learning phase info.
-    output_tensors = _to_list(output)
-    uses_lp = any(
-        [getattr(x, '_uses_learning_phase', False) for x in _to_list(inputs)])
-    uses_lp = getattr(self, 'uses_learning_phase', False) or uses_lp
-    for i in range(len(output_tensors)):
-      output_tensors[i]._uses_learning_phase = getattr(
-          output_tensors[i], '_uses_learning_phase', False) or uses_lp
-
-    # Optionally load weight values that were specified at layer instantiation.
-    if hasattr(self, '_initial_weights') and self._initial_weights is not None:
-      self.set_weights(self._initial_weights)
-      del self._initial_weights
-    return output
-
-  def compute_output_shape(self, input_shape):
-    """Computes the output shape of the layer.
-
-    Assumes that the layer will be built
-    to match that input shape provided.
-
-    Arguments:
-        input_shape: Shape tuple (tuple of integers)
-            or list of shape tuples (one per output tensor of the layer).
-            Shape tuples can include None for free dimensions,
-            instead of an integer.
-
-    Returns:
-        An input shape tuple.
-    """
-    logging.warning(
-        'All custom layers should implement the '
-        '`compute_output_shape` method. This layer (' + self.name + ') '
-        'is relying on the base `Layer.compute_output_shape` implementation, '
-        'which will start raising a `NotImplementedError` '
-        'as of July 1st, 2018.')
-    return input_shape
-
-  def compute_mask(self, inputs, mask=None):  # pylint: disable=unused-argument
-    """Computes an output mask tensor.
-
-    Arguments:
-        inputs: Tensor or list of tensors.
-        mask: Tensor or list of tensors.
-
-    Returns:
-        None or a tensor (or list of tensors,
-            one per output tensor of the layer).
-    """
-    if not self.supports_masking:
-      if mask is not None:
-        if isinstance(mask, list):
-          if any(m is not None for m in mask):
-            raise TypeError('Layer ' + self.name + ' does not support masking, '
-                            'but was passed an input_mask: ' + str(mask))
-        else:
-          raise TypeError('Layer ' + self.name + ' does not support masking, '
-                          'but was passed an input_mask: ' + str(mask))
-      # masking not explicitly supported: return None as mask
-      return None
-    # if masking is explicitly supported, by default
-    # carry over the input mask
-    return mask
-
-  def get_input_mask_at(self, node_index):
-    """Retrieves the input mask tensor(s) of a layer at a given node.
-
-    Arguments:
-        node_index: Integer, index of the node
-            from which to retrieve the attribute.
-            E.g. `node_index=0` will correspond to the
-            first time the layer was called.
-
-    Returns:
-        A mask tensor
-        (or list of tensors if the layer has multiple inputs).
-    """
-    inputs = self.get_input_at(node_index)
-    if isinstance(inputs, list):
-      return [getattr(x, '_keras_mask', None) for x in inputs]
-    else:
-      return getattr(inputs, '_keras_mask', None)
-
-  def get_output_mask_at(self, node_index):
-    """Retrieves the output mask tensor(s) of a layer at a given node.
-
-    Arguments:
-        node_index: Integer, index of the node
-            from which to retrieve the attribute.
-            E.g. `node_index=0` will correspond to the
-            first time the layer was called.
-
-    Returns:
-        A mask tensor
-        (or list of tensors if the layer has multiple outputs).
-    """
-    output = self.get_output_at(node_index)
-    if isinstance(output, list):
-      return [getattr(x, '_keras_mask', None) for x in output]
-    else:
-      return getattr(output, '_keras_mask', None)
-
-  @property
-  def input_mask(self):
-    """Retrieves the input mask tensor(s) of a layer.
-
-    Only applicable if the layer has exactly one inbound node,
-    i.e. if it is connected to one incoming layer.
-
-    Returns:
-        Input mask tensor (potentially None) or list of input
-        mask tensors.
-
-    Raises:
-        AttributeError: if the layer is connected to
-        more than one incoming layers.
-    """
-    inputs = self.input
-    if isinstance(inputs, list):
-      return [getattr(x, '_keras_mask', None) for x in inputs]
-    else:
-      return getattr(inputs, '_keras_mask', None)
-
-  @property
-  def output_mask(self):
-    """Retrieves the output mask tensor(s) of a layer.
-
-    Only applicable if the layer has exactly one inbound node,
-    i.e. if it is connected to one incoming layer.
-
-    Returns:
-        Output mask tensor (potentially None) or list of output
-        mask tensors.
-
-    Raises:
-        AttributeError: if the layer is connected to
-        more than one incoming layers.
-    """
-    output = self.output
-    if isinstance(output, list):
-      return [getattr(x, '_keras_mask', None) for x in output]
-    else:
-      return getattr(output, '_keras_mask', None)
-
-  def set_weights(self, weights):
-    """Sets the weights of the layer, from Numpy arrays.
-
-    Arguments:
-        weights: a list of Numpy arrays. The number
-            of arrays and their shape must match
-            number of the dimensions of the weights
-            of the layer (i.e. it should match the
-            output of `get_weights`).
-
-    Raises:
-        ValueError: If the provided weights list does not match the
-            layer's specifications.
-    """
-    params = self.weights
-    if len(params) != len(weights):
-      raise ValueError('You called `set_weights(weights)` on layer "' +
-                       self.name + '" with a  weight list of length ' +
-                       str(len(weights)) + ', but the layer was expecting ' +
-                       str(len(params)) + ' weights. Provided weights: ' +
-                       str(weights)[:50] + '...')
-    if not params:
-      return
-    weight_value_tuples = []
-    param_values = K.batch_get_value(params)
-    for pv, p, w in zip(param_values, params, weights):
-      if pv.shape != w.shape:
-        raise ValueError('Layer weight shape ' + str(pv.shape) +
-                         ' not compatible with '
-                         'provided weight shape ' + str(w.shape))
-      weight_value_tuples.append((p, w))
-    K.batch_set_value(weight_value_tuples)
-
-  def get_weights(self):
-    """Returns the current weights of the layer.
-
-    Returns:
-        Weights values as a list of numpy arrays.
-    """
-    params = self.weights
-    return K.batch_get_value(params)
-
-  def get_config(self):
-    """Returns the config of the layer.
-
-    A layer config is a Python dictionary (serializable)
-    containing the configuration of a layer.
-    The same layer can be reinstantiated later
-    (without its trained weights) from this configuration.
-
-    The config of a layer does not include connectivity
-    information, nor the layer class name. These are handled
-    by `Network` (one layer of abstraction above).
-
-    Returns:
-        Python dictionary.
-    """
-    config = {'name': self.name, 'trainable': self.trainable}
-    if hasattr(self, '_batch_input_shape'):
-      config['batch_input_shape'] = self._batch_input_shape
-    if hasattr(self, 'dtype'):
-      config['dtype'] = self.dtype
-    return config
-
-  @classmethod
-  def from_config(cls, config):
-    """Creates a layer from its config.
-
-    This method is the reverse of `get_config`,
-    capable of instantiating the same layer from the config
-    dictionary. It does not handle layer connectivity
-    (handled by Network), nor weights (handled by `set_weights`).
-
-    Arguments:
-        config: A Python dictionary, typically the
-            output of get_config.
-
-    Returns:
-        A layer instance.
-    """
-    return cls(**config)
-
-  @tf_base_layers.Layer.activity_regularizer.setter
-  def activity_regularizer(self, activity_regularizer):
-    self._activity_regularizer = activity_regularizer
-
-
-@tf_export('keras.layers.InputLayer')
-class InputLayer(tf_network.InputLayer, Layer):
-  """Layer to be used as an entry point into a graph.
-
-  It can either wrap an existing tensor (pass an `input_tensor` argument)
-  or create its a placeholder tensor (pass argument `input_shape`.
-
-  Arguments:
-      input_shape: Shape tuple, not including the batch axis.
-      batch_size: Optional input batch size (integer or None).
-      dtype: Datatype of the input.
-      input_tensor: Optional tensor to use as layer input
-          instead of creating a placeholder.
-      sparse: Boolean, whether the placeholder created
-          is meant to be sparse.
-      name: Name of the layer (string).
-  """
-
-  def __init__(self,
-               input_shape=None,
-               batch_size=None,
-               dtype=None,
-               input_tensor=None,
-               sparse=False,
-               name=None,
-               **kwargs):
-    if 'batch_input_shape' in kwargs:
-      batch_input_shape = kwargs.pop('batch_input_shape')
-      if input_shape and batch_input_shape:
-        raise ValueError('Only provide the input_shape OR '
-                         'batch_input_shape argument to '
-                         'InputLayer, not both at the same time.')
-      batch_size = batch_input_shape[0]
-      input_shape = batch_input_shape[1:]
-    if kwargs:
-      raise ValueError('Unrecognized keyword arguments:', kwargs.keys())
-
-    if not name:
-      prefix = 'input'
-      name = prefix + '_' + str(K.get_uid(prefix))
-
-    if not dtype:
-      if input_tensor is None:
-        dtype = K.floatx()
-      else:
-        dtype = K.dtype(input_tensor)
-    super(InputLayer, self).__init__(input_shape=input_shape,
-                                     batch_size=batch_size,
-                                     dtype=dtype,
-                                     input_tensor=input_tensor,
-                                     sparse=sparse,
-                                     name=name)
-
-  def get_config(self):
-    config = {
-        'batch_input_shape': self._batch_input_shape,
-        'dtype': self.dtype,
-        'sparse': self.sparse,
-        'name': self.name
-    }
-    return config
-
-
-@tf_export('keras.layers.Input', 'keras.Input')
-def Input(  # pylint: disable=invalid-name
-    shape=None,
-    batch_size=None,
-    name=None,
-    dtype=None,
-    sparse=False,
-    tensor=None,
-    **kwargs):
-  """`Input()` is used to instantiate a Keras tensor.
-
-  A Keras tensor is a tensor object from the underlying backend
-  (Theano or TensorFlow), which we augment with certain
-  attributes that allow us to build a Keras model
-  just by knowing the inputs and outputs of the model.
-
-  For instance, if a, b and c are Keras tensors,
-  it becomes possible to do:
-  `model = Model(input=[a, b], output=c)`
-
-  The added Keras attribute is:
-      `_keras_history`: Last layer applied to the tensor.
-          the entire layer graph is retrievable from that layer,
-          recursively.
-
-  Arguments:
-      shape: A shape tuple (integers), not including the batch size.
-          For instance, `shape=(32,)` indicates that the expected input
-          will be batches of 32-dimensional vectors.
-      batch_size: optional static batch size (integer).
-      name: An optional name string for the layer.
-          Should be unique in a model (do not reuse the same name twice).
-          It will be autogenerated if it isn't provided.
-      dtype: The data type expected by the input, as a string
-          (`float32`, `float64`, `int32`...)
-      sparse: A boolean specifying whether the placeholder
-          to be created is sparse.
-      tensor: Optional existing tensor to wrap into the `Input` layer.
-          If set, the layer will not create a placeholder tensor.
-      **kwargs: deprecated arguments support.
-
-  Returns:
-      A tensor.
-
-  Example:
-
-      ```python
-      # this is a logistic regression in Keras
-      x = Input(shape=(32,))
-      y = Dense(16, activation='softmax')(x)
-      model = Model(x, y)
-      ```
-
-  Raises:
-    ValueError: in case of invalid arguments.
-  """
-  if 'batch_shape' in kwargs:
-    batch_shape = kwargs.pop('batch_shape')
-    if shape and batch_shape:
-      raise ValueError('Only provide the shape OR '
-                       'batch_shape argument to '
-                       'Input, not both at the same time.')
-    batch_size = batch_shape[0]
-    shape = batch_shape[1:]
-  if kwargs:
-    raise ValueError('Unrecognized keyword arguments:', kwargs.keys())
-
-  if dtype is None:
-    dtype = K.floatx()
-  if not shape and tensor is None:
-    raise ValueError('Please provide to Input either a `shape`'
-                     ' or a `tensor` argument. Note that '
-                     '`shape` does not include the batch '
-                     'dimension.')
-  input_layer = InputLayer(
-      input_shape=shape,
-      batch_size=batch_size,
-      name=name,
-      dtype=dtype,
-      sparse=sparse,
-      input_tensor=tensor)
-  # Return tensor including `_keras_history`.
-  # Note that in this case train_output and test_output are the same pointer.
-  outputs = input_layer._inbound_nodes[0].output_tensors
-  if len(outputs) == 1:
-    return outputs[0]
-  else:
-    return outputs
-
-
-class Network(tf_network.GraphNetwork, Layer):
-  """A Network is a directed acyclic graph of layers.
-
-  It is the topological form of a "model". A Model
-  is simply a Network with added training routines.
-
-  # Properties
-      name
-      inputs
-      outputs
-      input_layers
-      output_layers
-      input_spec (list of class instances)
-          each entry describes one required input:
-              - ndim
-              - dtype
-      trainable (boolean)
-      input_shape
-      output_shape
-      inbound_nodes: list of nodes
-      outbound_nodes: list of nodes
-      trainable_weights (list of variables)
-      non_trainable_weights (list of variables)
-
-  # Methods
-      summary
-      get_layer
-      get_weights
-      set_weights
-      get_config
-      compute_output_shape
-
-  # Class Methods
-      from_config
-  """
-
-  def __init__(self, *args, **kwargs):  # pylint: disable=super-init-not-called
-    # Signature detection
-    if (len(args) == 2 or
-        len(args) == 1 and 'outputs' in kwargs or
-        'inputs' in kwargs and 'outputs' in kwargs):
-      # Graph network
-      self._init_graph_network(*args, **kwargs)
-    else:
-      # Subclassed network
-      self._init_subclassed_network(**kwargs)
-
-  def _init_graph_network(self, inputs, outputs, name=None):
-    # TODO(fchollet): merge back tf.layers.Network and tf.keras.Network
-    # into a single class tf.keras.Network
-    super(Network, self).__init__(inputs, outputs, name=name)
-
-    self._is_compiled = False
-    self.supports_masking = False
-    self.optimizer = None
-
-    # Fill in the output mask cache.
-    masks = []
-    for x in self.inputs:
-      mask = x._keras_mask if hasattr(x, '_keras_mask') else None
-      masks.append(mask)
-    mask_cache_key = (tf_layers_util.object_list_uid(self.inputs) + '_' +
-                      tf_layers_util.object_list_uid(masks))
-    masks = []
-    for x in self.outputs:
-      mask = x._keras_mask if hasattr(x, '_keras_mask') else None
-      masks.append(mask)
-    if len(masks) == 1:
-      mask = masks[0]
-    else:
-      mask = masks
-    self._output_mask_cache[mask_cache_key] = mask
-
-    # Build self.input_names and self.output_names.
-    self.input_names = []
-    self.output_names = []
-    self._feed_input_names = []
-    self._feed_inputs = []
-    self._feed_input_shapes = []
-    for i, layer in enumerate(self._input_layers):
-      self.input_names.append(layer.name)
-      if layer.is_placeholder:
-        self._feed_input_names.append(layer.name)
-        self._feed_input_shapes.append(K.int_shape(self.inputs[i]))
-        # layer.input gives an error in eager mode
-        if context.in_graph_mode():
-          self._feed_inputs.append(layer.input)
-    for layer in self._output_layers:
-      self.output_names.append(layer.name)
-
-  def _init_subclassed_network(self, name=None):
-    self._init_set_name(name)
-    self._layers = []
-    self._is_graph_network = False
-    self._is_compiled = False
-    self.outputs = None
-    self.inputs = None
-    self.trainable = True
-    self.supports_masking = False
-    self.built = False
-    self.optimizer = None
-
-    # Not used, exists for compatibility purposes due to implementation of
-    # the base layer tf.layers.Layer - TODO(fchollet): clean up when refactoring
-    self._scope = None
-    self._reuse = None
-    self._dtype = None
-    self._graph = None
-    self._activity_regularizer = None
-
-    # Used in symbolic mode only
-    self._updates = []
-    self._losses = []
-
-    # Used in symbolic mode only, only in conjonction with graph-networks
-    self._outbound_nodes = []
-    self._inbound_nodes = []
-
-  def __setattr__(self, name, value):
-    if isinstance(value, (tf_base_layers.Layer, Network)):
-      try:
-        is_graph_network = self._is_graph_network
-      except AttributeError:
-        raise RuntimeError('It looks like you are subclassing `Model` and you '
-                           'forgot to call `super(YourClass, self).__init__()`.'
-                           ' Always start with this line.')
-      if not is_graph_network:
-        if value not in self._layers:
-          self._layers.append(value)
-    super(Network, self).__setattr__(name, value)
-
-  def add_variable(self, name, shape, dtype=None, initializer=None,
-                   regularizer=None, trainable=True, constraint=None):
-    raise NotImplementedError('`add_variable` is not supported on Networks')
-
-  def add_loss(self, *args, **kwargs):
-    if context.in_eager_mode():
-      raise NotImplementedError('`add_loss` is not supported in eager-mode '
-                                'on Networks')
-    super(Network, self).add_loss(*args, **kwargs)
-
-  @property
-  def uses_learning_phase(self):
-    return any(
-        [getattr(x, '_uses_learning_phase', False) for x in self.outputs])
-
-  @property
-  def stateful(self):
-    return any([(hasattr(layer, 'stateful') and layer.stateful)
-                for layer in self.layers])
-
-  def reset_states(self):
-    for layer in self.layers:
-      if hasattr(layer, 'reset_states') and getattr(layer, 'stateful', False):
-        layer.reset_states()
-
-  @property
-  def state_updates(self):
-    """Returns the `updates` from all layers that are stateful.
-
-    This is useful for separating training updates and
-    state updates, e.g. when we need to update a layer's internal state
-    during prediction.
-
-    Returns:
-        A list of update ops.
-    """
-    state_updates = []
-    for layer in self.layers:
-      if getattr(layer, 'stateful', False):
-        if hasattr(layer, 'updates'):
-          state_updates += layer.updates
-    return state_updates
-
-  def get_weights(self):
-    """Retrieves the weights of the model.
-
-    Returns:
-        A flat list of Numpy arrays.
-    """
-    weights = []
-    for layer in self.layers:
-      weights += layer.weights
-    return K.batch_get_value(weights)
-
-  def set_weights(self, weights):
-    """Sets the weights of the model.
-
-    Arguments:
-        weights: A list of Numpy arrays with shapes and types matching
-            the output of `model.get_weights()`.
-    """
-    tuples = []
-    for layer in self.layers:
-      num_param = len(layer.weights)
-      layer_weights = weights[:num_param]
-      for sw, w in zip(layer.weights, layer_weights):
-        tuples.append((sw, w))
-      weights = weights[num_param:]
-    K.batch_set_value(tuples)
-
-  def compute_mask(self, inputs, mask):
-    if not self._is_graph_network:
-      return None
-
-    inputs = _to_list(inputs)
-    if mask is None:
-      masks = [None for _ in range(len(inputs))]
-    else:
-      masks = _to_list(mask)
-    cache_key = (tf_layers_util.object_list_uid(inputs)
-                 + '_' + tf_layers_util.object_list_uid(masks))
-    if cache_key in self._output_mask_cache:
-      return self._output_mask_cache[cache_key]
-    else:
-      _, output_masks = self._run_internal_graph(inputs, masks)
-      return output_masks
-
-  def get_config(self):
-    if not self._is_graph_network:
-      raise NotImplementedError
-
-    config = {
-        'name': self.name,
-    }
-    node_conversion_map = {}
-    for layer in self.layers:
-      if issubclass(layer.__class__, Network):
-        # Networks start with a pre-existing node
-        # linking their input to output.
-        kept_nodes = 1
-      else:
-        kept_nodes = 0
-      for original_node_index, node in enumerate(layer._inbound_nodes):
-        node_key = tf_network._make_node_key(layer.name,
-                                             original_node_index)
-        if node_key in self._network_nodes:
-          node_conversion_map[node_key] = kept_nodes
-          kept_nodes += 1
-    layer_configs = []
-    for layer in self.layers:  # From the earliest layers on.
-      layer_class_name = layer.__class__.__name__
-      layer_config = layer.get_config()
-      filtered_inbound_nodes = []
-      for original_node_index, node in enumerate(layer._inbound_nodes):
-        node_key = tf_network._make_node_key(layer.name,
-                                             original_node_index)
-        if node_key in self._network_nodes:
-          # The node is relevant to the model:
-          # add to filtered_inbound_nodes.
-          if node.arguments:
-            try:
-              json.dumps(node.arguments)
-              kwargs = node.arguments
-            except TypeError:
-              logging.warning(
-                  'Layer ' + layer.name +
-                  ' was passed non-serializable keyword arguments: ' +
-                  str(node.arguments) + '. They will not be included '
-                  'in the serialized model (and thus will be missing '
-                  'at deserialization time).')
-              kwargs = {}
-          else:
-            kwargs = {}
-          if node.inbound_layers:
-            node_data = []
-            for i in range(len(node.inbound_layers)):
-              inbound_layer = node.inbound_layers[i]
-              node_index = node.node_indices[i]
-              tensor_index = node.tensor_indices[i]
-              node_key = tf_network._make_node_key(inbound_layer.name,
-                                                   node_index)
-              new_node_index = node_conversion_map.get(node_key, 0)
-              node_data.append(
-                  [inbound_layer.name, new_node_index, tensor_index, kwargs])
-            filtered_inbound_nodes.append(node_data)
-      layer_configs.append({
-          'name': layer.name,
-          'class_name': layer_class_name,
-          'config': layer_config,
-          'inbound_nodes': filtered_inbound_nodes,
-      })
-    config['layers'] = layer_configs
-
-    # Gather info about inputs and outputs.
-    model_inputs = []
-    for i in range(len(self._input_layers)):
-      layer, node_index, tensor_index = self._input_coordinates[i]
-      node_key = tf_network._make_node_key(layer.name,
-                                           node_index)
-      if node_key not in self._network_nodes:
-        continue
-      new_node_index = node_conversion_map[node_key]
-      model_inputs.append([layer.name, new_node_index, tensor_index])
-    config['input_layers'] = model_inputs
-    model_outputs = []
-    for i in range(len(self._output_layers)):
-      layer, node_index, tensor_index = self._output_coordinates[i]
-      node_key = tf_network._make_node_key(layer.name,
-                                           node_index)
-      if node_key not in self._network_nodes:
-        continue
-      new_node_index = node_conversion_map[node_key]
-      model_outputs.append([layer.name, new_node_index, tensor_index])
-    config['output_layers'] = model_outputs
-    return copy.deepcopy(config)
-
-  @classmethod
-  def from_config(cls, config, custom_objects=None):
-    """Instantiates a Model from its config (output of `get_config()`).
-
-    Arguments:
-        config: Model config dictionary.
-        custom_objects: Optional dictionary mapping names
-            (strings) to custom classes or functions to be
-            considered during deserialization.
-
-    Returns:
-        A model instance.
-
-    Raises:
-        ValueError: In case of improperly formatted config dict.
-    """
-    # Layer instances created during
-    # the graph reconstruction process
-    created_layers = {}
-
-    # Dictionary mapping layer instances to
-    # node data that specifies a layer call.
-    # It acts as a queue that maintains any unprocessed
-    # layer call until it becomes possible to process it
-    # (i.e. until the input tensors to the call all exist).
-    unprocessed_nodes = {}
-
-    def add_unprocessed_node(layer, node_data):
-      if layer not in unprocessed_nodes:
-        unprocessed_nodes[layer] = [node_data]
-      else:
-        unprocessed_nodes[layer].append(node_data)
-
-    def process_node(layer, node_data):
-      """Deserialize a node.
-
-      Arguments:
-          layer: layer instance.
-          node_data: node config dict.
-
-      Raises:
-          ValueError: In case of improperly formatted `node_data` dict.
-      """
-      input_tensors = []
-      for input_data in node_data:
-        inbound_layer_name = input_data[0]
-        inbound_node_index = input_data[1]
-        inbound_tensor_index = input_data[2]
-        if len(input_data) == 3:
-          kwargs = {}
-        elif len(input_data) == 4:
-          kwargs = input_data[3]
-        else:
-          raise ValueError('Improperly formatted model config.')
-        if inbound_layer_name not in created_layers:
-          add_unprocessed_node(layer, node_data)
-          return
-        inbound_layer = created_layers[inbound_layer_name]
-        if len(inbound_layer._inbound_nodes) <= inbound_node_index:
-          add_unprocessed_node(layer, node_data)
-          return
-        inbound_node = inbound_layer._inbound_nodes[inbound_node_index]
-        input_tensors.append(inbound_node.output_tensors[inbound_tensor_index])
-      # Call layer on its inputs, thus creating the node
-      # and building the layer if needed.
-      if input_tensors:
-        if len(input_tensors) == 1:
-          layer(input_tensors[0], **kwargs)
-        else:
-          layer(input_tensors, **kwargs)
-
-    def process_layer(layer_data):
-      """Deserialize a layer, then call it on appropriate inputs.
-
-      Arguments:
-          layer_data: layer config dict.
-
-      Raises:
-          ValueError: In case of improperly formatted `layer_data` dict.
-      """
-      layer_name = layer_data['name']
-
-      # Instantiate layer.
-      from tensorflow.python.keras._impl.keras.layers import deserialize as deserialize_layer  # pylint: disable=g-import-not-at-top
-
-      layer = deserialize_layer(layer_data, custom_objects=custom_objects)
-      created_layers[layer_name] = layer
-
-      # Gather layer inputs.
-      inbound_nodes_data = layer_data['inbound_nodes']
-      for node_data in inbound_nodes_data:
-        # We don't process nodes (i.e. make layer calls)
-        # on the fly because the inbound node may not yet exist,
-        # in case of layer shared at different topological depths
-        # (e.g. a model such as A(B(A(B(x)))))
-        add_unprocessed_node(layer, node_data)
-
-    # First, we create all layers and enqueue nodes to be processed
-    for layer_data in config['layers']:
-      process_layer(layer_data)
-    # Then we process nodes in order of layer depth.
-    # Nodes that cannot yet be processed (if the inbound node
-    # does not yet exist) are re-enqueued, and the process
-    # is repeated until all nodes are processed.
-    while unprocessed_nodes:
-      for layer_data in config['layers']:
-        layer = created_layers[layer_data['name']]
-        if layer in unprocessed_nodes:
-          for node_data in unprocessed_nodes.pop(layer):
-            process_node(layer, node_data)
-
-    name = config.get('name')
-    input_tensors = []
-    output_tensors = []
-    for layer_data in config['input_layers']:
-      layer_name, node_index, tensor_index = layer_data
-      assert layer_name in created_layers
-      layer = created_layers[layer_name]
-      layer_output_tensors = layer._inbound_nodes[node_index].output_tensors
-      input_tensors.append(layer_output_tensors[tensor_index])
-    for layer_data in config['output_layers']:
-      layer_name, node_index, tensor_index = layer_data
-      assert layer_name in created_layers
-      layer = created_layers[layer_name]
-      layer_output_tensors = layer._inbound_nodes[node_index].output_tensors
-      output_tensors.append(layer_output_tensors[tensor_index])
-    return cls(inputs=input_tensors, outputs=output_tensors, name=name)
-
-  def save(self, filepath, overwrite=True, include_optimizer=True):
-    """Save the model to a single HDF5 file.
-
-    The savefile includes:
-        - The model architecture, allowing to re-instantiate the model.
-        - The model weights.
-        - The state of the optimizer, allowing to resume training
-            exactly where you left off.
-
-    This allows you to save the entirety of the state of a model
-    in a single file.
-
-    Saved models can be reinstantiated via `keras.models.load_model`.
-    The model returned by `load_model`
-    is a compiled model ready to be used (unless the saved model
-    was never compiled in the first place).
-
-    Arguments:
-        filepath: String, path to the file to save the weights to.
-        overwrite: Whether to silently overwrite any existing file at the
-            target location, or provide the user with a manual prompt.
-        include_optimizer: If True, save optimizer's state together.
-
-    Example:
-
-    ```python
-    from keras.models import load_model
-
-    model.save('my_model.h5')  # creates a HDF5 file 'my_model.h5'
-    del model  # deletes the existing model
-
-    # returns a compiled model
-    # identical to the previous one
-    model = load_model('my_model.h5')
-    ```
-    """
-    if not self._is_graph_network:
-      raise NotImplementedError
-
-    from tensorflow.python.keras._impl.keras.models import save_model  # pylint: disable=g-import-not-at-top
-    save_model(self, filepath, overwrite, include_optimizer)
-
-  def save_weights(self, filepath, overwrite=True):
-    """Dumps all layer weights to a HDF5 file.
-
-    The weight file has:
-        - `layer_names` (attribute), a list of strings
-            (ordered names of model layers).
-        - For every layer, a `group` named `layer.name`
-            - For every such layer group, a group attribute `weight_names`,
-                a list of strings
-                (ordered names of weights tensor of the layer).
-            - For every weight in the layer, a dataset
-                storing the weight value, named after the weight tensor.
-
-    Arguments:
-        filepath: String, path to the file to save the weights to.
-        overwrite: Whether to silently overwrite any existing file at the
-            target location, or provide the user with a manual prompt.
-
-    Raises:
-        ImportError: If h5py is not available.
-    """
-    if h5py is None:
-      raise ImportError('`save_weights` requires h5py.')
-    # If file exists and should not be overwritten:
-    if not overwrite and os.path.isfile(filepath):
-      proceed = ask_to_proceed_with_overwrite(filepath)
-      if not proceed:
-        return
-    with h5py.File(filepath, 'w') as f:
-      save_weights_to_hdf5_group(f, self.layers)
-
-  def load_weights(self, filepath, by_name=False):
-    """Loads all layer weights from a HDF5 save file.
-
-    If `by_name` is False (default) weights are loaded
-    based on the network's topology, meaning the architecture
-    should be the same as when the weights were saved.
-    Note that layers that don't have weights are not taken
-    into account in the topological ordering, so adding or
-    removing layers is fine as long as they don't have weights.
-
-    If `by_name` is True, weights are loaded into layers
-    only if they share the same name. This is useful
-    for fine-tuning or transfer-learning models where
-    some of the layers have changed.
-
-    Arguments:
-        filepath: String, path to the weights file to load.
-        by_name: Boolean, whether to load weights by name
-            or by topological order.
-
-    Raises:
-        ImportError: If h5py is not available.
-    """
-    if h5py is None:
-      raise ImportError('`load_weights` requires h5py.')
-    with h5py.File(filepath, 'r') as f:
-      if 'layer_names' not in f.attrs and 'model_weights' in f:
-        f = f['model_weights']
-      if by_name:
-        load_weights_from_hdf5_group_by_name(f, self.layers)
-      else:
-        load_weights_from_hdf5_group(f, self.layers)
-
-  def _updated_config(self):
-    """Util hared between different serialization methods.
-
-    Returns:
-        Model config with Keras version information added.
-    """
-    from tensorflow.python.keras._impl.keras import __version__ as keras_version  # pylint: disable=g-import-not-at-top
-
-    config = self.get_config()
-    model_config = {
-        'class_name': self.__class__.__name__,
-        'config': config,
-        'keras_version': keras_version,
-        'backend': K.backend()
-    }
-    return model_config
-
-  def to_json(self, **kwargs):
-    """Returns a JSON string containing the network configuration.
-
-    To load a network from a JSON save file, use
-    `keras.models.model_from_json(json_string, custom_objects={})`.
-
-    Arguments:
-        **kwargs: Additional keyword arguments
-            to be passed to `json.dumps()`.
-
-    Returns:
-        A JSON string.
-    """
-    if not self._is_graph_network:
-      raise NotImplementedError
-
-    def get_json_type(obj):
-      # If obj is any numpy type
-      if type(obj).__module__ == np.__name__:
-        return obj.item()
-
-      # If obj is a python 'type'
-      if type(obj).__name__ == type.__name__:
-        return obj.__name__
-
-      raise TypeError('Not JSON Serializable:', obj)
-
-    model_config = self._updated_config()
-    return json.dumps(model_config, default=get_json_type, **kwargs)
-
-  def to_yaml(self, **kwargs):
-    """Returns a yaml string containing the network configuration.
-
-    To load a network from a yaml save file, use
-    `keras.models.model_from_yaml(yaml_string, custom_objects={})`.
-
-    `custom_objects` should be a dictionary mapping
-    the names of custom losses / layers / etc to the corresponding
-    functions / classes.
-
-    Arguments:
-        **kwargs: Additional keyword arguments
-            to be passed to `yaml.dump()`.
-
-    Returns:
-        A YAML string.
-
-    Raises:
-        ImportError: if yaml module is not found.
-    """
-    if not self._is_graph_network:
-      raise NotImplementedError
-
-    if yaml is None:
-      raise ImportError('Requires yaml module installed.')
-    return yaml.dump(self._updated_config(), **kwargs)
-
-  def summary(self, line_length=None, positions=None, print_fn=None):
-    """Prints a string summary of the network.
-
-    Arguments:
-        line_length: Total length of printed lines
-            (e.g. set this to adapt the display to different
-            terminal window sizes).
-        positions: Relative or absolute positions of log elements
-            in each line. If not provided,
-            defaults to `[.33, .55, .67, 1.]`.
-        print_fn: Print function to use. Defaults to `print`.
-            It will be called on each line of the summary.
-            You can set it to a custom function
-            in order to capture the string summary.
-    """
-    print_layer_summary(self,
-                        line_length=line_length,
-                        positions=positions,
-                        print_fn=print_fn)
-
-
-def get_source_inputs(tensor, layer=None, node_index=None):
-  """Returns the list of input tensors necessary to compute `tensor`.
-
-  Output will always be a list of tensors
-  (potentially with 1 element).
-
-  Arguments:
-      tensor: The tensor to start from.
-      layer: Origin layer of the tensor. Will be
-          determined via tensor._keras_history if not provided.
-      node_index: Origin node index of the tensor.
-
-  Returns:
-      List of input tensors.
-  """
-  if not hasattr(tensor, '_keras_history'):
-    return tensor
-
-  if layer is None or node_index:
-    layer, node_index, _ = tensor._keras_history
-  if not layer._inbound_nodes:
-    return [tensor]
-  else:
-    node = layer._inbound_nodes[node_index]
-    if not node.inbound_layers:
-      # Reached an Input layer, stop recursion.
-      return node.input_tensors
-    else:
-      source_tensors = []
-      for i in range(len(node.inbound_layers)):
-        x = node.input_tensors[i]
-        layer = node.inbound_layers[i]
-        node_index = node.node_indices[i]
-        previous_sources = get_source_inputs(x, layer, node_index)
-        # Avoid input redundancy.
-        for x in previous_sources:
-          if x not in source_tensors:
-            source_tensors.append(x)
-      return source_tensors
-
-
-def _to_list(x):
-  """Normalizes a list/tensor into a list.
-
-  If a tensor is passed, we return
-  a list of size 1 containing the tensor.
-
-  Arguments:
-      x: target object to be normalized.
-
-  Returns:
-      A list.
-  """
-  if isinstance(x, list):
-    return x
-  return [x]
-
-
-def save_weights_to_hdf5_group(f, layers):
-  from tensorflow.python.keras._impl.keras import __version__ as keras_version  # pylint: disable=g-import-not-at-top
-
-  f.attrs['layer_names'] = [layer.name.encode('utf8') for layer in layers]
-  f.attrs['backend'] = K.backend().encode('utf8')
-  f.attrs['keras_version'] = str(keras_version).encode('utf8')
-
-  for layer in layers:
-    g = f.create_group(layer.name)
-    symbolic_weights = layer.weights
-    weight_values = K.batch_get_value(symbolic_weights)
-    weight_names = []
-    for i, (w, val) in enumerate(zip(symbolic_weights, weight_values)):
-      if hasattr(w, 'name') and w.name:
-        name = str(w.name)
-      else:
-        name = 'param_' + str(i)
-      weight_names.append(name.encode('utf8'))
-    g.attrs['weight_names'] = weight_names
-    for name, val in zip(weight_names, weight_values):
-      param_dset = g.create_dataset(name, val.shape, dtype=val.dtype)
-      if not val.shape:
-        # scalar
-        param_dset[()] = val
-      else:
-        param_dset[:] = val
-
-
-def preprocess_weights_for_loading(layer,
-                                   weights,
-                                   original_keras_version=None,
-                                   original_backend=None):
-  """Converts layers weights from Keras 1 format to Keras 2.
-
-  Arguments:
-      layer: Layer instance.
-      weights: List of weights values (Numpy arrays).
-      original_keras_version: Keras version for the weights, as a string.
-      original_backend: Keras backend the weights were trained with,
-          as a string.
-
-  Returns:
-      A list of weights values (Numpy arrays).
-  """
-  if layer.__class__.__name__ == 'Bidirectional':
-    num_weights_per_layer = len(weights) // 2
-    forward_weights = preprocess_weights_for_loading(
-        layer.forward_layer, weights[:num_weights_per_layer],
-        original_keras_version, original_backend)
-    backward_weights = preprocess_weights_for_loading(
-        layer.backward_layer, weights[num_weights_per_layer:],
-        original_keras_version, original_backend)
-    weights = forward_weights + backward_weights
-
-  if original_keras_version == '1':
-    if layer.__class__.__name__ == 'TimeDistributed':
-      weights = preprocess_weights_for_loading(
-          layer.layer, weights, original_keras_version, original_backend)
-
-    if layer.__class__.__name__ == 'Conv1D':
-      shape = weights[0].shape
-      # Handle Keras 1.1 format
-      if shape[:2] != (layer.kernel_size[0], 1) or shape[3] != layer.filters:
-        # Legacy shape:
-        # (filters, input_dim, filter_length, 1)
-        assert shape[0] == layer.filters and shape[2:] == (layer.kernel_size[0],
-                                                           1)
-        weights[0] = np.transpose(weights[0], (2, 3, 1, 0))
-      weights[0] = weights[0][:, 0, :, :]
-
-    if layer.__class__.__name__ == 'Conv2D':
-      if layer.data_format == 'channels_first':
-        # old: (filters, stack_size, kernel_rows, kernel_cols)
-        # new: (kernel_rows, kernel_cols, stack_size, filters)
-        weights[0] = np.transpose(weights[0], (2, 3, 1, 0))
-
-    if layer.__class__.__name__ == 'Conv2DTranspose':
-      if layer.data_format == 'channels_last':
-        # old: (kernel_rows, kernel_cols, stack_size, filters)
-        # new: (kernel_rows, kernel_cols, filters, stack_size)
-        weights[0] = np.transpose(weights[0], (0, 1, 3, 2))
-      if layer.data_format == 'channels_first':
-        # old: (filters, stack_size, kernel_rows, kernel_cols)
-        # new: (kernel_rows, kernel_cols, filters, stack_size)
-        weights[0] = np.transpose(weights[0], (2, 3, 0, 1))
-
-    if layer.__class__.__name__ == 'Conv3D':
-      if layer.data_format == 'channels_first':
-        # old: (filters, stack_size, ...)
-        # new: (..., stack_size, filters)
-        weights[0] = np.transpose(weights[0], (2, 3, 4, 1, 0))
-
-    if layer.__class__.__name__ == 'GRU':
-      if len(weights) == 9:
-        kernel = np.concatenate([weights[0], weights[3], weights[6]], axis=-1)
-        recurrent_kernel = np.concatenate(
-            [weights[1], weights[4], weights[7]], axis=-1)
-        bias = np.concatenate([weights[2], weights[5], weights[8]], axis=-1)
-        weights = [kernel, recurrent_kernel, bias]
-
-    if layer.__class__.__name__ == 'LSTM':
-      if len(weights) == 12:
-        # old: i, c, f, o
-        # new: i, f, c, o
-        kernel = np.concatenate(
-            [weights[0], weights[6], weights[3], weights[9]], axis=-1)
-        recurrent_kernel = np.concatenate(
-            [weights[1], weights[7], weights[4], weights[10]], axis=-1)
-        bias = np.concatenate(
-            [weights[2], weights[8], weights[5], weights[11]], axis=-1)
-        weights = [kernel, recurrent_kernel, bias]
-
-    if layer.__class__.__name__ == 'ConvLSTM2D':
-      if len(weights) == 12:
-        kernel = np.concatenate(
-            [weights[0], weights[6], weights[3], weights[9]], axis=-1)
-        recurrent_kernel = np.concatenate(
-            [weights[1], weights[7], weights[4], weights[10]], axis=-1)
-        bias = np.concatenate(
-            [weights[2], weights[8], weights[5], weights[11]], axis=-1)
-        if layer.data_format == 'channels_first':
-          # old: (filters, stack_size, kernel_rows, kernel_cols)
-          # new: (kernel_rows, kernel_cols, stack_size, filters)
-          kernel = np.transpose(kernel, (2, 3, 1, 0))
-          recurrent_kernel = np.transpose(recurrent_kernel, (2, 3, 1, 0))
-        weights = [kernel, recurrent_kernel, bias]
-
-    if layer.__class__.__name__ in ['Model', 'Sequential']:
-      new_weights = []
-      # trainable weights
-      for sublayer in layer.layers:
-        num_weights = len(sublayer.trainable_weights)
-        if num_weights > 0:
-          new_weights.extend(
-              preprocess_weights_for_loading(
-                  layer=sublayer,
-                  weights=weights[:num_weights],
-                  original_keras_version=original_keras_version,
-                  original_backend=original_backend))
-          weights = weights[num_weights:]
-
-      # non-trainable weights
-      for sublayer in layer.layers:
-        num_weights = len([
-            l for l in sublayer.weights if l not in sublayer.trainable_weights
-        ])
-        if num_weights > 0:
-          new_weights.extend(
-              preprocess_weights_for_loading(
-                  layer=sublayer,
-                  weights=weights[:num_weights],
-                  original_keras_version=original_keras_version,
-                  original_backend=original_backend))
-          weights = weights[num_weights:]
-      weights = new_weights
-
-  conv_layers = ['Conv1D', 'Conv2D', 'Conv3D', 'Conv2DTranspose', 'ConvLSTM2D']
-  if layer.__class__.__name__ in conv_layers:
-    if original_backend == 'theano':
-      weights[0] = conv_utils.convert_kernel(weights[0])
-      if layer.__class__.__name__ == 'ConvLSTM2D':
-        weights[1] = conv_utils.convert_kernel(weights[1])
-    if K.int_shape(layer.weights[0]) != weights[0].shape:
-      weights[0] = np.transpose(weights[0], (3, 2, 0, 1))
-      if layer.__class__.__name__ == 'ConvLSTM2D':
-        weights[1] = np.transpose(weights[1], (3, 2, 0, 1))
-
-  # Convert the weights of CuDNNLSTM so that they could be loaded into LSTM
-  if layer.__class__.__name__ == 'LSTM' and len(weights) == 3:
-    # Determine if loading a CuDNNLSTM layer from the number of bias weights:
-    # CuDNNLSTM has (units * 8) weights; while LSTM has (units * 4)
-    # if there's no bias weight in the file, skip this conversion
-    units = weights[1].shape[0]
-    bias = weights[2]
-    if len(bias) == units * 8:
-      # reshape the kernels
-      kernels = np.split(weights[0], 4, axis=1)
-      kernels = [
-          kernel.reshape(-1).reshape(kernel.shape, order='F')
-          for kernel in kernels
-      ]
-      weights[0] = np.concatenate(kernels, axis=1)
-
-      # transpose the recurrent kernels
-      recurrent_kernels = np.split(weights[1], 4, axis=1)
-      recurrent_kernels = [kernel.T for kernel in recurrent_kernels]
-      weights[1] = np.concatenate(recurrent_kernels, axis=1)
-
-      # split the bias into half and merge
-      weights[2] = bias[:units * 4] + bias[units * 4:]
-
-  return weights
-
-
-def load_weights_from_hdf5_group(f, layers):
-  """Implements topological (order-based) weight loading.
-
-  Arguments:
-      f: A pointer to a HDF5 group.
-      layers: a list of target layers.
-
-  Raises:
-      ValueError: in case of mismatch between provided layers
-          and weights file.
-  """
-  if 'keras_version' in f.attrs:
-    original_keras_version = f.attrs['keras_version'].decode('utf8')
-  else:
-    original_keras_version = '1'
-  if 'backend' in f.attrs:
-    original_backend = f.attrs['backend'].decode('utf8')
-  else:
-    original_backend = None
-
-  filtered_layers = []
-  for layer in layers:
-    weights = layer.weights
-    if weights:
-      filtered_layers.append(layer)
-
-  layer_names = [n.decode('utf8') for n in f.attrs['layer_names']]
-  filtered_layer_names = []
-  for name in layer_names:
-    g = f[name]
-    weight_names = [n.decode('utf8') for n in g.attrs['weight_names']]
-    if weight_names:
-      filtered_layer_names.append(name)
-  layer_names = filtered_layer_names
-  if len(layer_names) != len(filtered_layers):
-    raise ValueError('You are trying to load a weight file '
-                     'containing ' + str(len(layer_names)) +
-                     ' layers into a model with ' + str(len(filtered_layers)) +
-                     ' layers.')
-
-  # We batch weight value assignments in a single backend call
-  # which provides a speedup in TensorFlow.
-  weight_value_tuples = []
-  for k, name in enumerate(layer_names):
-    g = f[name]
-    weight_names = [n.decode('utf8') for n in g.attrs['weight_names']]
-    weight_values = [g[weight_name] for weight_name in weight_names]
-    layer = filtered_layers[k]
-    symbolic_weights = layer.weights
-    weight_values = preprocess_weights_for_loading(
-        layer, weight_values, original_keras_version, original_backend)
-    if len(weight_values) != len(symbolic_weights):
-      raise ValueError('Layer #' + str(k) + ' (named "' + layer.name +
-                       '" in the current model) was found to '
-                       'correspond to layer ' + name + ' in the save file. '
-                       'However the new layer ' + layer.name + ' expects ' +
-                       str(len(symbolic_weights)) +
-                       ' weights, but the saved weights have ' +
-                       str(len(weight_values)) + ' elements.')
-    weight_value_tuples += zip(symbolic_weights, weight_values)
-  K.batch_set_value(weight_value_tuples)
-
-
-def load_weights_from_hdf5_group_by_name(f, layers):
-  """Implements name-based weight loading.
-
-  (instead of topological weight loading).
-
-  Layers that have no matching name are skipped.
-
-  Arguments:
-      f: A pointer to a HDF5 group.
-      layers: a list of target layers.
-
-  Raises:
-      ValueError: in case of mismatch between provided layers
-          and weights file.
-  """
-  if 'keras_version' in f.attrs:
-    original_keras_version = f.attrs['keras_version'].decode('utf8')
-  else:
-    original_keras_version = '1'
-  if 'backend' in f.attrs:
-    original_backend = f.attrs['backend'].decode('utf8')
-  else:
-    original_backend = None
-
-  # New file format.
-  layer_names = [n.decode('utf8') for n in f.attrs['layer_names']]
-
-  # Reverse index of layer name to list of layers with name.
-  index = {}
-  for layer in layers:
-    if layer.name:
-      index.setdefault(layer.name, []).append(layer)
-
-  # We batch weight value assignments in a single backend call
-  # which provides a speedup in TensorFlow.
-  weight_value_tuples = []
-  for k, name in enumerate(layer_names):
-    g = f[name]
-    weight_names = [n.decode('utf8') for n in g.attrs['weight_names']]
-    weight_values = [g[weight_name] for weight_name in weight_names]
-
-    for layer in index.get(name, []):
-      symbolic_weights = layer.weights
-      weight_values = preprocess_weights_for_loading(
-          layer, weight_values, original_keras_version, original_backend)
-      if len(weight_values) != len(symbolic_weights):
-        raise ValueError('Layer #' + str(k) + ' (named "' + layer.name +
-                         '") expects ' + str(len(symbolic_weights)) +
-                         ' weight(s), but the saved weights' + ' have ' +
-                         str(len(weight_values)) + ' element(s).')
-      # Set values.
-      for i in range(len(weight_values)):
-        weight_value_tuples.append((symbolic_weights[i], weight_values[i]))
-  K.batch_set_value(weight_value_tuples)
-
-
-def shape_type_conversion(fn):
-  """Decorator that handles tuple/TensorShape conversion.
-
-  Used in `compute_output_shape` and `build`.
-
-  Arguments:
-    fn: function to wrap.
-
-  Returns:
-    Wrapped function.
-  """
-
-  def wrapper(instance, input_shape):
-    if input_shape is not None:
-      if isinstance(input_shape, list):
-        input_shape = [
-            tuple(tensor_shape.TensorShape(x).as_list()) for x in input_shape]
-      else:
-        input_shape = tuple(tensor_shape.TensorShape(input_shape).as_list())
-    output_shape = fn(instance, input_shape)
-    if output_shape is not None:
-      if isinstance(output_shape, list):
-        return [tensor_shape.TensorShape(x) for x in output_shape]
-      return tensor_shape.TensorShape(output_shape)
-
-  return wrapper
diff --git a/tensorflow/python/keras/_impl/keras/engine/topology_test.py b/tensorflow/python/keras/_impl/keras/engine/topology_test.py
index 28ddc094ee585ca4011d0cdaf190cfe826a2f0ce..04434323d6a9f8e12ad8f45c1e83819dfa8b3b96 100644
--- a/tensorflow/python/keras/_impl/keras/engine/topology_test.py
+++ b/tensorflow/python/keras/_impl/keras/engine/topology_test.py
@@ -18,13 +18,14 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import os
-import shutil
-
 import numpy as np
 
+from tensorflow.python.eager import context
+from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import test_util
 from tensorflow.python.keras._impl import keras
+from tensorflow.python.layers import base as tf_base_layers
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import state_ops
@@ -35,37 +36,255 @@ try:
 except ImportError:
   yaml = None
 
-try:
-  import h5py  # pylint:disable=g-import-not-at-top
-except ImportError:
-  h5py = None
-
 
 class TopologyConstructionTest(test.TestCase):
 
-  def test_get_updates_for(self):
-    a = keras.layers.Input(shape=(1,))
-    dense_layer = keras.layers.Dense(1)
-    dense_layer.build((None, 1))
-    update_1 = state_ops.assign_add(dense_layer.kernel, a)
-    update_2 = state_ops.assign_add(dense_layer.kernel, [[1.]])
-    dense_layer.add_update(update_1, inputs=a)
-    dense_layer.add_update(update_2, inputs=None)
-
-    self.assertListEqual(dense_layer.get_updates_for(a), [update_1])
-    self.assertListEqual(dense_layer.get_updates_for(None), [update_2])
-
-  def test_get_losses_for(self):
-    a = keras.layers.Input(shape=(1,))
-    dense_layer = keras.layers.Dense(1)
-    dense_layer.build((None, 1))
-    loss_1 = math_ops.reduce_sum(a)
-    loss_2 = math_ops.reduce_sum(dense_layer.kernel)
-    dense_layer.add_loss(loss_1, inputs=a)
-    dense_layer.add_loss(loss_2, inputs=None)
-
-    self.assertListEqual(dense_layer.get_losses_for(a), [loss_1])
-    self.assertListEqual(dense_layer.get_losses_for(None), [loss_2])
+  def test_get_updates(self):
+
+    class MyLayer(keras.layers.Layer):
+
+      def build(self, input_shape):
+        self.a = self.add_variable('a',
+                                   (1, 1),
+                                   'float32',
+                                   trainable=False)
+        self.b = self.add_variable('b',
+                                   (1, 1),
+                                   'float32',
+                                   trainable=False)
+        self.add_update(state_ops.assign_add(self.a, [[1.]]))
+        self.built = True
+
+      def call(self, inputs):
+        self.add_update(state_ops.assign_add(self.a, inputs),
+                        inputs=True)
+        return inputs + 1
+
+    x1 = keras.Input(shape=(1,))
+    layer = MyLayer()
+    _ = layer.apply(x1)
+
+    self.assertEqual(len(layer.updates), 2)
+    self.assertEqual(len(layer.get_updates_for(x1)), 1)
+    self.assertEqual(len(layer.get_updates_for(None)), 1)
+
+    x2 = keras.Input(shape=(1,))
+    y2 = layer.apply(x2)
+
+    self.assertEqual(len(layer.updates), 3)
+    self.assertEqual(len(layer.get_updates_for(x1)), 1)
+    self.assertEqual(len(layer.get_updates_for(x2)), 1)
+    self.assertEqual(len(layer.get_updates_for(None)), 1)
+
+    network = keras.engine.Network(x2, y2)
+    self.assertEqual(len(network.updates), 2)
+    self.assertEqual(len(network.get_updates_for(x1)), 0)
+    self.assertEqual(len(network.get_updates_for(x2)), 1)
+    self.assertEqual(len(network.get_updates_for(None)), 1)
+
+    x3 = keras.Input(shape=(1,))
+    _ = layer.apply(x3)
+    self.assertEqual(len(network.updates), 2)
+
+    x4 = keras.Input(shape=(1,))
+    _ = network(x4)
+    self.assertEqual(len(network.updates), 3)
+    self.assertEqual(len(network.get_updates_for(x2)), 1)
+    self.assertEqual(len(network.get_updates_for(x4)), 1)
+    self.assertEqual(len(network.get_updates_for(None)), 1)
+
+    network.add_update(state_ops.assign_add(layer.a, [[1]]))
+    self.assertEqual(len(network.updates), 4)
+    self.assertEqual(len(network.get_updates_for(None)), 2)
+
+    network.add_update(state_ops.assign_add(layer.a, x4), inputs=True)
+    self.assertEqual(len(network.updates), 5)
+    self.assertEqual(len(network.get_updates_for(x4)), 2)
+
+  def test_get_losses(self):
+
+    class MyLayer(keras.layers.Layer):
+
+      def build(self, input_shape):
+        self.a = self.add_variable('a',
+                                   (1, 1),
+                                   'float32',
+                                   trainable=False)
+        self.b = self.add_variable('b',
+                                   (1, 1),
+                                   'float32',
+                                   trainable=False)
+        self.add_loss(math_ops.reduce_sum(self.a))
+        self.built = True
+
+      def call(self, inputs):
+        self.add_loss(math_ops.reduce_sum(inputs),
+                      inputs=True)
+        return inputs + 1
+
+    x1 = keras.Input(shape=(1,))
+    layer = MyLayer()
+    _ = layer.apply(x1)
+
+    self.assertEqual(len(layer.losses), 2)
+    self.assertEqual(len(layer.get_losses_for(x1)), 1)
+    self.assertEqual(len(layer.get_losses_for(None)), 1)
+
+    x2 = keras.Input(shape=(1,))
+    y2 = layer.apply(x2)
+
+    self.assertEqual(len(layer.losses), 3)
+    self.assertEqual(len(layer.get_losses_for(x1)), 1)
+    self.assertEqual(len(layer.get_losses_for(x2)), 1)
+    self.assertEqual(len(layer.get_losses_for(None)), 1)
+
+    network = keras.engine.Network(x2, y2)
+    self.assertEqual(len(network.losses), 2)
+    self.assertEqual(len(network.get_losses_for(x1)), 0)
+    self.assertEqual(len(network.get_losses_for(x2)), 1)
+    self.assertEqual(len(network.get_losses_for(None)), 1)
+
+    x3 = keras.Input(shape=(1,))
+    _ = layer.apply(x3)
+    self.assertEqual(len(network.losses), 2)
+
+    x4 = keras.Input(shape=(1,))
+    _ = network(x4)
+    self.assertEqual(len(network.losses), 3)
+    self.assertEqual(len(network.get_losses_for(x2)), 1)
+    self.assertEqual(len(network.get_losses_for(x4)), 1)
+    self.assertEqual(len(network.get_losses_for(None)), 1)
+
+    network.add_loss(math_ops.reduce_sum(layer.a))
+    self.assertEqual(len(network.losses), 4)
+    self.assertEqual(len(network.get_losses_for(None)), 2)
+
+    network.add_loss(math_ops.reduce_sum(x4), inputs=True)
+    self.assertEqual(len(network.losses), 5)
+    self.assertEqual(len(network.get_losses_for(x4)), 2)
+
+  def testTopologicalAttributes(self):
+    # test layer attributes / methods related to cross-layer connectivity.
+    a = keras.Input(shape=(32,), name='input_a')
+    b = keras.Input(shape=(32,), name='input_b')
+
+    # test input, output, input_shape, output_shape
+    test_layer = keras.layers.Dense(16, name='test_layer')
+    a_test = test_layer(a)
+    self.assertEqual(test_layer.input, a)
+    self.assertEqual(test_layer.output, a_test)
+    self.assertEqual(test_layer.input_shape, (None, 32))
+    self.assertEqual(test_layer.output_shape, (None, 16))
+
+    # test `get_*_at` methods
+    dense = keras.layers.Dense(16, name='dense_1')
+    a_2 = dense(a)
+    b_2 = dense(b)
+
+    self.assertEqual(dense.get_input_at(0), a)
+    self.assertEqual(dense.get_input_at(1), b)
+    self.assertEqual(dense.get_output_at(0), a_2)
+    self.assertEqual(dense.get_output_at(1), b_2)
+    self.assertEqual(dense.get_input_shape_at(0), (None, 32))
+    self.assertEqual(dense.get_input_shape_at(1), (None, 32))
+    self.assertEqual(dense.get_output_shape_at(0), (None, 16))
+    self.assertEqual(dense.get_output_shape_at(1), (None, 16))
+
+    # Test invalid value for attribute retrieval.
+    with self.assertRaises(ValueError):
+      dense.get_input_at(2)
+    with self.assertRaises(AttributeError):
+      new_dense = keras.layers.Dense(16)
+      _ = new_dense.input
+    with self.assertRaises(AttributeError):
+      new_dense = keras.layers.Dense(16)
+      _ = new_dense.output
+    with self.assertRaises(AttributeError):
+      new_dense = keras.layers.Dense(16)
+      _ = new_dense.output_shape
+    with self.assertRaises(AttributeError):
+      new_dense = keras.layers.Dense(16)
+      _ = new_dense.input_shape
+    with self.assertRaises(AttributeError):
+      new_dense = keras.layers.Dense(16)
+      a = keras.Input(shape=(3, 32))
+      a = keras.Input(shape=(5, 32))
+      a_2 = dense(a)
+      b_2 = dense(b)
+      _ = new_dense.input_shape
+    with self.assertRaises(AttributeError):
+      new_dense = keras.layers.Dense(16)
+      a = keras.Input(shape=(3, 32))
+      a = keras.Input(shape=(5, 32))
+      a_2 = dense(a)
+      b_2 = dense(b)
+      _ = new_dense.output_shape
+
+  def testTopologicalAttributesMultiOutputLayer(self):
+
+    class PowersLayer(keras.layers.Layer):
+
+      def call(self, inputs):
+        return [inputs**2, inputs**3]
+
+    x = keras.Input(shape=(32,))
+    test_layer = PowersLayer()
+    p1, p2 = test_layer(x)  # pylint: disable=not-callable
+
+    self.assertEqual(test_layer.input, x)
+    self.assertEqual(test_layer.output, [p1, p2])
+    self.assertEqual(test_layer.input_shape, (None, 32))
+    self.assertEqual(test_layer.output_shape, [(None, 32), (None, 32)])
+
+  def testTopologicalAttributesMultiInputLayer(self):
+
+    class AddLayer(keras.layers.Layer):
+
+      def call(self, inputs):
+        assert len(inputs) == 2
+        return inputs[0] + inputs[1]
+
+    a = keras.Input(shape=(32,))
+    b = keras.Input(shape=(32,))
+    test_layer = AddLayer()
+    y = test_layer([a, b])  # pylint: disable=not-callable
+
+    self.assertEqual(test_layer.input, [a, b])
+    self.assertEqual(test_layer.output, y)
+    self.assertEqual(test_layer.input_shape, [(None, 32), (None, 32)])
+    self.assertEqual(test_layer.output_shape, (None, 32))
+
+  def testBasicNetwork(self):
+    # minimum viable network
+    x = keras.Input(shape=(32,))
+    dense = keras.layers.Dense(2)
+    y = dense(x)
+    network = keras.engine.Network(x, y, name='dense_network')
+
+    # test basic attributes
+    self.assertEqual(network.name, 'dense_network')
+    self.assertEqual(len(network.layers), 2)  # InputLayer + Dense
+    self.assertEqual(network.layers[1], dense)
+    self.assertEqual(network.weights, dense.weights)
+    self.assertEqual(network.trainable_weights, dense.trainable_weights)
+    self.assertEqual(network.non_trainable_weights, dense.non_trainable_weights)
+
+    # test callability on Input
+    x_2 = keras.Input(shape=(32,))
+    y_2 = network(x_2)
+    self.assertEqual(y_2.get_shape().as_list(), [None, 2])
+
+    # test callability on regular tensor
+    x_2 = array_ops.placeholder(dtype='float32', shape=(None, 32))
+    y_2 = network(x_2)
+    self.assertEqual(y_2.get_shape().as_list(), [None, 2])
+
+    # test network `trainable` attribute
+    network.trainable = False
+    self.assertEqual(network.weights, dense.weights)
+    self.assertEqual(network.trainable_weights, [])
+    self.assertEqual(network.non_trainable_weights,
+                     dense.trainable_weights + dense.non_trainable_weights)
 
   def test_trainable_weights(self):
     a = keras.layers.Input(shape=(2,))
@@ -108,41 +327,6 @@ class TopologyConstructionTest(test.TestCase):
     self.assertListEqual(model.trainable_weights, [])
     self.assertListEqual(model.non_trainable_weights, weights)
 
-  def test_weight_loading(self):
-    with self.test_session():
-      a = keras.layers.Input(shape=(2,))
-      x = keras.layers.Dense(3)(a)
-      b = keras.layers.Dense(1)(x)
-      model = keras.models.Model(a, b)
-
-      x = np.random.random((3, 2))
-      ref_y = model.predict(x)
-      weights = model.get_weights()
-      model.set_weights(weights)
-      y = model.predict(x)
-      self.assertAllClose(ref_y, y)
-
-      with self.assertRaises(ValueError):
-        model.set_weights(weights[1:])
-      with self.assertRaises(ValueError):
-        model.set_weights(weights[::-1])
-
-      if h5py is None:
-        return  # Skip rest of test if H5py isn't available.
-
-      temp_dir = self.get_temp_dir()
-      self.addCleanup(shutil.rmtree, temp_dir)
-
-      h5_path = os.path.join(temp_dir, 'test.h5')
-      model.save_weights(h5_path)
-      model.load_weights(h5_path)
-      y = model.predict(x)
-      self.assertAllClose(ref_y, y)
-
-      model.load_weights(h5_path, by_name=True)
-      y = model.predict(x)
-      self.assertAllClose(ref_y, y)
-
   def test_learning_phase(self):
     with self.test_session():
       a = keras.layers.Input(shape=(32,), name='input_a')
@@ -310,7 +494,7 @@ class TopologyConstructionTest(test.TestCase):
       self.assertListEqual([x.shape for x in fn_outputs], [(10, 64), (10, 5)])
 
       # test get_source_inputs
-      self.assertListEqual(keras.engine.topology.get_source_inputs(c), [a, b])
+      self.assertListEqual(keras.engine.network.get_source_inputs(c), [a, b])
 
       # serialization / deserialization
       json_config = model.to_json()
@@ -348,7 +532,7 @@ class TopologyConstructionTest(test.TestCase):
       e = keras.layers.Input(shape=(32,), name='input_e')
       f = keras.layers.Input(shape=(32,), name='input_f')
       g, h = model([e, f])
-      self.assertEqual(g.name, 'model_1/dense_2/BiasAdd:0')
+      self.assertEqual(g.name, 'model/dense_2/BiasAdd:0')
 
       self.assertListEqual(g.get_shape().as_list(), c.get_shape().as_list())
       self.assertListEqual(h.get_shape().as_list(), d.get_shape().as_list())
@@ -555,6 +739,42 @@ class TopologyConstructionTest(test.TestCase):
     model = keras.models.Model(a, b)
     self.assertEqual(model.output_mask.get_shape().as_list(), [None, 10])
 
+  def testMaskingSingleInput(self):
+
+    class MaskedLayer(keras.layers.Layer):
+
+      def call(self, inputs, mask=None):
+        if mask is not None:
+          return inputs * mask
+        return inputs
+
+      def compute_mask(self, inputs, mask=None):
+        return array_ops.ones_like(inputs)
+
+    if context.in_graph_mode():
+      x = keras.Input(shape=(32,))
+      y = MaskedLayer()(x)  # pylint: disable=not-callable
+      network = keras.engine.Network(x, y)
+
+      # test callability on Input
+      x_2 = keras.Input(shape=(32,))
+      y_2 = network(x_2)
+      self.assertEqual(y_2.get_shape().as_list(), [None, 32])
+
+      # test callability on regular tensor
+      x_2 = array_ops.placeholder(dtype='float32', shape=(None, 32))
+      y_2 = network(x_2)
+      self.assertEqual(y_2.get_shape().as_list(), [None, 32])
+    else:
+      a = constant_op.constant([2] * 32)
+      mask = constant_op.constant([0, 1] * 16)
+      a._keras_mask = mask
+      b = MaskedLayer().apply(a)
+      self.assertTrue(hasattr(b, '_keras_mask'))
+      self.assertAllEqual(self.evaluate(array_ops.ones_like(mask)),
+                          self.evaluate(getattr(b, '_keras_mask')))
+      self.assertAllEqual(self.evaluate(a * mask), self.evaluate(b))
+
   def test_activity_regularization_with_model_composition(self):
 
     def reg(x):
@@ -576,97 +796,6 @@ class TopologyConstructionTest(test.TestCase):
     loss = model_b.evaluate(x)
     self.assertEqual(loss, 4.)
 
-  def test_weight_preprocessing(self):
-    input_dim = 3
-    output_dim = 3
-    size = 2
-    cases = [
-        [
-            (keras.layers.Bidirectional(keras.layers.SimpleRNN(2))),
-            [np.random.random((2, 1)), np.random.random((2, 1))],
-            (None, 3, 2),
-        ],
-        [
-            (keras.layers.TimeDistributed(keras.layers.Dense(1))),
-            [np.random.random((2, 1)), np.random.random((1,))],
-            (None, 3, 2),
-        ],
-        [
-            (keras.layers.Conv1D(output_dim, size, use_bias=False)),
-            [np.random.random((output_dim, input_dim, size, 1))],
-            (None, 4, input_dim),
-        ],
-        [
-            (keras.layers.Conv2D(output_dim, size,
-                                 use_bias=False, data_format='channels_first')),
-            [np.random.random((output_dim, input_dim, size, size))],
-            (None, input_dim, 4, 4),
-        ],
-        [
-            (keras.layers.Conv2DTranspose(output_dim, size,
-                                          use_bias=False,
-                                          data_format='channels_first')),
-            [np.random.random((output_dim, input_dim, size, size))],
-            (None, input_dim, 4, 4),
-        ],
-        [
-            (keras.layers.Conv2DTranspose(output_dim, size,
-                                          use_bias=False,
-                                          data_format='channels_last')),
-            [np.random.random((size, size, input_dim, output_dim))],
-            (None, 4, 4, input_dim),
-        ],
-        [
-            (keras.layers.Conv3D(output_dim, size,
-                                 use_bias=False, data_format='channels_first')),
-            [np.random.random((output_dim, input_dim, size, size, size))],
-            (None, input_dim, 4, 4, 4),
-        ],
-        [
-            (keras.layers.GRU(output_dim)),
-            [np.random.random((input_dim, output_dim)),
-             np.random.random((output_dim, output_dim)),
-             np.random.random((output_dim,)),
-             np.random.random((input_dim, output_dim)),
-             np.random.random((output_dim, output_dim)),
-             np.random.random((output_dim,)),
-             np.random.random((input_dim, output_dim)),
-             np.random.random((output_dim, output_dim)),
-             np.random.random((output_dim,))],
-            (None, 4, input_dim),
-        ],
-        [
-            (keras.layers.LSTM(output_dim)),
-            [np.random.random((input_dim, output_dim)),
-             np.random.random((output_dim, output_dim)),
-             np.random.random((output_dim,)),
-             np.random.random((input_dim, output_dim)),
-             np.random.random((output_dim, output_dim)),
-             np.random.random((output_dim,)),
-             np.random.random((input_dim, output_dim)),
-             np.random.random((output_dim, output_dim)),
-             np.random.random((output_dim,)),
-             np.random.random((input_dim, output_dim)),
-             np.random.random((output_dim, output_dim)),
-             np.random.random((output_dim,))],
-            (None, 4, input_dim),
-        ],
-    ]
-    for layer, weights, input_shape in cases:
-      layer.build(input_shape)
-      _ = keras.engine.topology.preprocess_weights_for_loading(
-          layer, weights, original_keras_version='1')
-
-    model = keras.models.Sequential([keras.layers.Dense(2, input_dim=2)])
-    _ = keras.engine.topology.preprocess_weights_for_loading(
-        model, model.weights, original_keras_version='1')
-
-    x = keras.Input((2,))
-    y = keras.layers.Dense(2)(x)
-    model = keras.models.Model(x, y)
-    _ = keras.engine.topology.preprocess_weights_for_loading(
-        model, model.weights, original_keras_version='1')
-
   def test_layer_sharing_at_heterogenous_depth(self):
     with self.test_session():
       x_val = np.random.random((10, 5))
@@ -715,5 +844,92 @@ class TopologyConstructionTest(test.TestCase):
       output_val_2 = m2.predict(x_val)
       self.assertAllClose(output_val, output_val_2, atol=1e-6)
 
+  def test_explicit_training_argument(self):
+    with self.test_session():
+      a = keras.layers.Input(shape=(2,))
+      b = keras.layers.Dropout(0.5)(a)
+      base_model = keras.models.Model(a, b)
+
+      a = keras.layers.Input(shape=(2,))
+      b = base_model(a, training=False)
+      model = keras.models.Model(a, b)
+
+      x = np.ones((100, 2))
+      y = np.ones((100, 2))
+      model.compile(optimizer='sgd', loss='mse')
+      loss = model.train_on_batch(x, y)
+      self.assertEqual(loss, 0)  # In inference mode, output is equal to input.
+
+      a = keras.layers.Input(shape=(2,))
+      b = base_model(a, training=True)
+      model = keras.models.Model(a, b)
+      preds = model.predict(x)
+      self.assertEqual(np.min(preds), 0.)  # At least one unit was dropped.
+
+
+class DeferredModeTest(test.TestCase):
+
+  def testDeferredTensorAttributes(self):
+    x = tf_base_layers._DeferredTensor(shape=(None, 2),
+                                       dtype='float32',
+                                       name='x')
+    self.assertEqual(str(x),
+                     'DeferredTensor(\'x\', shape=(?, 2), dtype=float32)')
+    self.assertEqual(repr(x),
+                     '<_DeferredTensor \'x\' shape=(?, 2) dtype=float32>')
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testSimpleNetworkBuilding(self):
+    inputs = keras.engine.Input(shape=(32,))
+    if context.in_eager_mode():
+      self.assertIsInstance(inputs, tf_base_layers._DeferredTensor)
+      self.assertEqual(inputs.dtype.name, 'float32')
+      self.assertEqual(inputs.shape.as_list(), [None, 32])
+
+    x = keras.layers.Dense(2)(inputs)
+    if context.in_eager_mode():
+      self.assertIsInstance(x, tf_base_layers._DeferredTensor)
+      self.assertEqual(x.dtype.name, 'float32')
+      self.assertEqual(x.shape.as_list(), [None, 2])
+
+    outputs = keras.layers.Dense(4)(x)
+    network = keras.engine.Network(inputs, outputs)
+    self.assertIsInstance(network, keras.engine.Network)
+
+    if context.in_eager_mode():
+      # It should be possible to call such a network on EagerTensors.
+      inputs = constant_op.constant(
+          np.random.random((10, 32)).astype('float32'))
+      outputs = network(inputs)
+      self.assertEqual(outputs.shape.as_list(), [10, 4])
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testMultiIONetworkbuilding(self):
+    input_a = keras.engine.Input(shape=(32,))
+    input_b = keras.engine.Input(shape=(16,))
+    a = keras.layers.Dense(16)(input_a)
+
+    class AddLayer(keras.layers.Layer):
+
+      def call(self, inputs):
+        return inputs[0] + inputs[1]
+
+      def compute_output_shape(self, input_shape):
+        return input_shape[0]
+
+    c = AddLayer()([a, input_b])  # pylint: disable=not-callable
+    c = keras.layers.Dense(2)(c)
+
+    network = keras.engine.Network([input_a, input_b], [a, c])
+    if context.in_eager_mode():
+      a_val = constant_op.constant(
+          np.random.random((10, 32)).astype('float32'))
+      b_val = constant_op.constant(
+          np.random.random((10, 16)).astype('float32'))
+      outputs = network([a_val, b_val])
+      self.assertEqual(len(outputs), 2)
+      self.assertEqual(outputs[0].shape.as_list(), [10, 16])
+      self.assertEqual(outputs[1].shape.as_list(), [10, 2])
+
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/_impl/keras/engine/training.py b/tensorflow/python/keras/_impl/keras/engine/training.py
index fd14bf3d05f13bba3b5cfdc15d3add3c0e48138f..81ab77094eac0ff535d08939e663bdca46b90565 100644
--- a/tensorflow/python/keras/_impl/keras/engine/training.py
+++ b/tensorflow/python/keras/_impl/keras/engine/training.py
@@ -18,504 +18,90 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import copy
-
 import numpy as np
 
 from tensorflow.python.eager import context
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.keras._impl.keras import backend as K
-from tensorflow.python.keras._impl.keras import callbacks as cbks
 from tensorflow.python.keras._impl.keras import losses
 from tensorflow.python.keras._impl.keras import metrics as metrics_module
 from tensorflow.python.keras._impl.keras import optimizers
+from tensorflow.python.keras._impl.keras.engine import training_arrays
 from tensorflow.python.keras._impl.keras.engine import training_eager
-from tensorflow.python.keras._impl.keras.engine.topology import Layer
-from tensorflow.python.keras._impl.keras.engine.topology import Network
-from tensorflow.python.keras._impl.keras.utils.data_utils import GeneratorEnqueuer
-from tensorflow.python.keras._impl.keras.utils.data_utils import OrderedEnqueuer
-from tensorflow.python.keras._impl.keras.utils.data_utils import Sequence
-from tensorflow.python.keras._impl.keras.utils.generic_utils import make_batches
-from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
+from tensorflow.python.keras._impl.keras.engine import training_generator
+from tensorflow.python.keras._impl.keras.engine import training_utils
+from tensorflow.python.keras._impl.keras.engine.base_layer import Layer
+from tensorflow.python.keras._impl.keras.engine.network import Network
 from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
 from tensorflow.python.layers.base import _DeferredTensor
+from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import optimizer as tf_optimizer_module
 from tensorflow.python.util.tf_export import tf_export
 
-try:
-  from scipy.sparse import issparse  # pylint: disable=g-import-not-at-top
-except ImportError:
-  issparse = None
-
-
-def _standardize_input_data(data,
-                            names,
-                            shapes=None,
-                            check_batch_axis=True,
-                            exception_prefix=''):
-  """Normalizes inputs and targets provided by users.
-
-  Users may pass data as a list of arrays, dictionary of arrays,
-  or as a single array. We normalize this to an ordered list of
-  arrays (same order as `names`), while checking that the provided
-  arrays have shapes that match the network's expectations.
-
-  Arguments:
-      data: User-provided input data (polymorphic).
-      names: List of expected array names.
-      shapes: Optional list of expected array shapes.
-      check_batch_axis: Boolean; whether to check that
-          the batch axis of the arrays matches the expected
-          value found in `shapes`.
-      exception_prefix: String prefix used for exception formatting.
-
-  Returns:
-      List of standardized input arrays (one array per model input).
-
-  Raises:
-      ValueError: in case of improperly formatted user-provided data.
-  """
-  if not names:
-    if data is not None and hasattr(data, '__len__') and len(data):
-      raise ValueError('Error when checking model ' + exception_prefix + ': '
-                       'expected no data, but got:', data)
-    return []
-  if data is None:
-    return [None for _ in range(len(names))]
-
-  if isinstance(data, dict):
-    try:
-      data = [
-          data[x].values
-          if data[x].__class__.__name__ == 'DataFrame' else data[x]
-          for x in names
-      ]
-    except KeyError as e:
-      raise ValueError('No data provided for "' + e.args[0] + '". Need data '
-                       'for each key in: ' + str(names))
-  elif isinstance(data, list):
-    if isinstance(data[0], list):
-      data = [np.asarray(d) for d in data]
-    elif len(names) == 1 and isinstance(data[0], (float, int)):
-      data = [np.asarray(data)]
-    else:
-      data = [
-          x.values if x.__class__.__name__ == 'DataFrame' else x for x in data
-      ]
-  else:
-    data = data.values if data.__class__.__name__ == 'DataFrame' else data
-    data = [data]
-  data = [
-      np.expand_dims(x, 1) if x is not None and x.ndim == 1 else x for x in data
-  ]
-
-  if len(data) != len(names):
-    if data and hasattr(data[0], 'shape'):
-      raise ValueError('Error when checking model ' + exception_prefix +
-                       ': the list of Numpy arrays that you are passing to '
-                       'your model is not the size the model expected. '
-                       'Expected to see ' + str(len(names)) + ' array(s), '
-                       'but instead got the following list of ' +
-                       str(len(data)) + ' arrays: ' + str(data)[:200] + '...')
-    elif len(names) > 1:
-      raise ValueError(
-          'Error when checking model ' + exception_prefix +
-          ': you are passing a list as input to your model, '
-          'but the model expects a list of ' + str(len(names)) +
-          ' Numpy arrays instead. The list you passed was: ' + str(data)[:200])
-    elif len(data) == 1 and not hasattr(data[0], 'shape'):
-      raise TypeError('Error when checking model ' + exception_prefix +
-                      ': data should be a Numpy array, or list/dict of '
-                      'Numpy arrays. Found: ' + str(data)[:200] + '...')
-    elif len(names) == 1:
-      data = [np.asarray(data)]
-
-  # Check shapes compatibility.
-  if shapes:
-    for i in range(len(names)):
-      if shapes[i] is not None:
-        data_shape = data[i].shape
-        shape = shapes[i]
-        if data[i].ndim != len(shape):
-          raise ValueError('Error when checking ' + exception_prefix +
-                           ': expected ' + names[i] + ' to have ' +
-                           str(len(shape)) + ' dimensions, but got array '
-                           'with shape ' + str(data_shape))
-        if not check_batch_axis:
-          data_shape = data_shape[1:]
-          shape = shape[1:]
-        for dim, ref_dim in zip(data_shape, shape):
-          if ref_dim != dim and ref_dim:
-            raise ValueError(
-                'Error when checking ' + exception_prefix + ': expected ' +
-                names[i] + ' to have shape ' + str(shape) +
-                ' but got array with shape ' + str(data_shape))
-  return data
 
+@tf_export('keras.models.Model', 'keras.Model')
+class Model(Network):
+  """`Model` groups layers into an object with training and inference features.
 
-def _standardize_sample_or_class_weights(x_weight, output_names, weight_type):
-  """Maps `sample_weight` or `class_weight` to model outputs.
+  There are two ways to instantiate a `Model`:
 
-  Arguments:
-      x_weight: User-provided `sample_weight` or `class_weight` argument.
-      output_names: List of output names (strings) in the model.
-      weight_type: A string used purely for exception printing.
+  1 - With the "functional API", where you start from `Input`,
+  you chain layer calls to specify the model's forward pass,
+  and finally you create your model from inputs and outputs:
 
-  Returns:
-      A list of `sample_weight` or `class_weight` where there are exactly
-          one element per model output.
+  ```python
+  import tensorflow as tf
 
-  Raises:
-      ValueError: In case of invalid user-provided argument.
-  """
-  if x_weight is None or len(x_weight) == 0:  # pylint: disable=g-explicit-length-test
-    return [None for _ in output_names]
-  if len(output_names) == 1:
-    if isinstance(x_weight, list) and len(x_weight) == 1:
-      return x_weight
-    if isinstance(x_weight, dict) and output_names[0] in x_weight:
-      return [x_weight[output_names[0]]]
-    else:
-      return [x_weight]
-  if isinstance(x_weight, list):
-    if len(x_weight) != len(output_names):
-      raise ValueError('Provided `' + weight_type + '` was a list of ' +
-                       str(len(x_weight)) + ' elements, but the model has ' +
-                       str(len(output_names)) + ' outputs. '
-                       'You should provide one `' + weight_type + '`'
-                       'array per model output.')
-    return x_weight
-  if isinstance(x_weight, dict):
-    x_weights = []
-    for name in output_names:
-      x_weights.append(x_weight.get(name))
-    return x_weights
-  else:
-    raise TypeError(
-        'The model has multiple outputs, so `' + weight_type + '` '
-        'should be either a list or a dict. '
-        'Provided `' + weight_type + '` type not understood: ' + str(x_weight))
-
-
-def _standardize_class_weights(class_weight, output_names):
-  return _standardize_sample_or_class_weights(class_weight, output_names,
-                                              'class_weight')
-
-
-def _standardize_sample_weights(sample_weight, output_names):
-  return _standardize_sample_or_class_weights(sample_weight, output_names,
-                                              'sample_weight')
-
-
-def _check_array_lengths(inputs, targets, weights=None):
-  """Does user input validation for numpy arrays.
-
-  Arguments:
-      inputs: list of Numpy arrays of inputs.
-      targets: list of Numpy arrays of targets.
-      weights: list of Numpy arrays of sample weights.
-
-  Raises:
-      ValueError: in case of incorrectly formatted data.
-  """
+  inputs = tf.keras.Input(shape=(3,))
+  x = tf.keras.layers.Dense(4, activation=tf.nn.relu)(inputs)
+  outputs = tf.keras.layers.Dense(5, activation=tf.nn.softmax)(x)
+  model = tf.keras.Model(inputs=inputs, outputs=outputs)
+  ```
 
-  def set_of_lengths(x):
-    # return a set with the variation between
-    # different shapes, with None => 0
-    if x is None:
-      return {0}
-    else:
-      return set([0 if y is None else y.shape[0] for y in x])
-
-  set_x = set_of_lengths(inputs)
-  set_y = set_of_lengths(targets)
-  set_w = set_of_lengths(weights)
-  if len(set_x) > 1:
-    raise ValueError('All input arrays (x) should have '
-                     'the same number of samples. Got array shapes: ' +
-                     str([x.shape for x in inputs]))
-  if len(set_y) > 1:
-    raise ValueError('All target arrays (y) should have '
-                     'the same number of samples. Got array shapes: ' +
-                     str([y.shape for y in targets]))
-  if set_x and set_y and list(set_x)[0] != list(set_y)[0]:
-    raise ValueError('Input arrays should have '
-                     'the same number of samples as target arrays. '
-                     'Found ' + str(list(set_x)[0]) + ' input samples '
-                     'and ' + str(list(set_y)[0]) + ' target samples.')
-  if len(set_w) > 1:
-    raise ValueError('All sample_weight arrays should have '
-                     'the same number of samples. Got array shapes: ' +
-                     str([w.shape for w in weights]))
-  if set_y and set_w and list(set_y)[0] != list(set_w)[0]:
-    raise ValueError('Sample_weight arrays should have '
-                     'the same number of samples as target arrays. Got ' +
-                     str(list(set_y)[0]) + ' input samples and ' +
-                     str(list(set_w)[0]) + ' target samples.')
-
-
-def _check_loss_and_target_compatibility(targets, loss_fns, output_shapes):
-  """Does validation on the compatibility of targets and loss functions.
-
-  This helps prevent users from using loss functions incorrectly.
-
-  Arguments:
-      targets: list of Numpy arrays of targets.
-      loss_fns: list of loss functions.
-      output_shapes: list of shapes of model outputs.
-
-  Raises:
-      ValueError: if a loss function or target array
-          is incompatible with an output.
-  """
-  key_losses = {
-      losses.mean_squared_error, losses.binary_crossentropy,
-      losses.categorical_crossentropy
-  }
-  for y, loss, shape in zip(targets, loss_fns, output_shapes):
-    if y is None or loss is None:
-      continue
-    if loss is losses.categorical_crossentropy:
-      if y.shape[-1] == 1:
-        raise ValueError('You are passing a target array of shape ' + str(
-            y.shape) + ' while using as loss `categorical_crossentropy`. '
-                         '`categorical_crossentropy` expects '
-                         'targets to be binary matrices (1s and 0s) '
-                         'of shape (samples, classes). '
-                         'If your targets are integer classes, '
-                         'you can convert them to the expected format via:\n'
-                         '```\n'
-                         'from keras.utils import to_categorical\n'
-                         'y_binary = to_categorical(y_int)\n'
-                         '```\n'
-                         '\n'
-                         'Alternatively, you can use the loss function '
-                         '`sparse_categorical_crossentropy` instead, '
-                         'which does expect integer targets.')
-    if loss in key_losses:
-      for target_dim, out_dim in zip(y.shape[1:], shape[1:]):
-        if out_dim is not None and target_dim != out_dim:
-          raise ValueError('A target array with shape ' + str(y.shape) +
-                           ' was passed for an output of shape ' + str(shape) +
-                           ' while using as loss `' + loss.__name__ + '`. '
-                           'This loss expects '
-                           'targets to have the same shape '
-                           'as the output.')
-
-
-def _collect_metrics(metrics, output_names):
-  """Maps metric functions to model outputs.
-
-  Arguments:
-      metrics: a list or dict of metric functions.
-      output_names: a list of the names (strings) of model outputs.
-
-  Returns:
-      A list (one entry per model output) of lists of metric functions.
-      For instance, if the model has 2 outputs, and for the first output
-      we want to compute "binary_accuracy" and "binary_crossentropy",
-      and just "binary_accuracy" for the second output,
-      the list would look like:
-          `[[binary_accuracy, binary_crossentropy], [binary_accuracy]]`
-
-  Raises:
-      TypeError: if an incorrect type is passed for the `metrics` argument.
-  """
-  if not metrics:
-    return [[] for _ in output_names]
-  if isinstance(metrics, list):
-    # we then apply all metrics to all outputs.
-    return [copy.copy(metrics) for _ in output_names]
-  elif isinstance(metrics, dict):
-    nested_metrics = []
-    for name in output_names:
-      output_metrics = metrics.get(name, [])
-      if not isinstance(output_metrics, list):
-        output_metrics = [output_metrics]
-      nested_metrics.append(output_metrics)
-    return nested_metrics
-  else:
-    raise TypeError('Type of `metrics` argument not understood. '
-                    'Expected a list or dictionary, found: ' + str(metrics))
-
-
-def _batch_shuffle(index_array, batch_size):
-  """Shuffles an array in a batch-wise fashion.
-
-  Useful for shuffling HDF5 arrays
-  (where one cannot access arbitrary indices).
-
-  Arguments:
-      index_array: array of indices to be shuffled.
-      batch_size: integer.
-
-  Returns:
-      The `index_array` array, shuffled in a batch-wise fashion.
-  """
-  batch_count = int(len(index_array) / batch_size)
-  # to reshape we need to be cleanly divisible by batch size
-  # we stash extra items and reappend them after shuffling
-  last_batch = index_array[batch_count * batch_size:]
-  index_array = index_array[:batch_count * batch_size]
-  index_array = index_array.reshape((batch_count, batch_size))
-  np.random.shuffle(index_array)
-  index_array = index_array.flatten()
-  return np.append(index_array, last_batch)
-
-
-def _weighted_masked_objective(fn):
-  """Adds support for masking and sample-weighting to an objective function.
-
-  It transforms an objective function `fn(y_true, y_pred)`
-  into a sample-weighted, cost-masked objective function
-  `fn(y_true, y_pred, weights, mask)`.
-
-  Arguments:
-      fn: The objective function to wrap,
-          with signature `fn(y_true, y_pred)`.
-
-  Returns:
-      A function with signature `fn(y_true, y_pred, weights, mask)`.
-  """
-  if fn is None:
-    return None
+  2 - By subclassing the `Model` class: in that case, you should define your
+  layers in `__init__` and you should implement the model's forward pass
+  in `call`.
 
-  def weighted(y_true, y_pred, weights, mask=None):
-    """Wrapper function.
+  ```python
+  import tensorflow as tf
 
-    Arguments:
-        y_true: `y_true` argument of `fn`.
-        y_pred: `y_pred` argument of `fn`.
-        weights: Weights tensor.
-        mask: Mask tensor.
+  class MyModel(tf.keras.Model):
 
-    Returns:
-        Scalar tensor.
-    """
-    # score_array has ndim >= 2
-    score_array = fn(y_true, y_pred)
-    if mask is not None:
-      # Cast the mask to floatX to avoid float64 upcasting in theano
-      mask = K.cast(mask, K.floatx())
-      # mask should have the same shape as score_array
-      score_array *= mask
-      #  the loss per batch should be proportional
-      #  to the number of unmasked samples.
-      score_array /= K.mean(mask)
-
-    # apply sample weighting
-    if weights is not None:
-      # reduce score_array to same ndim as weight array
-      ndim = K.ndim(score_array)
-      weight_ndim = K.ndim(weights)
-      score_array = K.mean(score_array, axis=list(range(weight_ndim, ndim)))
-      score_array *= weights
-      score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
-    return K.mean(score_array)
-
-  return weighted
-
-
-def _standardize_weights(y,
-                         sample_weight=None,
-                         class_weight=None,
-                         sample_weight_mode=None):
-  """Performs sample weight validation and standardization.
-
-  Everything gets normalized to a single sample-wise (or timestep-wise)
-  weight array.
-
-  Arguments:
-      y: Numpy array of model targets to be weighted.
-      sample_weight: User-provided `sample_weight` argument.
-      class_weight: User-provided `class_weight` argument.
-      sample_weight_mode: One of `None` or `"temporal"`.
-          `"temporal"` indicated that we expect 2D weight data
-          that will be applied to the last 2 dimensions of
-          the targets (i.e. we are weighting timesteps, not samples).
-
-  Returns:
-      A numpy array of target weights, one entry per sample to weight.
-
-  Raises:
-      ValueError: In case of invalid user-provided arguments.
-  """
-  if sample_weight_mode is not None:
-    if sample_weight_mode != 'temporal':
-      raise ValueError('"sample_weight_mode '
-                       'should be None or "temporal". '
-                       'Found: ' + str(sample_weight_mode))
-    if len(y.shape) < 3:
-      raise ValueError('Found a sample_weight array for '
-                       'an input with shape ' + str(y.shape) + '. '
-                       'Timestep-wise sample weighting (use of '
-                       'sample_weight_mode="temporal") is restricted to '
-                       'outputs that are at least 3D, i.e. that have '
-                       'a time dimension.')
-    if sample_weight is not None and len(sample_weight.shape) != 2:
-      raise ValueError('Found a sample_weight array with shape ' +
-                       str(sample_weight.shape) + '. '
-                       'In order to use timestep-wise sample weighting, '
-                       'you should pass a 2D sample_weight array.')
-  else:
-    if sample_weight is not None and len(sample_weight.shape) != 1:
-      raise ValueError('Found a sample_weight array with shape ' +
-                       str(sample_weight.shape) + '. '
-                       'In order to use timestep-wise sample weights, '
-                       'you should specify '
-                       'sample_weight_mode="temporal" '
-                       'in compile(). If you just mean to use '
-                       'sample-wise weights, make sure your '
-                       'sample_weight array is 1D.')
-
-  if sample_weight is not None:
-    if len(sample_weight.shape) > len(y.shape):
-      raise ValueError(
-          'Found a sample_weight with shape' + str(sample_weight.shape) + '.'
-          'Expected sample_weight with rank '
-          'less than or equal to ' + str(len(y.shape)))
-
-    if y.shape[:sample_weight.ndim] != sample_weight.shape:
-      raise ValueError(
-          'Found a sample_weight array with shape ' + str(sample_weight.shape) +
-          ' for an input with shape ' + str(y.shape) + '. '
-          'sample_weight cannot be broadcast.')
-    return sample_weight
-  elif isinstance(class_weight, dict):
-    if len(y.shape) > 2:
-      raise ValueError('`class_weight` not supported for '
-                       '3+ dimensional targets.')
-    if y.shape[1] > 1:
-      y_classes = np.argmax(y, axis=1)
-    elif y.shape[1] == 1:
-      y_classes = np.reshape(y, y.shape[0])
-    else:
-      y_classes = y
-
-    weights = np.asarray(
-        [class_weight[cls] for cls in y_classes if cls in class_weight])
-
-    if len(weights) != len(y_classes):
-      # subtract the sets to pick all missing classes
-      existing_classes = set(y_classes)
-      existing_class_weight = set(class_weight.keys())
-      raise ValueError('`class_weight` must contain all classes in the data.'
-                       ' The classes %s exist in the data but not in '
-                       '`class_weight`.' %
-                       (existing_classes - existing_class_weight))
-    return weights
-  else:
-    if sample_weight_mode is None:
-      return np.ones((y.shape[0],), dtype=K.floatx())
-    else:
-      return np.ones((y.shape[0], y.shape[1]), dtype=K.floatx())
+    def __init__(self):
+      self.dense1 = tf.keras.layers.Dense(4, activation=tf.nn.relu)
+      self.dense2 = tf.keras.layers.Dense(5, activation=tf.nn.softmax)
 
+    def call(self, inputs):
+      x = self.dense1(inputs)
+      return self.dense2(x)
 
-@tf_export('keras.models.Model', 'keras.Model')
-class Model(Network):
-  """The `Model` class adds training & evaluation routines to a `Network`.
+  model = MyModel()
+  ```
+
+  If you subclass `Model`, you can optionally have
+  a `training` argument (boolean) in `call`, which you can use to specify
+  a different behavior in training and inference:
+
+  ```python
+  import tensorflow as tf
+
+  class MyModel(tf.keras.Model):
+
+    def __init__(self):
+      self.dense1 = tf.keras.layers.Dense(4, activation=tf.nn.relu)
+      self.dense2 = tf.keras.layers.Dense(5, activation=tf.nn.softmax)
+      self.dropout = tf.keras.layers.Dropout(0.5)
+
+    def call(self, inputs, training=False):
+      x = self.dense1(inputs)
+      if training:
+        x = self.dropout(x, training=training)
+      return self.dense2(x)
+
+  model = MyModel()
+  ```
   """
 
   def compile(self,
@@ -630,7 +216,8 @@ class Model(Network):
       loss_functions = [loss_function for _ in range(len(self.outputs))]
     self.loss_functions = loss_functions
 
-    weighted_losses = [_weighted_masked_objective(fn) for fn in loss_functions]
+    weighted_losses = [training_utils.weighted_masked_objective(fn)
+                       for fn in loss_functions]
     skip_target_indices = []
     skip_target_weighing_indices = []
     self._feed_outputs = []
@@ -687,7 +274,8 @@ class Model(Network):
       for i in range(len(self.outputs)):
         if len(self.outputs) > 1:
           self.metrics_names.append(self.output_names[i] + '_loss')
-      self.nested_metrics = _collect_metrics(metrics, self.output_names)
+      self.nested_metrics = training_utils.collect_metrics(metrics,
+                                                           self.output_names)
       self._feed_sample_weight_modes = []
       for i in range(len(self.outputs)):
         self._feed_sample_weight_modes.append(None)
@@ -804,12 +392,12 @@ class Model(Network):
           sample_weights.append(None)
         else:
           if sample_weight_mode == 'temporal':
-            sample_weights.append(
-                K.placeholder(ndim=2, name=name + '_sample_weights'))
+            sample_weights.append(array_ops.placeholder_with_default(
+                [[1.]], shape=[None, None], name=name + '_sample_weights'))
             sample_weight_modes.append('temporal')
           else:
-            sample_weights.append(
-                K.placeholder(ndim=1, name=name + '_sample_weights'))
+            sample_weights.append(array_ops.placeholder_with_default(
+                [1.], shape=[None], name=name + '_sample_weights'))
             sample_weight_modes.append(None)
     self.sample_weight_modes = sample_weight_modes
     self._feed_sample_weight_modes = []
@@ -857,9 +445,9 @@ class Model(Network):
 
     # List of same size as output_names.
     # contains tuples (metrics for output, names of metrics).
-    nested_metrics = _collect_metrics(metrics, self.output_names)
-    nested_weighted_metrics = _collect_metrics(weighted_metrics,
-                                               self.output_names)
+    nested_metrics = training_utils.collect_metrics(metrics, self.output_names)
+    nested_weighted_metrics = training_utils.collect_metrics(weighted_metrics,
+                                                             self.output_names)
     self.metrics_updates = []
     self.stateful_metric_names = []
     with K.name_scope('metrics'):
@@ -905,11 +493,13 @@ class Model(Network):
                 suffix = 'acc'
               elif metric in ('crossentropy', 'ce'):
                 suffix = 'ce'
-              weighted_metric_fn = _weighted_masked_objective(metric_fn)
+              weighted_metric_fn = training_utils.weighted_masked_objective(
+                  metric_fn)
               metric_name = metric_name_prefix + suffix
             else:
               metric_fn = metrics_module.get(metric)
-              weighted_metric_fn = _weighted_masked_objective(metric_fn)
+              weighted_metric_fn = training_utils.weighted_masked_objective(
+                  metric_fn)
               # Get metric name as string
               if hasattr(metric_fn, 'name'):
                 metric_name = metric_fn.name
@@ -1047,451 +637,6 @@ class Model(Network):
           name='predict_function',
           **kwargs)
 
-  def _check_num_samples(self,
-                         ins,
-                         batch_size=None,
-                         steps=None,
-                         steps_name='steps'):
-    """Determine the number of samples provided for training and evaluation.
-
-    The number of samples is not defined when running with `steps`,
-    in which case the number of samples is set to `None`.
-
-    Arguments:
-        ins: List of tensors to be fed to the Keras function.
-        batch_size: Integer batch size or `None` if not defined.
-        steps: Total number of steps (batches of samples)
-            before declaring `_predict_loop` finished.
-            Ignored with the default value of `None`.
-        steps_name: The public API's parameter name for `steps`.
-
-    Raises:
-        ValueError: when `steps` is `None` and the attribute `ins.shape`
-        does not exist. Also raises ValueError when `steps` is not `None`
-        and `batch_size` is not `None` because they are mutually
-        exclusive.
-
-    Returns:
-        When steps is `None`, returns the number of samples to be
-        processed based on the size of the first dimension of the
-        first input numpy array. When steps is not `None` and
-        `batch_size` is `None`, returns `None`.
-
-    Raises:
-        ValueError: In case of invalid arguments.
-    """
-    if steps is not None:
-      num_samples = None
-      if batch_size is not None:
-        raise ValueError(
-            'If ' + steps_name + ' is set, the `batch_size` must be None.')
-    elif ins and hasattr(ins[0], 'shape'):
-      num_samples = ins[0].shape[0]
-    else:
-      raise ValueError(
-          'Either the input data should have '
-          'a defined shape, or ' + steps_name + ' should be specified.')
-    return num_samples
-
-  def _fit_loop(self,
-                f,
-                ins,
-                out_labels=None,
-                batch_size=None,
-                epochs=100,
-                verbose=1,
-                callbacks=None,
-                val_f=None,
-                val_ins=None,
-                shuffle=True,
-                callback_metrics=None,
-                initial_epoch=0,
-                steps_per_epoch=None,
-                validation_steps=None):
-    """Abstract fit function for `f(ins)`.
-
-    Assume that f returns a list, labeled by out_labels.
-
-    Arguments:
-        f: Keras function returning a list of tensors
-        ins: List of tensors to be fed to `f`
-        out_labels: List of strings, display names of
-            the outputs of `f`
-        batch_size: Integer batch size or None if unknown.
-        epochs: Number of times to iterate over the data
-        verbose: Verbosity mode, 0, 1 or 2
-        callbacks: List of callbacks to be called during training
-        val_f: Keras function to call for validation
-        val_ins: List of tensors to be fed to `val_f`
-        shuffle: Whether to shuffle the data at the beginning of each epoch
-        callback_metrics: List of strings, the display names of the metrics
-            passed to the callbacks. They should be the
-            concatenation of list the display names of the outputs of
-             `f` and the list of display names of the outputs of `f_val`.
-        initial_epoch: Epoch at which to start training
-            (useful for resuming a previous training run)
-        steps_per_epoch: Total number of steps (batches of samples)
-            before declaring one epoch finished and starting the
-            next epoch. Ignored with the default value of `None`.
-        validation_steps: Number of steps to run validation for
-            (only if doing validation from data tensors).
-            Ignored with the default value of `None`.
-
-    Returns:
-        `History` object.
-
-    Raises:
-        ValueError: in case of invalid arguments.
-    """
-    do_validation = False
-    if val_f and val_ins:
-      do_validation = True
-      if verbose and ins and hasattr(ins[0], 'shape') and hasattr(
-          val_ins[0], 'shape'):
-        print('Train on %d samples, validate on %d samples' %
-              (ins[0].shape[0], val_ins[0].shape[0]))
-    if validation_steps:
-      do_validation = True
-      if steps_per_epoch is None:
-        raise ValueError('Can only use `validation_steps` '
-                         'when doing step-wise '
-                         'training, i.e. `steps_per_epoch` '
-                         'must be set.')
-
-    num_train_samples = self._check_num_samples(
-        ins, batch_size, steps_per_epoch, 'steps_per_epoch')
-    if num_train_samples is not None:
-      index_array = np.arange(num_train_samples)
-
-    self.history = cbks.History()
-    all_callbacks = [cbks.BaseLogger(
-        stateful_metrics=self.stateful_metric_names)]
-    if verbose:
-      if steps_per_epoch is not None:
-        count_mode = 'steps'
-      else:
-        count_mode = 'samples'
-      all_callbacks.append(
-          cbks.ProgbarLogger(
-              count_mode, stateful_metrics=self.stateful_metric_names))
-    all_callbacks += (callbacks or []) + [self.history]
-    callbacks = cbks.CallbackList(all_callbacks)
-    out_labels = out_labels or []
-
-    # it's possible to callback a different model than self
-    # (used by Sequential models)
-    if hasattr(self, 'callback_model') and self.callback_model:
-      callback_model = self.callback_model
-    else:
-      callback_model = self
-
-    callbacks.set_model(callback_model)
-
-    callbacks.set_params({
-        'batch_size': batch_size,
-        'epochs': epochs,
-        'steps': steps_per_epoch,
-        'samples': num_train_samples,
-        'verbose': verbose,
-        'do_validation': do_validation,
-        'metrics': callback_metrics or [],
-    })
-    callbacks.on_train_begin()
-    callback_model.stop_training = False
-    for cbk in callbacks:
-      cbk.validation_data = val_ins
-
-    # To prevent a slowdown, we find beforehand the arrays that need conversion.
-    feed = self._feed_inputs + self._feed_targets + self._feed_sample_weights
-    indices_for_conversion_to_dense = []
-    for i in range(len(feed)):
-      if issparse is not None and issparse(ins[i]) and not K.is_sparse(feed[i]):
-        indices_for_conversion_to_dense.append(i)
-
-    for epoch in range(initial_epoch, epochs):
-      # Reset stateful metrics
-      for m in self.metrics:
-        if isinstance(m, Layer):
-          m.reset_states()
-      # Update callbacks
-      callbacks.on_epoch_begin(epoch)
-      epoch_logs = {}
-      if steps_per_epoch is not None:
-        for step_index in range(steps_per_epoch):
-          batch_logs = {}
-          batch_logs['batch'] = step_index
-          batch_logs['size'] = 1
-          callbacks.on_batch_begin(step_index, batch_logs)
-          outs = f(ins)
-
-          if not isinstance(outs, list):
-            outs = [outs]
-          for l, o in zip(out_labels, outs):
-            batch_logs[l] = o
-
-          callbacks.on_batch_end(step_index, batch_logs)
-          if callback_model.stop_training:
-            break
-
-        if do_validation:
-          val_outs = self._test_loop(
-              val_f,
-              val_ins,
-              batch_size=batch_size,
-              steps=validation_steps,
-              verbose=0)
-          if not isinstance(val_outs, list):
-            val_outs = [val_outs]
-          # Same labels assumed.
-          for l, o in zip(out_labels, val_outs):
-            epoch_logs['val_' + l] = o
-      else:
-        if shuffle == 'batch':
-          index_array = _batch_shuffle(index_array, batch_size)
-        elif shuffle:
-          np.random.shuffle(index_array)
-
-        batches = make_batches(num_train_samples, batch_size)
-
-        for batch_index, (batch_start, batch_end) in enumerate(batches):
-          batch_ids = index_array[batch_start:batch_end]
-          try:
-            if isinstance(ins[-1], float):
-              # Do not slice the training phase flag.
-              ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-            else:
-              ins_batch = slice_arrays(ins, batch_ids)
-          except TypeError:
-            raise TypeError('TypeError while preparing batch. '
-                            'If using HDF5 input data, '
-                            'pass shuffle="batch".')
-          batch_logs = {}
-          batch_logs['batch'] = batch_index
-          batch_logs['size'] = len(batch_ids)
-          callbacks.on_batch_begin(batch_index, batch_logs)
-          for i in indices_for_conversion_to_dense:
-            ins_batch[i] = ins_batch[i].toarray()
-
-          outs = f(ins_batch)
-          if not isinstance(outs, list):
-            outs = [outs]
-          for l, o in zip(out_labels, outs):
-            batch_logs[l] = o
-
-          callbacks.on_batch_end(batch_index, batch_logs)
-          if callback_model.stop_training:
-            break
-
-          if batch_index == len(batches) - 1:  # Last batch.
-            if do_validation:
-              val_outs = self._test_loop(
-                  val_f, val_ins, batch_size=batch_size, verbose=0)
-              if not isinstance(val_outs, list):
-                val_outs = [val_outs]
-              # Same labels assumed.
-              for l, o in zip(out_labels, val_outs):
-                epoch_logs['val_' + l] = o
-      callbacks.on_epoch_end(epoch, epoch_logs)
-      if callback_model.stop_training:
-        break
-    callbacks.on_train_end()
-    return self.history
-
-  def _predict_loop(self, f, ins, batch_size=32, verbose=0, steps=None):
-    """Abstract method to loop over some data in batches.
-
-    Arguments:
-        f: Keras function returning a list of tensors.
-        ins: list of tensors to be fed to `f`.
-        batch_size: integer batch size.
-        verbose: verbosity mode.
-        steps: Total number of steps (batches of samples)
-            before declaring `_predict_loop` finished.
-            Ignored with the default value of `None`.
-
-    Returns:
-        Array of predictions (if the model has a single output)
-        or list of arrays of predictions
-        (if the model has multiple outputs).
-    """
-    if hasattr(self, 'metrics'):
-      for m in self.metrics:
-        if isinstance(m, Layer):
-          m.reset_states()
-
-    num_samples = self._check_num_samples(ins, batch_size, steps, 'steps')
-    if verbose == 1:
-      if steps is not None:
-        progbar = Progbar(target=steps,
-                          stateful_metrics=self.stateful_metric_names)
-      else:
-        progbar = Progbar(target=num_samples,
-                          stateful_metrics=self.stateful_metric_names)
-
-    indices_for_conversion_to_dense = []
-    for i in range(len(self._feed_inputs)):
-      if (issparse is not None and issparse(ins[i]) and
-          not K.is_sparse(self._feed_inputs[i])):
-        indices_for_conversion_to_dense.append(i)
-
-    if steps is not None:
-      # Step-based predictions.
-      # Since we do not know how many samples
-      # we will see, we cannot pre-allocate
-      # the returned Numpy arrays.
-      # Instead, we store one array per batch seen
-      # and concatenate them upon returning.
-      unconcatenated_outs = []
-      for step in range(steps):
-        batch_outs = f(ins)
-        if not isinstance(batch_outs, list):
-          batch_outs = [batch_outs]
-        if step == 0:
-          for batch_out in batch_outs:
-            unconcatenated_outs.append([])
-        for i, batch_out in enumerate(batch_outs):
-          unconcatenated_outs[i].append(batch_out)
-        if verbose == 1:
-          progbar.update(step + 1)
-      if len(unconcatenated_outs) == 1:
-        return np.concatenate(unconcatenated_outs[0], axis=0)
-      return [
-          np.concatenate(unconcatenated_outs[i], axis=0)
-          for i in range(len(unconcatenated_outs))
-      ]
-    else:
-      # Sample-based predictions.
-      outs = []
-      batches = make_batches(num_samples, batch_size)
-      index_array = np.arange(num_samples)
-      for batch_index, (batch_start, batch_end) in enumerate(batches):
-        batch_ids = index_array[batch_start:batch_end]
-        if ins and isinstance(ins[-1], float):
-          # Do not slice the training phase flag.
-          ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-        else:
-          ins_batch = slice_arrays(ins, batch_ids)
-        for i in indices_for_conversion_to_dense:
-          ins_batch[i] = ins_batch[i].toarray()
-
-        batch_outs = f(ins_batch)
-        if not isinstance(batch_outs, list):
-          batch_outs = [batch_outs]
-        if batch_index == 0:
-          # Pre-allocate the results arrays.
-          for batch_out in batch_outs:
-            shape = (num_samples,) + batch_out.shape[1:]
-            outs.append(np.zeros(shape, dtype=batch_out.dtype))
-        for i, batch_out in enumerate(batch_outs):
-          outs[i][batch_start:batch_end] = batch_out
-        if verbose == 1:
-          progbar.update(batch_end)
-      if len(outs) == 1:
-        return outs[0]
-      return outs
-
-  def _test_loop(self, f, ins, batch_size=None, verbose=0, steps=None):
-    """Abstract method to loop over some data in batches.
-
-    Arguments:
-        f: Keras function returning a list of tensors.
-        ins: list of tensors to be fed to `f`.
-        batch_size: integer batch size or `None`.
-        verbose: verbosity mode.
-        steps: Total number of steps (batches of samples)
-            before declaring predictions finished.
-            Ignored with the default value of `None`.
-
-    Returns:
-        Scalar loss (if the model has a single output and no metrics)
-        or list of scalars (if the model has multiple outputs
-        and/or metrics). The attribute `model.metrics_names` will give you
-        the display labels for the scalar outputs.
-    """
-    if hasattr(self, 'metrics'):
-      for m in self.metrics:
-        if isinstance(m, Layer):
-          m.reset_states()
-      stateful_metric_indices = [
-          i for i, name in enumerate(self.metrics_names)
-          if str(name) in self.stateful_metric_names
-      ]
-    else:
-      stateful_metric_indices = []
-
-    num_samples = self._check_num_samples(ins, batch_size, steps, 'steps')
-    outs = []
-    if verbose == 1:
-      if steps is not None:
-        progbar = Progbar(target=steps)
-      else:
-        progbar = Progbar(target=num_samples)
-
-    # To prevent a slowdown, we find beforehand the arrays that need conversion.
-    feed = self._feed_inputs + self._feed_targets + self._feed_sample_weights
-    indices_for_conversion_to_dense = []
-    for i in range(len(feed)):
-      if issparse is not None and issparse(ins[i]) and not K.is_sparse(feed[i]):
-        indices_for_conversion_to_dense.append(i)
-
-    if steps is not None:
-      for step in range(steps):
-        batch_outs = f(ins)
-        if isinstance(batch_outs, list):
-          if step == 0:
-            for _ in enumerate(batch_outs):
-              outs.append(0.)
-          for i, batch_out in enumerate(batch_outs):
-            if i in stateful_metric_indices:
-              outs[i] = batch_out
-            else:
-              outs[i] += batch_out
-        else:
-          if step == 0:
-            outs.append(0.)
-          outs[0] += batch_outs
-        if verbose == 1:
-          progbar.update(step + 1)
-      for i in range(len(outs)):
-        if i not in stateful_metric_indices:
-          outs[i] /= steps
-    else:
-      batches = make_batches(num_samples, batch_size)
-      index_array = np.arange(num_samples)
-      for batch_index, (batch_start, batch_end) in enumerate(batches):
-        batch_ids = index_array[batch_start:batch_end]
-        if isinstance(ins[-1], float):
-          # Do not slice the training phase flag.
-          ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-        else:
-          ins_batch = slice_arrays(ins, batch_ids)
-        for i in indices_for_conversion_to_dense:
-          ins_batch[i] = ins_batch[i].toarray()
-
-        batch_outs = f(ins_batch)
-
-        if isinstance(batch_outs, list):
-          if batch_index == 0:
-            for batch_out in enumerate(batch_outs):
-              outs.append(0.)
-          for i, batch_out in enumerate(batch_outs):
-            if i in stateful_metric_indices:
-              outs[i] = batch_out
-            else:
-              outs[i] += batch_out * len(batch_ids)
-        else:
-          if batch_index == 0:
-            outs.append(0.)
-          outs[0] += batch_outs * len(batch_ids)
-        if verbose == 1:
-          progbar.update(batch_end)
-      for i in range(len(outs)):
-        if i not in stateful_metric_indices:
-          outs[i] /= num_samples
-    if len(outs) == 1:
-      return outs[0]
-    return outs
-
   def _standardize_user_data(self,
                              x,
                              y=None,
@@ -1631,7 +776,7 @@ class Model(Network):
       feed_input_shapes = self._feed_input_shapes
 
     # Standardize the inputs.
-    x = _standardize_input_data(
+    x = training_utils.standardize_input_data(
         x,
         feed_input_names,
         feed_input_shapes,
@@ -1670,7 +815,7 @@ class Model(Network):
             feed_output_shapes.append(output_shape)
 
       # Standardize the outputs.
-      y = _standardize_input_data(
+      y = training_utils.standardize_input_data(
           y,
           feed_output_names,
           feed_output_shapes,
@@ -1679,21 +824,21 @@ class Model(Network):
 
       # Generate sample-wise weight values given the `sample_weight` and
       # `class_weight` arguments.
-      sample_weights = _standardize_sample_weights(sample_weight,
-                                                   feed_output_names)
-      class_weights = _standardize_class_weights(class_weight,
-                                                 feed_output_names)
+      sample_weights = training_utils.standardize_sample_weights(
+          sample_weight, feed_output_names)
+      class_weights = training_utils.standardize_class_weights(
+          class_weight, feed_output_names)
       sample_weights = [
-          _standardize_weights(ref, sw, cw, mode)
+          training_utils.standardize_weights(ref, sw, cw, mode)
           for (ref, sw, cw, mode) in zip(y, sample_weights, class_weights,
                                          feed_sample_weight_modes)
       ]
       # Check that all arrays have the same length.
-      _check_array_lengths(x, y, sample_weights)
+      training_utils.check_array_lengths(x, y, sample_weights)
       if self._is_graph_network and not context.in_eager_mode():
         # Additional checks to avoid users mistakenly using improper loss fns.
-        _check_loss_and_target_compatibility(y, self._feed_loss_fns,
-                                             feed_output_shapes)
+        training_utils.check_loss_and_target_compatibility(
+            y, self._feed_loss_fns, feed_output_shapes)
     else:
       y = []
       sample_weights = []
@@ -1709,7 +854,7 @@ class Model(Network):
                          str(x[0].shape[0]) + ' samples')
     return x, y, sample_weights
 
-  def _set_inputs(self, inputs):
+  def _set_inputs(self, inputs, training=None):
     """Set model's input and output specs based on the input data received.
 
     This is to be used for Model subclasses, which do not know at instantiation
@@ -1725,11 +870,23 @@ class Model(Network):
           when calling `fit`/etc.
         - if data tensors: the model is built on top of these tensors.
           We do not expect any Numpy data to be provided when calling `fit`/etc.
+      training: Boolean or None. Only relevant in symbolic mode. Specifies
+        whether to build the model's graph in inference mode (False), training
+        mode (True), or using the Keras learning phase (None).
     """
     if context.in_eager_mode():
       self._eager_set_inputs(inputs)
     else:
-      self._symbolic_set_inputs(inputs)
+      self._symbolic_set_inputs(inputs, training=training)
+
+  def _set_scope(self, scope=None):
+    """Modify the Layer scope creation logic to create ResourceVariables."""
+    super(Model, self)._set_scope(scope=scope)
+    # Subclassed Models create ResourceVariables by default. This makes it
+    # easier to use Models in an eager/graph agnostic way (since eager execution
+    # always uses ResourceVariables).
+    if not self._is_graph_network:
+      self._scope.set_use_resource(True)
 
   def _eager_set_inputs(self, inputs):
     """Set model's input and output specs based on the input data received.
@@ -1775,14 +932,20 @@ class Model(Network):
         'output_%d' % (i + 1) for i in range(len(dummy_output_values))]
     self.built = True
 
-  def _symbolic_set_inputs(self, inputs):
-    """Set model's inputs based on the input data received from the user.
+  def _symbolic_set_inputs(self, inputs, outputs=None, training=None):
+    """Set model's inputs and output specs based.
 
     This is to be used for Model subclasses, which do not know at instantiation
     time what their inputs look like.
 
     Args:
       inputs: Argument `x` (input data) passed by the user upon first model use.
+      outputs: None, a data tensor, or a list of data tensors. If None, the
+        outputs will be determined by invoking self.call(), otherwise the
+        provided value will be used.
+      training: Boolean or None. Only relevant in symbolic mode. Specifies
+        whether to build the model's graph in inference mode (False), training
+        mode (True), or using the Keras learning phase (None).
 
     Raises:
       ValueError: If the model's inputs are already set.
@@ -1829,11 +992,18 @@ class Model(Network):
           self._feed_input_names.append(name)
           self._feed_input_shapes.append(K.int_shape(v))
 
-    # Obtain symbolic outputs by calling the model.
-    if len(self.inputs) == 1:
-      outputs = self.call(self.inputs[0])
-    else:
-      outputs = self.call(self.inputs)
+    if outputs is None:
+      # Obtain symbolic outputs by calling the model.
+      if len(self.inputs) == 1:
+        if self._expects_training_arg:
+          outputs = self.call(self.inputs[0], training=training)
+        else:
+          outputs = self.call(self.inputs[0])
+      else:
+        if self._expects_training_arg:
+          outputs = self.call(self.inputs, training=training)
+        else:
+          outputs = self.call(self.inputs)
     if isinstance(outputs, (list, tuple)):
       outputs = list(outputs)
     else:
@@ -1979,10 +1149,7 @@ class Model(Network):
         class_weight=class_weight,
         batch_size=batch_size)
     # Prepare validation data.
-    do_validation = False
-    val_ins = []
     if validation_data:
-      do_validation = True
       if len(validation_data) == 2:
         val_x, val_y = validation_data  # pylint: disable=unpacking-non-sequence
         val_sample_weight = None
@@ -2000,13 +1167,8 @@ class Model(Network):
           val_y,
           sample_weight=val_sample_weight,
           batch_size=batch_size)
-      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-        val_ins = val_x + val_y + val_sample_weights + [0.]
-      else:
-        val_ins = val_x + val_y + val_sample_weights
 
     elif validation_split and 0. < validation_split < 1.:
-      do_validation = True
       if hasattr(x[0], 'shape'):
         split_at = int(x[0].shape[0] * (1. - validation_split))
       else:
@@ -2015,77 +1177,44 @@ class Model(Network):
       y, val_y = (slice_arrays(y, 0, split_at), slice_arrays(y, split_at))
       sample_weights, val_sample_weights = (slice_arrays(
           sample_weights, 0, split_at), slice_arrays(sample_weights, split_at))
-      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-        val_ins = val_x + val_y + val_sample_weights + [0.]
-      else:
-        val_ins = val_x + val_y + val_sample_weights
-
     elif validation_steps:
-      do_validation = True
-      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-        val_ins = [0.]
-
-    # Prepare input arrays and training function.
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + y + sample_weights + [1.]
+      val_x = []
+      val_y = []
+      val_sample_weights = []
     else:
-      ins = x + y + sample_weights
-
-    # Prepare display labels.
-    out_labels = self.metrics_names
+      val_x = None
+      val_y = None
+      val_sample_weights = None
 
     if context.in_eager_mode():
-      if do_validation:
-        callback_metrics = copy.copy(out_labels) + [
-            'val_' + n for n in out_labels
-        ]
-      else:
-        callback_metrics = copy.copy(out_labels)
-
       return training_eager.fit_loop(
           self,
-          ins,
-          out_labels=out_labels,
+          inputs=x,
+          targets=y,
+          sample_weights=sample_weights,
           batch_size=batch_size,
           epochs=epochs,
           verbose=verbose,
           callbacks=callbacks,
-          val_ins=val_ins,
+          val_inputs=val_x,
+          val_targets=val_y,
+          val_sample_weights=val_sample_weights,
           shuffle=shuffle,
-          callback_metrics=callback_metrics,
           initial_epoch=initial_epoch,
           steps_per_epoch=steps_per_epoch,
           validation_steps=validation_steps)
     else:
-      self._make_train_function()
-      f = self.train_function
-
-      if do_validation:
-        if context.in_graph_mode():
-          self._make_test_function()
-          val_f = self.test_function
-        else:
-          val_f = None
-        callback_metrics = copy.copy(out_labels) + [
-            'val_' + n for n in out_labels
-        ]
-      else:
-        val_f = None
-        callback_metrics = copy.copy(out_labels)
-
-      # Delegate logic to `_fit_loop`.
-      return self._fit_loop(
-          f,
-          ins,
-          out_labels=out_labels,
+      return training_arrays.fit_loop(
+          self, x, y,
+          sample_weights=sample_weights,
           batch_size=batch_size,
           epochs=epochs,
           verbose=verbose,
           callbacks=callbacks,
-          val_f=val_f,
-          val_ins=val_ins,
+          val_inputs=val_x,
+          val_targets=val_y,
+          val_sample_weights=val_sample_weights,
           shuffle=shuffle,
-          callback_metrics=callback_metrics,
           initial_epoch=initial_epoch,
           steps_per_epoch=steps_per_epoch,
           validation_steps=validation_steps)
@@ -2159,20 +1288,15 @@ class Model(Network):
         y,
         sample_weight=sample_weight,
         batch_size=batch_size)
-    # Prepare inputs, delegate logic to `_test_loop`.
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + y + sample_weights + [0.]
-    else:
-      ins = x + y + sample_weights
 
     if context.in_eager_mode():
       return training_eager.test_loop(
-          self, ins, batch_size=batch_size, verbose=verbose, steps=steps)
+          self, inputs=x, targets=y, sample_weights=sample_weights,
+          batch_size=batch_size, verbose=verbose, steps=steps)
     else:
-      self._make_test_function()
-      f = self.test_function
-      return self._test_loop(
-          f, ins, batch_size=batch_size, verbose=verbose, steps=steps)
+      return training_arrays.test_loop(
+          self, inputs=x, targets=y, sample_weights=sample_weights,
+          batch_size=batch_size, verbose=verbose, steps=steps)
 
   def predict(self, x, batch_size=None, verbose=0, steps=None):
     """Generates output predictions for the input samples.
@@ -2206,21 +1330,12 @@ class Model(Network):
                        'argument.')
     x, _, _ = self._standardize_user_data(x)
 
-    # Prepare inputs, delegate logic to `_predict_loop`.
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + [0.]
-    else:
-      ins = x
-
     if context.in_eager_mode():
       return training_eager.predict_loop(
-          self, ins, batch_size=batch_size, verbose=verbose, steps=steps)
+          self, x, batch_size=batch_size, verbose=verbose, steps=steps)
     else:
-      self._make_predict_function()
-      f = self.predict_function
-
-      return self._predict_loop(
-          f, ins, batch_size=batch_size, verbose=verbose, steps=steps)
+      return training_arrays.predict_loop(
+          self, x, batch_size=batch_size, verbose=verbose, steps=steps)
 
   def train_on_batch(self, x, y, sample_weight=None, class_weight=None):
     """Runs a single gradient update on a single batch of data.
@@ -2257,20 +1372,24 @@ class Model(Network):
         and/or metrics). The attribute `model.metrics_names` will give you
         the display labels for the scalar outputs.
 
+    Raises:
+      ValueError: In case of invalid user-provided arguments.
     """
     x, y, sample_weights = self._standardize_user_data(
         x,
         y,
         sample_weight=sample_weight,
         class_weight=class_weight)
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + y + sample_weights + [1.]
-    else:
-      ins = x + y + sample_weights
 
     if context.in_eager_mode():
-      outputs = training_eager.train_on_batch(self, ins)
+      outputs = training_eager.train_on_batch(
+          self, x, y, sample_weights=sample_weights)
     else:
+      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
+        ins = x + y + sample_weights + [1]
+      else:
+        ins = x + y + sample_weights
+
       self._make_train_function()
       outputs = self.train_function(ins)
 
@@ -2307,18 +1426,19 @@ class Model(Network):
         the display labels for the scalar outputs.
 
     Raises:
-        ValueError: in case of invalid arguments.
+        ValueError: In case of invalid user-provided arguments.
     """
     x, y, sample_weights = self._standardize_user_data(
         x, y, sample_weight=sample_weight)
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + y + sample_weights + [0.]
-    else:
-      ins = x + y + sample_weights
 
     if context.in_eager_mode():
-      outputs = training_eager.test_on_batch(self, ins)
+      outputs = training_eager.test_on_batch(
+          self, x, y, sample_weights=sample_weights)
     else:
+      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
+        ins = x + y + sample_weights + [0]
+      else:
+        ins = x + y + sample_weights
       self._make_test_function()
       outputs = self.test_function(ins)
 
@@ -2338,26 +1458,19 @@ class Model(Network):
     """
     x, _, _ = self._standardize_user_data(x)
 
-    if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
-      ins = x + [0.]
-    else:
-      ins = x
-
     if context.in_eager_mode():
-      ins_batch_converted = []
-      for ib in ins:
-        ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-
-      eager_model_inputs = []
-      for i in range(len(self.inputs)):
-        eager_model_inputs.append(ins_batch_converted[i])
-
-      outs = self(eager_model_inputs)  # pylint: disable=not-callable
-      return outs
+      inputs = [ops.convert_to_tensor(val, dtype=K.floatx()) for val in x]
+      return self(inputs)  # pylint: disable=not-callable
 
     if context.in_graph_mode():
+      if self.uses_learning_phase and not isinstance(K.learning_phase(), int):
+        ins = x + [0]
+      else:
+        ins = x
+
       self._make_predict_function()
       outputs = self.predict_function(ins)
+
       if len(outputs) == 1:
         return outputs[0]
       return outputs
@@ -2473,213 +1586,21 @@ class Model(Network):
       raise NotImplementedError(
           '`fit_generator` is not yet enabled for Model subclasses')
 
-    wait_time = 0.01  # in seconds
-    epoch = initial_epoch
-
-    do_validation = bool(validation_data)
-    self._make_train_function()
-    if do_validation:
-      self._make_test_function()
-
-    is_sequence = isinstance(generator, Sequence)
-    if not is_sequence and use_multiprocessing and workers > 1:
-      logging.warning(
-          UserWarning('Using a generator with `use_multiprocessing=True`'
-                      ' and multiple workers may duplicate your data.'
-                      ' Please consider using the`keras.utils.Sequence'
-                      ' class.'))
-    if steps_per_epoch is None:
-      if is_sequence:
-        steps_per_epoch = len(generator)
-      else:
-        raise ValueError('`steps_per_epoch=None` is only valid for a'
-                         ' generator based on the `keras.utils.Sequence`'
-                         ' class. Please specify `steps_per_epoch` or use'
-                         ' the `keras.utils.Sequence` class.')
-
-    # python 2 has 'next', 3 has '__next__'
-    # avoid any explicit version checks
-    val_gen = (
-        hasattr(validation_data, 'next') or
-        hasattr(validation_data, '__next__') or
-        isinstance(validation_data, Sequence))
-    if (val_gen and not isinstance(validation_data, Sequence) and
-        not validation_steps):
-      raise ValueError('`validation_steps=None` is only valid for a'
-                       ' generator based on the `keras.utils.Sequence`'
-                       ' class. Please specify `validation_steps` or use'
-                       ' the `keras.utils.Sequence` class.')
-
-    # Prepare display labels.
-    out_labels = self.metrics_names
-    callback_metrics = out_labels + ['val_%s' % n for n in out_labels]
-
-    # prepare callbacks
-    self.history = cbks.History()
-    callbacks = [cbks.BaseLogger()] + (callbacks or []) + [self.history]
-    if verbose:
-      callbacks += [cbks.ProgbarLogger(count_mode='steps')]
-    callbacks = cbks.CallbackList(callbacks)
-
-    # it's possible to callback a different model than self:
-    if hasattr(self, 'callback_model') and self.callback_model:
-      callback_model = self.callback_model
-    else:
-      callback_model = self
-    callbacks.set_model(callback_model)
-    callbacks.set_params({
-        'epochs': epochs,
-        'steps': steps_per_epoch,
-        'verbose': verbose,
-        'do_validation': do_validation,
-        'metrics': callback_metrics,
-    })
-    callbacks.on_train_begin()
-
-    enqueuer = None
-    val_enqueuer = None
-
-    try:
-      if do_validation:
-        if val_gen:
-          if workers > 0:
-            if isinstance(validation_data, Sequence):
-              val_enqueuer = OrderedEnqueuer(
-                  validation_data, use_multiprocessing=use_multiprocessing)
-              if validation_steps is None:
-                validation_steps = len(validation_data)
-            else:
-              val_enqueuer = GeneratorEnqueuer(
-                  validation_data,
-                  use_multiprocessing=use_multiprocessing,
-                  wait_time=wait_time)
-            val_enqueuer.start(workers=workers, max_queue_size=max_queue_size)
-            validation_generator = val_enqueuer.get()
-          else:
-            validation_generator = validation_data
-        else:
-          if len(validation_data) == 2:
-            val_x, val_y = validation_data  # pylint: disable=unpacking-non-sequence
-            val_sample_weight = None
-          elif len(validation_data) == 3:
-            val_x, val_y, val_sample_weight = validation_data  # pylint: disable=unpacking-non-sequence
-          else:
-            raise ValueError(
-                '`validation_data` should be a tuple '
-                '`(val_x, val_y, val_sample_weight)` '
-                'or `(val_x, val_y)`. Found: ' + str(validation_data))
-          val_x, val_y, val_sample_weights = self._standardize_user_data(
-              val_x, val_y, val_sample_weight)
-          val_data = val_x + val_y + val_sample_weights
-          if self.uses_learning_phase and not isinstance(
-              K.learning_phase(), int):
-            val_data += [0.]
-          for cbk in callbacks:
-            cbk.validation_data = val_data
-
-      if workers > 0:
-        if is_sequence:
-          enqueuer = OrderedEnqueuer(
-              generator,
-              use_multiprocessing=use_multiprocessing,
-              shuffle=shuffle)
-        else:
-          enqueuer = GeneratorEnqueuer(
-              generator,
-              use_multiprocessing=use_multiprocessing,
-              wait_time=wait_time)
-        enqueuer.start(workers=workers, max_queue_size=max_queue_size)
-        output_generator = enqueuer.get()
-      else:
-        output_generator = generator
-
-      callback_model.stop_training = False
-      # Construct epoch logs.
-      epoch_logs = {}
-      while epoch < epochs:
-        callbacks.on_epoch_begin(epoch)
-        steps_done = 0
-        batch_index = 0
-        while steps_done < steps_per_epoch:
-          generator_output = next(output_generator)
-
-          if not hasattr(generator_output, '__len__'):
-            raise ValueError('Output of generator should be '
-                             'a tuple `(x, y, sample_weight)` '
-                             'or `(x, y)`. Found: ' + str(generator_output))
-
-          if len(generator_output) == 2:
-            x, y = generator_output
-            sample_weight = None
-          elif len(generator_output) == 3:
-            x, y, sample_weight = generator_output
-          else:
-            raise ValueError('Output of generator should be '
-                             'a tuple `(x, y, sample_weight)` '
-                             'or `(x, y)`. Found: ' + str(generator_output))
-          # build batch logs
-          batch_logs = {}
-          if isinstance(x, list):
-            batch_size = x[0].shape[0]
-          elif isinstance(x, dict):
-            batch_size = list(x.values())[0].shape[0]
-          else:
-            batch_size = x.shape[0]
-          batch_logs['batch'] = batch_index
-          batch_logs['size'] = batch_size
-          callbacks.on_batch_begin(batch_index, batch_logs)
-
-          outs = self.train_on_batch(
-              x, y, sample_weight=sample_weight, class_weight=class_weight)
-
-          if not isinstance(outs, list):
-            outs = [outs]
-          for l, o in zip(out_labels, outs):
-            batch_logs[l] = o
-
-          callbacks.on_batch_end(batch_index, batch_logs)
-
-          batch_index += 1
-          steps_done += 1
-
-          # Epoch finished.
-          if steps_done >= steps_per_epoch and do_validation:
-            if val_gen:
-              val_outs = self.evaluate_generator(
-                  validation_generator, validation_steps, workers=0)
-            else:
-              # No need for try/except because
-              # data has already been validated.
-              val_outs = self.evaluate(
-                  val_x,
-                  val_y,
-                  batch_size=batch_size,
-                  sample_weight=val_sample_weights,
-                  verbose=0)
-            if not isinstance(val_outs, list):
-              val_outs = [val_outs]
-            # Same labels assumed.
-            for l, o in zip(out_labels, val_outs):
-              epoch_logs['val_' + l] = o
-
-          if callback_model.stop_training:
-            break
-
-        callbacks.on_epoch_end(epoch, epoch_logs)
-        epoch += 1
-        if callback_model.stop_training:
-          break
-
-    finally:
-      try:
-        if enqueuer is not None:
-          enqueuer.stop()
-      finally:
-        if val_enqueuer is not None:
-          val_enqueuer.stop()
-
-    callbacks.on_train_end()
-    return self.history
+    return training_generator.fit_generator(
+        self,
+        generator,
+        steps_per_epoch=steps_per_epoch,
+        epochs=epochs,
+        verbose=verbose,
+        callbacks=callbacks,
+        validation_data=validation_data,
+        validation_steps=validation_steps,
+        class_weight=class_weight,
+        max_queue_size=max_queue_size,
+        workers=workers,
+        use_multiprocessing=use_multiprocessing,
+        shuffle=shuffle,
+        initial_epoch=initial_epoch)
 
   def evaluate_generator(self,
                          generator,
@@ -2732,87 +1653,13 @@ class Model(Network):
       raise NotImplementedError(
           '`evaluate_generator` is not yet enabled for Model subclasses')
 
-    self._make_test_function()
-
-    steps_done = 0
-    wait_time = 0.01
-    all_outs = []
-    batch_sizes = []
-    is_sequence = isinstance(generator, Sequence)
-    if not is_sequence and use_multiprocessing and workers > 1:
-      logging.warning(
-          UserWarning('Using a generator with `use_multiprocessing=True`'
-                      ' and multiple workers may duplicate your data.'
-                      ' Please consider using the`keras.utils.Sequence'
-                      ' class.'))
-    if steps is None:
-      if is_sequence:
-        steps = len(generator)
-      else:
-        raise ValueError('`steps=None` is only valid for a generator'
-                         ' based on the `keras.utils.Sequence` class.'
-                         ' Please specify `steps` or use the'
-                         ' `keras.utils.Sequence` class.')
-    enqueuer = None
-
-    try:
-      if workers > 0:
-        if is_sequence:
-          enqueuer = OrderedEnqueuer(
-              generator, use_multiprocessing=use_multiprocessing)
-        else:
-          enqueuer = GeneratorEnqueuer(
-              generator,
-              use_multiprocessing=use_multiprocessing,
-              wait_time=wait_time)
-        enqueuer.start(workers=workers, max_queue_size=max_queue_size)
-        output_generator = enqueuer.get()
-      else:
-        output_generator = generator
-
-      while steps_done < steps:
-        generator_output = next(output_generator)
-        if not hasattr(generator_output, '__len__'):
-          raise ValueError('Output of generator should be a tuple '
-                           '(x, y, sample_weight) '
-                           'or (x, y). Found: ' + str(generator_output))
-        if len(generator_output) == 2:
-          x, y = generator_output
-          sample_weight = None
-        elif len(generator_output) == 3:
-          x, y, sample_weight = generator_output
-        else:
-          raise ValueError('Output of generator should be a tuple '
-                           '(x, y, sample_weight) '
-                           'or (x, y). Found: ' + str(generator_output))
-        outs = self.test_on_batch(x, y, sample_weight=sample_weight)
-
-        if isinstance(x, list):
-          batch_size = x[0].shape[0]
-        elif isinstance(x, dict):
-          batch_size = list(x.values())[0].shape[0]
-        else:
-          batch_size = x.shape[0]
-        if batch_size == 0:
-          raise ValueError('Received an empty batch. '
-                           'Batches should at least contain one item.')
-        all_outs.append(outs)
-
-        steps_done += 1
-        batch_sizes.append(batch_size)
-
-    finally:
-      if enqueuer is not None:
-        enqueuer.stop()
-
-    if not isinstance(outs, list):
-      return np.average(np.asarray(all_outs), weights=batch_sizes)
-    else:
-      averages = []
-      for i in range(len(outs)):
-        averages.append(
-            np.average([out[i] for out in all_outs], weights=batch_sizes))
-      return averages
+    return training_generator.evaluate_generator(
+        self,
+        generator,
+        steps=steps,
+        max_queue_size=max_queue_size,
+        workers=workers,
+        use_multiprocessing=use_multiprocessing)
 
   def predict_generator(self,
                         generator,
@@ -2860,88 +1707,11 @@ class Model(Network):
       raise NotImplementedError(
           '`predict_generator` is not yet enabled for Model subclasses')
 
-    self._make_predict_function()
-
-    steps_done = 0
-    wait_time = 0.01
-    all_outs = []
-    is_sequence = isinstance(generator, Sequence)
-    if not is_sequence and use_multiprocessing and workers > 1:
-      logging.warning(
-          UserWarning('Using a generator with `use_multiprocessing=True`'
-                      ' and multiple workers may duplicate your data.'
-                      ' Please consider using the`keras.utils.Sequence'
-                      ' class.'))
-    if steps is None:
-      if is_sequence:
-        steps = len(generator)
-      else:
-        raise ValueError('`steps=None` is only valid for a generator'
-                         ' based on the `keras.utils.Sequence` class.'
-                         ' Please specify `steps` or use the'
-                         ' `keras.utils.Sequence` class.')
-    enqueuer = None
-
-    try:
-      if workers > 0:
-        if is_sequence:
-          enqueuer = OrderedEnqueuer(
-              generator, use_multiprocessing=use_multiprocessing)
-        else:
-          enqueuer = GeneratorEnqueuer(
-              generator,
-              use_multiprocessing=use_multiprocessing,
-              wait_time=wait_time)
-        enqueuer.start(workers=workers, max_queue_size=max_queue_size)
-        output_generator = enqueuer.get()
-      else:
-        output_generator = generator
-
-      if verbose == 1:
-        progbar = Progbar(target=steps)
-
-      while steps_done < steps:
-        generator_output = next(output_generator)
-        if isinstance(generator_output, tuple):
-          # Compatibility with the generators
-          # used for training.
-          if len(generator_output) == 2:
-            x, _ = generator_output
-          elif len(generator_output) == 3:
-            x, _, _ = generator_output
-          else:
-            raise ValueError('Output of generator should be '
-                             'a tuple `(x, y, sample_weight)` '
-                             'or `(x, y)`. Found: ' + str(generator_output))
-        else:
-          # Assumes a generator that only
-          # yields inputs (not targets and sample weights).
-          x = generator_output
-
-        outs = self.predict_on_batch(x)
-        if not isinstance(outs, list):
-          outs = [outs]
-
-        if not all_outs:
-          for out in outs:
-            all_outs.append([])
-
-        for i, out in enumerate(outs):
-          all_outs[i].append(out)
-        steps_done += 1
-        if verbose == 1:
-          progbar.update(steps_done)
-
-    finally:
-      if enqueuer is not None:
-        enqueuer.stop()
-
-    if len(all_outs) == 1:
-      if steps_done == 1:
-        return all_outs[0][0]
-      else:
-        return np.concatenate(all_outs[0])
-    if steps_done == 1:
-      return [out[0] for out in all_outs]
-    else:
-      return [np.concatenate(out) for out in all_outs]
+    return training_generator.predict_generator(
+        self,
+        generator,
+        steps=steps,
+        max_queue_size=max_queue_size,
+        workers=workers,
+        use_multiprocessing=use_multiprocessing,
+        verbose=verbose)
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_arrays.py b/tensorflow/python/keras/_impl/keras/engine/training_arrays.py
new file mode 100644
index 0000000000000000000000000000000000000000..9291ef5fe6005b616a6a8e038c586c043de595ca
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/training_arrays.py
@@ -0,0 +1,495 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Part of the Keras training engine related to plain array data.
+"""
+# pylint: disable=protected-access
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+
+import numpy as np
+
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import callbacks as cbks
+from tensorflow.python.keras._impl.keras.engine import training_utils
+from tensorflow.python.keras._impl.keras.engine.base_layer import Layer
+from tensorflow.python.keras._impl.keras.utils.generic_utils import make_batches
+from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
+from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
+
+try:
+  from scipy.sparse import issparse  # pylint: disable=g-import-not-at-top
+except ImportError:
+  issparse = None
+
+
+def fit_loop(model,
+             inputs,
+             targets,
+             sample_weights=None,
+             batch_size=None,
+             epochs=100,
+             verbose=1,
+             callbacks=None,
+             val_inputs=None,
+             val_targets=None,
+             val_sample_weights=None,
+             shuffle=True,
+             callback_metrics=None,
+             initial_epoch=0,
+             steps_per_epoch=None,
+             validation_steps=None):
+  """Abstract fit function for arrays of data.
+
+  Arguments:
+      model: Keras Model instance.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
+      batch_size: Integer batch size or None if unknown.
+      epochs: Number of times to iterate over the data
+      verbose: Verbosity mode, 0, 1 or 2
+      callbacks: List of callbacks to be called during training
+      val_inputs: List of input arrays.
+      val_targets: List of target arrays.
+      val_sample_weights: Optional list of sample weight arrays.
+      shuffle: Whether to shuffle the data at the beginning of each epoch
+      callback_metrics: List of strings, the display names of the metrics
+          passed to the callbacks. They should be the
+          concatenation of list the display names of the outputs of
+           `f` and the list of display names of the outputs of `f_val`.
+      initial_epoch: Epoch at which to start training
+          (useful for resuming a previous training run)
+      steps_per_epoch: Total number of steps (batches of samples)
+          before declaring one epoch finished and starting the
+          next epoch. Ignored with the default value of `None`.
+      validation_steps: Number of steps to run validation for
+          (only if doing validation from data tensors).
+          Ignored with the default value of `None`.
+
+  Returns:
+      `History` object.
+
+  Raises:
+      ValueError: in case of invalid arguments.
+  """
+  model._make_train_function()
+  f = model.train_function
+
+  sample_weights = sample_weights or []
+  val_sample_weights = val_sample_weights or []
+  if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+    ins = inputs + targets + sample_weights + [1]
+    if val_inputs:
+      val_ins = val_inputs + val_targets + val_sample_weights + [1]
+  else:
+    ins = inputs + targets + sample_weights
+    if val_inputs:
+      val_ins = val_inputs + val_targets + val_sample_weights
+  if not val_inputs:
+    val_ins = []
+
+  do_validation = False
+  if val_inputs:
+    do_validation = True
+    if verbose and inputs and hasattr(inputs[0], 'shape') and hasattr(
+        val_inputs[0], 'shape'):
+      print('Train on %d samples, validate on %d samples' %
+            (inputs[0].shape[0], val_inputs[0].shape[0]))
+  if validation_steps:
+    do_validation = True
+    if steps_per_epoch is None:
+      raise ValueError('Can only use `validation_steps` '
+                       'when doing step-wise '
+                       'training, i.e. `steps_per_epoch` '
+                       'must be set.')
+
+  out_labels = model.metrics_names
+  if do_validation:
+    callback_metrics = copy.copy(out_labels) + [
+        'val_' + n for n in out_labels
+    ]
+  else:
+    callback_metrics = copy.copy(out_labels)
+
+  num_train_samples = training_utils.check_num_samples(
+      ins, batch_size, steps_per_epoch, 'steps_per_epoch')
+  if num_train_samples is not None:
+    index_array = np.arange(num_train_samples)
+
+  model.history = cbks.History()
+  all_callbacks = [cbks.BaseLogger(
+      stateful_metrics=model.stateful_metric_names)]
+  if verbose:
+    if steps_per_epoch is not None:
+      count_mode = 'steps'
+    else:
+      count_mode = 'samples'
+    all_callbacks.append(
+        cbks.ProgbarLogger(
+            count_mode, stateful_metrics=model.stateful_metric_names))
+  all_callbacks += (callbacks or []) + [model.history]
+  callbacks = cbks.CallbackList(all_callbacks)
+  out_labels = out_labels or []
+
+  # it's possible to callback a different model than self
+  # (used by Sequential models)
+  if hasattr(model, 'callback_model') and model.callback_model:
+    callback_model = model.callback_model
+  else:
+    callback_model = model
+
+  callbacks.set_model(callback_model)
+
+  callbacks.set_params({
+      'batch_size': batch_size,
+      'epochs': epochs,
+      'steps': steps_per_epoch,
+      'samples': num_train_samples,
+      'verbose': verbose,
+      'do_validation': do_validation,
+      'metrics': callback_metrics or [],
+  })
+  callbacks.on_train_begin()
+  callback_model.stop_training = False
+  for cbk in callbacks:
+    cbk.validation_data = val_ins
+
+  # To prevent a slowdown, we find beforehand the arrays that need conversion.
+  feed = model._feed_inputs + model._feed_targets + model._feed_sample_weights
+  indices_for_conversion_to_dense = []
+  for i in range(len(feed)):
+    if issparse is not None and issparse(ins[i]) and not K.is_sparse(feed[i]):
+      indices_for_conversion_to_dense.append(i)
+
+  for epoch in range(initial_epoch, epochs):
+    # Reset stateful metrics
+    for m in model.metrics:
+      if isinstance(m, Layer):
+        m.reset_states()
+    # Update callbacks
+    callbacks.on_epoch_begin(epoch)
+    epoch_logs = {}
+    if steps_per_epoch is not None:
+      for step_index in range(steps_per_epoch):
+        batch_logs = {}
+        batch_logs['batch'] = step_index
+        batch_logs['size'] = 1
+        callbacks.on_batch_begin(step_index, batch_logs)
+        outs = f(ins)
+
+        if not isinstance(outs, list):
+          outs = [outs]
+        for l, o in zip(out_labels, outs):
+          batch_logs[l] = o
+
+        callbacks.on_batch_end(step_index, batch_logs)
+        if callback_model.stop_training:
+          break
+
+      if do_validation:
+        val_outs = test_loop(
+            model,
+            val_inputs,
+            val_targets,
+            sample_weights=val_sample_weights,
+            batch_size=batch_size,
+            steps=validation_steps,
+            verbose=0)
+        if not isinstance(val_outs, list):
+          val_outs = [val_outs]
+        # Same labels assumed.
+        for l, o in zip(out_labels, val_outs):
+          epoch_logs['val_' + l] = o
+    else:
+      if shuffle == 'batch':
+        index_array = training_utils.batch_shuffle(index_array, batch_size)
+      elif shuffle:
+        np.random.shuffle(index_array)
+
+      batches = make_batches(num_train_samples, batch_size)
+
+      for batch_index, (batch_start, batch_end) in enumerate(batches):
+        batch_ids = index_array[batch_start:batch_end]
+        try:
+          if isinstance(ins[-1], int):
+            # Do not slice the training phase flag.
+            ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
+          else:
+            ins_batch = slice_arrays(ins, batch_ids)
+        except TypeError:
+          raise TypeError('TypeError while preparing batch. '
+                          'If using HDF5 input data, '
+                          'pass shuffle="batch".')
+        batch_logs = {}
+        batch_logs['batch'] = batch_index
+        batch_logs['size'] = len(batch_ids)
+        callbacks.on_batch_begin(batch_index, batch_logs)
+        for i in indices_for_conversion_to_dense:
+          ins_batch[i] = ins_batch[i].toarray()
+
+        outs = f(ins_batch)
+        if not isinstance(outs, list):
+          outs = [outs]
+        for l, o in zip(out_labels, outs):
+          batch_logs[l] = o
+
+        callbacks.on_batch_end(batch_index, batch_logs)
+        if callback_model.stop_training:
+          break
+
+        if batch_index == len(batches) - 1:  # Last batch.
+          if do_validation:
+            val_outs = test_loop(
+                model,
+                val_inputs,
+                val_targets,
+                sample_weights=val_sample_weights,
+                batch_size=batch_size,
+                verbose=0)
+            if not isinstance(val_outs, list):
+              val_outs = [val_outs]
+            # Same labels assumed.
+            for l, o in zip(out_labels, val_outs):
+              epoch_logs['val_' + l] = o
+    callbacks.on_epoch_end(epoch, epoch_logs)
+    if callback_model.stop_training:
+      break
+  callbacks.on_train_end()
+  return model.history
+
+
+def predict_loop(model, inputs, batch_size=32, verbose=0, steps=None):
+  """Abstract method to loop over some data in batches.
+
+  Arguments:
+      model: Keras Model instance.
+      inputs: list of tensors to be fed to `f`.
+      batch_size: integer batch size.
+      verbose: verbosity mode.
+      steps: Total number of steps (batches of samples)
+          before declaring `_predict_loop` finished.
+          Ignored with the default value of `None`.
+
+  Returns:
+      Array of predictions (if the model has a single output)
+      or list of arrays of predictions
+      (if the model has multiple outputs).
+  """
+  model._make_predict_function()
+  f = model.predict_function
+
+  if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+    ins = inputs + [0]
+  else:
+    ins = inputs
+
+  if hasattr(model, 'metrics'):
+    for m in model.metrics:
+      if isinstance(m, Layer):
+        m.reset_states()
+
+  num_samples = training_utils.check_num_samples(
+      inputs, batch_size, steps, 'steps')
+  if verbose == 1:
+    if steps is not None:
+      progbar = Progbar(target=steps,
+                        stateful_metrics=model.stateful_metric_names)
+    else:
+      progbar = Progbar(target=num_samples,
+                        stateful_metrics=model.stateful_metric_names)
+
+  indices_for_conversion_to_dense = []
+  for i in range(len(model._feed_inputs)):
+    if (issparse is not None and issparse(inputs[i]) and
+        not K.is_sparse(model._feed_inputs[i])):
+      indices_for_conversion_to_dense.append(i)
+
+  if steps is not None:
+    # Step-based predictions.
+    # Since we do not know how many samples
+    # we will see, we cannot pre-allocate
+    # the returned Numpy arrays.
+    # Instead, we store one array per batch seen
+    # and concatenate them upon returning.
+    unconcatenated_outs = []
+    for step in range(steps):
+      batch_outs = f(ins)
+      if not isinstance(batch_outs, list):
+        batch_outs = [batch_outs]
+      if step == 0:
+        for batch_out in batch_outs:
+          unconcatenated_outs.append([])
+      for i, batch_out in enumerate(batch_outs):
+        unconcatenated_outs[i].append(batch_out)
+      if verbose == 1:
+        progbar.update(step + 1)
+    if len(unconcatenated_outs) == 1:
+      return np.concatenate(unconcatenated_outs[0], axis=0)
+    return [
+        np.concatenate(unconcatenated_outs[i], axis=0)
+        for i in range(len(unconcatenated_outs))
+    ]
+  else:
+    # Sample-based predictions.
+    outs = []
+    batches = make_batches(num_samples, batch_size)
+    index_array = np.arange(num_samples)
+    for batch_index, (batch_start, batch_end) in enumerate(batches):
+      batch_ids = index_array[batch_start:batch_end]
+      if ins and isinstance(ins[-1], int):
+        # Do not slice the training phase flag.
+        ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
+      else:
+        ins_batch = slice_arrays(ins, batch_ids)
+      for i in indices_for_conversion_to_dense:
+        ins_batch[i] = ins_batch[i].toarray()
+
+      batch_outs = f(ins_batch)
+      if not isinstance(batch_outs, list):
+        batch_outs = [batch_outs]
+      if batch_index == 0:
+        # Pre-allocate the results arrays.
+        for batch_out in batch_outs:
+          shape = (num_samples,) + batch_out.shape[1:]
+          outs.append(np.zeros(shape, dtype=batch_out.dtype))
+      for i, batch_out in enumerate(batch_outs):
+        outs[i][batch_start:batch_end] = batch_out
+      if verbose == 1:
+        progbar.update(batch_end)
+    if len(outs) == 1:
+      return outs[0]
+    return outs
+
+
+def test_loop(model, inputs, targets,
+              sample_weights=None,
+              batch_size=None,
+              verbose=0,
+              steps=None):
+  """Abstract method to loop over some data in batches.
+
+  Arguments:
+      model: Keras Model instance.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
+      batch_size: integer batch size or `None`.
+      verbose: verbosity mode.
+      steps: Total number of steps (batches of samples)
+          before declaring predictions finished.
+          Ignored with the default value of `None`.
+
+  Returns:
+      Scalar loss (if the model has a single output and no metrics)
+      or list of scalars (if the model has multiple outputs
+      and/or metrics). The attribute `model.metrics_names` will give you
+      the display labels for the scalar outputs.
+  """
+  model._make_test_function()
+  f = model.test_function
+
+  sample_weights = sample_weights or []
+  if model.uses_learning_phase and not isinstance(K.learning_phase(), int):
+    ins = inputs + targets + sample_weights + [0]
+  else:
+    ins = inputs + targets + sample_weights
+
+  if hasattr(model, 'metrics'):
+    for m in model.metrics:
+      if isinstance(m, Layer):
+        m.reset_states()
+    stateful_metric_indices = [
+        i for i, name in enumerate(model.metrics_names)
+        if str(name) in model.stateful_metric_names
+    ]
+  else:
+    stateful_metric_indices = []
+
+  num_samples = training_utils.check_num_samples(
+      ins, batch_size, steps, 'steps')
+  outs = []
+  if verbose == 1:
+    if steps is not None:
+      progbar = Progbar(target=steps)
+    else:
+      progbar = Progbar(target=num_samples)
+
+  # To prevent a slowdown, we find beforehand the arrays that need conversion.
+  feed = model._feed_inputs + model._feed_targets + model._feed_sample_weights
+  indices_for_conversion_to_dense = []
+  for i in range(len(feed)):
+    if issparse is not None and issparse(ins[i]) and not K.is_sparse(feed[i]):
+      indices_for_conversion_to_dense.append(i)
+
+  if steps is not None:
+    for step in range(steps):
+      batch_outs = f(ins)
+      if isinstance(batch_outs, list):
+        if step == 0:
+          for _ in enumerate(batch_outs):
+            outs.append(0.)
+        for i, batch_out in enumerate(batch_outs):
+          if i in stateful_metric_indices:
+            outs[i] = batch_out
+          else:
+            outs[i] += batch_out
+      else:
+        if step == 0:
+          outs.append(0.)
+        outs[0] += batch_outs
+      if verbose == 1:
+        progbar.update(step + 1)
+    for i in range(len(outs)):
+      if i not in stateful_metric_indices:
+        outs[i] /= steps
+  else:
+    batches = make_batches(num_samples, batch_size)
+    index_array = np.arange(num_samples)
+    for batch_index, (batch_start, batch_end) in enumerate(batches):
+      batch_ids = index_array[batch_start:batch_end]
+      if isinstance(ins[-1], int):
+        # Do not slice the training phase flag.
+        ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
+      else:
+        ins_batch = slice_arrays(ins, batch_ids)
+      for i in indices_for_conversion_to_dense:
+        ins_batch[i] = ins_batch[i].toarray()
+
+      batch_outs = f(ins_batch)
+
+      if isinstance(batch_outs, list):
+        if batch_index == 0:
+          for batch_out in enumerate(batch_outs):
+            outs.append(0.)
+        for i, batch_out in enumerate(batch_outs):
+          if i in stateful_metric_indices:
+            outs[i] = batch_out
+          else:
+            outs[i] += batch_out * len(batch_ids)
+      else:
+        if batch_index == 0:
+          outs.append(0.)
+        outs[0] += batch_outs * len(batch_ids)
+      if verbose == 1:
+        progbar.update(batch_end)
+    for i in range(len(outs)):
+      if i not in stateful_metric_indices:
+        outs[i] /= num_samples
+  if len(outs) == 1:
+    return outs[0]
+  return outs
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_eager.py b/tensorflow/python/keras/_impl/keras/engine/training_eager.py
index 477bb2fe7ac44f1f52191a113c495360400b8d75..75c96e6916fd37511da75d48cf7587cddb040cce 100644
--- a/tensorflow/python/keras/_impl/keras/engine/training_eager.py
+++ b/tensorflow/python/keras/_impl/keras/engine/training_eager.py
@@ -12,13 +12,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Keras training and evaluation routines.
+"""Keras training and evaluation routines for eager execution.
 """
 # pylint: disable=protected-access
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+
+import copy
+
 import numpy as np
+
 from tensorflow.python.eager.backprop import GradientTape
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import tensor_util
@@ -26,9 +30,11 @@ from tensorflow.python.keras._impl.keras import backend as K
 from tensorflow.python.keras._impl.keras import callbacks as cbks
 from tensorflow.python.keras._impl.keras import losses
 from tensorflow.python.keras._impl.keras import metrics as metrics_module
+from tensorflow.python.keras._impl.keras.engine import training_utils
 from tensorflow.python.keras._impl.keras.utils.generic_utils import make_batches
 from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
 from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
+from tensorflow.python.platform import tf_logging as logging
 
 
 def _get_metrics_info(metric, internal_output_shapes=None, loss_func=None):
@@ -98,14 +104,15 @@ def _eager_metrics_fn(model, outputs, targets):
   return metric_names, metric_results
 
 
-def _model_loss(model, inputs, targets):
+def _model_loss(model, inputs, targets, sample_weights=None, training=False):
   """Calculates the loss for a given model.
 
   Arguments:
-     model: The model on which metrics are being calculated.
-     inputs: The inputs of the given model. This is typically the mini batch of
-              data that is fed to the model.
-     targets: The predictions or targets of the given model.
+      model: The model on which metrics are being calculated.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
+      training: Whether the model should be run in inference or training mode.
 
   Returns:
      Returns the model output, total loss and loss value calculated using the
@@ -114,9 +121,15 @@ def _model_loss(model, inputs, targets):
   """
   total_loss = 0
   if len(inputs) == 1:
-    outs = model.call(inputs[0])
+    if model._expects_training_arg:
+      outs = model.call(inputs[0], training=training)
+    else:
+      outs = model.call(inputs[0])
   else:
-    outs = model.call(inputs)
+    if model._expects_training_arg:
+      outs = model.call(inputs, training=training)
+    else:
+      outs = model.call(inputs)
   if not isinstance(outs, list):
     outs = [outs]
 
@@ -126,31 +139,20 @@ def _model_loss(model, inputs, targets):
   loss_metrics = []
   with K.name_scope('loss'):
     for i, loss_fn in enumerate(model.loss_functions):
-      # compute the loss
-      output_loss = _eager_loss_fn(outs[i], targets[i], loss_fn,
-                                   model.output_names[i])
-      loss_metrics.append(K.mean(output_loss))
+      if sample_weights:
+        weights = sample_weights[i]
+      else:
+        weights = None
 
+      # TODO(fchollet): support masking; in practice `_keras_mask` is never
+      # set in this context currently.
       mask = outs[i]._keras_mask
-      # adapted from weighted_loss_fn
-      if mask is not None:
-        # mask should have the same shape as output_loss
-        output_loss *= mask
-        #  the loss per batch should be proportional
-        #  to the number of unmasked samples.
-        output_loss /= K.mean(mask)
-
-      # adapted from weighted_loss_fn
-      # apply sample weighting
-      if model.sample_weights:
-        # reduce score_array to same ndim as weight array
-        ndim = K.ndim(output_loss)
-        weight_ndim = K.ndim(model.sample_weights)
-        output_loss = K.mean(output_loss, axis=list(range(weight_ndim, ndim)))
-        output_loss *= model.sample_weights
-        output_loss /= K.mean(K.cast(K.not_equal(model.sample_weights, 0),
-                                     K.floatx()))
-        output_loss = K.mean(output_loss)
+
+      weighted_masked_fn = training_utils.weighted_masked_objective(loss_fn)
+      with K.name_scope(model.output_names[i] + '_loss'):
+        output_loss = weighted_masked_fn(
+            outs[i], targets[i], weights, mask=mask)
+      loss_metrics.append(K.mean(output_loss))
 
       loss_weight = model.loss_weights_list[i]
       if total_loss is None:
@@ -171,16 +173,20 @@ def _model_loss(model, inputs, targets):
   return outs, total_loss, loss_metrics
 
 
-def _process_single_batch(eager_model_inputs, eager_model_outputs, model,
-                          training=True):
+def _process_single_batch(model,
+                          inputs,
+                          targets,
+                          sample_weights=None,
+                          training=False):
   """Calculate the loss and gradient for one input batch.
 
      The model weights are updated if training is set to True.
 
   Arguments:
-      eager_model_inputs: Input batch data.
-      eager_model_outputs: Output batch data.
       model: Model whose loss has to be calculated.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
       training: The boolean represents if the weights of the model are updated.
               'fit' methods will set this to True while 'evaluate' methods will
               set this to False.
@@ -189,82 +195,83 @@ def _process_single_batch(eager_model_inputs, eager_model_outputs, model,
       output of the model, total loss and the loss associated with each output.
 
   Raises:
-      ValueError: If the model loss is 0 or if the trainable weights list is
-                  empty when the trainable parameter is set to True.
+      ValueError: If the model has no loss to optimize.
   """
   K.set_learning_phase(training)
   with GradientTape() as tape:
-    outs, loss, loss_metrics = _model_loss(model, eager_model_inputs,
-                                           eager_model_outputs)
+    outs, loss, loss_metrics = _model_loss(model, inputs, targets,
+                                           sample_weights=sample_weights,
+                                           training=training)
     if loss is None:
       raise ValueError('The model cannot be run '
                        'because it has no loss to optimize.')
   if training:
     if not model._collected_trainable_weights:
-      raise ValueError('The list of trainable weights is empty. Make sure that '
-                       'you are not setting model.trainable to False before '
-                       'compiling the model.')
-    grads = tape.gradient(loss, model._collected_trainable_weights)
-    model.optimizer.apply_gradients(zip(grads,
-                                        model._collected_trainable_weights))
+      logging.warning('The list of trainable weights is empty. Make sure that '
+                      'you are not setting model.trainable to False before '
+                      'compiling the model.')
+    else:
+      grads = tape.gradient(loss, model._collected_trainable_weights)
+      model.optimizer.apply_gradients(zip(grads,
+                                          model._collected_trainable_weights))
   return outs, loss, loss_metrics
 
 
-def train_on_batch(model, ins):
+def train_on_batch(model, inputs, targets, sample_weights=None):
   """Calculates the loss and gradient updates for one input batch.
 
   Arguments:
-      model: Given model on which loss and gradients are calculated.
-      ins: Input and output batch numpy arrays.
+      model: Model whose loss has to be calculated.
+      inputs: Input batch data.
+      targets: Target batch data.
+      sample_weights: Sample weight batch data.
 
   Returns:
       total loss and the loss associated with each output.
   """
-  ins_batch_converted = []
-  for ib in ins:
-    ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-  eager_model_inputs = []
-  eager_model_outputs = []
-  for i in range(len(model.inputs)):
-    eager_model_inputs.append(ins_batch_converted[i])
-  for i in range(len(model.inputs), len(ins_batch_converted)):
-    eager_model_outputs.append(ins_batch_converted[i])
+  inputs = [
+      ops.convert_to_tensor(val, dtype=K.floatx()) for val in inputs]
+  targets = [
+      ops.convert_to_tensor(val, dtype=K.floatx()) for val in targets]
+  sample_weights = [
+      ops.convert_to_tensor(val, dtype=K.floatx())
+      if val is not None else None for val in sample_weights]
   outs, loss, _ = _process_single_batch(
-      eager_model_inputs, eager_model_outputs, model)
+      model, inputs, targets, sample_weights=sample_weights, training=True)
   if not isinstance(outs, list):
     outs = [outs]
   _, metrics_results = _eager_metrics_fn(
-      model, outs, eager_model_outputs)
+      model, outs, targets)
   if not isinstance(loss, list):
     loss = [loss]
   return loss + metrics_results
 
 
-def test_on_batch(model, ins):
+def test_on_batch(model, inputs, targets, sample_weights=None):
   """Calculates the loss for one input batch.
 
   Arguments:
-      model: Given model on which loss is calculated.
-      ins: Input and output batch numpy arrays.
+      model: Model whose loss has to be calculated.
+      inputs: Input batch data.
+      targets: Target batch data.
+      sample_weights: Sample weight batch data.
 
   Returns:
       total loss, loss and metrics associated with each output.
   """
-  ins_batch_converted = []
-  for ib in ins:
-    ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-  eager_model_inputs = []
-  eager_model_outputs = []
-  for i in range(len(model.inputs)):
-    eager_model_inputs.append(ins_batch_converted[i])
-  for i in range(len(model.inputs), len(ins_batch_converted)):
-    eager_model_outputs.append(ins_batch_converted[i])
+  inputs = [
+      ops.convert_to_tensor(val, dtype=K.floatx()) for val in inputs]
+  targets = [
+      ops.convert_to_tensor(val, dtype=K.floatx()) for val in targets]
+  sample_weights = [
+      ops.convert_to_tensor(val, dtype=K.floatx())
+      if val is not None else None for val in sample_weights]
   outs, loss, loss_metrics = _process_single_batch(
-      eager_model_inputs, eager_model_outputs, model, training=False)
+      model, inputs, targets, sample_weights=sample_weights, training=False)
   if not isinstance(outs, list):
     outs = [outs]
   metric_names, metrics_results = _eager_metrics_fn(
-      model, outs, eager_model_outputs)
+      model, outs, targets)
   model.metrics_names.append(metric_names)
   if not isinstance(loss, list):
     loss = [loss]
@@ -273,32 +280,35 @@ def test_on_batch(model, ins):
 
 def fit_loop(
     model,
-    ins,
-    out_labels=None,
+    inputs,
+    targets,
+    sample_weights=None,
+    val_inputs=None,
+    val_targets=None,
+    val_sample_weights=None,
     batch_size=None,
     epochs=100,
     verbose=1,
     callbacks=None,
-    val_ins=None,
     shuffle=True,
     callback_metrics=None,
     initial_epoch=0,
     steps_per_epoch=None,
     validation_steps=None):
-  """Abstract fit function for `f(ins)`.
-
-  Assume that f returns a list, labeled by out_labels.
+  """Abstract fit function for eager execution.
 
   Arguments:
       model: Instance of the model that is being executed in Eager mode.
-      ins: List of tensors to be fed to `f`
-      out_labels: List of strings, display names of
-          the outputs of `f`
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
+      val_inputs: Input data for validation.
+      val_targets: Target data for validation.
+      val_sample_weights: Sample weight data for validation.
       batch_size: Integer batch size or None if unknown.
       epochs: Number of times to iterate over the data
       verbose: Verbosity mode, 0, 1 or 2
       callbacks: List of callbacks to be called during training
-      val_ins: List of tensors to be fed to `val_f`
       shuffle: Whether to shuffle the data at the beginning of each epoch
       callback_metrics: List of strings, the display names of the metrics
           passed to the callbacks. They should be the
@@ -322,20 +332,35 @@ def fit_loop(
   K.set_learning_phase(True)
 
   do_validation = False
-  if val_ins:
+  if val_inputs:
     do_validation = True
-    if (verbose and ins and hasattr(ins[0], 'shape') and
-        hasattr(val_ins[0], 'shape')):
+    if (verbose and inputs and hasattr(inputs[0], 'shape') and
+        hasattr(val_inputs[0], 'shape')):
       print('Train on %d samples, validate on %d samples' %
-            (ins[0].shape[0], val_ins[0].shape[0]))
+            (inputs[0].shape[0], val_inputs[0].shape[0]))
   if validation_steps:
     if steps_per_epoch is None:
       raise ValueError('Can only use `validation_steps` when doing step-wise '
                        'training, i.e. `steps_per_epoch` must be set.')
     do_validation = True
 
-  num_train_samples = model._check_num_samples(
-      ins, batch_size, steps_per_epoch, 'steps_per_epoch')
+  out_labels = model.metrics_names
+  if do_validation:
+    callback_metrics = copy.copy(out_labels) + [
+        'val_' + n for n in out_labels
+    ]
+  else:
+    callback_metrics = copy.copy(out_labels)
+
+  if sample_weights:
+    feed_data = inputs + targets + sample_weights
+  else:
+    feed_data = inputs + targets
+  num_train_samples = training_utils.check_num_samples(
+      feed_data,
+      batch_size=batch_size,
+      steps=steps_per_epoch,
+      steps_name='steps_per_epoch')
 
   if num_train_samples is not None:
     index_array = np.arange(num_train_samples)
@@ -349,7 +374,6 @@ def fit_loop(
       count_mode = 'samples'
     callbacks += [cbks.ProgbarLogger(count_mode)]
   callbacks = cbks.CallbackList(callbacks)
-  out_labels = out_labels or []
 
   # it's possible to callback a different model than self
   # (used by Sequential models)
@@ -372,7 +396,12 @@ def fit_loop(
   callbacks.on_train_begin()
   callback_model.stop_training = False
   for cbk in callbacks:
-    cbk.validation_data = val_ins
+    if not val_inputs:
+      cbk.validation_data = []
+    elif val_sample_weights:
+      cbk.validation_data = val_inputs + val_targets + val_sample_weights
+    else:
+      cbk.validation_data = val_inputs + val_targets
 
   for epoch in range(initial_epoch, epochs):
     callbacks.on_epoch_begin(epoch)
@@ -387,11 +416,12 @@ def fit_loop(
     for batch_index, (batch_start, batch_end) in enumerate(batches):
       batch_ids = index_array[batch_start:batch_end]
       try:
-        if isinstance(ins[-1], float):
-          # Do not slice the training phase flag.
-          ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
+        inputs_batch = slice_arrays(inputs, batch_ids)
+        targets_batch = slice_arrays(targets, batch_ids)
+        if sample_weights:
+          sample_weights_batch = slice_arrays(sample_weights, batch_ids)
         else:
-          ins_batch = slice_arrays(ins, batch_ids)
+          sample_weights_batch = None
       except TypeError:
         raise TypeError('TypeError while preparing batch. '
                         'If using HDF5 input data, '
@@ -402,20 +432,22 @@ def fit_loop(
 
       callbacks.on_batch_begin(batch_index, batch_logs)
 
-      ins_batch_converted = []
-      for ib in ins_batch:
-        ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-      eager_model_inputs = []
-      eager_model_outputs = []
-      for i in range(len(model.inputs)):
-        eager_model_inputs.append(ins_batch_converted[i])
-
-      for i in range(len(model.inputs), len(ins_batch_converted)):
-        eager_model_outputs.append(ins_batch_converted[i])
-
-      outs, loss, loss_metrics = _process_single_batch(eager_model_inputs,
-                                                       eager_model_outputs,
-                                                       model)
+      inputs_batch = [
+          ops.convert_to_tensor(val, dtype=K.floatx()) for val in inputs_batch]
+      targets_batch = [
+          ops.convert_to_tensor(val, dtype=K.floatx()) for val in targets_batch]
+      if sample_weights:
+        sample_weights_batch = [
+            ops.convert_to_tensor(val, dtype=K.floatx())
+            if val is not None else None
+            for val in sample_weights_batch]
+
+      outs, loss, loss_metrics = _process_single_batch(
+          model,
+          inputs_batch,
+          targets_batch,
+          sample_weights=sample_weights_batch,
+          training=True)
 
       if not isinstance(outs, list):
         outs = [outs]
@@ -423,8 +455,8 @@ def fit_loop(
       for l, o in zip(out_labels, outs):
         batch_logs[l] = o
       # Required for Eager mode
-      metrics_names, metrics_results = _eager_metrics_fn(model, outs,
-                                                         eager_model_outputs)
+      metrics_names, metrics_results = _eager_metrics_fn(
+          model, outs, targets_batch)
       batch_logs['loss'] = tensor_util.constant_value(K.mean(loss))
 
       # TODO(anjalisridhar): Move this to compile to avoid duplicate code.
@@ -458,7 +490,10 @@ def fit_loop(
       if batch_index == len(batches) - 1:  # Last batch.
         if do_validation:
           val_outs = test_loop(
-              model, val_ins, batch_size=batch_size, verbose=0)
+              model, val_inputs, val_targets,
+              sample_weights=val_sample_weights,
+              batch_size=batch_size,
+              verbose=0)
           if not isinstance(val_outs, list):
             val_outs = [val_outs]
           # Same labels assumed.
@@ -471,12 +506,18 @@ def fit_loop(
   return model.history
 
 
-def test_loop(model, ins, batch_size=None, verbose=0, steps=None):
+def test_loop(model, inputs, targets,
+              sample_weights=None,
+              batch_size=None,
+              verbose=0,
+              steps=None):
   """Abstract method to loop over some data in batches.
 
   Arguments:
       model: Model instance that is being evaluated in Eager mode.
-      ins: list of tensors to be fed to `f`.
+      inputs: List of input arrays.
+      targets: List of target arrays.
+      sample_weights: Optional list of sample weight arrays.
       batch_size: integer batch size or `None`.
       verbose: verbosity mode.
       steps: Total number of steps (batches of samples)
@@ -490,7 +531,11 @@ def test_loop(model, ins, batch_size=None, verbose=0, steps=None):
       the display labels for the scalar outputs.
   """
   K.set_learning_phase(False)
-  num_samples = model._check_num_samples(ins, batch_size, steps, 'steps')
+  feed_data = inputs + targets
+  if sample_weights:
+    feed_data += sample_weights
+  num_samples = training_utils.check_num_samples(
+      feed_data, batch_size=batch_size, steps=steps, steps_name='steps')
   outs = []
   if verbose == 1:
     progbar = Progbar(target=num_samples)
@@ -498,28 +543,30 @@ def test_loop(model, ins, batch_size=None, verbose=0, steps=None):
   index_array = np.arange(num_samples)
   for batch_index, (batch_start, batch_end) in enumerate(batches):
     batch_ids = index_array[batch_start:batch_end]
-    if isinstance(ins[-1], float):
-      # Do not slice the training phase flag.
-      ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
+    inputs_batch = slice_arrays(inputs, batch_ids)
+    targets_batch = slice_arrays(targets, batch_ids)
+    if sample_weights:
+      sample_weights_batch = slice_arrays(sample_weights, batch_ids)
     else:
-      ins_batch = slice_arrays(ins, batch_ids)
-
-    ins_batch_converted = []
-    for ib in ins_batch:
-      ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
-
-    eager_model_inputs = []
-    eager_model_outputs = []
-    for i in range(len(model.inputs)):
-      eager_model_inputs.append(ins_batch_converted[i])
-
-    for i in range(len(model.inputs), len(ins_batch_converted)):
-      eager_model_outputs.append(ins_batch_converted[i])
-
-    loss_outs, loss, loss_metrics = _model_loss(model, eager_model_inputs,
-                                                eager_model_outputs)
-    _, metrics_results = _eager_metrics_fn(model, loss_outs,
-                                           eager_model_outputs)
+      sample_weights_batch = None
+
+    inputs_batch = [
+        ops.convert_to_tensor(val, dtype=K.floatx()) for val in inputs_batch]
+    targets_batch = [
+        ops.convert_to_tensor(val, dtype=K.floatx()) for val in targets_batch]
+    if sample_weights:
+      sample_weights_batch = [
+          ops.convert_to_tensor(val, dtype=K.floatx())
+          if val is not None else None
+          for val in sample_weights_batch]
+
+    loss_outs, loss, loss_metrics = _model_loss(
+        model,
+        inputs_batch,
+        targets_batch,
+        sample_weights=sample_weights_batch,
+        training=False)
+    _, metrics_results = _eager_metrics_fn(model, loss_outs, targets_batch)
     batch_outs = []
     for _, v in zip(model.metrics_names,
                     [K.mean(loss)] + loss_metrics + metrics_results):
@@ -545,12 +592,15 @@ def test_loop(model, ins, batch_size=None, verbose=0, steps=None):
   return outs
 
 
-def predict_loop(model, ins, batch_size=32, verbose=0, steps=None):
+def predict_loop(model, inputs,
+                 batch_size=32,
+                 verbose=0,
+                 steps=None):
   """Abstract method to loop over some data in batches.
 
   Arguments:
       model:
-      ins: list of tensors to be fed to `f`.
+      inputs: List of input arrays.
       batch_size: integer batch size.
       verbose: verbosity mode.
       steps: Total number of steps (batches of samples)
@@ -563,7 +613,8 @@ def predict_loop(model, ins, batch_size=32, verbose=0, steps=None):
       (if the model has multiple outputs).
   """
   K.set_learning_phase(False)
-  num_samples = model._check_num_samples(ins, batch_size, steps, 'steps')
+  num_samples = training_utils.check_num_samples(
+      inputs, batch_size, steps, 'steps')
   if verbose == 1:
     if steps is not None:
       progbar = Progbar(target=steps)
@@ -575,24 +626,21 @@ def predict_loop(model, ins, batch_size=32, verbose=0, steps=None):
   index_array = np.arange(num_samples)
   for batch_index, (batch_start, batch_end) in enumerate(batches):
     batch_ids = index_array[batch_start:batch_end]
-    if ins and isinstance(ins[-1], float):
-      # Do not slice the training phase flag.
-      ins_batch = slice_arrays(ins[:-1], batch_ids) + [ins[-1]]
-    else:
-      ins_batch = slice_arrays(ins, batch_ids)
+    inputs_batch = slice_arrays(inputs, batch_ids)
 
-    ins_batch_converted = []
-    for ib in ins_batch:
-      ins_batch_converted.append(ops.convert_to_tensor(ib, dtype=K.floatx()))
+    inputs_batch = [
+        ops.convert_to_tensor(val, dtype=K.floatx()) for val in inputs_batch]
 
-    eager_model_inputs = []
-    for i in range(len(model.inputs)):
-      eager_model_inputs.append(ins_batch_converted[i])
-
-    if len(eager_model_inputs) == 1:
-      batch_outs = model.call(eager_model_inputs[0])
+    if len(inputs_batch) == 1:
+      if model._expects_training_arg:
+        batch_outs = model.call(inputs_batch[0], training=False)
+      else:
+        batch_outs = model.call(inputs_batch[0])
     else:
-      batch_outs = model.call(eager_model_inputs)
+      if model._expects_training_arg:
+        batch_outs = model.call(inputs_batch, training=False)
+      else:
+        batch_outs = model.call(inputs_batch)
 
     if not isinstance(batch_outs, list):
       batch_outs = [batch_outs]
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_eager_test.py b/tensorflow/python/keras/_impl/keras/engine/training_eager_test.py
index 45601f964a090fd927a22eb525d3c1c154fd71db..8848b393d5e602e564cb357c32a937eaabd68203 100644
--- a/tensorflow/python/keras/_impl/keras/engine/training_eager_test.py
+++ b/tensorflow/python/keras/_impl/keras/engine/training_eager_test.py
@@ -24,7 +24,6 @@ import numpy as np
 from tensorflow.python.framework import ops
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
-from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
 from tensorflow.python.platform import test
 from tensorflow.python.training.rmsprop import RMSPropOptimizer
 
@@ -315,10 +314,9 @@ class LossWeightingTest(test.TestCase):
   def test_class_weights(self):
     num_classes = 5
     batch_size = 5
-    epochs = 5
     weighted_class = 3
-    train_samples = 3000
-    test_samples = 3000
+    train_samples = 300
+    test_samples = 300
     input_dim = 5
 
     model = keras.models.Sequential()
@@ -343,16 +341,16 @@ class LossWeightingTest(test.TestCase):
     test_ids = np.where(int_y_test == np.array(weighted_class))[0]
 
     class_weight = dict([(i, 1.) for i in range(num_classes)])
-    class_weight[weighted_class] = 2.
+    class_weight[weighted_class] = 4.
 
     sample_weight = np.ones((y_train.shape[0]))
-    sample_weight[int_y_train == weighted_class] = 2.
+    sample_weight[int_y_train == weighted_class] = 4.
 
     model.fit(
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 3,
+        epochs=2,
         verbose=0,
         class_weight=class_weight,
         validation_data=(x_train, y_train, sample_weight))
@@ -360,14 +358,14 @@ class LossWeightingTest(test.TestCase):
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 2,
+        epochs=2,
         verbose=0,
         class_weight=class_weight)
     model.fit(
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 2,
+        epochs=2,
         verbose=0,
         class_weight=class_weight,
         validation_split=0.1)
@@ -382,10 +380,9 @@ class LossWeightingTest(test.TestCase):
   def test_sample_weights(self):
     num_classes = 5
     batch_size = 5
-    epochs = 5
     weighted_class = 3
-    train_samples = 3000
-    test_samples = 3000
+    train_samples = 300
+    test_samples = 300
     input_dim = 5
 
     model = keras.models.Sequential()
@@ -397,36 +394,32 @@ class LossWeightingTest(test.TestCase):
                   optimizer=RMSPropOptimizer(learning_rate=0.001))
 
     np.random.seed(43)
-    (x_train, y_train), (x_test, y_test) = testing_utils.get_test_data(
+    (x_train, y_train), _ = testing_utils.get_test_data(
         train_samples=train_samples,
         test_samples=test_samples,
         input_shape=(input_dim,),
         num_classes=num_classes)
-    int_y_test = y_test.copy()
     int_y_train = y_train.copy()
-    # convert class vectors to binary class matrices
     y_train = keras.utils.to_categorical(y_train, num_classes)
-    y_test = keras.utils.to_categorical(y_test, num_classes)
-    test_ids = np.where(int_y_test == np.array(weighted_class))[0]
 
     class_weight = dict([(i, 1.) for i in range(num_classes)])
-    class_weight[weighted_class] = 2.
+    class_weight[weighted_class] = 4.
 
     sample_weight = np.ones((y_train.shape[0]))
-    sample_weight[int_y_train == weighted_class] = 2.
+    sample_weight[int_y_train == weighted_class] = 4.
 
     model.fit(
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 3,
+        epochs=2,
         verbose=0,
         sample_weight=sample_weight)
     model.fit(
         x_train,
         y_train,
         batch_size=batch_size,
-        epochs=epochs // 3,
+        epochs=2,
         verbose=0,
         sample_weight=sample_weight,
         validation_split=0.1)
@@ -539,209 +532,6 @@ class LossWeightingTest(test.TestCase):
       model.fit(x_np, [y_np, y_np], epochs=1, sample_weight={'1': bad_w_np})
 
 
-class TestDynamicTrainability(test.TestCase):
-
-  def test_trainable_warning(self):
-    x = np.random.random((5, 3))
-    y = np.random.random((5, 2))
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(2, input_dim=3))
-    model.trainable = False
-    model.compile(RMSPropOptimizer(learning_rate=0.001), 'mse')
-    model.trainable = True
-    with self.assertRaises(ValueError):
-      model.train_on_batch(x, y)
-
-  def test_trainable_argument(self):
-    x = np.random.random((5, 3))
-    y = np.random.random((5, 2))
-
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(2, input_dim=3, trainable=False))
-    model.compile(RMSPropOptimizer(learning_rate=0.001), 'mse')
-    out = model.predict(x)
-    with self.assertRaises(ValueError):
-      model.train_on_batch(x, y)
-    out_2 = model.predict(x)
-    self.assertAllClose(out, out_2)
-
-    # test with nesting
-    inputs = keras.layers.Input(shape=(3,))
-    output = model(inputs)
-    model = keras.models.Model(inputs, output)
-    model.compile(RMSPropOptimizer(learning_rate=0.001), 'mse')
-    out = model.predict(x)
-    with self.assertRaises(ValueError):
-      model.train_on_batch(x, y)
-    out_2 = model.predict(x)
-    self.assertAllClose(out, out_2)
-
-  def test_layer_trainability_switch(self):
-    # with constructor argument, in Sequential
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(2, trainable=False, input_dim=1))
-    self.assertListEqual(model.trainable_weights, [])
-
-    # by setting the `trainable` argument, in Sequential
-    model = keras.models.Sequential()
-    layer = keras.layers.Dense(2, input_dim=1)
-    model.add(layer)
-    self.assertListEqual(model.trainable_weights, layer.trainable_weights)
-    layer.trainable = False
-    self.assertListEqual(model.trainable_weights, [])
-
-    # with constructor argument, in Model
-    x = keras.layers.Input(shape=(1,))
-    y = keras.layers.Dense(2, trainable=False)(x)
-    model = keras.models.Model(x, y)
-    self.assertListEqual(model.trainable_weights, [])
-
-    # by setting the `trainable` argument, in Model
-    x = keras.layers.Input(shape=(1,))
-    layer = keras.layers.Dense(2)
-    y = layer(x)
-    model = keras.models.Model(x, y)
-    self.assertListEqual(model.trainable_weights, layer.trainable_weights)
-    layer.trainable = False
-    self.assertListEqual(model.trainable_weights, [])
-
-  def test_model_trainability_switch(self):
-    # a non-trainable model has no trainable weights
-    x = keras.layers.Input(shape=(1,))
-    y = keras.layers.Dense(2)(x)
-    model = keras.models.Model(x, y)
-    model.trainable = False
-    self.assertListEqual(model.trainable_weights, [])
-
-    # same for Sequential
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(2, input_dim=1))
-    model.trainable = False
-    self.assertListEqual(model.trainable_weights, [])
-
-  def test_nested_model_trainability(self):
-
-    # a Sequential inside a Model
-    inner_model = keras.models.Sequential()
-    inner_model.add(keras.layers.Dense(2, input_dim=1))
-
-    x = keras.layers.Input(shape=(1,))
-    y = inner_model(x)
-    outer_model = keras.models.Model(x, y)
-    self.assertListEqual(outer_model.trainable_weights,
-                         inner_model.trainable_weights)
-    inner_model.trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-    inner_model.trainable = True
-    inner_model.layers[-1].trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-
-    # a Sequential inside a Sequential
-    inner_model = keras.models.Sequential()
-    inner_model.add(keras.layers.Dense(2, input_dim=1))
-    outer_model = keras.models.Sequential()
-    outer_model.add(inner_model)
-    self.assertListEqual(outer_model.trainable_weights,
-                         inner_model.trainable_weights)
-    inner_model.trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-    inner_model.trainable = True
-    inner_model.layers[-1].trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-
-    # a Model inside a Model
-    x = keras.layers.Input(shape=(1,))
-    y = keras.layers.Dense(2)(x)
-    inner_model = keras.models.Model(x, y)
-    x = keras.layers.Input(shape=(1,))
-    y = inner_model(x)
-    outer_model = keras.models.Model(x, y)
-    self.assertListEqual(outer_model.trainable_weights,
-                         inner_model.trainable_weights)
-    inner_model.trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-    inner_model.trainable = True
-    inner_model.layers[-1].trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-
-    # a Model inside a Sequential
-    x = keras.layers.Input(shape=(1,))
-    y = keras.layers.Dense(2)(x)
-    inner_model = keras.models.Model(x, y)
-    outer_model = keras.models.Sequential()
-    outer_model.add(inner_model)
-    self.assertListEqual(outer_model.trainable_weights,
-                         inner_model.trainable_weights)
-    inner_model.trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-    inner_model.trainable = True
-    inner_model.layers[-1].trainable = False
-    self.assertListEqual(outer_model.trainable_weights, [])
-
-
-class TestTrainingUtils(test.TestCase):
-
-  def test_check_array_lengths(self):
-    keras.engine.training._check_array_lengths(None, None, None)
-    a_np = np.random.random((4, 3, 3))
-    keras.engine.training._check_array_lengths(a_np, a_np, a_np)
-    keras.engine.training._check_array_lengths(
-        [a_np, a_np], [a_np, a_np], [a_np, a_np])
-    keras.engine.training._check_array_lengths([None], [None], [None])
-
-    b_np = np.random.random((3, 4))
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths(a_np, None, None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths(a_np, a_np, None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], [None], None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], [b_np], None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], None, [b_np])
-
-  def test_slice_arrays(self):
-    input_a = np.random.random((10, 3))
-    slice_arrays(None)
-    slice_arrays(input_a, 0)
-    slice_arrays(input_a, 0, 1)
-    slice_arrays(input_a, stop=2)
-    input_a = [None, [1, 1], None, [1, 1]]
-    slice_arrays(input_a, 0)
-    slice_arrays(input_a, 0, 1)
-    slice_arrays(input_a, stop=2)
-    input_a = [None]
-    slice_arrays(input_a, 0)
-    slice_arrays(input_a, 0, 1)
-    slice_arrays(input_a, stop=2)
-    input_a = None
-    slice_arrays(input_a, 0)
-    slice_arrays(input_a, 0, 1)
-    slice_arrays(input_a, stop=2)
-
-  def test_fit_with_BatchNorm(self):
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(10, input_dim=4))
-    model.add(keras.layers.BatchNormalization())
-    model.add(keras.layers.Activation('tanh'))
-    model.add(keras.layers.Dropout(0.2))
-
-    input_a_np = np.random.random((10, 4))
-    output_b_np = np.random.random((10, 10))
-
-    model.compile(loss='binary_crossentropy', optimizer=RMSPropOptimizer(0.001))
-    model.fit(input_a_np, output_b_np, epochs=1, batch_size=5, verbose=0)
-
-  def test_fit_with_regularization(self):
-    model = keras.models.Sequential()
-    with self.assertRaises(ValueError):
-      model.add(
-          keras.layers.Dense(4, input_dim=3,
-                             kernel_regularizer=keras.regularizers.l2(0.01),
-                             activity_regularizer=keras.regularizers.l1(0.01)))
-
-
 if __name__ == '__main__':
   # Bazel sets these environment variables to very long paths.
   # Tempfile uses them to create long paths, and in turn multiprocessing
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_generator.py b/tensorflow/python/keras/_impl/keras/engine/training_generator.py
new file mode 100644
index 0000000000000000000000000000000000000000..4af62c85d573f0985bb1428c76ff4bc413dd9253
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/training_generator.py
@@ -0,0 +1,439 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Part of the Keras training engine related to Python generators of array data.
+"""
+# pylint: disable=protected-access
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import callbacks as cbks
+from tensorflow.python.keras._impl.keras.utils.data_utils import GeneratorEnqueuer
+from tensorflow.python.keras._impl.keras.utils.data_utils import OrderedEnqueuer
+from tensorflow.python.keras._impl.keras.utils.data_utils import Sequence
+from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
+from tensorflow.python.platform import tf_logging as logging
+
+
+def fit_generator(model,
+                  generator,
+                  steps_per_epoch=None,
+                  epochs=1,
+                  verbose=1,
+                  callbacks=None,
+                  validation_data=None,
+                  validation_steps=None,
+                  class_weight=None,
+                  max_queue_size=10,
+                  workers=1,
+                  use_multiprocessing=False,
+                  shuffle=True,
+                  initial_epoch=0):
+  """See docstring for `Model.fit_generator`."""
+  wait_time = 0.01  # in seconds
+  epoch = initial_epoch
+
+  do_validation = bool(validation_data)
+  model._make_train_function()
+  if do_validation:
+    model._make_test_function()
+
+  is_sequence = isinstance(generator, Sequence)
+  if not is_sequence and use_multiprocessing and workers > 1:
+    logging.warning(
+        UserWarning('Using a generator with `use_multiprocessing=True`'
+                    ' and multiple workers may duplicate your data.'
+                    ' Please consider using the`keras.utils.Sequence'
+                    ' class.'))
+  if steps_per_epoch is None:
+    if is_sequence:
+      steps_per_epoch = len(generator)
+    else:
+      raise ValueError('`steps_per_epoch=None` is only valid for a'
+                       ' generator based on the `keras.utils.Sequence`'
+                       ' class. Please specify `steps_per_epoch` or use'
+                       ' the `keras.utils.Sequence` class.')
+
+  # python 2 has 'next', 3 has '__next__'
+  # avoid any explicit version checks
+  val_gen = (
+      hasattr(validation_data, 'next') or
+      hasattr(validation_data, '__next__') or
+      isinstance(validation_data, Sequence))
+  if (val_gen and not isinstance(validation_data, Sequence) and
+      not validation_steps):
+    raise ValueError('`validation_steps=None` is only valid for a'
+                     ' generator based on the `keras.utils.Sequence`'
+                     ' class. Please specify `validation_steps` or use'
+                     ' the `keras.utils.Sequence` class.')
+
+  # Prepare display labels.
+  out_labels = model.metrics_names
+  callback_metrics = out_labels + ['val_%s' % n for n in out_labels]
+
+  # prepare callbacks
+  model.history = cbks.History()
+  callbacks = [cbks.BaseLogger()] + (callbacks or []) + [model.history]
+  if verbose:
+    callbacks += [cbks.ProgbarLogger(count_mode='steps')]
+  callbacks = cbks.CallbackList(callbacks)
+
+  # it's possible to callback a different model than self:
+  if hasattr(model, 'callback_model') and model.callback_model:
+    callback_model = model.callback_model
+  else:
+    callback_model = model
+  callbacks.set_model(callback_model)
+  callbacks.set_params({
+      'epochs': epochs,
+      'steps': steps_per_epoch,
+      'verbose': verbose,
+      'do_validation': do_validation,
+      'metrics': callback_metrics,
+  })
+  callbacks.on_train_begin()
+
+  enqueuer = None
+  val_enqueuer = None
+
+  try:
+    if do_validation:
+      if val_gen:
+        if workers > 0:
+          if isinstance(validation_data, Sequence):
+            val_enqueuer = OrderedEnqueuer(
+                validation_data, use_multiprocessing=use_multiprocessing)
+            if validation_steps is None:
+              validation_steps = len(validation_data)
+          else:
+            val_enqueuer = GeneratorEnqueuer(
+                validation_data,
+                use_multiprocessing=use_multiprocessing,
+                wait_time=wait_time)
+          val_enqueuer.start(workers=workers, max_queue_size=max_queue_size)
+          validation_generator = val_enqueuer.get()
+        else:
+          validation_generator = validation_data
+      else:
+        if len(validation_data) == 2:
+          val_x, val_y = validation_data  # pylint: disable=unpacking-non-sequence
+          val_sample_weight = None
+        elif len(validation_data) == 3:
+          val_x, val_y, val_sample_weight = validation_data  # pylint: disable=unpacking-non-sequence
+        else:
+          raise ValueError(
+              '`validation_data` should be a tuple '
+              '`(val_x, val_y, val_sample_weight)` '
+              'or `(val_x, val_y)`. Found: ' + str(validation_data))
+        val_x, val_y, val_sample_weights = model._standardize_user_data(
+            val_x, val_y, val_sample_weight)
+        val_data = val_x + val_y + val_sample_weights
+        if model.uses_learning_phase and not isinstance(
+            K.learning_phase(), int):
+          val_data += [0]
+        for cbk in callbacks:
+          cbk.validation_data = val_data
+
+    if workers > 0:
+      if is_sequence:
+        enqueuer = OrderedEnqueuer(
+            generator,
+            use_multiprocessing=use_multiprocessing,
+            shuffle=shuffle)
+      else:
+        enqueuer = GeneratorEnqueuer(
+            generator,
+            use_multiprocessing=use_multiprocessing,
+            wait_time=wait_time)
+      enqueuer.start(workers=workers, max_queue_size=max_queue_size)
+      output_generator = enqueuer.get()
+    else:
+      output_generator = generator
+
+    callback_model.stop_training = False
+    # Construct epoch logs.
+    epoch_logs = {}
+    while epoch < epochs:
+      callbacks.on_epoch_begin(epoch)
+      steps_done = 0
+      batch_index = 0
+      while steps_done < steps_per_epoch:
+        generator_output = next(output_generator)
+
+        if not hasattr(generator_output, '__len__'):
+          raise ValueError('Output of generator should be '
+                           'a tuple `(x, y, sample_weight)` '
+                           'or `(x, y)`. Found: ' + str(generator_output))
+
+        if len(generator_output) == 2:
+          x, y = generator_output
+          sample_weight = None
+        elif len(generator_output) == 3:
+          x, y, sample_weight = generator_output
+        else:
+          raise ValueError('Output of generator should be '
+                           'a tuple `(x, y, sample_weight)` '
+                           'or `(x, y)`. Found: ' + str(generator_output))
+        # build batch logs
+        batch_logs = {}
+        if isinstance(x, list):
+          batch_size = x[0].shape[0]
+        elif isinstance(x, dict):
+          batch_size = list(x.values())[0].shape[0]
+        else:
+          batch_size = x.shape[0]
+        batch_logs['batch'] = batch_index
+        batch_logs['size'] = batch_size
+        callbacks.on_batch_begin(batch_index, batch_logs)
+
+        outs = model.train_on_batch(
+            x, y, sample_weight=sample_weight, class_weight=class_weight)
+
+        if not isinstance(outs, list):
+          outs = [outs]
+        for l, o in zip(out_labels, outs):
+          batch_logs[l] = o
+
+        callbacks.on_batch_end(batch_index, batch_logs)
+
+        batch_index += 1
+        steps_done += 1
+
+        # Epoch finished.
+        if steps_done >= steps_per_epoch and do_validation:
+          if val_gen:
+            val_outs = evaluate_generator(
+                model, validation_generator, validation_steps, workers=0)
+          else:
+            # No need for try/except because
+            # data has already been validated.
+            val_outs = model.evaluate(
+                val_x,
+                val_y,
+                batch_size=batch_size,
+                sample_weight=val_sample_weights,
+                verbose=0)
+          if not isinstance(val_outs, list):
+            val_outs = [val_outs]
+          # Same labels assumed.
+          for l, o in zip(out_labels, val_outs):
+            epoch_logs['val_' + l] = o
+
+        if callback_model.stop_training:
+          break
+
+      callbacks.on_epoch_end(epoch, epoch_logs)
+      epoch += 1
+      if callback_model.stop_training:
+        break
+
+  finally:
+    try:
+      if enqueuer is not None:
+        enqueuer.stop()
+    finally:
+      if val_enqueuer is not None:
+        val_enqueuer.stop()
+
+  callbacks.on_train_end()
+  return model.history
+
+
+def evaluate_generator(model,
+                       generator,
+                       steps=None,
+                       max_queue_size=10,
+                       workers=1,
+                       use_multiprocessing=False):
+  """See docstring for `Model.evaluate_generator`."""
+  model._make_test_function()
+
+  steps_done = 0
+  wait_time = 0.01
+  all_outs = []
+  batch_sizes = []
+  is_sequence = isinstance(generator, Sequence)
+  if not is_sequence and use_multiprocessing and workers > 1:
+    logging.warning(
+        UserWarning('Using a generator with `use_multiprocessing=True`'
+                    ' and multiple workers may duplicate your data.'
+                    ' Please consider using the`keras.utils.Sequence'
+                    ' class.'))
+  if steps is None:
+    if is_sequence:
+      steps = len(generator)
+    else:
+      raise ValueError('`steps=None` is only valid for a generator'
+                       ' based on the `keras.utils.Sequence` class.'
+                       ' Please specify `steps` or use the'
+                       ' `keras.utils.Sequence` class.')
+  enqueuer = None
+
+  try:
+    if workers > 0:
+      if is_sequence:
+        enqueuer = OrderedEnqueuer(
+            generator, use_multiprocessing=use_multiprocessing)
+      else:
+        enqueuer = GeneratorEnqueuer(
+            generator,
+            use_multiprocessing=use_multiprocessing,
+            wait_time=wait_time)
+      enqueuer.start(workers=workers, max_queue_size=max_queue_size)
+      output_generator = enqueuer.get()
+    else:
+      output_generator = generator
+
+    while steps_done < steps:
+      generator_output = next(output_generator)
+      if not hasattr(generator_output, '__len__'):
+        raise ValueError('Output of generator should be a tuple '
+                         '(x, y, sample_weight) '
+                         'or (x, y). Found: ' + str(generator_output))
+      if len(generator_output) == 2:
+        x, y = generator_output
+        sample_weight = None
+      elif len(generator_output) == 3:
+        x, y, sample_weight = generator_output
+      else:
+        raise ValueError('Output of generator should be a tuple '
+                         '(x, y, sample_weight) '
+                         'or (x, y). Found: ' + str(generator_output))
+      outs = model.test_on_batch(x, y, sample_weight=sample_weight)
+
+      if isinstance(x, list):
+        batch_size = x[0].shape[0]
+      elif isinstance(x, dict):
+        batch_size = list(x.values())[0].shape[0]
+      else:
+        batch_size = x.shape[0]
+      if batch_size == 0:
+        raise ValueError('Received an empty batch. '
+                         'Batches should at least contain one item.')
+      all_outs.append(outs)
+
+      steps_done += 1
+      batch_sizes.append(batch_size)
+
+  finally:
+    if enqueuer is not None:
+      enqueuer.stop()
+
+  if not isinstance(outs, list):
+    return np.average(np.asarray(all_outs), weights=batch_sizes)
+  else:
+    averages = []
+    for i in range(len(outs)):
+      averages.append(
+          np.average([out[i] for out in all_outs], weights=batch_sizes))
+    return averages
+
+
+def predict_generator(model,
+                      generator,
+                      steps=None,
+                      max_queue_size=10,
+                      workers=1,
+                      use_multiprocessing=False,
+                      verbose=0):
+  """See docstring for `Model.predict_generator`."""
+  model._make_predict_function()
+
+  steps_done = 0
+  wait_time = 0.01
+  all_outs = []
+  is_sequence = isinstance(generator, Sequence)
+  if not is_sequence and use_multiprocessing and workers > 1:
+    logging.warning(
+        UserWarning('Using a generator with `use_multiprocessing=True`'
+                    ' and multiple workers may duplicate your data.'
+                    ' Please consider using the`keras.utils.Sequence'
+                    ' class.'))
+  if steps is None:
+    if is_sequence:
+      steps = len(generator)
+    else:
+      raise ValueError('`steps=None` is only valid for a generator'
+                       ' based on the `keras.utils.Sequence` class.'
+                       ' Please specify `steps` or use the'
+                       ' `keras.utils.Sequence` class.')
+  enqueuer = None
+
+  try:
+    if workers > 0:
+      if is_sequence:
+        enqueuer = OrderedEnqueuer(
+            generator, use_multiprocessing=use_multiprocessing)
+      else:
+        enqueuer = GeneratorEnqueuer(
+            generator,
+            use_multiprocessing=use_multiprocessing,
+            wait_time=wait_time)
+      enqueuer.start(workers=workers, max_queue_size=max_queue_size)
+      output_generator = enqueuer.get()
+    else:
+      output_generator = generator
+
+    if verbose == 1:
+      progbar = Progbar(target=steps)
+
+    while steps_done < steps:
+      generator_output = next(output_generator)
+      if isinstance(generator_output, tuple):
+        # Compatibility with the generators
+        # used for training.
+        if len(generator_output) == 2:
+          x, _ = generator_output
+        elif len(generator_output) == 3:
+          x, _, _ = generator_output
+        else:
+          raise ValueError('Output of generator should be '
+                           'a tuple `(x, y, sample_weight)` '
+                           'or `(x, y)`. Found: ' + str(generator_output))
+      else:
+        # Assumes a generator that only
+        # yields inputs (not targets and sample weights).
+        x = generator_output
+
+      outs = model.predict_on_batch(x)
+      if not isinstance(outs, list):
+        outs = [outs]
+
+      if not all_outs:
+        for out in outs:
+          all_outs.append([])
+
+      for i, out in enumerate(outs):
+        all_outs[i].append(out)
+      steps_done += 1
+      if verbose == 1:
+        progbar.update(steps_done)
+
+  finally:
+    if enqueuer is not None:
+      enqueuer.stop()
+
+  if len(all_outs) == 1:
+    if steps_done == 1:
+      return all_outs[0][0]
+    else:
+      return np.concatenate(all_outs[0])
+  if steps_done == 1:
+    return [out[0] for out in all_outs]
+  else:
+    return [np.concatenate(out) for out in all_outs]
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_test.py b/tensorflow/python/keras/_impl/keras/engine/training_test.py
index 9651eb9f14f1275dc79c8d3b1fb54690772086a1..38ba0f0eaea69486d279c18451ed29dbd8617ec7 100644
--- a/tensorflow/python/keras/_impl/keras/engine/training_test.py
+++ b/tensorflow/python/keras/_impl/keras/engine/training_test.py
@@ -25,7 +25,7 @@ import numpy as np
 
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
-from tensorflow.python.keras._impl.keras.engine.training import _weighted_masked_objective
+from tensorflow.python.keras._impl.keras.engine.training_utils import weighted_masked_objective
 from tensorflow.python.keras._impl.keras.utils.generic_utils import slice_arrays
 from tensorflow.python.platform import test
 
@@ -705,7 +705,7 @@ class LossMaskingTest(test.TestCase):
 
   def test_loss_masking(self):
     with self.test_session():
-      weighted_loss = _weighted_masked_objective(keras.losses.get('mae'))
+      weighted_loss = weighted_masked_objective(keras.losses.get('mae'))
       shape = (3, 4, 2)
       x = np.arange(24).reshape(shape)
       y = 2 * x
@@ -1037,24 +1037,16 @@ class TestGeneratorMethods(test.TestCase):
 class TestTrainingUtils(test.TestCase):
 
   def test_check_array_lengths(self):
-    keras.engine.training._check_array_lengths(None, None, None)
+    keras.engine.training_utils.check_array_lengths(None, None, None)
     a_np = np.random.random((4, 3, 3))
-    keras.engine.training._check_array_lengths(a_np, a_np, a_np)
-    keras.engine.training._check_array_lengths(
+    keras.engine.training_utils.check_array_lengths(a_np, a_np, a_np)
+    keras.engine.training_utils.check_array_lengths(
         [a_np, a_np], [a_np, a_np], [a_np, a_np])
-    keras.engine.training._check_array_lengths([None], [None], [None])
+    keras.engine.training_utils.check_array_lengths([None], [None], [None])
 
     b_np = np.random.random((3, 4))
     with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths(a_np, None, None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths(a_np, a_np, None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], [None], None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], [b_np], None)
-    with self.assertRaises(ValueError):
-      keras.engine.training._check_array_lengths([a_np], None, [b_np])
+      keras.engine.training_utils.check_array_lengths([a_np], [b_np], None)
 
   def test_slice_arrays(self):
     input_a = np.random.random((10, 3))
diff --git a/tensorflow/python/keras/_impl/keras/engine/training_utils.py b/tensorflow/python/keras/_impl/keras/engine/training_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..105638ce1087e8668b49b6653a847667e8f9157d
--- /dev/null
+++ b/tensorflow/python/keras/_impl/keras/engine/training_utils.py
@@ -0,0 +1,534 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Training-related utilities.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import copy
+
+import numpy as np
+
+from tensorflow.python.framework import tensor_util
+from tensorflow.python.keras._impl.keras import backend as K
+from tensorflow.python.keras._impl.keras import losses
+
+
+def check_num_samples(ins,
+                      batch_size=None,
+                      steps=None,
+                      steps_name='steps'):
+  """Determine the number of samples provided for training and evaluation.
+
+  The number of samples is not defined when running with `steps`,
+  in which case the number of samples is set to `None`.
+
+  Arguments:
+      ins: List of tensors to be fed to the Keras function.
+      batch_size: Integer batch size or `None` if not defined.
+      steps: Total number of steps (batches of samples)
+          before declaring `_predict_loop` finished.
+          Ignored with the default value of `None`.
+      steps_name: The public API's parameter name for `steps`.
+
+  Raises:
+      ValueError: when `steps` is `None` and the attribute `ins.shape`
+      does not exist. Also raises ValueError when `steps` is not `None`
+      and `batch_size` is not `None` because they are mutually
+      exclusive.
+
+  Returns:
+      When steps is `None`, returns the number of samples to be
+      processed based on the size of the first dimension of the
+      first input numpy array. When steps is not `None` and
+      `batch_size` is `None`, returns `None`.
+
+  Raises:
+      ValueError: In case of invalid arguments.
+  """
+  if steps is not None:
+    num_samples = None
+    if batch_size is not None:
+      raise ValueError(
+          'If ' + steps_name + ' is set, the `batch_size` must be None.')
+  elif ins and hasattr(ins[0], 'shape'):
+    num_samples = ins[0].shape[0]
+  else:
+    raise ValueError(
+        'Either the input data should have '
+        'a defined shape, or ' + steps_name + ' should be specified.')
+  return num_samples
+
+
+def standardize_input_data(data,
+                           names,
+                           shapes=None,
+                           check_batch_axis=True,
+                           exception_prefix=''):
+  """Normalizes inputs and targets provided by users.
+
+  Users may pass data as a list of arrays, dictionary of arrays,
+  or as a single array. We normalize this to an ordered list of
+  arrays (same order as `names`), while checking that the provided
+  arrays have shapes that match the network's expectations.
+
+  Arguments:
+      data: User-provided input data (polymorphic).
+      names: List of expected array names.
+      shapes: Optional list of expected array shapes.
+      check_batch_axis: Boolean; whether to check that
+          the batch axis of the arrays matches the expected
+          value found in `shapes`.
+      exception_prefix: String prefix used for exception formatting.
+
+  Returns:
+      List of standardized input arrays (one array per model input).
+
+  Raises:
+      ValueError: in case of improperly formatted user-provided data.
+  """
+  if not names:
+    if data is not None and hasattr(data, '__len__') and len(data):
+      raise ValueError('Error when checking model ' + exception_prefix + ': '
+                       'expected no data, but got:', data)
+    return []
+  if data is None:
+    return [None for _ in range(len(names))]
+
+  if isinstance(data, dict):
+    try:
+      data = [
+          data[x].values
+          if data[x].__class__.__name__ == 'DataFrame' else data[x]
+          for x in names
+      ]
+    except KeyError as e:
+      raise ValueError('No data provided for "' + e.args[0] + '". Need data '
+                       'for each key in: ' + str(names))
+  elif isinstance(data, list):
+    if isinstance(data[0], list):
+      data = [np.asarray(d) for d in data]
+    elif len(names) == 1 and isinstance(data[0], (float, int)):
+      data = [np.asarray(data)]
+    else:
+      data = [
+          x.values if x.__class__.__name__ == 'DataFrame' else x for x in data
+      ]
+  else:
+    data = data.values if data.__class__.__name__ == 'DataFrame' else data
+    data = [data]
+  data = [
+      np.expand_dims(x, 1) if x is not None and x.ndim == 1 else x for x in data
+  ]
+
+  if len(data) != len(names):
+    if data and hasattr(data[0], 'shape'):
+      raise ValueError('Error when checking model ' + exception_prefix +
+                       ': the list of Numpy arrays that you are passing to '
+                       'your model is not the size the model expected. '
+                       'Expected to see ' + str(len(names)) + ' array(s), '
+                       'but instead got the following list of ' +
+                       str(len(data)) + ' arrays: ' + str(data)[:200] + '...')
+    elif len(names) > 1:
+      raise ValueError(
+          'Error when checking model ' + exception_prefix +
+          ': you are passing a list as input to your model, '
+          'but the model expects a list of ' + str(len(names)) +
+          ' Numpy arrays instead. The list you passed was: ' + str(data)[:200])
+    elif len(data) == 1 and not hasattr(data[0], 'shape'):
+      raise TypeError('Error when checking model ' + exception_prefix +
+                      ': data should be a Numpy array, or list/dict of '
+                      'Numpy arrays. Found: ' + str(data)[:200] + '...')
+    elif len(names) == 1:
+      data = [np.asarray(data)]
+
+  # Check shapes compatibility.
+  if shapes:
+    for i in range(len(names)):
+      if shapes[i] is not None:
+        data_shape = data[i].shape
+        shape = shapes[i]
+        if data[i].ndim != len(shape):
+          raise ValueError('Error when checking ' + exception_prefix +
+                           ': expected ' + names[i] + ' to have ' +
+                           str(len(shape)) + ' dimensions, but got array '
+                           'with shape ' + str(data_shape))
+        if not check_batch_axis:
+          data_shape = data_shape[1:]
+          shape = shape[1:]
+        for dim, ref_dim in zip(data_shape, shape):
+          if ref_dim != dim and ref_dim:
+            raise ValueError(
+                'Error when checking ' + exception_prefix + ': expected ' +
+                names[i] + ' to have shape ' + str(shape) +
+                ' but got array with shape ' + str(data_shape))
+  return data
+
+
+def standardize_sample_or_class_weights(x_weight, output_names, weight_type):
+  """Maps `sample_weight` or `class_weight` to model outputs.
+
+  Arguments:
+      x_weight: User-provided `sample_weight` or `class_weight` argument.
+      output_names: List of output names (strings) in the model.
+      weight_type: A string used purely for exception printing.
+
+  Returns:
+      A list of `sample_weight` or `class_weight` where there are exactly
+          one element per model output.
+
+  Raises:
+      ValueError: In case of invalid user-provided argument.
+  """
+  if x_weight is None or len(x_weight) == 0:  # pylint: disable=g-explicit-length-test
+    return [None for _ in output_names]
+  if len(output_names) == 1:
+    if isinstance(x_weight, list) and len(x_weight) == 1:
+      return x_weight
+    if isinstance(x_weight, dict) and output_names[0] in x_weight:
+      return [x_weight[output_names[0]]]
+    else:
+      return [x_weight]
+  if isinstance(x_weight, list):
+    if len(x_weight) != len(output_names):
+      raise ValueError('Provided `' + weight_type + '` was a list of ' +
+                       str(len(x_weight)) + ' elements, but the model has ' +
+                       str(len(output_names)) + ' outputs. '
+                       'You should provide one `' + weight_type + '`'
+                       'array per model output.')
+    return x_weight
+  if isinstance(x_weight, dict):
+    x_weights = []
+    for name in output_names:
+      x_weights.append(x_weight.get(name))
+    return x_weights
+  else:
+    raise TypeError(
+        'The model has multiple outputs, so `' + weight_type + '` '
+        'should be either a list or a dict. '
+        'Provided `' + weight_type + '` type not understood: ' + str(x_weight))
+
+
+def standardize_class_weights(class_weight, output_names):
+  return standardize_sample_or_class_weights(class_weight, output_names,
+                                             'class_weight')
+
+
+def standardize_sample_weights(sample_weight, output_names):
+  return standardize_sample_or_class_weights(sample_weight, output_names,
+                                             'sample_weight')
+
+
+def check_array_lengths(inputs, targets, weights=None):
+  """Does user input validation for numpy arrays.
+
+  Arguments:
+      inputs: list of Numpy arrays of inputs.
+      targets: list of Numpy arrays of targets.
+      weights: list of Numpy arrays of sample weights.
+
+  Raises:
+      ValueError: in case of incorrectly formatted data.
+  """
+
+  def set_of_lengths(x):
+    # return a set with the variation between
+    # different shapes, with None => 0
+    if x is None:
+      return {}
+    else:
+      return set([y.shape[0] for y in x if y is not None])
+
+  set_x = set_of_lengths(inputs)
+  set_y = set_of_lengths(targets)
+  set_w = set_of_lengths(weights)
+  if len(set_x) > 1:
+    raise ValueError('All input arrays (x) should have '
+                     'the same number of samples. Got array shapes: ' +
+                     str([x.shape for x in inputs]))
+  if len(set_y) > 1:
+    raise ValueError('All target arrays (y) should have '
+                     'the same number of samples. Got array shapes: ' +
+                     str([y.shape for y in targets]))
+  if set_x and set_y and list(set_x)[0] != list(set_y)[0]:
+    raise ValueError('Input arrays should have '
+                     'the same number of samples as target arrays. '
+                     'Found ' + str(list(set_x)[0]) + ' input samples '
+                     'and ' + str(list(set_y)[0]) + ' target samples.')
+  if len(set_w) > 1:
+    raise ValueError('All sample_weight arrays should have '
+                     'the same number of samples. Got array shapes: ' +
+                     str([w.shape for w in weights]))
+  if set_y and set_w and list(set_y)[0] != list(set_w)[0]:
+    raise ValueError('Sample_weight arrays should have '
+                     'the same number of samples as target arrays. Got ' +
+                     str(list(set_y)[0]) + ' input samples and ' +
+                     str(list(set_w)[0]) + ' target samples.')
+
+
+def check_loss_and_target_compatibility(targets, loss_fns, output_shapes):
+  """Does validation on the compatibility of targets and loss functions.
+
+  This helps prevent users from using loss functions incorrectly. This check
+  is purely for UX purposes.
+
+  Arguments:
+      targets: list of Numpy arrays of targets.
+      loss_fns: list of loss functions.
+      output_shapes: list of shapes of model outputs.
+
+  Raises:
+      ValueError: if a loss function or target array
+          is incompatible with an output.
+  """
+  key_losses = {
+      losses.mean_squared_error, losses.binary_crossentropy,
+      losses.categorical_crossentropy
+  }
+  for y, loss, shape in zip(targets, loss_fns, output_shapes):
+    if y is None or loss is None or tensor_util.is_tensor(y):
+      continue
+    if loss is losses.categorical_crossentropy:
+      if y.shape[-1] == 1:
+        raise ValueError('You are passing a target array of shape ' + str(
+            y.shape) + ' while using as loss `categorical_crossentropy`. '
+                         '`categorical_crossentropy` expects '
+                         'targets to be binary matrices (1s and 0s) '
+                         'of shape (samples, classes). '
+                         'If your targets are integer classes, '
+                         'you can convert them to the expected format via:\n'
+                         '```\n'
+                         'from keras.utils import to_categorical\n'
+                         'y_binary = to_categorical(y_int)\n'
+                         '```\n'
+                         '\n'
+                         'Alternatively, you can use the loss function '
+                         '`sparse_categorical_crossentropy` instead, '
+                         'which does expect integer targets.')
+    if loss in key_losses:
+      for target_dim, out_dim in zip(y.shape[1:], shape[1:]):
+        if out_dim is not None and target_dim != out_dim:
+          raise ValueError('A target array with shape ' + str(y.shape) +
+                           ' was passed for an output of shape ' + str(shape) +
+                           ' while using as loss `' + loss.__name__ + '`. '
+                           'This loss expects '
+                           'targets to have the same shape '
+                           'as the output.')
+
+
+def collect_metrics(metrics, output_names):
+  """Maps metric functions to model outputs.
+
+  Arguments:
+      metrics: a list or dict of metric functions.
+      output_names: a list of the names (strings) of model outputs.
+
+  Returns:
+      A list (one entry per model output) of lists of metric functions.
+      For instance, if the model has 2 outputs, and for the first output
+      we want to compute "binary_accuracy" and "binary_crossentropy",
+      and just "binary_accuracy" for the second output,
+      the list would look like:
+          `[[binary_accuracy, binary_crossentropy], [binary_accuracy]]`
+
+  Raises:
+      TypeError: if an incorrect type is passed for the `metrics` argument.
+  """
+  if not metrics:
+    return [[] for _ in output_names]
+  if isinstance(metrics, list):
+    # we then apply all metrics to all outputs.
+    return [copy.copy(metrics) for _ in output_names]
+  elif isinstance(metrics, dict):
+    nested_metrics = []
+    for name in output_names:
+      output_metrics = metrics.get(name, [])
+      if not isinstance(output_metrics, list):
+        output_metrics = [output_metrics]
+      nested_metrics.append(output_metrics)
+    return nested_metrics
+  else:
+    raise TypeError('Type of `metrics` argument not understood. '
+                    'Expected a list or dictionary, found: ' + str(metrics))
+
+
+def batch_shuffle(index_array, batch_size):
+  """Shuffles an array in a batch-wise fashion.
+
+  Useful for shuffling HDF5 arrays
+  (where one cannot access arbitrary indices).
+
+  Arguments:
+      index_array: array of indices to be shuffled.
+      batch_size: integer.
+
+  Returns:
+      The `index_array` array, shuffled in a batch-wise fashion.
+  """
+  batch_count = int(len(index_array) / batch_size)
+  # to reshape we need to be cleanly divisible by batch size
+  # we stash extra items and reappend them after shuffling
+  last_batch = index_array[batch_count * batch_size:]
+  index_array = index_array[:batch_count * batch_size]
+  index_array = index_array.reshape((batch_count, batch_size))
+  np.random.shuffle(index_array)
+  index_array = index_array.flatten()
+  return np.append(index_array, last_batch)
+
+
+def weighted_masked_objective(fn):
+  """Adds support for masking and sample-weighting to an objective function.
+
+  It transforms an objective function `fn(y_true, y_pred)`
+  into a sample-weighted, cost-masked objective function
+  `fn(y_true, y_pred, weights, mask)`.
+
+  Arguments:
+      fn: The objective function to wrap,
+          with signature `fn(y_true, y_pred)`.
+
+  Returns:
+      A function with signature `fn(y_true, y_pred, weights, mask)`.
+  """
+  if fn is None:
+    return None
+
+  def weighted(y_true, y_pred, weights, mask=None):
+    """Wrapper function.
+
+    Arguments:
+        y_true: `y_true` argument of `fn`.
+        y_pred: `y_pred` argument of `fn`.
+        weights: Weights tensor.
+        mask: Mask tensor.
+
+    Returns:
+        Scalar tensor.
+    """
+    # score_array has ndim >= 2
+    score_array = fn(y_true, y_pred)
+    if mask is not None:
+      # Cast the mask to floatX to avoid float64 upcasting in theano
+      mask = K.cast(mask, K.floatx())
+      # mask should have the same shape as score_array
+      score_array *= mask
+      #  the loss per batch should be proportional
+      #  to the number of unmasked samples.
+      score_array /= K.mean(mask)
+
+    # apply sample weighting
+    if weights is not None:
+      # reduce score_array to same ndim as weight array
+      ndim = K.ndim(score_array)
+      weight_ndim = K.ndim(weights)
+      score_array = K.mean(score_array, axis=list(range(weight_ndim, ndim)))
+      score_array *= weights
+      score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
+    return K.mean(score_array)
+
+  return weighted
+
+
+def standardize_weights(y,
+                        sample_weight=None,
+                        class_weight=None,
+                        sample_weight_mode=None):
+  """Performs sample weight validation and standardization.
+
+  Everything gets normalized to a single sample-wise (or timestep-wise)
+  weight array.
+
+  Arguments:
+      y: Numpy array of model targets to be weighted.
+      sample_weight: User-provided `sample_weight` argument.
+      class_weight: User-provided `class_weight` argument.
+      sample_weight_mode: One of `None` or `"temporal"`.
+          `"temporal"` indicated that we expect 2D weight data
+          that will be applied to the last 2 dimensions of
+          the targets (i.e. we are weighting timesteps, not samples).
+
+  Returns:
+      A numpy array of target weights, one entry per sample to weight.
+
+  Raises:
+      ValueError: In case of invalid user-provided arguments.
+  """
+  if sample_weight_mode is not None:
+    if sample_weight_mode != 'temporal':
+      raise ValueError('"sample_weight_mode '
+                       'should be None or "temporal". '
+                       'Found: ' + str(sample_weight_mode))
+    if len(y.shape) < 3:
+      raise ValueError('Found a sample_weight array for '
+                       'an input with shape ' + str(y.shape) + '. '
+                       'Timestep-wise sample weighting (use of '
+                       'sample_weight_mode="temporal") is restricted to '
+                       'outputs that are at least 3D, i.e. that have '
+                       'a time dimension.')
+    if sample_weight is not None and len(sample_weight.shape) != 2:
+      raise ValueError('Found a sample_weight array with shape ' +
+                       str(sample_weight.shape) + '. '
+                       'In order to use timestep-wise sample weighting, '
+                       'you should pass a 2D sample_weight array.')
+  else:
+    if sample_weight is not None and len(sample_weight.shape) != 1:
+      raise ValueError('Found a sample_weight array with shape ' +
+                       str(sample_weight.shape) + '. '
+                       'In order to use timestep-wise sample weights, '
+                       'you should specify '
+                       'sample_weight_mode="temporal" '
+                       'in compile(). If you just mean to use '
+                       'sample-wise weights, make sure your '
+                       'sample_weight array is 1D.')
+
+  if sample_weight is not None:
+    if len(sample_weight.shape) > len(y.shape):
+      raise ValueError(
+          'Found a sample_weight with shape' + str(sample_weight.shape) + '.'
+          'Expected sample_weight with rank '
+          'less than or equal to ' + str(len(y.shape)))
+
+    if y.shape[:sample_weight.ndim] != sample_weight.shape:
+      raise ValueError(
+          'Found a sample_weight array with shape ' + str(sample_weight.shape) +
+          ' for an input with shape ' + str(y.shape) + '. '
+          'sample_weight cannot be broadcast.')
+    return sample_weight
+  elif isinstance(class_weight, dict):
+    if len(y.shape) > 2:
+      raise ValueError('`class_weight` not supported for '
+                       '3+ dimensional targets.')
+    if y.shape[1] > 1:
+      y_classes = np.argmax(y, axis=1)
+    elif y.shape[1] == 1:
+      y_classes = np.reshape(y, y.shape[0])
+    else:
+      y_classes = y
+
+    weights = np.asarray(
+        [class_weight[cls] for cls in y_classes if cls in class_weight])
+
+    if len(weights) != len(y_classes):
+      # subtract the sets to pick all missing classes
+      existing_classes = set(y_classes)
+      existing_class_weight = set(class_weight.keys())
+      raise ValueError('`class_weight` must contain all classes in the data.'
+                       ' The classes %s exist in the data but not in '
+                       '`class_weight`.' %
+                       (existing_classes - existing_class_weight))
+    return weights
+  else:
+    return None
diff --git a/tensorflow/python/keras/_impl/keras/estimator.py b/tensorflow/python/keras/_impl/keras/estimator.py
index db0140c2df4d20f9e18e6c1401c6c6aa197bcf1f..0bf5bd41dc915fbecbce4c3a6191e925612dbebb 100644
--- a/tensorflow/python/keras/_impl/keras/estimator.py
+++ b/tensorflow/python/keras/_impl/keras/estimator.py
@@ -222,18 +222,18 @@ def _save_first_checkpoint(keras_model, estimator, custom_objects,
   Returns:
     The model_fn for a keras Estimator.
   """
-  with ops.Graph().as_default() as g, g.device(estimator._device_fn):
-    random_seed.set_random_seed(estimator.config.tf_random_seed)
-    training_util.create_global_step()
-    model = _clone_and_build_model(model_fn_lib.ModeKeys.TRAIN, keras_model,
-                                   custom_objects)
-
-    if isinstance(model, models.Sequential):
-      model = model.model
-    # Load weights and save to checkpoint if there is no checkpoint
-    latest_path = saver_lib.latest_checkpoint(estimator.model_dir)
-    if not latest_path:
-      with session.Session() as sess:
+  # Load weights and save to checkpoint if there is no checkpoint
+  latest_path = saver_lib.latest_checkpoint(estimator.model_dir)
+  if not latest_path:
+    with ops.Graph().as_default():
+      random_seed.set_random_seed(estimator.config.tf_random_seed)
+      training_util.create_global_step()
+      model = _clone_and_build_model(model_fn_lib.ModeKeys.TRAIN, keras_model,
+                                     custom_objects)
+      if isinstance(model, models.Sequential):
+        model = model.model
+      # save to checkpoint
+      with session.Session(config=estimator._session_config) as sess:
         model.set_weights(keras_weights)
         # Make update ops and initialize all variables.
         if not model.train_function:
diff --git a/tensorflow/python/keras/_impl/keras/estimator_test.py b/tensorflow/python/keras/_impl/keras/estimator_test.py
index 9fc48b4117e7ee2c717d5418754254aa02b82869..88dd14b856a4ee9dfbee61d6fd1bdb96af24b50c 100644
--- a/tensorflow/python/keras/_impl/keras/estimator_test.py
+++ b/tensorflow/python/keras/_impl/keras/estimator_test.py
@@ -17,6 +17,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+import json
 from math import log10
 import os
 import tempfile
@@ -62,7 +63,7 @@ def simple_functional_model():
   return model
 
 
-def get_resource_for_simple_model(is_sequential, is_evaluate):
+def get_resource_for_simple_model(is_sequential=True, is_evaluate=False):
   model = simple_sequential_model(
   ) if is_sequential else simple_functional_model()
   if is_sequential:
@@ -352,6 +353,30 @@ class TestKerasEstimator(test_util.TensorFlowTestCase):
           model_dir=tempfile.mkdtemp(dir=self._base_dir),
           custom_objects=custom_objects)
 
+  def test_tf_config(self):
+    keras_model, (_, _), (_, _), _, _ = get_resource_for_simple_model()
+    keras_model.compile(
+        loss='categorical_crossentropy',
+        optimizer='rmsprop',
+        metrics=['mse', keras.metrics.categorical_accuracy])
+
+    tf_config = json.dumps({
+        'cluster': {
+            run_config_lib.TaskType.PS: ['localhost:1234'],
+            run_config_lib.TaskType.WORKER: ['localhost:1236'],
+            run_config_lib.TaskType.MASTER: ['localhost:1238']
+        },
+        'task': {
+            'type': run_config_lib.TaskType.MASTER,
+            'index': 0
+        }
+    })
+    with test.mock.patch.dict('os.environ', {'TF_CONFIG': tf_config}):
+      with self.test_session():
+        keras.estimator.model_to_estimator(
+            keras_model=keras_model,
+            model_dir=tempfile.mkdtemp(dir=self._base_dir))
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/keras/_impl/keras/integration_test.py b/tensorflow/python/keras/_impl/keras/integration_test.py
index 15c3d14727a44c9726a1c2c86f47640bcc490e70..280f7ed1b11e2026ac196eb319f7d5da8301f060 100644
--- a/tensorflow/python/keras/_impl/keras/integration_test.py
+++ b/tensorflow/python/keras/_impl/keras/integration_test.py
@@ -23,7 +23,6 @@ import numpy as np
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
 from tensorflow.python.layers import core as tf_core_layers
-from tensorflow.python.layers import network as tf_network_layers
 from tensorflow.python.ops import nn
 from tensorflow.python.platform import test
 
@@ -275,10 +274,10 @@ class KerasIntegrationTest(test.TestCase):
       y_train = keras.utils.to_categorical(y_train)
       y_test = keras.utils.to_categorical(y_test)
 
-      inputs = tf_network_layers.Input(shape=(10,))
+      inputs = keras.Input(shape=(10,))
       x = tf_core_layers.Dense(32, activation=nn.relu)(inputs)
       outputs = tf_core_layers.Dense(2, activation=nn.softmax)(x)
-      model = keras.models.Model(inputs, outputs)
+      model = keras.Model(inputs, outputs)
       model.summary()
 
       model.compile(loss='categorical_crossentropy',
diff --git a/tensorflow/python/keras/_impl/keras/layers/advanced_activations.py b/tensorflow/python/keras/_impl/keras/layers/advanced_activations.py
index 7cac17c51a9adcf8fc62154b6633de60bab18387..c40ee109aaea7dacea72e095b1d8cea3ed2e9bf8 100644
--- a/tensorflow/python/keras/_impl/keras/layers/advanced_activations.py
+++ b/tensorflow/python/keras/_impl/keras/layers/advanced_activations.py
@@ -25,7 +25,7 @@ from tensorflow.python.keras._impl.keras import initializers
 from tensorflow.python.keras._impl.keras import regularizers
 from tensorflow.python.keras._impl.keras.engine import InputSpec
 from tensorflow.python.keras._impl.keras.engine import Layer
-from tensorflow.python.keras._impl.keras.engine.topology import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.base_layer import shape_type_conversion
 from tensorflow.python.util.tf_export import tf_export
 
 
diff --git a/tensorflow/python/keras/_impl/keras/layers/convolutional_recurrent.py b/tensorflow/python/keras/_impl/keras/layers/convolutional_recurrent.py
index d2792b9636214d21e9658018f853fb6c0808abb4..d95a0942452afa82e277c358be5c3b2ba061ac98 100644
--- a/tensorflow/python/keras/_impl/keras/layers/convolutional_recurrent.py
+++ b/tensorflow/python/keras/_impl/keras/layers/convolutional_recurrent.py
@@ -26,7 +26,7 @@ from tensorflow.python.keras._impl.keras import constraints
 from tensorflow.python.keras._impl.keras import initializers
 from tensorflow.python.keras._impl.keras import regularizers
 from tensorflow.python.keras._impl.keras.engine import InputSpec
-from tensorflow.python.keras._impl.keras.engine.topology import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.base_layer import shape_type_conversion
 from tensorflow.python.keras._impl.keras.layers.recurrent import Recurrent
 from tensorflow.python.keras._impl.keras.utils import conv_utils
 from tensorflow.python.util.tf_export import tf_export
diff --git a/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py b/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py
index 4a6228121b4f8839daa98e35748b2c5867ccca96..c612e97a9d67f7398c78a7da1107f8e067bf9371 100644
--- a/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/convolutional_test.py
@@ -22,6 +22,8 @@ import copy
 
 import numpy as np
 
+from tensorflow.python.eager import context
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
 from tensorflow.python.platform import test
@@ -43,6 +45,7 @@ class Convolution1DTest(test.TestCase):
             kwargs=test_kwargs,
             input_shape=(num_samples, length, stack_size))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_conv1d(self):
     kwargs = {
         'filters': 2,
@@ -114,6 +117,7 @@ class Conv2DTest(test.TestCase):
             kwargs=test_kwargs,
             input_shape=(num_samples, num_row, num_col, stack_size))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_conv2d(self):
     kwargs = {
         'filters': 2,
@@ -188,6 +192,7 @@ class Conv2DTransposeTest(test.TestCase):
             kwargs=test_kwargs,
             input_shape=(num_samples, num_row, num_col, stack_size))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_conv2dtranspose(self):
     kwargs = {
         'filters': 2,
@@ -253,6 +258,7 @@ class Conv3DTransposeTest(test.TestCase):
             kwargs=test_kwargs,
             input_shape=(num_samples, depth, num_row, num_col, stack_size))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_conv3dtranspose(self):
     kwargs = {
         'filters': 2,
@@ -316,6 +322,7 @@ class SeparableConv1DTest(test.TestCase):
             kwargs=test_kwargs,
             input_shape=(num_samples, length, stack_size))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_separable_conv1d(self):
     kwargs = {
         'filters': 2,
@@ -391,6 +398,7 @@ class SeparableConv2DTest(test.TestCase):
             kwargs=test_kwargs,
             input_shape=(num_samples, num_row, num_col, stack_size))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_separable_conv2d(self):
     kwargs = {
         'filters': 2,
@@ -469,6 +477,7 @@ class Conv3DTest(test.TestCase):
             kwargs=test_kwargs,
             input_shape=(num_samples, depth, num_row, num_col, stack_size))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_conv3d(self):
     kwargs = {
         'filters': 2,
@@ -520,6 +529,7 @@ class Conv3DTest(test.TestCase):
 
 class ZeroPaddingTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_zero_padding_1d(self):
     num_samples = 2
     input_dim = 2
@@ -543,7 +553,10 @@ class ZeroPaddingTest(test.TestCase):
       layer = keras.layers.ZeroPadding1D(padding=2)
       layer.build(shape)
       output = layer(keras.backend.variable(inputs))
-      np_output = keras.backend.eval(output)
+      if context.in_eager_mode():
+        np_output = output.numpy()
+      else:
+        np_output = keras.backend.eval(output)
       for offset in [0, 1, -1, -2]:
         np.testing.assert_allclose(np_output[:, offset, :], 0.)
       np.testing.assert_allclose(np_output[:, 2:-2, :], 1.)
@@ -551,7 +564,10 @@ class ZeroPaddingTest(test.TestCase):
       layer = keras.layers.ZeroPadding1D(padding=(1, 2))
       layer.build(shape)
       output = layer(keras.backend.variable(inputs))
-      np_output = keras.backend.eval(output)
+      if context.in_eager_mode():
+        np_output = output.numpy()
+      else:
+        np_output = keras.backend.eval(output)
       for left_offset in [0]:
         np.testing.assert_allclose(np_output[:, left_offset, :], 0.)
       for right_offset in [-1, -2]:
@@ -565,6 +581,7 @@ class ZeroPaddingTest(test.TestCase):
     with self.assertRaises(ValueError):
       keras.layers.ZeroPadding1D(padding=None)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_zero_padding_2d(self):
     num_samples = 2
     stack_size = 2
@@ -593,7 +610,10 @@ class ZeroPaddingTest(test.TestCase):
             padding=(2, 2), data_format=data_format)
         layer.build(inputs.shape)
         output = layer(keras.backend.variable(inputs))
-        np_output = keras.backend.eval(output)
+        if context.in_eager_mode():
+          np_output = output.numpy()
+        else:
+          np_output = keras.backend.eval(output)
         if data_format == 'channels_last':
           for offset in [0, 1, -1, -2]:
             np.testing.assert_allclose(np_output[:, offset, :, :], 0.)
@@ -609,7 +629,10 @@ class ZeroPaddingTest(test.TestCase):
             padding=((1, 2), (3, 4)), data_format=data_format)
         layer.build(inputs.shape)
         output = layer(keras.backend.variable(inputs))
-        np_output = keras.backend.eval(output)
+        if context.in_eager_mode():
+          np_output = output.numpy()
+        else:
+          np_output = keras.backend.eval(output)
         if data_format == 'channels_last':
           for top_offset in [0]:
             np.testing.assert_allclose(np_output[:, top_offset, :, :], 0.)
@@ -637,6 +660,7 @@ class ZeroPaddingTest(test.TestCase):
       with self.assertRaises(ValueError):
         keras.layers.ZeroPadding2D(padding=None)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_zero_padding_3d(self):
     num_samples = 2
     stack_size = 2
@@ -659,7 +683,10 @@ class ZeroPaddingTest(test.TestCase):
       layer = keras.layers.ZeroPadding3D(padding=(2, 2, 2))
       layer.build(inputs.shape)
       output = layer(keras.backend.variable(inputs))
-      np_output = keras.backend.eval(output)
+      if context.in_eager_mode():
+        np_output = output.numpy()
+      else:
+        np_output = keras.backend.eval(output)
       for offset in [0, 1, -1, -2]:
         np.testing.assert_allclose(np_output[:, offset, :, :, :], 0.)
         np.testing.assert_allclose(np_output[:, :, offset, :, :], 0.)
@@ -675,11 +702,13 @@ class ZeroPaddingTest(test.TestCase):
 
 class UpSamplingTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_upsampling_1d(self):
     with self.test_session(use_gpu=True):
       testing_utils.layer_test(
           keras.layers.UpSampling1D, kwargs={'size': 2}, input_shape=(3, 5, 4))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_upsampling_2d(self):
     num_samples = 2
     stack_size = 2
@@ -708,7 +737,10 @@ class UpSamplingTest(test.TestCase):
                 size=(length_row, length_col), data_format=data_format)
             layer.build(inputs.shape)
             output = layer(keras.backend.variable(inputs))
-            np_output = keras.backend.eval(output)
+            if context.in_eager_mode():
+              np_output = output.numpy()
+            else:
+              np_output = keras.backend.eval(output)
             if data_format == 'channels_first':
               assert np_output.shape[2] == length_row * input_num_row
               assert np_output.shape[3] == length_col * input_num_col
@@ -726,6 +758,7 @@ class UpSamplingTest(test.TestCase):
 
             np.testing.assert_allclose(np_output, expected_out)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_upsampling_3d(self):
     num_samples = 2
     stack_size = 2
@@ -757,7 +790,10 @@ class UpSamplingTest(test.TestCase):
                   data_format=data_format)
               layer.build(inputs.shape)
               output = layer(keras.backend.variable(inputs))
-              np_output = keras.backend.eval(output)
+              if context.in_eager_mode():
+                np_output = output.numpy()
+              else:
+                np_output = keras.backend.eval(output)
               if data_format == 'channels_first':
                 assert np_output.shape[2] == length_dim1 * input_len_dim1
                 assert np_output.shape[3] == length_dim2 * input_len_dim2
@@ -782,6 +818,7 @@ class UpSamplingTest(test.TestCase):
 
 class CroppingTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_cropping_1d(self):
     num_samples = 2
     time_length = 4
@@ -800,6 +837,7 @@ class CroppingTest(test.TestCase):
     with self.assertRaises(ValueError):
       keras.layers.Cropping1D(cropping=None)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_cropping_2d(self):
     num_samples = 2
     stack_size = 2
@@ -827,7 +865,10 @@ class CroppingTest(test.TestCase):
             cropping=cropping, data_format=data_format)
         layer.build(inputs.shape)
         output = layer(keras.backend.variable(inputs))
-        np_output = keras.backend.eval(output)
+        if context.in_eager_mode():
+          np_output = output.numpy()
+        else:
+          np_output = keras.backend.eval(output)
         # compare with numpy
         if data_format == 'channels_first':
           expected_out = inputs[:, :, cropping[0][0]:-cropping[0][1], cropping[
@@ -851,7 +892,10 @@ class CroppingTest(test.TestCase):
             cropping=cropping, data_format=data_format)
         layer.build(inputs.shape)
         output = layer(keras.backend.variable(inputs))
-        np_output = keras.backend.eval(output)
+        if context.in_eager_mode():
+          np_output = output.numpy()
+        else:
+          np_output = keras.backend.eval(output)
         # compare with input
         np.testing.assert_allclose(np_output, inputs)
 
@@ -861,6 +905,7 @@ class CroppingTest(test.TestCase):
     with self.assertRaises(ValueError):
       keras.layers.Cropping2D(cropping=None)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_cropping_3d(self):
     num_samples = 2
     stack_size = 2
@@ -892,7 +937,10 @@ class CroppingTest(test.TestCase):
                 cropping=cropping, data_format=data_format)
             layer.build(inputs.shape)
             output = layer(keras.backend.variable(inputs))
-            np_output = keras.backend.eval(output)
+            if context.in_eager_mode():
+              np_output = output.numpy()
+            else:
+              np_output = keras.backend.eval(output)
             # compare with numpy
             if data_format == 'channels_first':
               expected_out = inputs[:, :,
diff --git a/tensorflow/python/keras/_impl/keras/layers/core_test.py b/tensorflow/python/keras/_impl/keras/layers/core_test.py
index bdb99c91c289cf808fec7b891376dbfcf5504aca..2ca816adbdcecaf371776d99f3da60d0d8790832 100644
--- a/tensorflow/python/keras/_impl/keras/layers/core_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/core_test.py
@@ -20,11 +20,9 @@ from __future__ import print_function
 
 import numpy as np
 
-from tensorflow.python.eager import context
-from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
-from tensorflow.python.ops import init_ops
 from tensorflow.python.platform import test
 
 
@@ -52,146 +50,134 @@ class CoreLayersTest(test.TestCase):
       dropout = keras.layers.Dropout(0.5)
       self.assertEqual(True, dropout.supports_masking)
 
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.SpatialDropout1D,
-          kwargs={'rate': 0.5},
-          input_shape=(2, 3, 4))
-
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.SpatialDropout2D,
-          kwargs={'rate': 0.5},
-          input_shape=(2, 3, 4, 5))
-
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.SpatialDropout2D,
-          kwargs={'rate': 0.5, 'data_format': 'channels_first'},
-          input_shape=(2, 3, 4, 5))
-
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.SpatialDropout3D,
-          kwargs={'rate': 0.5},
-          input_shape=(2, 3, 4, 4, 5))
-
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.SpatialDropout3D,
-          kwargs={'rate': 0.5, 'data_format': 'channels_first'},
-          input_shape=(2, 3, 4, 4, 5))
-
+  @tf_test_util.run_in_graph_and_eager_modes()
+  def test_spatial_dropout(self):
+    testing_utils.layer_test(
+        keras.layers.SpatialDropout1D,
+        kwargs={'rate': 0.5},
+        input_shape=(2, 3, 4))
+
+    testing_utils.layer_test(
+        keras.layers.SpatialDropout2D,
+        kwargs={'rate': 0.5},
+        input_shape=(2, 3, 4, 5))
+
+    testing_utils.layer_test(
+        keras.layers.SpatialDropout2D,
+        kwargs={'rate': 0.5, 'data_format': 'channels_first'},
+        input_shape=(2, 3, 4, 5))
+
+    testing_utils.layer_test(
+        keras.layers.SpatialDropout3D,
+        kwargs={'rate': 0.5},
+        input_shape=(2, 3, 4, 4, 5))
+
+    testing_utils.layer_test(
+        keras.layers.SpatialDropout3D,
+        kwargs={'rate': 0.5, 'data_format': 'channels_first'},
+        input_shape=(2, 3, 4, 4, 5))
+
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_activation(self):
     # with string argument
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Activation,
-          kwargs={'activation': 'relu'},
-          input_shape=(3, 2))
+    testing_utils.layer_test(
+        keras.layers.Activation,
+        kwargs={'activation': 'relu'},
+        input_shape=(3, 2))
 
     # with function argument
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Activation,
-          kwargs={'activation': keras.backend.relu},
-          input_shape=(3, 2))
+    testing_utils.layer_test(
+        keras.layers.Activation,
+        kwargs={'activation': keras.backend.relu},
+        input_shape=(3, 2))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_reshape(self):
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Reshape,
-          kwargs={'target_shape': (8, 1)},
-          input_shape=(3, 2, 4))
-
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Reshape,
-          kwargs={'target_shape': (-1, 1)},
-          input_shape=(3, 2, 4))
-
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Reshape,
-          kwargs={'target_shape': (1, -1)},
-          input_shape=(3, 2, 4))
-
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Reshape,
-          kwargs={'target_shape': (-1, 1)},
-          input_shape=(None, None, 2))
-
+    testing_utils.layer_test(
+        keras.layers.Reshape,
+        kwargs={'target_shape': (8, 1)},
+        input_shape=(3, 2, 4))
+
+    testing_utils.layer_test(
+        keras.layers.Reshape,
+        kwargs={'target_shape': (-1, 1)},
+        input_shape=(3, 2, 4))
+
+    testing_utils.layer_test(
+        keras.layers.Reshape,
+        kwargs={'target_shape': (1, -1)},
+        input_shape=(3, 2, 4))
+
+    testing_utils.layer_test(
+        keras.layers.Reshape,
+        kwargs={'target_shape': (-1, 1)},
+        input_shape=(None, None, 2))
+
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_permute(self):
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Permute, kwargs={'dims': (2, 1)}, input_shape=(3, 2, 4))
+    testing_utils.layer_test(
+        keras.layers.Permute, kwargs={'dims': (2, 1)}, input_shape=(3, 2, 4))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_flatten(self):
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Flatten, kwargs={}, input_shape=(3, 2, 4))
+    testing_utils.layer_test(
+        keras.layers.Flatten, kwargs={}, input_shape=(3, 2, 4))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_repeat_vector(self):
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.RepeatVector, kwargs={'n': 3}, input_shape=(3, 2))
+    testing_utils.layer_test(
+        keras.layers.RepeatVector, kwargs={'n': 3}, input_shape=(3, 2))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_lambda(self):
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Lambda,
-          kwargs={'function': lambda x: x + 1},
-          input_shape=(3, 2))
-
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Lambda,
-          kwargs={
-              'function': lambda x, a, b: x * a + b,
-              'arguments': {
-                  'a': 0.6,
-                  'b': 0.4
-              }
-          },
-          input_shape=(3, 2))
-
-    with self.test_session():
-      # test serialization with function
-      def f(x):
-        return x + 1
-
-      ld = keras.layers.Lambda(f)
-      config = ld.get_config()
-      ld = keras.layers.deserialize({
-          'class_name': 'Lambda',
-          'config': config
-      })
-
-      # test with lambda
-      ld = keras.layers.Lambda(
-          lambda x: keras.backend.concatenate([keras.backend.square(x), x]))
-      config = ld.get_config()
-      ld = keras.layers.Lambda.from_config(config)
-
+    testing_utils.layer_test(
+        keras.layers.Lambda,
+        kwargs={'function': lambda x: x + 1},
+        input_shape=(3, 2))
+
+    testing_utils.layer_test(
+        keras.layers.Lambda,
+        kwargs={
+            'function': lambda x, a, b: x * a + b,
+            'arguments': {
+                'a': 0.6,
+                'b': 0.4
+            }
+        },
+        input_shape=(3, 2))
+
+    # test serialization with function
+    def f(x):
+      return x + 1
+
+    ld = keras.layers.Lambda(f)
+    config = ld.get_config()
+    ld = keras.layers.deserialize({
+        'class_name': 'Lambda',
+        'config': config
+    })
+
+    # test with lambda
+    ld = keras.layers.Lambda(
+        lambda x: keras.backend.concatenate([keras.backend.square(x), x]))
+    config = ld.get_config()
+    ld = keras.layers.Lambda.from_config(config)
+
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_dense(self):
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Dense, kwargs={'units': 3}, input_shape=(3, 2))
+    testing_utils.layer_test(
+        keras.layers.Dense, kwargs={'units': 3}, input_shape=(3, 2))
 
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Dense, kwargs={'units': 3}, input_shape=(3, 4, 2))
+    testing_utils.layer_test(
+        keras.layers.Dense, kwargs={'units': 3}, input_shape=(3, 4, 2))
 
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Dense, kwargs={'units': 3}, input_shape=(None, None, 2))
+    testing_utils.layer_test(
+        keras.layers.Dense, kwargs={'units': 3}, input_shape=(None, None, 2))
 
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Dense, kwargs={'units': 3}, input_shape=(3, 4, 5, 2))
+    testing_utils.layer_test(
+        keras.layers.Dense, kwargs={'units': 3}, input_shape=(3, 4, 5, 2))
 
-    # Test regularization
+  def test_dense_regularization(self):
     with self.test_session():
       layer = keras.layers.Dense(
           3,
@@ -202,7 +188,7 @@ class CoreLayersTest(test.TestCase):
       layer(keras.backend.variable(np.ones((2, 4))))
       self.assertEqual(3, len(layer.losses))
 
-    # Test constraints
+  def test_dense_constraints(self):
     with self.test_session():
       k_constraint = keras.constraints.max_norm(0.01)
       b_constraint = keras.constraints.max_norm(0.01)
@@ -212,12 +198,6 @@ class CoreLayersTest(test.TestCase):
       self.assertEqual(layer.kernel.constraint, k_constraint)
       self.assertEqual(layer.bias.constraint, b_constraint)
 
-  def test_eager_dense(self):
-    with context.eager_mode():
-      l = keras.layers.Dense(units=3,
-                             kernel_initializer=init_ops.zeros_initializer())
-      self.assertAllEqual(l(constant_op.constant([[1.0]])), [[0., 0., 0.]])
-
   def test_activity_regularization(self):
     with self.test_session():
       layer = keras.layers.ActivityRegularization(l1=0.1)
diff --git a/tensorflow/python/keras/_impl/keras/layers/embeddings.py b/tensorflow/python/keras/_impl/keras/layers/embeddings.py
index ca92899a455cd28a756e9efff63655d7c43c9f45..006ecd3135be25d43133daed1603734ecd1be955 100644
--- a/tensorflow/python/keras/_impl/keras/layers/embeddings.py
+++ b/tensorflow/python/keras/_impl/keras/layers/embeddings.py
@@ -23,7 +23,7 @@ from tensorflow.python.keras._impl.keras import constraints
 from tensorflow.python.keras._impl.keras import initializers
 from tensorflow.python.keras._impl.keras import regularizers
 from tensorflow.python.keras._impl.keras.engine import Layer
-from tensorflow.python.keras._impl.keras.engine.topology import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.base_layer import shape_type_conversion
 from tensorflow.python.util.tf_export import tf_export
 
 
diff --git a/tensorflow/python/keras/_impl/keras/layers/embeddings_test.py b/tensorflow/python/keras/_impl/keras/layers/embeddings_test.py
index 1712111b877cf1fee4353c5542f33a973a26de95..26fd1f1c114587c2f1b3e0155f1259dd5f0dcf60 100644
--- a/tensorflow/python/keras/_impl/keras/layers/embeddings_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/embeddings_test.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
 from tensorflow.python.platform import test
@@ -25,47 +26,44 @@ from tensorflow.python.platform import test
 
 class EmbeddingTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_embedding(self):
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Embedding,
-          kwargs={'output_dim': 4,
-                  'input_dim': 10,
-                  'input_length': 2},
-          input_shape=(3, 2),
-          input_dtype='int32',
-          expected_output_dtype='float32')
+    testing_utils.layer_test(
+        keras.layers.Embedding,
+        kwargs={'output_dim': 4,
+                'input_dim': 10,
+                'input_length': 2},
+        input_shape=(3, 2),
+        input_dtype='int32',
+        expected_output_dtype='float32')
 
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Embedding,
-          kwargs={'output_dim': 4,
-                  'input_dim': 10,
-                  'mask_zero': True},
-          input_shape=(3, 2),
-          input_dtype='int32',
-          expected_output_dtype='float32')
+    testing_utils.layer_test(
+        keras.layers.Embedding,
+        kwargs={'output_dim': 4,
+                'input_dim': 10,
+                'mask_zero': True},
+        input_shape=(3, 2),
+        input_dtype='int32',
+        expected_output_dtype='float32')
 
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Embedding,
-          kwargs={'output_dim': 4,
-                  'input_dim': 10,
-                  'mask_zero': True},
-          input_shape=(3, 4, 2),
-          input_dtype='int32',
-          expected_output_dtype='float32')
+    testing_utils.layer_test(
+        keras.layers.Embedding,
+        kwargs={'output_dim': 4,
+                'input_dim': 10,
+                'mask_zero': True},
+        input_shape=(3, 4, 2),
+        input_dtype='int32',
+        expected_output_dtype='float32')
 
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.Embedding,
-          kwargs={'output_dim': 4,
-                  'input_dim': 10,
-                  'mask_zero': True,
-                  'input_length': (None, 2)},
-          input_shape=(3, 4, 2),
-          input_dtype='int32',
-          expected_output_dtype='float32')
+    testing_utils.layer_test(
+        keras.layers.Embedding,
+        kwargs={'output_dim': 4,
+                'input_dim': 10,
+                'mask_zero': True,
+                'input_length': (None, 2)},
+        input_shape=(3, 4, 2),
+        input_dtype='int32',
+        expected_output_dtype='float32')
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/keras/_impl/keras/layers/gru_test.py b/tensorflow/python/keras/_impl/keras/layers/gru_test.py
index c57fbac41cc43995ef3249414ed03928e7ffd044..48e7e14f5ab73b534ab0d1c765ad2572b2930b2b 100644
--- a/tensorflow/python/keras/_impl/keras/layers/gru_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/gru_test.py
@@ -20,64 +20,66 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
 from tensorflow.python.platform import test
+from tensorflow.python.training.rmsprop import RMSPropOptimizer
 
 
 class GRULayerTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_return_sequences_GRU(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.GRU,
-          kwargs={'units': units,
-                  'return_sequences': True},
-          input_shape=(num_samples, timesteps, embedding_dim))
+    testing_utils.layer_test(
+        keras.layers.GRU,
+        kwargs={'units': units,
+                'return_sequences': True},
+        input_shape=(num_samples, timesteps, embedding_dim))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_dynamic_behavior_GRU(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      layer = keras.layers.GRU(units, input_shape=(None, embedding_dim))
-      model = keras.models.Sequential()
-      model.add(layer)
-      model.compile('sgd', 'mse')
-      x = np.random.random((num_samples, timesteps, embedding_dim))
-      y = np.random.random((num_samples, units))
-      model.train_on_batch(x, y)
-
+    layer = keras.layers.GRU(units, input_shape=(None, embedding_dim))
+    model = keras.models.Sequential()
+    model.add(layer)
+    model.compile(RMSPropOptimizer(0.01), 'mse')
+    x = np.random.random((num_samples, timesteps, embedding_dim))
+    y = np.random.random((num_samples, units))
+    model.train_on_batch(x, y)
+
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_dropout_GRU(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.GRU,
-          kwargs={'units': units,
-                  'dropout': 0.1,
-                  'recurrent_dropout': 0.1},
-          input_shape=(num_samples, timesteps, embedding_dim))
-
+    testing_utils.layer_test(
+        keras.layers.GRU,
+        kwargs={'units': units,
+                'dropout': 0.1,
+                'recurrent_dropout': 0.1},
+        input_shape=(num_samples, timesteps, embedding_dim))
+
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_implementation_mode_GRU(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      for mode in [0, 1, 2]:
-        testing_utils.layer_test(
-            keras.layers.GRU,
-            kwargs={'units': units,
-                    'implementation': mode},
-            input_shape=(num_samples, timesteps, embedding_dim))
+    for mode in [0, 1, 2]:
+      testing_utils.layer_test(
+          keras.layers.GRU,
+          kwargs={'units': units,
+                  'implementation': mode},
+          input_shape=(num_samples, timesteps, embedding_dim))
 
   def test_statefulness_GRU(self):
     num_samples = 2
diff --git a/tensorflow/python/keras/_impl/keras/layers/local.py b/tensorflow/python/keras/_impl/keras/layers/local.py
index df0efe6b8b7eaa0259eb6f4e246269551b3e0c15..13d96e939220c11a4090cf535e3efa4365fe8b62 100644
--- a/tensorflow/python/keras/_impl/keras/layers/local.py
+++ b/tensorflow/python/keras/_impl/keras/layers/local.py
@@ -25,7 +25,7 @@ from tensorflow.python.keras._impl.keras import initializers
 from tensorflow.python.keras._impl.keras import regularizers
 from tensorflow.python.keras._impl.keras.engine import InputSpec
 from tensorflow.python.keras._impl.keras.engine import Layer
-from tensorflow.python.keras._impl.keras.engine.topology import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.base_layer import shape_type_conversion
 from tensorflow.python.keras._impl.keras.utils import conv_utils
 from tensorflow.python.util.tf_export import tf_export
 
diff --git a/tensorflow/python/keras/_impl/keras/layers/local_test.py b/tensorflow/python/keras/_impl/keras/layers/local_test.py
index a815a0fadc8215c00f3db4749e323f96e44b66f3..93741d24b9a74cf9e8a83069f7c4235b1f489818 100644
--- a/tensorflow/python/keras/_impl/keras/layers/local_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/local_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
 from tensorflow.python.platform import test
@@ -27,6 +28,7 @@ from tensorflow.python.platform import test
 
 class LocallyConnectedLayersTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_locallyconnected_1d(self):
     num_samples = 2
     num_steps = 8
@@ -39,16 +41,15 @@ class LocallyConnectedLayersTest(test.TestCase):
         if padding == 'same' and strides != 1:
           continue
 
-        with self.test_session():
-          testing_utils.layer_test(
-              keras.layers.LocallyConnected1D,
-              kwargs={
-                  'filters': filters,
-                  'kernel_size': filter_length,
-                  'padding': padding,
-                  'strides': strides
-              },
-              input_shape=(num_samples, num_steps, input_dim))
+        testing_utils.layer_test(
+            keras.layers.LocallyConnected1D,
+            kwargs={
+                'filters': filters,
+                'kernel_size': filter_length,
+                'padding': padding,
+                'strides': strides
+            },
+            input_shape=(num_samples, num_steps, input_dim))
 
   def test_locallyconnected_1d_regularization(self):
     num_samples = 2
@@ -86,6 +87,7 @@ class LocallyConnectedLayersTest(test.TestCase):
       self.assertEqual(layer.kernel.constraint, k_constraint)
       self.assertEqual(layer.bias.constraint, b_constraint)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_locallyconnected_2d(self):
     num_samples = 8
     filters = 3
@@ -98,20 +100,18 @@ class LocallyConnectedLayersTest(test.TestCase):
         if padding == 'same' and strides != (1, 1):
           continue
 
-        with self.test_session():
-          testing_utils.layer_test(
-              keras.layers.LocallyConnected2D,
-              kwargs={
-                  'filters': filters,
-                  'kernel_size': 3,
-                  'padding': padding,
-                  'kernel_regularizer': 'l2',
-                  'bias_regularizer': 'l2',
-                  'activity_regularizer': 'l2',
-                  'strides': strides,
-                  'data_format': 'channels_last'
-              },
-              input_shape=(num_samples, num_row, num_col, stack_size))
+        testing_utils.layer_test(
+            keras.layers.LocallyConnected2D,
+            kwargs={
+                'filters': filters,
+                'kernel_size': 3,
+                'padding': padding,
+                'kernel_regularizer': 'l2',
+                'bias_regularizer': 'l2',
+                'strides': strides,
+                'data_format': 'channels_last'
+            },
+            input_shape=(num_samples, num_row, num_col, stack_size))
 
   def test_locallyconnected_2d_channels_first(self):
     num_samples = 8
diff --git a/tensorflow/python/keras/_impl/keras/layers/lstm_test.py b/tensorflow/python/keras/_impl/keras/layers/lstm_test.py
index deb1d7c0c685e51ed756cbcdd5aec81ee60b5f96..11a5e0aeaacfa7520361ae41ac3d40607e8a9050 100644
--- a/tensorflow/python/keras/_impl/keras/layers/lstm_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/lstm_test.py
@@ -20,28 +20,29 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
 from tensorflow.python.platform import test
+from tensorflow.python.training.rmsprop import RMSPropOptimizer
 
 
 class LSTMLayerTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_return_sequences_LSTM(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.LSTM,
-          kwargs={'units': units,
-                  'return_sequences': True},
-          input_shape=(num_samples, timesteps, embedding_dim))
+    testing_utils.layer_test(
+        keras.layers.LSTM,
+        kwargs={'units': units,
+                'return_sequences': True},
+        input_shape=(num_samples, timesteps, embedding_dim))
 
   def test_static_shape_inference_LSTM(self):
     # Github issue: 15165
-    num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
@@ -53,48 +54,47 @@ class LSTMLayerTest(test.TestCase):
     layer = keras.layers.LSTM(units, return_sequences=True)
     model.add(layer)
     outputs = model.layers[-1].output
-    self.assertEquals(outputs.get_shape().as_list(),
-                      [None, timesteps, units])
+    self.assertEquals(outputs.get_shape().as_list(), [None, timesteps, units])
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_dynamic_behavior_LSTM(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      layer = keras.layers.LSTM(units, input_shape=(None, embedding_dim))
-      model = keras.models.Sequential()
-      model.add(layer)
-      model.compile('sgd', 'mse')
-      x = np.random.random((num_samples, timesteps, embedding_dim))
-      y = np.random.random((num_samples, units))
-      model.train_on_batch(x, y)
+    layer = keras.layers.LSTM(units, input_shape=(None, embedding_dim))
+    model = keras.models.Sequential()
+    model.add(layer)
+    model.compile(RMSPropOptimizer(0.001), 'mse')
+    x = np.random.random((num_samples, timesteps, embedding_dim))
+    y = np.random.random((num_samples, units))
+    model.train_on_batch(x, y)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_dropout_LSTM(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.LSTM,
-          kwargs={'units': units,
-                  'dropout': 0.1,
-                  'recurrent_dropout': 0.1},
-          input_shape=(num_samples, timesteps, embedding_dim))
-
+    testing_utils.layer_test(
+        keras.layers.LSTM,
+        kwargs={'units': units,
+                'dropout': 0.1,
+                'recurrent_dropout': 0.1},
+        input_shape=(num_samples, timesteps, embedding_dim))
+
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_implementation_mode_LSTM(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      for mode in [0, 1, 2]:
-        testing_utils.layer_test(
-            keras.layers.LSTM,
-            kwargs={'units': units,
-                    'implementation': mode},
-            input_shape=(num_samples, timesteps, embedding_dim))
+    for mode in [0, 1, 2]:
+      testing_utils.layer_test(
+          keras.layers.LSTM,
+          kwargs={'units': units,
+                  'implementation': mode},
+          input_shape=(num_samples, timesteps, embedding_dim))
 
   def test_statefulness_LSTM(self):
     num_samples = 2
diff --git a/tensorflow/python/keras/_impl/keras/layers/merge.py b/tensorflow/python/keras/_impl/keras/layers/merge.py
index cdf2878e83e32147d30d6b29742b7e9013a1facb..c660cbd449b11a139f64cfa8b3a35310a597491c 100644
--- a/tensorflow/python/keras/_impl/keras/layers/merge.py
+++ b/tensorflow/python/keras/_impl/keras/layers/merge.py
@@ -21,8 +21,8 @@ from __future__ import division
 from __future__ import print_function
 
 from tensorflow.python.keras._impl.keras import backend as K
-from tensorflow.python.keras._impl.keras.engine.topology import Layer
-from tensorflow.python.keras._impl.keras.engine.topology import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.base_layer import Layer
+from tensorflow.python.keras._impl.keras.engine.base_layer import shape_type_conversion
 from tensorflow.python.util.tf_export import tf_export
 
 
diff --git a/tensorflow/python/keras/_impl/keras/layers/merge_test.py b/tensorflow/python/keras/_impl/keras/layers/merge_test.py
index bb03dda1fc645222c1ced97cfce8d459586dd89d..b2fe06f93e33ed63d6a2aa29522ecb552f582440 100644
--- a/tensorflow/python/keras/_impl/keras/layers/merge_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/merge_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.ops import array_ops
 from tensorflow.python.platform import test
@@ -27,24 +28,25 @@ from tensorflow.python.platform import test
 
 class MergeLayersTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_merge_add(self):
-    with self.test_session():
-      i1 = keras.layers.Input(shape=(4, 5))
-      i2 = keras.layers.Input(shape=(4, 5))
-      i3 = keras.layers.Input(shape=(4, 5))
+    i1 = keras.layers.Input(shape=(4, 5))
+    i2 = keras.layers.Input(shape=(4, 5))
+    i3 = keras.layers.Input(shape=(4, 5))
 
-      o = keras.layers.add([i1, i2, i3])
-      self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
-      model = keras.models.Model([i1, i2, i3], o)
+    o = keras.layers.add([i1, i2, i3])
+    self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
+    model = keras.models.Model([i1, i2, i3], o)
 
-      x1 = np.random.random((2, 4, 5))
-      x2 = np.random.random((2, 4, 5))
-      x3 = np.random.random((2, 4, 5))
-      out = model.predict([x1, x2, x3])
-      self.assertEqual(out.shape, (2, 4, 5))
-      self.assertAllClose(out, x1 + x2 + x3, atol=1e-4)
+    x1 = np.random.random((2, 4, 5))
+    x2 = np.random.random((2, 4, 5))
+    x3 = np.random.random((2, 4, 5))
+    out = model.predict([x1, x2, x3])
+    self.assertEqual(out.shape, (2, 4, 5))
+    self.assertAllClose(out, x1 + x2 + x3, atol=1e-4)
 
-      # test masking
+  def test_merge_add_masking(self):
+    with self.test_session():
       i1 = keras.layers.Input(shape=(4, 5))
       i2 = keras.layers.Input(shape=(4, 5))
       m1 = keras.layers.Masking()(i1)
@@ -54,11 +56,13 @@ class MergeLayersTest(test.TestCase):
       mask = layer.output_mask
       self.assertListEqual(mask.get_shape().as_list(), [None, 4])
 
-      # test missing shape
+  def test_merge_add_dynamic_shape(self):
+    with self.test_session():
       i1 = array_ops.placeholder(shape=(4, None), dtype='float32')
       i2 = array_ops.placeholder(shape=(4, 5), dtype='float32')
       layer = keras.layers.Add()
       o = layer([i1, i2])
+      self.assertListEqual(o.get_shape().as_list(), [4, 5])
 
   def test_merge_elementwise_errors(self):
     i1 = keras.layers.Input(shape=(4, 5))
@@ -72,79 +76,82 @@ class MergeLayersTest(test.TestCase):
     with self.assertRaises(ValueError):
       keras.layers.add([i1])
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_merge_multiply(self):
-    with self.test_session():
-      i1 = keras.layers.Input(shape=(4, 5))
-      i2 = keras.layers.Input(shape=(4, 5))
-      i3 = keras.layers.Input(shape=(4, 5))
-      o = keras.layers.multiply([i1, i2, i3])
-      self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
-      model = keras.models.Model([i1, i2, i3], o)
-
-      x1 = np.random.random((2, 4, 5))
-      x2 = np.random.random((2, 4, 5))
-      x3 = np.random.random((2, 4, 5))
-      out = model.predict([x1, x2, x3])
-      self.assertEqual(out.shape, (2, 4, 5))
-      self.assertAllClose(out, x1 * x2 * x3, atol=1e-4)
-
+    i1 = keras.layers.Input(shape=(4, 5))
+    i2 = keras.layers.Input(shape=(4, 5))
+    i3 = keras.layers.Input(shape=(4, 5))
+    o = keras.layers.multiply([i1, i2, i3])
+    self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
+    model = keras.models.Model([i1, i2, i3], o)
+
+    x1 = np.random.random((2, 4, 5))
+    x2 = np.random.random((2, 4, 5))
+    x3 = np.random.random((2, 4, 5))
+    out = model.predict([x1, x2, x3])
+    self.assertEqual(out.shape, (2, 4, 5))
+    self.assertAllClose(out, x1 * x2 * x3, atol=1e-4)
+
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_merge_average(self):
-    with self.test_session():
-      i1 = keras.layers.Input(shape=(4, 5))
-      i2 = keras.layers.Input(shape=(4, 5))
-      o = keras.layers.average([i1, i2])
-      self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
-      model = keras.models.Model([i1, i2], o)
+    i1 = keras.layers.Input(shape=(4, 5))
+    i2 = keras.layers.Input(shape=(4, 5))
+    o = keras.layers.average([i1, i2])
+    self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
+    model = keras.models.Model([i1, i2], o)
 
-      x1 = np.random.random((2, 4, 5))
-      x2 = np.random.random((2, 4, 5))
-      out = model.predict([x1, x2])
-      self.assertEqual(out.shape, (2, 4, 5))
-      self.assertAllClose(out, 0.5 * (x1 + x2), atol=1e-4)
+    x1 = np.random.random((2, 4, 5))
+    x2 = np.random.random((2, 4, 5))
+    out = model.predict([x1, x2])
+    self.assertEqual(out.shape, (2, 4, 5))
+    self.assertAllClose(out, 0.5 * (x1 + x2), atol=1e-4)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_merge_maximum(self):
-    with self.test_session():
-      i1 = keras.layers.Input(shape=(4, 5))
-      i2 = keras.layers.Input(shape=(4, 5))
-      o = keras.layers.maximum([i1, i2])
-      self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
-      model = keras.models.Model([i1, i2], o)
+    i1 = keras.layers.Input(shape=(4, 5))
+    i2 = keras.layers.Input(shape=(4, 5))
+    o = keras.layers.maximum([i1, i2])
+    self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
+    model = keras.models.Model([i1, i2], o)
 
-      x1 = np.random.random((2, 4, 5))
-      x2 = np.random.random((2, 4, 5))
-      out = model.predict([x1, x2])
-      self.assertEqual(out.shape, (2, 4, 5))
-      self.assertAllClose(out, np.maximum(x1, x2), atol=1e-4)
+    x1 = np.random.random((2, 4, 5))
+    x2 = np.random.random((2, 4, 5))
+    out = model.predict([x1, x2])
+    self.assertEqual(out.shape, (2, 4, 5))
+    self.assertAllClose(out, np.maximum(x1, x2), atol=1e-4)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_merge_minimum(self):
-    with self.test_session():
-      i1 = keras.layers.Input(shape=(4, 5))
-      i2 = keras.layers.Input(shape=(4, 5))
-      o = keras.layers.minimum([i1, i2])
-      self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
-      model = keras.models.Model([i1, i2], o)
+    i1 = keras.layers.Input(shape=(4, 5))
+    i2 = keras.layers.Input(shape=(4, 5))
+    o = keras.layers.minimum([i1, i2])
+    self.assertListEqual(o.get_shape().as_list(), [None, 4, 5])
+    model = keras.models.Model([i1, i2], o)
 
-      x1 = np.random.random((2, 4, 5))
-      x2 = np.random.random((2, 4, 5))
-      out = model.predict([x1, x2])
-      self.assertEqual(out.shape, (2, 4, 5))
-      self.assertAllClose(out, np.minimum(x1, x2), atol=1e-4)
+    x1 = np.random.random((2, 4, 5))
+    x2 = np.random.random((2, 4, 5))
+    out = model.predict([x1, x2])
+    self.assertEqual(out.shape, (2, 4, 5))
+    self.assertAllClose(out, np.minimum(x1, x2), atol=1e-4)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_merge_concatenate(self):
+    i1 = keras.layers.Input(shape=(4, 5))
+    i2 = keras.layers.Input(shape=(4, 5))
+    o = keras.layers.concatenate([i1, i2], axis=1)
+    self.assertListEqual(o.get_shape().as_list(), [None, 8, 5])
+    model = keras.models.Model([i1, i2], o)
+
+    x1 = np.random.random((2, 4, 5))
+    x2 = np.random.random((2, 4, 5))
+    out = model.predict([x1, x2])
+    self.assertEqual(out.shape, (2, 8, 5))
+    self.assertAllClose(out, np.concatenate([x1, x2], axis=1), atol=1e-4)
+
+  def test_merge_concatenate_masking(self):
     with self.test_session():
       i1 = keras.layers.Input(shape=(4, 5))
       i2 = keras.layers.Input(shape=(4, 5))
-      o = keras.layers.concatenate([i1, i2], axis=1)
-      self.assertListEqual(o.get_shape().as_list(), [None, 8, 5])
-      model = keras.models.Model([i1, i2], o)
-
-      x1 = np.random.random((2, 4, 5))
-      x2 = np.random.random((2, 4, 5))
-      out = model.predict([x1, x2])
-      self.assertEqual(out.shape, (2, 8, 5))
-      self.assertAllClose(out, np.concatenate([x1, x2], axis=1), atol=1e-4)
-
-      # test masking
       m1 = keras.layers.Masking()(i1)
       layer = keras.layers.Concatenate()
       o = layer([m1, i2])
@@ -162,35 +169,35 @@ class MergeLayersTest(test.TestCase):
     with self.assertRaisesRegexp(ValueError, 'called on a list'):
       keras.layers.concatenate([i1], axis=-1)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_merge_dot(self):
-    with self.test_session():
-      i1 = keras.layers.Input(shape=(4,))
-      i2 = keras.layers.Input(shape=(4,))
-      o = keras.layers.dot([i1, i2], axes=1)
-      self.assertListEqual(o.get_shape().as_list(), [None, 1])
-      model = keras.models.Model([i1, i2], o)
-      _ = keras.layers.Dot(axes=1).get_config()
-
-      x1 = np.random.random((2, 4))
-      x2 = np.random.random((2, 4))
-      out = model.predict([x1, x2])
-      self.assertEqual(out.shape, (2, 1))
-      expected = np.zeros((2, 1))
-      expected[0, 0] = np.dot(x1[0], x2[0])
-      expected[1, 0] = np.dot(x1[1], x2[1])
-      self.assertAllClose(out, expected, atol=1e-4)
-
-      # Test with negative tuple of axes.
-      o = keras.layers.dot([i1, i2], axes=(-1, -1))
-      self.assertListEqual(o.get_shape().as_list(), [None, 1])
-      model = keras.models.Model([i1, i2], o)
-      out = model.predict([x1, x2])
-      self.assertEqual(out.shape, (2, 1))
-      self.assertAllClose(out, expected, atol=1e-4)
-
-      # test compute_output_shape
-      layer = keras.layers.Dot(axes=-1)
-      self.assertEqual(layer.compute_output_shape([(4, 5), (4, 5)]), (4, 1))
+    i1 = keras.layers.Input(shape=(4,))
+    i2 = keras.layers.Input(shape=(4,))
+    o = keras.layers.dot([i1, i2], axes=1)
+    self.assertListEqual(o.get_shape().as_list(), [None, 1])
+    model = keras.models.Model([i1, i2], o)
+    _ = keras.layers.Dot(axes=1).get_config()
+
+    x1 = np.random.random((2, 4))
+    x2 = np.random.random((2, 4))
+    out = model.predict([x1, x2])
+    self.assertEqual(out.shape, (2, 1))
+    expected = np.zeros((2, 1))
+    expected[0, 0] = np.dot(x1[0], x2[0])
+    expected[1, 0] = np.dot(x1[1], x2[1])
+    self.assertAllClose(out, expected, atol=1e-4)
+
+    # Test with negative tuple of axes.
+    o = keras.layers.dot([i1, i2], axes=(-1, -1))
+    self.assertListEqual(o.get_shape().as_list(), [None, 1])
+    model = keras.models.Model([i1, i2], o)
+    out = model.predict([x1, x2])
+    self.assertEqual(out.shape, (2, 1))
+    self.assertAllClose(out, expected, atol=1e-4)
+
+    # test compute_output_shape
+    layer = keras.layers.Dot(axes=-1)
+    self.assertEqual(layer.compute_output_shape([(4, 5), (4, 5)]), (4, 1))
 
   def test_dot_errors(self):
     i1 = keras.layers.Input(shape=(4, 5))
@@ -208,6 +215,7 @@ class MergeLayersTest(test.TestCase):
       dot = keras.layers.Dot(1)
       dot.compute_output_shape(1)
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_merge_subtract(self):
     i1 = keras.layers.Input(shape=(4, 5))
     i2 = keras.layers.Input(shape=(4, 5))
diff --git a/tensorflow/python/keras/_impl/keras/layers/noise.py b/tensorflow/python/keras/_impl/keras/layers/noise.py
index 9010f4961585af58b7eae43dcd224e0c39606239..e309d160e5a9be97ff5f5356dad9dfaf85430233 100644
--- a/tensorflow/python/keras/_impl/keras/layers/noise.py
+++ b/tensorflow/python/keras/_impl/keras/layers/noise.py
@@ -22,7 +22,7 @@ import numpy as np
 
 from tensorflow.python.keras._impl.keras import backend as K
 from tensorflow.python.keras._impl.keras.engine import Layer
-from tensorflow.python.keras._impl.keras.engine.topology import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.base_layer import shape_type_conversion
 from tensorflow.python.util.tf_export import tf_export
 
 
diff --git a/tensorflow/python/keras/_impl/keras/layers/noise_test.py b/tensorflow/python/keras/_impl/keras/layers/noise_test.py
index f9b4d9cd090ffec1a5acd9118ea6a65798bd72a6..af4f031ec95bb56b72c1f1018e0e529d8ff55564 100644
--- a/tensorflow/python/keras/_impl/keras/layers/noise_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/noise_test.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
 from tensorflow.python.platform import test
@@ -39,12 +40,12 @@ class NoiseLayersTest(test.TestCase):
           kwargs={'rate': 0.5},
           input_shape=(3, 2, 3))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_AlphaDropout(self):
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.AlphaDropout,
-          kwargs={'rate': 0.2},
-          input_shape=(3, 2, 3))
+    testing_utils.layer_test(
+        keras.layers.AlphaDropout,
+        kwargs={'rate': 0.2},
+        input_shape=(3, 2, 3))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/keras/_impl/keras/layers/pooling_test.py b/tensorflow/python/keras/_impl/keras/layers/pooling_test.py
index ec0a5ae560f49ee39ecffb64f4ac65d3e800024c..70049f0976b7170005183bb4b028079b39a23afb 100644
--- a/tensorflow/python/keras/_impl/keras/layers/pooling_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/pooling_test.py
@@ -18,6 +18,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.eager import context
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
 from tensorflow.python.platform import test
@@ -25,81 +27,85 @@ from tensorflow.python.platform import test
 
 class GlobalPoolingTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes(use_gpu=True)
   def test_globalpooling_1d(self):
-    with self.test_session(use_gpu=True):
-      testing_utils.layer_test(keras.layers.pooling.GlobalMaxPooling1D,
-                               input_shape=(3, 4, 5))
-      testing_utils.layer_test(
-          keras.layers.pooling.GlobalAveragePooling1D, input_shape=(3, 4, 5))
+    testing_utils.layer_test(keras.layers.pooling.GlobalMaxPooling1D,
+                             input_shape=(3, 4, 5))
+    testing_utils.layer_test(
+        keras.layers.pooling.GlobalAveragePooling1D, input_shape=(3, 4, 5))
 
+  @tf_test_util.run_in_graph_and_eager_modes(use_gpu=True)
   def test_globalpooling_2d(self):
-    with self.test_session(use_gpu=True):
-      testing_utils.layer_test(
-          keras.layers.pooling.GlobalMaxPooling2D,
-          kwargs={'data_format': 'channels_first'},
-          input_shape=(3, 4, 5, 6))
-      testing_utils.layer_test(
-          keras.layers.pooling.GlobalMaxPooling2D,
-          kwargs={'data_format': 'channels_last'},
-          input_shape=(3, 5, 6, 4))
-      testing_utils.layer_test(
-          keras.layers.pooling.GlobalAveragePooling2D,
-          kwargs={'data_format': 'channels_first'},
-          input_shape=(3, 4, 5, 6))
-      testing_utils.layer_test(
-          keras.layers.pooling.GlobalAveragePooling2D,
-          kwargs={'data_format': 'channels_last'},
-          input_shape=(3, 5, 6, 4))
-
+    testing_utils.layer_test(
+        keras.layers.pooling.GlobalMaxPooling2D,
+        kwargs={'data_format': 'channels_first'},
+        input_shape=(3, 4, 5, 6))
+    testing_utils.layer_test(
+        keras.layers.pooling.GlobalMaxPooling2D,
+        kwargs={'data_format': 'channels_last'},
+        input_shape=(3, 5, 6, 4))
+    testing_utils.layer_test(
+        keras.layers.pooling.GlobalAveragePooling2D,
+        kwargs={'data_format': 'channels_first'},
+        input_shape=(3, 4, 5, 6))
+    testing_utils.layer_test(
+        keras.layers.pooling.GlobalAveragePooling2D,
+        kwargs={'data_format': 'channels_last'},
+        input_shape=(3, 5, 6, 4))
+
+  @tf_test_util.run_in_graph_and_eager_modes(use_gpu=True)
   def test_globalpooling_3d(self):
-    with self.test_session(use_gpu=True):
-      testing_utils.layer_test(
-          keras.layers.pooling.GlobalMaxPooling3D,
-          kwargs={'data_format': 'channels_first'},
-          input_shape=(3, 4, 3, 4, 3))
-      testing_utils.layer_test(
-          keras.layers.pooling.GlobalMaxPooling3D,
-          kwargs={'data_format': 'channels_last'},
-          input_shape=(3, 4, 3, 4, 3))
-      testing_utils.layer_test(
-          keras.layers.pooling.GlobalAveragePooling3D,
-          kwargs={'data_format': 'channels_first'},
-          input_shape=(3, 4, 3, 4, 3))
-      testing_utils.layer_test(
-          keras.layers.pooling.GlobalAveragePooling3D,
-          kwargs={'data_format': 'channels_last'},
-          input_shape=(3, 4, 3, 4, 3))
+    testing_utils.layer_test(
+        keras.layers.pooling.GlobalMaxPooling3D,
+        kwargs={'data_format': 'channels_first'},
+        input_shape=(3, 4, 3, 4, 3))
+    testing_utils.layer_test(
+        keras.layers.pooling.GlobalMaxPooling3D,
+        kwargs={'data_format': 'channels_last'},
+        input_shape=(3, 4, 3, 4, 3))
+    testing_utils.layer_test(
+        keras.layers.pooling.GlobalAveragePooling3D,
+        kwargs={'data_format': 'channels_first'},
+        input_shape=(3, 4, 3, 4, 3))
+    testing_utils.layer_test(
+        keras.layers.pooling.GlobalAveragePooling3D,
+        kwargs={'data_format': 'channels_last'},
+        input_shape=(3, 4, 3, 4, 3))
 
 
 class Pooling2DTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes(use_gpu=True)
   def test_maxpooling_2d(self):
     pool_size = (3, 3)
-    with self.test_session(use_gpu=True):
-      for strides in [(1, 1), (2, 2)]:
-        testing_utils.layer_test(
-            keras.layers.MaxPooling2D,
-            kwargs={
-                'strides': strides,
-                'padding': 'valid',
-                'pool_size': pool_size
-            },
-            input_shape=(3, 5, 6, 4))
-
-  def test_averagepooling_2d(self):
-    with self.test_session(use_gpu=True):
+    for strides in [(1, 1), (2, 2)]:
       testing_utils.layer_test(
-          keras.layers.AveragePooling2D,
-          kwargs={'strides': (2, 2),
-                  'padding': 'same',
-                  'pool_size': (2, 2)},
-          input_shape=(3, 5, 6, 4))
-      testing_utils.layer_test(
-          keras.layers.AveragePooling2D,
-          kwargs={'strides': (2, 2),
-                  'padding': 'valid',
-                  'pool_size': (3, 3)},
+          keras.layers.MaxPooling2D,
+          kwargs={
+              'strides': strides,
+              'padding': 'valid',
+              'pool_size': pool_size
+          },
           input_shape=(3, 5, 6, 4))
+
+  @tf_test_util.run_in_graph_and_eager_modes(use_gpu=True)
+  def test_averagepooling_2d(self):
+    testing_utils.layer_test(
+        keras.layers.AveragePooling2D,
+        kwargs={'strides': (2, 2),
+                'padding': 'same',
+                'pool_size': (2, 2)},
+        input_shape=(3, 5, 6, 4))
+    testing_utils.layer_test(
+        keras.layers.AveragePooling2D,
+        kwargs={'strides': (2, 2),
+                'padding': 'valid',
+                'pool_size': (3, 3)},
+        input_shape=(3, 5, 6, 4))
+
+    # This part of the test can only run on GPU but doesn't appear
+    # to be properly assigned to a GPU when running in eager mode.
+    if not context.in_eager_mode():
       # Only runs on GPU with CUDA, channels_first is not supported on CPU.
       # TODO(b/62340061): Support channels_first on CPU.
       if test.is_gpu_available(cuda_only=True):
@@ -116,66 +122,66 @@ class Pooling2DTest(test.TestCase):
 
 class Pooling3DTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes(use_gpu=True)
   def test_maxpooling_3d(self):
     pool_size = (3, 3, 3)
-    with self.test_session(use_gpu=True):
-      testing_utils.layer_test(
-          keras.layers.MaxPooling3D,
-          kwargs={'strides': 2,
-                  'padding': 'valid',
-                  'pool_size': pool_size},
-          input_shape=(3, 11, 12, 10, 4))
-      testing_utils.layer_test(
-          keras.layers.MaxPooling3D,
-          kwargs={
-              'strides': 3,
-              'padding': 'valid',
-              'data_format': 'channels_first',
-              'pool_size': pool_size
-          },
-          input_shape=(3, 4, 11, 12, 10))
-
+    testing_utils.layer_test(
+        keras.layers.MaxPooling3D,
+        kwargs={'strides': 2,
+                'padding': 'valid',
+                'pool_size': pool_size},
+        input_shape=(3, 11, 12, 10, 4))
+    testing_utils.layer_test(
+        keras.layers.MaxPooling3D,
+        kwargs={
+            'strides': 3,
+            'padding': 'valid',
+            'data_format': 'channels_first',
+            'pool_size': pool_size
+        },
+        input_shape=(3, 4, 11, 12, 10))
+
+  @tf_test_util.run_in_graph_and_eager_modes(use_gpu=True)
   def test_averagepooling_3d(self):
     pool_size = (3, 3, 3)
-    with self.test_session(use_gpu=True):
-      testing_utils.layer_test(
-          keras.layers.AveragePooling3D,
-          kwargs={'strides': 2,
-                  'padding': 'valid',
-                  'pool_size': pool_size},
-          input_shape=(3, 11, 12, 10, 4))
-      testing_utils.layer_test(
-          keras.layers.AveragePooling3D,
-          kwargs={
-              'strides': 3,
-              'padding': 'valid',
-              'data_format': 'channels_first',
-              'pool_size': pool_size
-          },
-          input_shape=(3, 4, 11, 12, 10))
+    testing_utils.layer_test(
+        keras.layers.AveragePooling3D,
+        kwargs={'strides': 2,
+                'padding': 'valid',
+                'pool_size': pool_size},
+        input_shape=(3, 11, 12, 10, 4))
+    testing_utils.layer_test(
+        keras.layers.AveragePooling3D,
+        kwargs={
+            'strides': 3,
+            'padding': 'valid',
+            'data_format': 'channels_first',
+            'pool_size': pool_size
+        },
+        input_shape=(3, 4, 11, 12, 10))
 
 
 class Pooling1DTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes(use_gpu=True)
   def test_maxpooling_1d(self):
-    with self.test_session(use_gpu=True):
-      for padding in ['valid', 'same']:
-        for stride in [1, 2]:
-          testing_utils.layer_test(
-              keras.layers.MaxPooling1D,
-              kwargs={'strides': stride,
-                      'padding': padding},
-              input_shape=(3, 5, 4))
+    for padding in ['valid', 'same']:
+      for stride in [1, 2]:
+        testing_utils.layer_test(
+            keras.layers.MaxPooling1D,
+            kwargs={'strides': stride,
+                    'padding': padding},
+            input_shape=(3, 5, 4))
 
+  @tf_test_util.run_in_graph_and_eager_modes(use_gpu=True)
   def test_averagepooling_1d(self):
-    with self.test_session(use_gpu=True):
-      for padding in ['valid', 'same']:
-        for stride in [1, 2]:
-          testing_utils.layer_test(
-              keras.layers.AveragePooling1D,
-              kwargs={'strides': stride,
-                      'padding': padding},
-              input_shape=(3, 5, 4))
+    for padding in ['valid', 'same']:
+      for stride in [1, 2]:
+        testing_utils.layer_test(
+            keras.layers.AveragePooling1D,
+            kwargs={'strides': stride,
+                    'padding': padding},
+            input_shape=(3, 5, 4))
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/keras/_impl/keras/layers/recurrent.py b/tensorflow/python/keras/_impl/keras/layers/recurrent.py
index 2e9003f52d7617e96950f76637759e577a8b5e4f..0264c7ae0119b36261a0a5467576c47a12a30801 100644
--- a/tensorflow/python/keras/_impl/keras/layers/recurrent.py
+++ b/tensorflow/python/keras/_impl/keras/layers/recurrent.py
@@ -22,6 +22,7 @@ from __future__ import print_function
 import numbers
 import numpy as np
 
+from tensorflow.python.eager import context
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.keras._impl.keras import activations
 from tensorflow.python.keras._impl.keras import backend as K
@@ -30,7 +31,7 @@ from tensorflow.python.keras._impl.keras import initializers
 from tensorflow.python.keras._impl.keras import regularizers
 from tensorflow.python.keras._impl.keras.engine import InputSpec
 from tensorflow.python.keras._impl.keras.engine import Layer
-from tensorflow.python.keras._impl.keras.engine.topology import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.base_layer import shape_type_conversion
 from tensorflow.python.keras._impl.keras.utils.generic_utils import has_arg
 from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.util.tf_export import tf_export
@@ -935,7 +936,9 @@ class SimpleRNNCell(Layer):
 
     # Properly set learning phase on output tensor.
     if 0 < self.dropout + self.recurrent_dropout:
-      if training is None:
+      if training is None and not context.in_eager_mode():
+        # This would be harmless to set in eager mode, but eager tensors
+        # disallow setting arbitrary attributes.
         output._uses_learning_phase = True
     return output, [output]
 
@@ -1299,23 +1302,6 @@ class GRUCell(Layer):
           constraint=self.bias_constraint)
     else:
       self.bias = None
-
-    self.kernel_z = self.kernel[:, :self.units]
-    self.recurrent_kernel_z = self.recurrent_kernel[:, :self.units]
-    self.kernel_r = self.kernel[:, self.units:self.units * 2]
-    self.recurrent_kernel_r = self.recurrent_kernel[:, self.units:
-                                                    self.units * 2]
-    self.kernel_h = self.kernel[:, self.units * 2:]
-    self.recurrent_kernel_h = self.recurrent_kernel[:, self.units * 2:]
-
-    if self.use_bias:
-      self.bias_z = self.bias[:self.units]
-      self.bias_r = self.bias[self.units:self.units * 2]
-      self.bias_h = self.bias[self.units * 2:]
-    else:
-      self.bias_z = None
-      self.bias_r = None
-      self.bias_h = None
     self.built = True
 
   def call(self, inputs, states, training=None):
@@ -1350,13 +1336,13 @@ class GRUCell(Layer):
         inputs_z = inputs
         inputs_r = inputs
         inputs_h = inputs
-      x_z = K.dot(inputs_z, self.kernel_z)
-      x_r = K.dot(inputs_r, self.kernel_r)
-      x_h = K.dot(inputs_h, self.kernel_h)
+      x_z = K.dot(inputs_z, self.kernel[:, :self.units])
+      x_r = K.dot(inputs_r, self.kernel[:, self.units:self.units * 2])
+      x_h = K.dot(inputs_h, self.kernel[:, self.units * 2:])
       if self.use_bias:
-        x_z = K.bias_add(x_z, self.bias_z)
-        x_r = K.bias_add(x_r, self.bias_r)
-        x_h = K.bias_add(x_h, self.bias_h)
+        x_z = K.bias_add(x_z, self.bias[:self.units])
+        x_r = K.bias_add(x_r, self.bias[self.units:self.units * 2])
+        x_h = K.bias_add(x_h, self.bias[self.units * 2:])
 
       if 0. < self.recurrent_dropout < 1.:
         h_tm1_z = h_tm1 * rec_dp_mask[0]
@@ -1367,11 +1353,14 @@ class GRUCell(Layer):
         h_tm1_r = h_tm1
         h_tm1_h = h_tm1
       z = self.recurrent_activation(
-          x_z + K.dot(h_tm1_z, self.recurrent_kernel_z))
+          x_z + K.dot(h_tm1_z, self.recurrent_kernel[:, :self.units]))
       r = self.recurrent_activation(
-          x_r + K.dot(h_tm1_r, self.recurrent_kernel_r))
+          x_r + K.dot(h_tm1_r, self.recurrent_kernel[:, self.units:
+                                                     self.units * 2]))
 
-      hh = self.activation(x_h + K.dot(r * h_tm1_h, self.recurrent_kernel_h))
+      hh = self.activation(x_h + K.dot(r * h_tm1_h,
+                                       self.recurrent_kernel[:,
+                                                             self.units * 2:]))
     else:
       if 0. < self.dropout < 1.:
         inputs *= dp_mask[0]
@@ -1395,44 +1384,34 @@ class GRUCell(Layer):
       hh = self.activation(x_h + recurrent_h)
     h = z * h_tm1 + (1 - z) * hh
     if 0 < self.dropout + self.recurrent_dropout:
-      if training is None:
+      if training is None and not context.in_eager_mode():
+        # This would be harmless to set in eager mode, but eager tensors
+        # disallow setting arbitrary attributes.
         h._uses_learning_phase = True
     return h, [h]
 
   def get_config(self):
     config = {
-        'units':
-            self.units,
-        'activation':
-            activations.serialize(self.activation),
+        'units': self.units,
+        'activation': activations.serialize(self.activation),
         'recurrent_activation':
             activations.serialize(self.recurrent_activation),
-        'use_bias':
-            self.use_bias,
-        'kernel_initializer':
-            initializers.serialize(self.kernel_initializer),
+        'use_bias': self.use_bias,
+        'kernel_initializer': initializers.serialize(self.kernel_initializer),
         'recurrent_initializer':
             initializers.serialize(self.recurrent_initializer),
-        'bias_initializer':
-            initializers.serialize(self.bias_initializer),
-        'kernel_regularizer':
-            regularizers.serialize(self.kernel_regularizer),
+        'bias_initializer': initializers.serialize(self.bias_initializer),
+        'kernel_regularizer': regularizers.serialize(self.kernel_regularizer),
         'recurrent_regularizer':
             regularizers.serialize(self.recurrent_regularizer),
-        'bias_regularizer':
-            regularizers.serialize(self.bias_regularizer),
-        'kernel_constraint':
-            constraints.serialize(self.kernel_constraint),
+        'bias_regularizer': regularizers.serialize(self.bias_regularizer),
+        'kernel_constraint': constraints.serialize(self.kernel_constraint),
         'recurrent_constraint':
             constraints.serialize(self.recurrent_constraint),
-        'bias_constraint':
-            constraints.serialize(self.bias_constraint),
-        'dropout':
-            self.dropout,
-        'recurrent_dropout':
-            self.recurrent_dropout,
-        'implementation':
-            self.implementation
+        'bias_constraint': constraints.serialize(self.bias_constraint),
+        'dropout': self.dropout,
+        'recurrent_dropout': self.recurrent_dropout,
+        'implementation': self.implementation
     }
     base_config = super(GRUCell, self).get_config()
     return dict(list(base_config.items()) + list(config.items()))
@@ -1809,29 +1788,6 @@ class LSTMCell(Layer):
           constraint=self.bias_constraint)
     else:
       self.bias = None
-
-    self.kernel_i = self.kernel[:, :self.units]
-    self.kernel_f = self.kernel[:, self.units:self.units * 2]
-    self.kernel_c = self.kernel[:, self.units * 2:self.units * 3]
-    self.kernel_o = self.kernel[:, self.units * 3:]
-
-    self.recurrent_kernel_i = self.recurrent_kernel[:, :self.units]
-    self.recurrent_kernel_f = self.recurrent_kernel[:, self.units:
-                                                    self.units * 2]
-    self.recurrent_kernel_c = self.recurrent_kernel[:, self.units * 2:
-                                                    self.units * 3]
-    self.recurrent_kernel_o = self.recurrent_kernel[:, self.units * 3:]
-
-    if self.use_bias:
-      self.bias_i = self.bias[:self.units]
-      self.bias_f = self.bias[self.units:self.units * 2]
-      self.bias_c = self.bias[self.units * 2:self.units * 3]
-      self.bias_o = self.bias[self.units * 3:]
-    else:
-      self.bias_i = None
-      self.bias_f = None
-      self.bias_c = None
-      self.bias_o = None
     self.built = True
 
   def call(self, inputs, states, training=None):
@@ -1869,15 +1825,15 @@ class LSTMCell(Layer):
         inputs_f = inputs
         inputs_c = inputs
         inputs_o = inputs
-      x_i = K.dot(inputs_i, self.kernel_i)
-      x_f = K.dot(inputs_f, self.kernel_f)
-      x_c = K.dot(inputs_c, self.kernel_c)
-      x_o = K.dot(inputs_o, self.kernel_o)
+      x_i = K.dot(inputs_i, self.kernel[:, :self.units])
+      x_f = K.dot(inputs_f, self.kernel[:, self.units:self.units * 2])
+      x_c = K.dot(inputs_c, self.kernel[:, self.units * 2:self.units * 3])
+      x_o = K.dot(inputs_o, self.kernel[:, self.units * 3:])
       if self.use_bias:
-        x_i = K.bias_add(x_i, self.bias_i)
-        x_f = K.bias_add(x_f, self.bias_f)
-        x_c = K.bias_add(x_c, self.bias_c)
-        x_o = K.bias_add(x_o, self.bias_o)
+        x_i = K.bias_add(x_i, self.bias[:self.units])
+        x_f = K.bias_add(x_f, self.bias[self.units:self.units * 2])
+        x_c = K.bias_add(x_c, self.bias[self.units * 2:self.units * 3])
+        x_o = K.bias_add(x_o, self.bias[self.units * 3:])
 
       if 0 < self.recurrent_dropout < 1.:
         h_tm1_i = h_tm1 * rec_dp_mask[0]
@@ -1890,13 +1846,15 @@ class LSTMCell(Layer):
         h_tm1_c = h_tm1
         h_tm1_o = h_tm1
       i = self.recurrent_activation(
-          x_i + K.dot(h_tm1_i, self.recurrent_kernel_i))
+          x_i + K.dot(h_tm1_i, self.recurrent_kernel[:, :self.units]))
       f = self.recurrent_activation(
-          x_f + K.dot(h_tm1_f, self.recurrent_kernel_f))
+          x_f + K.dot(h_tm1_f,
+                      self.recurrent_kernel[:, self.units: self.units * 2]))
       c = f * c_tm1 + i * self.activation(
-          x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
+          x_c + K.dot(h_tm1_c,
+                      self.recurrent_kernel[:, self.units * 2: self.units * 3]))
       o = self.recurrent_activation(
-          x_o + K.dot(h_tm1_o, self.recurrent_kernel_o))
+          x_o + K.dot(h_tm1_o, self.recurrent_kernel[:, self.units * 3:]))
     else:
       if 0. < self.dropout < 1.:
         inputs *= dp_mask[0]
@@ -1919,7 +1877,9 @@ class LSTMCell(Layer):
 
     h = o * self.activation(c)
     if 0 < self.dropout + self.recurrent_dropout:
-      if training is None:
+      if training is None and not context.in_eager_mode():
+        # This would be harmless to set in eager mode, but eager tensors
+        # disallow setting arbitrary attributes.
         h._uses_learning_phase = True
     return h, [h, c]
 
diff --git a/tensorflow/python/keras/_impl/keras/layers/simplernn_test.py b/tensorflow/python/keras/_impl/keras/layers/simplernn_test.py
index 7edebdacd07d74fe6b5a982d12645fb5556bdf75..8c7189cd4718450a85c015e08ab3a58cc5d86531 100644
--- a/tensorflow/python/keras/_impl/keras/layers/simplernn_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/simplernn_test.py
@@ -20,64 +20,66 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.keras._impl.keras import testing_utils
 from tensorflow.python.platform import test
+from tensorflow.python.training.rmsprop import RMSPropOptimizer
 
 
 class SimpleRNNLayerTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_return_sequences_SimpleRNN(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.SimpleRNN,
-          kwargs={'units': units,
-                  'return_sequences': True},
-          input_shape=(num_samples, timesteps, embedding_dim))
+    testing_utils.layer_test(
+        keras.layers.SimpleRNN,
+        kwargs={'units': units,
+                'return_sequences': True},
+        input_shape=(num_samples, timesteps, embedding_dim))
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_dynamic_behavior_SimpleRNN(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      layer = keras.layers.SimpleRNN(units, input_shape=(None, embedding_dim))
-      model = keras.models.Sequential()
-      model.add(layer)
-      model.compile('sgd', 'mse')
-      x = np.random.random((num_samples, timesteps, embedding_dim))
-      y = np.random.random((num_samples, units))
-      model.train_on_batch(x, y)
-
+    layer = keras.layers.SimpleRNN(units, input_shape=(None, embedding_dim))
+    model = keras.models.Sequential()
+    model.add(layer)
+    model.compile(RMSPropOptimizer(0.01), 'mse')
+    x = np.random.random((num_samples, timesteps, embedding_dim))
+    y = np.random.random((num_samples, units))
+    model.train_on_batch(x, y)
+
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_dropout_SimpleRNN(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      testing_utils.layer_test(
-          keras.layers.SimpleRNN,
-          kwargs={'units': units,
-                  'dropout': 0.1,
-                  'recurrent_dropout': 0.1},
-          input_shape=(num_samples, timesteps, embedding_dim))
-
+    testing_utils.layer_test(
+        keras.layers.SimpleRNN,
+        kwargs={'units': units,
+                'dropout': 0.1,
+                'recurrent_dropout': 0.1},
+        input_shape=(num_samples, timesteps, embedding_dim))
+
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_implementation_mode_SimpleRNN(self):
     num_samples = 2
     timesteps = 3
     embedding_dim = 4
     units = 2
-    with self.test_session():
-      for mode in [0, 1, 2]:
-        testing_utils.layer_test(
-            keras.layers.SimpleRNN,
-            kwargs={'units': units,
-                    'implementation': mode},
-            input_shape=(num_samples, timesteps, embedding_dim))
+    for mode in [0, 1, 2]:
+      testing_utils.layer_test(
+          keras.layers.SimpleRNN,
+          kwargs={'units': units,
+                  'implementation': mode},
+          input_shape=(num_samples, timesteps, embedding_dim))
 
   def test_statefulness_SimpleRNN(self):
     num_samples = 2
diff --git a/tensorflow/python/keras/_impl/keras/layers/wrappers.py b/tensorflow/python/keras/_impl/keras/layers/wrappers.py
index 61f1a758e4701e6925af88b7fed9c48cf42ca735..76ddd9299dd669da35d89a6fe8fc521ce4c26337 100644
--- a/tensorflow/python/keras/_impl/keras/layers/wrappers.py
+++ b/tensorflow/python/keras/_impl/keras/layers/wrappers.py
@@ -25,7 +25,7 @@ from tensorflow.python.framework import tensor_shape
 from tensorflow.python.keras._impl.keras import backend as K
 from tensorflow.python.keras._impl.keras.engine import InputSpec
 from tensorflow.python.keras._impl.keras.engine import Layer
-from tensorflow.python.keras._impl.keras.engine.topology import shape_type_conversion
+from tensorflow.python.keras._impl.keras.engine.base_layer import shape_type_conversion
 from tensorflow.python.keras._impl.keras.utils.generic_utils import has_arg
 from tensorflow.python.layers import utils as tf_layers_util
 from tensorflow.python.util.tf_export import tf_export
diff --git a/tensorflow/python/keras/_impl/keras/layers/wrappers_test.py b/tensorflow/python/keras/_impl/keras/layers/wrappers_test.py
index c81d6b883cb0aa2b30331e35b387457072dbf3c3..8fcf66e90ff1289a06a996768ae5de2f1548a27c 100644
--- a/tensorflow/python/keras/_impl/keras/layers/wrappers_test.py
+++ b/tensorflow/python/keras/_impl/keras/layers/wrappers_test.py
@@ -20,44 +20,43 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.framework import test_util as tf_test_util
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.platform import test
+from tensorflow.python.training.rmsprop import RMSPropOptimizer
 
 
 class TimeDistributedTest(test.TestCase):
 
+  @tf_test_util.run_in_graph_and_eager_modes()
   def test_timedistributed_dense(self):
-    # first, test with Dense layer
-    with self.test_session():
-      model = keras.models.Sequential()
-      model.add(
-          keras.layers.TimeDistributed(
-              keras.layers.Dense(2), input_shape=(3, 4)))
-      model.compile(optimizer='rmsprop', loss='mse')
-      model.fit(
-          np.random.random((10, 3, 4)),
-          np.random.random((10, 3, 2)),
-          epochs=1,
-          batch_size=10)
-
-      # test config
-      model.get_config()
+    model = keras.models.Sequential()
+    model.add(
+        keras.layers.TimeDistributed(
+            keras.layers.Dense(2), input_shape=(3, 4)))
+    model.compile(optimizer=RMSPropOptimizer(0.01), loss='mse')
+    model.fit(
+        np.random.random((10, 3, 4)),
+        np.random.random((10, 3, 2)),
+        epochs=1,
+        batch_size=10)
+
+    # test config
+    model.get_config()
 
   def test_timedistributed_static_batch_size(self):
-    with self.test_session():
-      model = keras.models.Sequential()
-      model.add(
-          keras.layers.TimeDistributed(
-              keras.layers.Dense(2), input_shape=(3, 4), batch_size=10))
-      model.compile(optimizer='rmsprop', loss='mse')
-      model.fit(
-          np.random.random((10, 3, 4)),
-          np.random.random((10, 3, 2)),
-          epochs=1,
-          batch_size=10)
+    model = keras.models.Sequential()
+    model.add(
+        keras.layers.TimeDistributed(
+            keras.layers.Dense(2), input_shape=(3, 4), batch_size=10))
+    model.compile(optimizer=RMSPropOptimizer(0.01), loss='mse')
+    model.fit(
+        np.random.random((10, 3, 4)),
+        np.random.random((10, 3, 2)),
+        epochs=1,
+        batch_size=10)
 
   def test_timedistributed_conv2d(self):
-    # test with Conv2D
     with self.test_session():
       model = keras.models.Sequential()
       model.add(
@@ -73,7 +72,6 @@ class TimeDistributedTest(test.TestCase):
       model.summary()
 
   def test_timedistributed_stacked(self):
-    # test stacked layers
     with self.test_session():
       model = keras.models.Sequential()
       model.add(
@@ -167,7 +165,7 @@ class BidirectionalTest(test.TestCase):
         model.add(
             keras.layers.Bidirectional(
                 rnn(output_dim), merge_mode=mode, input_shape=(timesteps, dim)))
-        model.compile(loss='mse', optimizer='sgd')
+        model.compile(optimizer=RMSPropOptimizer(0.01), loss='mse')
         model.fit(x, y, epochs=1, batch_size=1)
 
         # test compute output shape
diff --git a/tensorflow/python/keras/_impl/keras/model_subclassing_test.py b/tensorflow/python/keras/_impl/keras/model_subclassing_test.py
index 275985aa36fc6d85768ae05f14cf65e710ad7353..58b144365be6cd8ea5b2ea82e275eacdee6b6c84 100644
--- a/tensorflow/python/keras/_impl/keras/model_subclassing_test.py
+++ b/tensorflow/python/keras/_impl/keras/model_subclassing_test.py
@@ -174,19 +174,18 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 100
     input_dim = 50
 
-    with self.test_session():
-      model = SimpleTestModel(num_classes=num_classes,
-                              use_dp=True,
-                              use_bn=True)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = SimpleTestModel(num_classes=num_classes,
+                            use_dp=True,
+                            use_bn=True)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x = np.ones((num_samples, input_dim))
-      y = np.zeros((num_samples, num_classes))
+    x = np.ones((num_samples, input_dim))
+    y = np.zeros((num_samples, num_classes))
 
-      model.fit(x, y, epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate(x, y, verbose=0)
+    model.fit(x, y, epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate(x, y, verbose=0)
 
   @test_util.run_in_graph_and_eager_modes()
   def test_multi_io_workflow_with_np_arrays(self):
@@ -194,21 +193,20 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 1000
     input_dim = 50
 
-    with self.test_session():
-      model = MultiIOTestModel(num_classes=num_classes,
-                               use_dp=True,
-                               use_bn=True)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = MultiIOTestModel(num_classes=num_classes,
+                             use_dp=True,
+                             use_bn=True)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x1 = np.ones((num_samples, input_dim))
-      x2 = np.ones((num_samples, input_dim))
-      y1 = np.zeros((num_samples, num_classes[0]))
-      y2 = np.zeros((num_samples, num_classes[1]))
+    x1 = np.ones((num_samples, input_dim))
+    x2 = np.ones((num_samples, input_dim))
+    y1 = np.zeros((num_samples, num_classes[0]))
+    y2 = np.zeros((num_samples, num_classes[1]))
 
-      model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate([x1, x2], [y1, y2], verbose=0)
+    model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate([x1, x2], [y1, y2], verbose=0)
 
   def test_single_io_workflow_with_tensors(self):
 
@@ -321,14 +319,13 @@ class ModelSubclassingTest(test.TestCase):
     x = np.ones((num_samples, input_dim))
     y = np.ones((num_samples, input_dim))
 
-    with self.test_session():
-      model = BNNet()
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      y_ref = model.predict(x)
+    model = BNNet()
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    y_ref = model.predict(x)
 
-      model.train_on_batch(x, y)
-      y_new = model.predict(x)
-      self.assertGreater(np.sum(np.abs(y_ref - y_new)), 0.1)
+    model.train_on_batch(x, y)
+    y_new = model.predict(x)
+    self.assertGreater(np.sum(np.abs(y_ref - y_new)), 0.1)
 
   @test_util.run_in_graph_and_eager_modes()
   def test_training_and_inference_behavior(self):
@@ -350,14 +347,13 @@ class ModelSubclassingTest(test.TestCase):
         x = self.dp(inputs)
         return self.dense(x)
 
-    with self.test_session():
-      model = DPNet()
-      x = np.ones((num_samples, input_dim))
-      y = model.predict(x)
-      self.assertEqual(np.sum(y), np.sum(x))
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      loss = model.train_on_batch(x, y)
-      self.assertGreater(loss, 0.1)
+    model = DPNet()
+    x = np.ones((num_samples, input_dim))
+    y = model.predict(x)
+    self.assertEqual(np.sum(y), np.sum(x))
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    loss = model.train_on_batch(x, y)
+    self.assertGreater(loss, 0.1)
 
   @test_util.run_in_graph_and_eager_modes()
   def test_training_methods(self):
@@ -373,21 +369,20 @@ class ModelSubclassingTest(test.TestCase):
     y1 = np.zeros((num_samples, num_classes[0]))
     y2 = np.zeros((num_samples, num_classes[1]))
 
-    with self.test_session():
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32)
-      model.fit({'input_1': x1, 'input_2': x2},
-                {'output_1': y1, 'output_2': y2},
-                epochs=2, batch_size=32)
-      model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32,
-                validation_data=([x1, x2], [y1, y2]))
-
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      model.train_on_batch([x1, x2], [y1, y2])
-      model.train_on_batch({'input_1': x1, 'input_2': x2},
-                           {'output_1': y1, 'output_2': y2})
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
+    model.fit({'input_1': x1, 'input_2': x2},
+              {'output_1': y1, 'output_2': y2},
+              epochs=2, batch_size=32)
+    model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0,
+              validation_data=([x1, x2], [y1, y2]))
+
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    model.train_on_batch([x1, x2], [y1, y2])
+    model.train_on_batch({'input_1': x1, 'input_2': x2},
+                         {'output_1': y1, 'output_2': y2})
 
   @test_util.run_in_graph_and_eager_modes(assert_no_eager_garbage=True)
   def test_inference_methods(self):
@@ -402,17 +397,16 @@ class ModelSubclassingTest(test.TestCase):
     y1 = np.zeros((num_samples, num_classes[0]))
     y2 = np.zeros((num_samples, num_classes[1]))
 
-    with self.test_session():
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      model.evaluate([x1, x2], [y1, y2])
-      model.test_on_batch([x1, x2], [y1, y2])
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    model.evaluate([x1, x2], [y1, y2])
+    model.test_on_batch([x1, x2], [y1, y2])
 
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.predict([x1, x2])
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.predict([x1, x2])
 
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.predict_on_batch([x1, x2])
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.predict_on_batch([x1, x2])
 
   @test_util.run_in_graph_and_eager_modes()
   def test_trainable_mutation(self):
@@ -435,26 +429,25 @@ class ModelSubclassingTest(test.TestCase):
     y1 = np.zeros((num_samples, num_classes[0]))
     y2 = np.zeros((num_samples, num_classes[1]))
 
-    with self.test_session():
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
-      model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32)
-      y_ref_1, y_ref_2 = model.predict([x1, x2])
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    model.fit([x1, x2], [y1, y2], epochs=2, batch_size=32, verbose=0)
+    y_ref_1, y_ref_2 = model.predict([x1, x2])
 
-      fd, fname = tempfile.mkstemp('.h5')
-      model.save_weights(fname)
+    fd, fname = tempfile.mkstemp('.h5')
+    model.save_weights(fname)
 
-      model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
-      # need to build the model before loading weights
-      # (otherwise no weights to load)
-      model._set_inputs([x1, x2])
-      model.load_weights(fname)
+    model = MultiIOTestModel(num_classes=num_classes, use_bn=True)
+    # need to build the model before loading weights
+    # (otherwise no weights to load)
+    model._set_inputs([x1, x2])
+    model.load_weights(fname)
 
-      y1, y2 = model.predict([x1, x2])
-      self.assertAllClose(y_ref_1, y1, atol=1e-5)
-      self.assertAllClose(y_ref_2, y2, atol=1e-5)
-      os.close(fd)
-      os.remove(fname)
+    y1, y2 = model.predict([x1, x2])
+    self.assertAllClose(y_ref_1, y1, atol=1e-5)
+    self.assertAllClose(y_ref_2, y2, atol=1e-5)
+    os.close(fd)
+    os.remove(fname)
 
   @test_util.run_in_graph_and_eager_modes()
   def test_summary(self):
@@ -488,23 +481,22 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 100
     input_dim = 50
 
-    with self.test_session():
-      model = NestedTestModel1(num_classes=num_classes)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = NestedTestModel1(num_classes=num_classes)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x = np.ones((num_samples, input_dim))
-      y = np.zeros((num_samples, num_classes))
+    x = np.ones((num_samples, input_dim))
+    y = np.zeros((num_samples, num_classes))
 
-      model.fit(x, y, epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate(x, y, verbose=0)
+    model.fit(x, y, epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate(x, y, verbose=0)
 
-      self.assertEqual(len(model.weights), 8 + len(model.test_net.weights))
-      self.assertEqual(len(model.non_trainable_weights),
-                       2 + len(model.test_net.non_trainable_weights))
-      self.assertEqual(len(model.trainable_weights),
-                       6 + len(model.test_net.trainable_weights))
+    self.assertEqual(len(model.weights), 8 + len(model.test_net.weights))
+    self.assertEqual(len(model.non_trainable_weights),
+                     2 + len(model.test_net.non_trainable_weights))
+    self.assertEqual(len(model.trainable_weights),
+                     6 + len(model.test_net.trainable_weights))
 
   @test_util.run_in_graph_and_eager_modes()
   def test_graph_nested_in_subclass(self):
@@ -512,23 +504,22 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 100
     input_dim = 50
 
-    with self.test_session():
-      model = NestedTestModel2(num_classes=num_classes)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = NestedTestModel2(num_classes=num_classes)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x = np.ones((num_samples, input_dim))
-      y = np.zeros((num_samples, num_classes))
+    x = np.ones((num_samples, input_dim))
+    y = np.zeros((num_samples, num_classes))
 
-      model.fit(x, y, epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate(x, y, verbose=0)
+    model.fit(x, y, epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate(x, y, verbose=0)
 
-      self.assertEqual(len(model.weights), 8 + len(model.test_net.weights))
-      self.assertEqual(len(model.non_trainable_weights),
-                       2 + len(model.test_net.non_trainable_weights))
-      self.assertEqual(len(model.trainable_weights),
-                       6 + len(model.test_net.trainable_weights))
+    self.assertEqual(len(model.weights), 8 + len(model.test_net.weights))
+    self.assertEqual(len(model.non_trainable_weights),
+                     2 + len(model.test_net.non_trainable_weights))
+    self.assertEqual(len(model.trainable_weights),
+                     6 + len(model.test_net.trainable_weights))
 
   @test_util.run_in_graph_and_eager_modes()
   def test_subclass_nested_in_graph(self):
@@ -536,22 +527,51 @@ class ModelSubclassingTest(test.TestCase):
     num_samples = 100
     input_dim = 50
 
-    with self.test_session():
-      model = get_nested_model_3(input_dim=input_dim, num_classes=num_classes)
-      model.compile(loss='mse',
-                    optimizer=RMSPropOptimizer(learning_rate=0.001),
-                    metrics=['acc'])
+    model = get_nested_model_3(input_dim=input_dim, num_classes=num_classes)
+    model.compile(loss='mse',
+                  optimizer=RMSPropOptimizer(learning_rate=0.001),
+                  metrics=['acc'])
 
-      x = np.ones((num_samples, input_dim))
-      y = np.zeros((num_samples, num_classes))
+    x = np.ones((num_samples, input_dim))
+    y = np.zeros((num_samples, num_classes))
+
+    model.fit(x, y, epochs=2, batch_size=32, verbose=0)
+    _ = model.evaluate(x, y, verbose=0)
+
+    self.assertEqual(len(model.weights), 16)
+    self.assertEqual(
+        len(model.non_trainable_weights), 4)
+    self.assertEqual(len(model.trainable_weights), 12)
+
+  @test_util.run_in_graph_and_eager_modes()
+  def test_support_for_manual_training_arg(self):
+    # In most cases, the `training` argument is left unspecified, in which
+    # case it defaults to value corresponding to the Model method being used
+    # (fit -> True, predict -> False, etc).
+    # If the user writes their model `call` method to take
+    # an explicit `training` argument, we must check that the correct value
+    # is being passed to the model for each method call.
+
+    class DPNet(keras.Model):
+
+      def __init__(self):
+        super(DPNet, self).__init__()
+        self.dp = keras.layers.Dropout(0.5)
+        self.dense = keras.layers.Dense(1,
+                                        use_bias=False,
+                                        kernel_initializer='ones')
 
-      model.fit(x, y, epochs=2, batch_size=32, verbose=0)
-      _ = model.evaluate(x, y, verbose=0)
+      def call(self, inputs, training=False):
+        x = self.dp(inputs, training=training)
+        return self.dense(x)
 
-      self.assertEqual(len(model.weights), 16)
-      self.assertEqual(
-          len(model.non_trainable_weights), 4)
-      self.assertEqual(len(model.trainable_weights), 12)
+    model = DPNet()
+    x = np.ones((10, 10))
+    y = model.predict(x)
+    self.assertEqual(np.sum(y), np.sum(x))
+    model.compile(loss='mse', optimizer=RMSPropOptimizer(learning_rate=0.001))
+    loss = model.train_on_batch(x, y)
+    self.assertGreater(loss, 0.1)
 
 
 if __name__ == '__main__':
diff --git a/tensorflow/python/keras/_impl/keras/models.py b/tensorflow/python/keras/_impl/keras/models.py
index 4c3ec7dbe458bfb78d38950b1bad7a474bb55ad3..9602e7ba39b290f33c7ca9d0d1b5b35838667531 100644
--- a/tensorflow/python/keras/_impl/keras/models.py
+++ b/tensorflow/python/keras/_impl/keras/models.py
@@ -13,1305 +13,30 @@
 # limitations under the License.
 # ==============================================================================
 # pylint: disable=protected-access
-"""Home of the Sequential model, and the `save_model`/`load_model` functions.
+"""Code for model cloning, plus model-related API entries.
 """
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import copy
-import json
-import os
-
-import numpy as np
-
-from tensorflow.python.framework import ops
 from tensorflow.python.keras._impl.keras import backend as K
-from tensorflow.python.keras._impl.keras import layers as layer_module
-from tensorflow.python.keras._impl.keras import optimizers
-from tensorflow.python.keras._impl.keras.engine import topology
-from tensorflow.python.keras._impl.keras.engine.topology import Input
-from tensorflow.python.keras._impl.keras.engine.topology import InputLayer
-from tensorflow.python.keras._impl.keras.engine.topology import Layer
-from tensorflow.python.keras._impl.keras.engine.topology import TFBaseLayer
-from tensorflow.python.keras._impl.keras.engine.training import Model
+from tensorflow.python.keras._impl.keras.engine import saving
+from tensorflow.python.keras._impl.keras.engine import sequential
+from tensorflow.python.keras._impl.keras.engine import training
+from tensorflow.python.keras._impl.keras.engine.input_layer import Input
+from tensorflow.python.keras._impl.keras.engine.input_layer import InputLayer
+from tensorflow.python.keras._impl.keras.utils import generic_utils
 from tensorflow.python.keras._impl.keras.utils.generic_utils import has_arg
-from tensorflow.python.keras._impl.keras.utils.io_utils import ask_to_proceed_with_overwrite
-from tensorflow.python.platform import tf_logging as logging
-from tensorflow.python.util.tf_export import tf_export
-
-
-# pylint: disable=g-import-not-at-top
-try:
-  import h5py
-except ImportError:
-  h5py = None
-
-try:
-  import yaml
-except ImportError:
-  yaml = None
-# pylint: enable=g-import-not-at-top
-
-
-@tf_export('keras.models.save_model')
-def save_model(model, filepath, overwrite=True, include_optimizer=True):
-  """Save a model to a HDF5 file.
-
-  The saved model contains:
-      - the model's configuration (topology)
-      - the model's weights
-      - the model's optimizer's state (if any)
-
-  Thus the saved model can be reinstantiated in
-  the exact same state, without any of the code
-  used for model definition or training.
-
-  Arguments:
-      model: Keras model instance to be saved.
-      filepath: String, path where to save the model.
-      overwrite: Whether we should overwrite any existing
-          model at the target location, or instead
-          ask the user with a manual prompt.
-      include_optimizer: If True, save optimizer's state together.
-
-  Raises:
-      ImportError: if h5py is not available.
-  """
-
-  if h5py is None:
-    raise ImportError('`save_model` requires h5py.')
-
-  def get_json_type(obj):
-    """Serialize any object to a JSON-serializable structure.
-
-    Arguments:
-        obj: the object to serialize
-
-    Returns:
-        JSON-serializable structure representing `obj`.
-
-    Raises:
-        TypeError: if `obj` cannot be serialized.
-    """
-    # if obj is a serializable Keras class instance
-    # e.g. optimizer, layer
-    if hasattr(obj, 'get_config'):
-      return {'class_name': obj.__class__.__name__, 'config': obj.get_config()}
-
-    # if obj is any numpy type
-    if type(obj).__module__ == np.__name__:
-      if isinstance(obj, np.ndarray):
-        return {'type': type(obj), 'value': obj.tolist()}
-      else:
-        return obj.item()
-
-    # misc functions (e.g. loss function)
-    if callable(obj):
-      return obj.__name__
-
-    # if obj is a python 'type'
-    if type(obj).__name__ == type.__name__:
-      return obj.__name__
-
-    raise TypeError('Not JSON Serializable:', obj)
-
-  from tensorflow.python.keras._impl.keras import __version__ as keras_version  # pylint: disable=g-import-not-at-top
-
-  # If file exists and should not be overwritten.
-  if not overwrite and os.path.isfile(filepath):
-    proceed = ask_to_proceed_with_overwrite(filepath)
-    if not proceed:
-      return
-
-  with h5py.File(filepath, mode='w') as f:
-    f.attrs['keras_version'] = str(keras_version).encode('utf8')
-    f.attrs['backend'] = K.backend().encode('utf8')
-    f.attrs['model_config'] = json.dumps(
-        {
-            'class_name': model.__class__.__name__,
-            'config': model.get_config()
-        },
-        default=get_json_type).encode('utf8')
-
-    model_weights_group = f.create_group('model_weights')
-    model_layers = model.layers
-    topology.save_weights_to_hdf5_group(model_weights_group, model_layers)
-
-    if include_optimizer and hasattr(model, 'optimizer'):
-      if isinstance(model.optimizer, optimizers.TFOptimizer):
-        logging.warning(
-            'TensorFlow optimizers do not '
-            'make it possible to access '
-            'optimizer attributes or optimizer state '
-            'after instantiation. '
-            'As a result, we cannot save the optimizer '
-            'as part of the model save file.'
-            'You will have to compile your model again after loading it. '
-            'Prefer using a Keras optimizer instead '
-            '(see keras.io/optimizers).')
-      else:
-        f.attrs['training_config'] = json.dumps(
-            {
-                'optimizer_config': {
-                    'class_name': model.optimizer.__class__.__name__,
-                    'config': model.optimizer.get_config()
-                },
-                'loss': model.loss,
-                'metrics': model.metrics,
-                'sample_weight_mode': model.sample_weight_mode,
-                'loss_weights': model.loss_weights,
-            },
-            default=get_json_type).encode('utf8')
-
-        # Save optimizer weights.
-        symbolic_weights = getattr(model.optimizer, 'weights')
-        if symbolic_weights:
-          optimizer_weights_group = f.create_group('optimizer_weights')
-          weight_values = K.batch_get_value(symbolic_weights)
-          weight_names = []
-          for w, val in zip(symbolic_weights, weight_values):
-            name = str(w.name)
-            weight_names.append(name.encode('utf8'))
-          optimizer_weights_group.attrs['weight_names'] = weight_names
-          for name, val in zip(weight_names, weight_values):
-            param_dset = optimizer_weights_group.create_dataset(
-                name, val.shape, dtype=val.dtype)
-            if not val.shape:
-              # scalar
-              param_dset[()] = val
-            else:
-              param_dset[:] = val
-    f.flush()
-
-
-@tf_export('keras.models.load_model')
-def load_model(filepath, custom_objects=None, compile=True):  # pylint: disable=redefined-builtin
-  """Loads a model saved via `save_model`.
-
-  Arguments:
-      filepath: String, path to the saved model.
-      custom_objects: Optional dictionary mapping names
-          (strings) to custom classes or functions to be
-          considered during deserialization.
-      compile: Boolean, whether to compile the model
-          after loading.
-
-  Returns:
-      A Keras model instance. If an optimizer was found
-      as part of the saved model, the model is already
-      compiled. Otherwise, the model is uncompiled and
-      a warning will be displayed. When `compile` is set
-      to False, the compilation is omitted without any
-      warning.
-
-  Raises:
-      ImportError: if h5py is not available.
-      ValueError: In case of an invalid savefile.
-  """
-  if h5py is None:
-    raise ImportError('`load_model` requires h5py.')
-
-  if not custom_objects:
-    custom_objects = {}
-
-  def convert_custom_objects(obj):
-    """Handles custom object lookup.
-
-    Arguments:
-        obj: object, dict, or list.
-
-    Returns:
-        The same structure, where occurrences
-            of a custom object name have been replaced
-            with the custom object.
-    """
-    if isinstance(obj, list):
-      deserialized = []
-      for value in obj:
-        deserialized.append(convert_custom_objects(value))
-      return deserialized
-    if isinstance(obj, dict):
-      deserialized = {}
-      for key, value in obj.items():
-        deserialized[key] = convert_custom_objects(value)
-      return deserialized
-    if obj in custom_objects:
-      return custom_objects[obj]
-    return obj
-
-  with h5py.File(filepath, mode='r') as f:
-    # instantiate model
-    model_config = f.attrs.get('model_config')
-    if model_config is None:
-      raise ValueError('No model found in config file.')
-    model_config = json.loads(model_config.decode('utf-8'))
-    model = model_from_config(model_config, custom_objects=custom_objects)
-
-    # set weights
-    topology.load_weights_from_hdf5_group(f['model_weights'], model.layers)
-
-    # Early return if compilation is not required.
-    if not compile:
-      return model
-
-    # instantiate optimizer
-    training_config = f.attrs.get('training_config')
-    if training_config is None:
-      logging.warning('No training configuration found in save file: '
-                      'the model was *not* compiled. Compile it manually.')
-      return model
-    training_config = json.loads(training_config.decode('utf-8'))
-    optimizer_config = training_config['optimizer_config']
-    optimizer = optimizers.deserialize(
-        optimizer_config, custom_objects=custom_objects)
-
-    # Recover loss functions and metrics.
-    loss = convert_custom_objects(training_config['loss'])
-    metrics = convert_custom_objects(training_config['metrics'])
-    sample_weight_mode = training_config['sample_weight_mode']
-    loss_weights = training_config['loss_weights']
-
-    # Compile model.
-    model.compile(
-        optimizer=optimizer,
-        loss=loss,
-        metrics=metrics,
-        loss_weights=loss_weights,
-        sample_weight_mode=sample_weight_mode)
-
-    # Set optimizer weights.
-    if 'optimizer_weights' in f:
-      # Build train function (to get weight updates).
-      if isinstance(model, Sequential):
-        model.model._make_train_function()
-      else:
-        model._make_train_function()
-      optimizer_weights_group = f['optimizer_weights']
-      optimizer_weight_names = [
-          n.decode('utf8')
-          for n in optimizer_weights_group.attrs['weight_names']
-      ]
-      optimizer_weight_values = [
-          optimizer_weights_group[n] for n in optimizer_weight_names
-      ]
-      try:
-        model.optimizer.set_weights(optimizer_weight_values)
-      except ValueError:
-        logging.warning('Error in loading the saved optimizer '
-                        'state. As a result, your model is '
-                        'starting with a freshly initialized '
-                        'optimizer.')
-  return model
-
-
-@tf_export('keras.models.model_from_config')
-def model_from_config(config, custom_objects=None):
-  """Instantiates a Keras model from its config.
-
-  Arguments:
-      config: Configuration dictionary.
-      custom_objects: Optional dictionary mapping names
-          (strings) to custom classes or functions to be
-          considered during deserialization.
-
-  Returns:
-      A Keras model instance (uncompiled).
-
-  Raises:
-      TypeError: if `config` is not a dictionary.
-  """
-  if isinstance(config, list):
-    raise TypeError('`model_from_config` expects a dictionary, not a list. '
-                    'Maybe you meant to use '
-                    '`Sequential.from_config(config)`?')
-  return layer_module.deserialize(config, custom_objects=custom_objects)
-
-
-@tf_export('keras.models.model_from_yaml')
-def model_from_yaml(yaml_string, custom_objects=None):
-  """Parses a yaml model configuration file and returns a model instance.
-
-  Arguments:
-      yaml_string: YAML string encoding a model configuration.
-      custom_objects: Optional dictionary mapping names
-          (strings) to custom classes or functions to be
-          considered during deserialization.
-
-  Returns:
-      A Keras model instance (uncompiled).
-
-  Raises:
-      ImportError: if yaml module is not found.
-  """
-  if yaml is None:
-    raise ImportError('Requires yaml module installed.')
-  config = yaml.load(yaml_string)
-  return layer_module.deserialize(config, custom_objects=custom_objects)
-
-
-@tf_export('keras.models.model_from_json')
-def model_from_json(json_string, custom_objects=None):
-  """Parses a JSON model configuration file and returns a model instance.
-
-  Arguments:
-      json_string: JSON string encoding a model configuration.
-      custom_objects: Optional dictionary mapping names
-          (strings) to custom classes or functions to be
-          considered during deserialization.
-
-  Returns:
-      A Keras model instance (uncompiled).
-  """
-  config = json.loads(json_string)
-  return layer_module.deserialize(config, custom_objects=custom_objects)
-
-
-@tf_export('keras.models.Sequential', 'keras.Sequential')
-class Sequential(Model):
-  """Linear stack of layers.
-
-  Arguments:
-      layers: list of layers to add to the model.
-
-  # Note
-      The first layer passed to a Sequential model
-      should have a defined input shape. What that
-      means is that it should have received an `input_shape`
-      or `batch_input_shape` argument,
-      or for some type of layers (recurrent, Dense...)
-      an `input_dim` argument.
-
-  Example:
-
-      ```python
-          model = Sequential()
-          # first layer must have a defined input shape
-          model.add(Dense(32, input_dim=500))
-          # afterwards, Keras does automatic shape inference
-          model.add(Dense(32))
-
-          # also possible (equivalent to the above):
-          model = Sequential()
-          model.add(Dense(32, input_shape=(500,)))
-          model.add(Dense(32))
-
-          # also possible (equivalent to the above):
-          model = Sequential()
-          # here the batch dimension is None,
-          # which means any batch size will be accepted by the model.
-          model.add(Dense(32, batch_input_shape=(None, 500)))
-          model.add(Dense(32))
-      ```
-  """
-
-  def __init__(self, layers=None, name=None):
-    self._is_graph_network = True
-    self._is_compiled = False
-    self._layers = []  # Stack of layers.
-    self.model = None  # Internal Model instance.
-    self.inputs = []  # List of input tensors
-    self.outputs = []  # List of length 1: the output tensor (unique).
-    self._trainable = True
-    self._initial_weights = None
-    self._input_layers = []
-
-    # Model attributes.
-    self._inbound_nodes = []
-    self._outbound_nodes = []
-    self.built = False
-
-    # Set model name.
-    if not name:
-      prefix = 'sequential_'
-      name = prefix + str(K.get_uid(prefix))
-    self._name = name
-
-    # Used by Layer base class.
-    self._dtype = None
-    self._activity_regularizer = None
-
-    # The following properties are not actually used by Keras;
-    # they exist for compatibility with TF's variable scoping mechanism.
-    self._updates = []
-    self._losses = []
-    self._scope = None
-    self._reuse = None
-    self._base_name = name
-    self._graph = ops.get_default_graph()
-
-    # Add to the model any layers passed to the constructor.
-    if layers:
-      for layer in layers:
-        self.add(layer)
-
-  def add(self, layer):
-    """Adds a layer instance on top of the layer stack.
-
-    Arguments:
-        layer: layer instance.
-
-    Raises:
-        TypeError: If `layer` is not a layer instance.
-        ValueError: In case the `layer` argument does not
-            know its input shape.
-        ValueError: In case the `layer` argument has
-            multiple output tensors, or is already connected
-            somewhere else (forbidden in `Sequential` models).
-    """
-    if not isinstance(layer, (Layer, TFBaseLayer)):
-      raise TypeError('The added layer must be '
-                      'an instance of class Layer. '
-                      'Found: ' + str(layer))
-    if not self.outputs:
-      # First layer in model: check that it is an input layer.
-      if not isinstance(layer, InputLayer):
-        # Create an input layer.
-        # First, we need to infer its expected input shape and dtype.
-        if isinstance(layer, (Model, Sequential)):
-          # We were passed a model as first layer.
-          # This requires a specific way to figure out the
-          # input shape and dtype.
-          if not layer.layers:
-            raise ValueError('Cannot add an empty model '
-                             'to a `Sequential` model.')
-          # In case of nested models: recover the first layer
-          # of the deepest model to infer input shape and dtype.
-          first_layer = layer.layers[0]
-          while isinstance(first_layer, (Model, Sequential)):
-            first_layer = first_layer.layers[0]
-          batch_shape = first_layer._batch_input_shape
-          dtype = first_layer.dtype
-        else:
-          # We were passed a regular layer, and it should
-          # know about its input shape. Otherwise, that's an error.
-          if not hasattr(layer, '_batch_input_shape'):
-            raise ValueError('The first layer in a '
-                             'Sequential model must '
-                             'get an `input_shape` argument.')
-          batch_shape = layer._batch_input_shape
-          dtype = layer.dtype
-        # Instantiate the input layer.
-        x = Input(
-            batch_shape=batch_shape, dtype=dtype, name=layer.name + '_input')
-        # This will build the current layer
-        # and create the node connecting the current layer
-        # to the input layer we just created.
-        layer(x)
-
-      if len(layer._inbound_nodes[-1].output_tensors) != 1:
-        raise ValueError('All layers in a Sequential model '
-                         'should have a single output tensor. '
-                         'For multi-output layers, '
-                         'use the functional API.')
-
-      self.outputs = [layer._inbound_nodes[-1].output_tensors[0]]
-      self.inputs = topology.get_source_inputs(self.outputs[0])
-
-      # We create an input node, which we will keep updated
-      # as we add more layers
-      topology.Node(
-          outbound_layer=self,
-          inbound_layers=[],
-          node_indices=[],
-          tensor_indices=[],
-          input_tensors=self.inputs,
-          output_tensors=self.outputs)
-    else:
-      output_tensor = layer(self.outputs[0])
-      if isinstance(output_tensor, list):
-        raise TypeError('All layers in a Sequential model '
-                        'should have a single output tensor. '
-                        'For multi-output layers, '
-                        'use the functional API.')
-      self.outputs = [output_tensor]
-      # update self._inbound_nodes
-      self._inbound_nodes[0].output_tensors = self.outputs
-      self._inbound_nodes[0].output_shapes = [K.int_shape(self.outputs[0])]
-
-    self._layers.append(layer)
-    self.built = False
-
-  def pop(self):
-    """Removes the last layer in the model.
-
-    Raises:
-        TypeError: if there are no layers in the model.
-    """
-    if not self.layers:
-      raise TypeError('There are no layers in the model.')
-
-    self.layers.pop()
-    if not self.layers:
-      self.outputs = []
-      self._inbound_nodes = []
-      self._outbound_nodes = []
-    else:
-      self.layers[-1]._outbound_nodes = []
-      self.outputs = [self.layers[-1].output]
-      # update self._inbound_nodes
-      self._inbound_nodes[0].output_tensors = self.outputs
-      self._inbound_nodes[0].output_shapes = [K.int_shape(self.outputs[0])]
-    self.built = False
-
-  def get_layer(self, name=None, index=None):
-    """Retrieve a layer that is part of the model.
-
-    Returns a layer based on either its name (unique)
-    or its index in the graph. Indices are based on
-    order of horizontal graph traversal (bottom-up).
-
-    Arguments:
-        name: string, name of layer.
-        index: integer, index of layer.
-
-    Returns:
-        A layer instance.
-    """
-    if not self.built:
-      self.build()
-    return self.model.get_layer(name, index)
-
-  def call(self, inputs, mask=None):
-    if not self.built:
-      self.build()
-    return self.model.call(inputs, mask)
-
-  def build(self, input_shape=None):
-    if not self.inputs or not self.outputs:
-      raise TypeError('Sequential model cannot be built: model is empty.'
-                      ' Add some layers first.')
-    # actually create the model
-    self.model = Model(self.inputs, self.outputs[0], name=self.name + '_model')
-    self.model.trainable = self.trainable
-
-    # mirror model attributes
-    self.supports_masking = self.model.supports_masking
-    self._output_mask_cache = self.model._output_mask_cache
-    self._output_tensor_cache = self.model._output_tensor_cache
-    self._output_shape_cache = self.model._output_shape_cache
-    self._input_layers = self.model._input_layers
-    self._output_layers = self.model._output_layers
-    self._input_coordinates = self.model._input_coordinates
-    self._output_coordinates = self.model._output_coordinates
-    self._nodes_by_depth = self.model._nodes_by_depth
-    self._network_nodes = self.model._network_nodes
-    self.output_names = self.model.output_names
-    self.input_names = self.model.input_names
-    self._feed_input_names = self.model._feed_input_names
-    self._feed_inputs = self.model._feed_inputs
-
-    # Make sure child model callbacks
-    # will call the parent Sequential model.
-    self.model.callback_model = self
-
-    self.built = True
-
-  @property
-  def uses_learning_phase(self):
-    if not self.built:
-      self.build()
-    return self.model.uses_learning_phase
-
-  def _gather_list_attr(self, attr):
-    all_attrs = []
-    for layer in self.layers:
-      all_attrs += getattr(layer, attr, [])
-    return all_attrs
-
-  @property
-  def trainable(self):
-    return self._trainable
-
-  @trainable.setter
-  def trainable(self, value):
-    if self.model:
-      self.model.trainable = value
-    self._trainable = value
-
-  @property
-  def trainable_weights(self):
-    if not self.trainable:
-      return []
-    return self._gather_list_attr('trainable_weights')
-
-  @property
-  def non_trainable_weights(self):
-    weights = self._gather_list_attr('non_trainable_weights')
-    if not self.trainable:
-      trainable_weights = self._gather_list_attr('trainable_weights')
-      return trainable_weights + weights
-    return weights
-
-  @property
-  def regularizers(self):
-    if not self.built:
-      self.build()
-    return self.model.regularizers
-
-  def get_weights(self):
-    """Retrieves the weights of the model.
-
-    Returns:
-        A flat list of Numpy arrays
-        (one array per model weight).
-    """
-    if not self.built:
-      self.build()
-    return self.model.get_weights()
-
-  def set_weights(self, weights):
-    """Sets the weights of the model.
-
-    Arguments:
-        weights: Should be a list
-            of Numpy arrays with shapes and types matching
-            the output of `model.get_weights()`.
-    """
-    if not self.built:
-      self.build()
-    self.model.set_weights(weights)
-
-  def load_weights(self, filepath, by_name=False):
-    if h5py is None:
-      raise ImportError('`load_weights` requires h5py.')
-    f = h5py.File(filepath, mode='r')
-    if 'layer_names' not in f.attrs and 'model_weights' in f:
-      f = f['model_weights']
-    layers = self.layers
-    if by_name:
-      topology.load_weights_from_hdf5_group_by_name(f, layers)
-    else:
-      topology.load_weights_from_hdf5_group(f, layers)
-    if hasattr(f, 'close'):
-      f.close()
-
-  def save_weights(self, filepath, overwrite=True):
-    if h5py is None:
-      raise ImportError('`save_weights` requires h5py.')
-    # If file exists and should not be overwritten:
-    if not overwrite and os.path.isfile(filepath):
-      proceed = ask_to_proceed_with_overwrite(filepath)
-      if not proceed:
-        return
-    layers = self.layers
-    f = h5py.File(filepath, 'w')
-    topology.save_weights_to_hdf5_group(f, layers)
-    f.flush()
-    f.close()
-
-  def compile(self,
-              optimizer,
-              loss,
-              metrics=None,
-              sample_weight_mode=None,
-              weighted_metrics=None,
-              target_tensors=None,
-              **kwargs):
-    """Configures the model for training.
-
-    Arguments:
-        optimizer: String (name of optimizer) or optimizer object.
-            See [optimizers](/optimizers).
-        loss: String (name of objective function) or objective function.
-            See [losses](/losses).
-            If the model has multiple outputs, you can use a different loss
-            on each output by passing a dictionary or a list of losses.
-            The loss value that will be minimized by the model
-            will then be the sum of all individual losses.
-        metrics: List of metrics to be evaluated by the model
-            during training and testing.
-            Typically you will use `metrics=['accuracy']`.
-            To specify different metrics for different outputs of a
-            multi-output model, you could also pass a dictionary,
-            such as `metrics={'output_a': 'accuracy'}`.
-        sample_weight_mode: If you need to do timestep-wise
-            sample weighting (2D weights), set this to `"temporal"`.
-            `None` defaults to sample-wise weights (1D).
-            If the model has multiple outputs, you can use a different
-            `sample_weight_mode` on each output by passing a
-            dictionary or a list of modes.
-        weighted_metrics: list of metrics to be evaluated and weighted
-             by `sample_weight` or `class_weight` during training and testing.
-        target_tensors: By default, Keras will create a placeholder for the
-            model's target, which will be fed with the target data during
-            training. If instead you would like to use your own
-            target tensor (in turn, Keras will not expect external
-            Numpy data for these targets at training time), you
-            can specify them via the `target_tensors` argument.
-            It should be a single tensor
-            (for a single-output `Sequential` model).
-        **kwargs: These arguments are passed into `tf.Session.run`.
-
-    Example:
-        ```python
-            model = Sequential()
-            model.add(Dense(32, input_shape=(500,)))
-            model.add(Dense(10, activation='softmax'))
-            model.compile(optimizer='rmsprop',
-                          loss='categorical_crossentropy',
-                          metrics=['accuracy'])
-        ```
-    """
-    # create the underlying model
-    self.build()
-    # call compile method of Model class
-    self.model.compile(
-        optimizer,
-        loss,
-        metrics=metrics,
-        sample_weight_mode=sample_weight_mode,
-        weighted_metrics=weighted_metrics,
-        target_tensors=target_tensors,
-        **kwargs)
-    self.optimizer = self.model.optimizer
-    self.loss = self.model.loss
-    self.metrics = self.model.metrics
-    self.loss_weights = self.model.loss_weights
-    self.sample_weight_mode = self.model.sample_weight_mode
-    self.weighted_metrics = self.model.weighted_metrics
-    self.targets = self.model.targets
-    self.metrics_tensors = self.model.metrics_tensors
-    self.metrics_names = self.model.metrics_names
-    self.sample_weights = self.model.sample_weights
-    self.total_loss = self.model.total_loss
-
-  def fit(self,
-          x=None,
-          y=None,
-          batch_size=None,
-          epochs=1,
-          verbose=1,
-          callbacks=None,
-          validation_split=0.,
-          validation_data=None,
-          shuffle=True,
-          class_weight=None,
-          sample_weight=None,
-          initial_epoch=0,
-          steps_per_epoch=None,
-          validation_steps=None,
-          **kwargs):
-    """Trains the model for a fixed number of epochs.
-
-    Arguments:
-        x: Numpy array of training data.
-            If the input layer in the model is named, you can also pass a
-            dictionary mapping the input name to a Numpy array.
-            `x` can be `None` (default) if feeding from
-            TensorFlow data tensors.
-        y: Numpy array of target (label) data.
-            If the output layer in the model is named, you can also pass a
-            dictionary mapping the output name to a Numpy array.
-            `y` can be `None` (default) if feeding from
-            TensorFlow data tensors.
-        batch_size: Integer or `None`.
-            Number of samples per gradient update.
-            If unspecified, it will default to 32.
-        epochs: Integer. Number of epochs to train the model.
-            An epoch is an iteration over the entire `x` and `y`
-            data provided.
-            Note that in conjunction with `initial_epoch`,
-            `epochs` is to be understood as "final epoch".
-            The model is not trained for a number of iterations
-            given by `epochs`, but merely until the epoch
-            of index `epochs` is reached.
-        verbose: 0, 1, or 2. Verbosity mode.
-            0 = silent, 1 = progress bar, 2 = one line per epoch.
-        callbacks: List of `keras.callbacks.Callback` instances.
-            List of callbacks to apply during training.
-            See [callbacks](/callbacks).
-        validation_split: Float between 0 and 1:
-            Fraction of the training data to be used as validation data.
-            The model will set apart this fraction of the training data,
-            will not train on it, and will evaluate
-            the loss and any model metrics
-            on this data at the end of each epoch.
-            The validation data is selected from the last samples
-            in the `x` and `y` data provided, before shuffling.
-        validation_data: tuple `(x_val, y_val)` or tuple
-            `(x_val, y_val, val_sample_weights)` on which to evaluate
-            the loss and any model metrics at the end of each epoch.
-            The model will not be trained on this data.
-            This will override `validation_split`.
-        shuffle: Boolean (whether to shuffle the training data
-            before each epoch) or str (for 'batch').
-            'batch' is a special option for dealing with the
-            limitations of HDF5 data; it shuffles in batch-sized chunks.
-            Has no effect when `steps_per_epoch` is not `None`.
-        class_weight: Optional dictionary mapping class indices (integers)
-            to a weight (float) value, used for weighting the loss function
-            (during training only).
-            This can be useful to tell the model to
-            "pay more attention" to samples from
-            an under-represented class.
-        sample_weight: Optional Numpy array of weights for
-            the training samples, used for weighting the loss function
-            (during training only). You can either pass a flat (1D)
-            Numpy array with the same length as the input samples
-            (1:1 mapping between weights and samples),
-            or in the case of temporal data,
-            you can pass a 2D array with shape
-            `(samples, sequence_length)`,
-            to apply a different weight to every timestep of every sample.
-            In this case you should make sure to specify
-            `sample_weight_mode="temporal"` in `compile()`.
-        initial_epoch: Epoch at which to start training
-            (useful for resuming a previous training run).
-        steps_per_epoch: Total number of steps (batches of samples)
-            before declaring one epoch finished and starting the
-            next epoch. When training with input tensors such as
-            TensorFlow data tensors, the default `None` is equal to
-            the number of unique samples in your dataset divided by
-            the batch size, or 1 if that cannot be determined.
-        validation_steps: Only relevant if `steps_per_epoch`
-            is specified. Total number of steps (batches of samples)
-            to validate before stopping.
-        **kwargs: Used for backwards compatibility support.
-
-    Returns:
-        A `History` object. Its `History.history` attribute is
-        a record of training loss values and metrics values
-        at successive epochs, as well as validation loss values
-        and validation metrics values (if applicable).
-
-    Raises:
-        RuntimeError: If the model was never compiled.
-        ValueError: In case of mismatch between the provided input data
-            and what the model expects.
-    """
-    if not self.built:
-      raise RuntimeError('The model needs to be compiled before being used.')
-    return self.model.fit(
-        x,
-        y,
-        batch_size=batch_size,
-        epochs=epochs,
-        verbose=verbose,
-        callbacks=callbacks,
-        validation_split=validation_split,
-        validation_data=validation_data,
-        shuffle=shuffle,
-        class_weight=class_weight,
-        sample_weight=sample_weight,
-        initial_epoch=initial_epoch,
-        steps_per_epoch=steps_per_epoch,
-        validation_steps=validation_steps)
-
-  def evaluate(self, x, y, batch_size=32, verbose=1, sample_weight=None):
-    """Computes the loss on some input data, batch by batch.
-
-    Arguments:
-        x: input data, as a Numpy array or list of Numpy arrays
-            (if the model has multiple inputs).
-        y: labels, as a Numpy array.
-        batch_size: integer. Number of samples per gradient update.
-        verbose: verbosity mode, 0 or 1.
-        sample_weight: sample weights, as a Numpy array.
-
-    Returns:
-        Scalar test loss (if the model has no metrics)
-        or list of scalars (if the model computes other metrics).
-        The attribute `model.metrics_names` will give you
-        the display labels for the scalar outputs.
-
-    Raises:
-        RuntimeError: if the model was never compiled.
-    """
-    if not self.built:
-      raise RuntimeError('The model needs to be compiled before being used.')
-    return self.model.evaluate(
-        x,
-        y,
-        batch_size=batch_size,
-        verbose=verbose,
-        sample_weight=sample_weight)
-
-  def predict(self, x, batch_size=32, verbose=0):
-    """Generates output predictions for the input samples.
-
-    The input samples are processed batch by batch.
-
-    Arguments:
-        x: the input data, as a Numpy array.
-        batch_size: integer.
-        verbose: verbosity mode, 0 or 1.
-
-    Returns:
-        A Numpy array of predictions.
-    """
-    if not self.built:
-      self.build()
-    return self.model.predict(x, batch_size=batch_size, verbose=verbose)
-
-  def predict_on_batch(self, x):
-    """Returns predictions for a single batch of samples.
-
-    Arguments:
-        x: input data, as a Numpy array or list of Numpy arrays
-            (if the model has multiple inputs).
-
-    Returns:
-        A Numpy array of predictions.
-    """
-    if not self.built:
-      self.build()
-    return self.model.predict_on_batch(x)
-
-  def train_on_batch(self, x, y, class_weight=None, sample_weight=None):
-    """Single gradient update over one batch of samples.
-
-    Arguments:
-        x: input data, as a Numpy array or list of Numpy arrays
-            (if the model has multiple inputs).
-        y: labels, as a Numpy array.
-        class_weight: dictionary mapping classes to a weight value,
-            used for scaling the loss function (during training only).
-        sample_weight: sample weights, as a Numpy array.
-
-    Returns:
-        Scalar training loss (if the model has no metrics)
-        or list of scalars (if the model computes other metrics).
-        The attribute `model.metrics_names` will give you
-        the display labels for the scalar outputs.
-
-    Raises:
-        RuntimeError: if the model was never compiled.
-    """
-    if not self.built:
-      raise RuntimeError('The model needs to be compiled before being used.')
-    return self.model.train_on_batch(
-        x, y, sample_weight=sample_weight, class_weight=class_weight)
-
-  def test_on_batch(self, x, y, sample_weight=None):
-    """Evaluates the model over a single batch of samples.
-
-    Arguments:
-        x: input data, as a Numpy array or list of Numpy arrays
-            (if the model has multiple inputs).
-        y: labels, as a Numpy array.
-        sample_weight: sample weights, as a Numpy array.
-
-    Returns:
-        Scalar test loss (if the model has no metrics)
-        or list of scalars (if the model computes other metrics).
-        The attribute `model.metrics_names` will give you
-        the display labels for the scalar outputs.
-
-    Raises:
-        RuntimeError: if the model was never compiled.
-    """
-    if not self.built:
-      raise RuntimeError('The model needs to be compiled before being used.')
-    return self.model.test_on_batch(x, y, sample_weight=sample_weight)
-
-  def predict_proba(self, x, batch_size=32, verbose=0):
-    """Generates class probability predictions for the input samples.
-
-    The input samples are processed batch by batch.
-
-    Arguments:
-        x: input data, as a Numpy array or list of Numpy arrays
-            (if the model has multiple inputs).
-        batch_size: integer.
-        verbose: verbosity mode, 0 or 1.
-
-    Returns:
-        A Numpy array of probability predictions.
-    """
-    preds = self.predict(x, batch_size, verbose)
-    if preds.min() < 0. or preds.max() > 1.:
-      logging.warning('Network returning invalid probability values. '
-                      'The last layer might not normalize predictions '
-                      'into probabilities '
-                      '(like softmax or sigmoid would).')
-    return preds
-
-  def predict_classes(self, x, batch_size=32, verbose=0):
-    """Generate class predictions for the input samples.
-
-    The input samples are processed batch by batch.
-
-    Arguments:
-        x: input data, as a Numpy array or list of Numpy arrays
-            (if the model has multiple inputs).
-        batch_size: integer.
-        verbose: verbosity mode, 0 or 1.
-
-    Returns:
-        A numpy array of class predictions.
-    """
-    proba = self.predict(x, batch_size=batch_size, verbose=verbose)
-    if proba.shape[-1] > 1:
-      return proba.argmax(axis=-1)
-    else:
-      return (proba > 0.5).astype('int32')
-
-  def fit_generator(self,
-                    generator,
-                    steps_per_epoch=None,
-                    epochs=1,
-                    verbose=1,
-                    callbacks=None,
-                    validation_data=None,
-                    validation_steps=None,
-                    class_weight=None,
-                    max_queue_size=10,
-                    workers=1,
-                    use_multiprocessing=False,
-                    shuffle=True,
-                    initial_epoch=0,
-                    **kwargs):
-    """Fits the model on data generated batch-by-batch by a Python generator.
-
-    The generator is run in parallel to the model, for efficiency.
-    For instance, this allows you to do real-time data augmentation
-    on images on CPU in parallel to training your model on GPU.
-
-    Arguments:
-        generator: A generator.
-            The output of the generator must be either
-            - a tuple (inputs, targets)
-            - a tuple (inputs, targets, sample_weights).
-            All arrays should contain the same number of samples.
-            The generator is expected to loop over its data
-            indefinitely. An epoch finishes when `steps_per_epoch`
-            batches have been seen by the model.
-        steps_per_epoch: Total number of steps (batches of samples)
-            to yield from `generator` before declaring one epoch
-            finished and starting the next epoch. It should typically
-            be equal to the number of samples of your dataset
-            divided by the batch size.
-            Optional for `Sequence`: if unspecified, will use
-            the `len(generator)` as a number of steps.
-        epochs: Integer, total number of iterations on the data.
-            Note that in conjunction with initial_epoch, the parameter
-            epochs is to be understood as "final epoch". The model is
-            not trained for n steps given by epochs, but until the
-            epoch epochs is reached.
-        verbose: Verbosity mode, 0, 1, or 2.
-        callbacks: List of callbacks to be called during training.
-        validation_data: This can be either
-            - A generator for the validation data
-            - A tuple (inputs, targets)
-            - A tuple (inputs, targets, sample_weights).
-        validation_steps: Only relevant if `validation_data`
-            is a generator.
-            Number of steps to yield from validation generator
-            at the end of every epoch. It should typically
-            be equal to the number of samples of your
-            validation dataset divided by the batch size.
-            Optional for `Sequence`: if unspecified, will use
-            the `len(validation_data)` as a number of steps.
-        class_weight: Dictionary mapping class indices to a weight
-            for the class.
-        max_queue_size: Maximum size for the generator queue
-        workers: Maximum number of processes to spin up
-        use_multiprocessing: If True, use process based threading.
-            Note that because
-            this implementation relies on multiprocessing,
-            you should not pass
-            non picklable arguments to the generator
-            as they can't be passed
-            easily to children processes.
-       shuffle: Whether to shuffle the order of the batches at
-              the beginning of each epoch. Only used with instances
-              of `Sequence` (keras.utils.Sequence).
-        initial_epoch: Epoch at which to start training
-            (useful for resuming a previous training run)
-        **kwargs: support for legacy arguments.
-
-    Returns:
-        A `History` object.
-
-    Raises:
-        RuntimeError: if the model was never compiled.
-        ValueError: In case the generator yields
-            data in an invalid format.
-
-    Example:
-
-    ```python
-        def generate_arrays_from_file(path):
-            while 1:
-                f = open(path)
-                for line in f:
-                    # create Numpy arrays of input data
-                    # and labels, from each line in the file
-                    x, y = process_line(line)
-                    yield (x, y)
-                    f.close()
-
-        model.fit_generator(generate_arrays_from_file('/my_file.txt'),
-                            steps_per_epoch=1000, epochs=10)
-    ```
-    """
-    # Legacy support
-    if 'max_q_size' in kwargs:
-      max_queue_size = kwargs.pop('max_q_size')
-      logging.warning('The argument `max_q_size` has been renamed '
-                      '`max_queue_size`. Update your method calls accordingly.')
-    if 'pickle_safe' in kwargs:
-      use_multiprocessing = kwargs.pop('pickle_safe')
-      logging.warning('The argument `pickle_safe` has been renamed '
-                      '`use_multiprocessing`. '
-                      'Update your method calls accordingly.')
-    if kwargs:
-      raise ValueError('Unrecognized keyword arguments: ' + str(kwargs))
-
-    if not self.built:
-      raise RuntimeError('The model needs to be compiled before being used.')
-    return self.model.fit_generator(
-        generator,
-        steps_per_epoch,
-        epochs,
-        verbose=verbose,
-        callbacks=callbacks,
-        validation_data=validation_data,
-        validation_steps=validation_steps,
-        class_weight=class_weight,
-        max_queue_size=max_queue_size,
-        workers=workers,
-        use_multiprocessing=use_multiprocessing,
-        shuffle=shuffle,
-        initial_epoch=initial_epoch)
-
-  def evaluate_generator(self,
-                         generator,
-                         steps=None,
-                         max_queue_size=10,
-                         workers=1,
-                         use_multiprocessing=False,
-                         **kwargs):
-    """Evaluates the model on a data generator.
-
-    The generator should return the same kind of data
-    as accepted by `test_on_batch`.
-
-    Arguments:
-        generator: Generator yielding tuples (inputs, targets)
-            or (inputs, targets, sample_weights)
-        steps: Total number of steps (batches of samples)
-            to yield from `generator` before stopping.
-            Optional for `Sequence`: if unspecified, will use
-            the `len(generator)` as a number of steps.
-        max_queue_size: maximum size for the generator queue
-        workers: maximum number of processes to spin up
-        use_multiprocessing: if True, use process based threading.
-            Note that because this implementation
-            relies on multiprocessing, you should not pass
-            non picklable arguments to the generator
-            as they can't be passed easily to children processes.
-        **kwargs: support for legacy arguments.
-
-    Returns:
-        Scalar test loss (if the model has no metrics)
-        or list of scalars (if the model computes other metrics).
-        The attribute `model.metrics_names` will give you
-        the display labels for the scalar outputs.
-
-    Raises:
-        RuntimeError: if the model was never compiled.
-        ValueError: In case the generator yields
-            data in an invalid format.
-    """
-    # Legacy support
-    if 'max_q_size' in kwargs:
-      max_queue_size = kwargs.pop('max_q_size')
-      logging.warning('The argument `max_q_size` has been renamed '
-                      '`max_queue_size`. Update your method calls accordingly.')
-    if 'pickle_safe' in kwargs:
-      use_multiprocessing = kwargs.pop('pickle_safe')
-      logging.warning('The argument `pickle_safe` has been renamed '
-                      '`use_multiprocessing`. '
-                      'Update your method calls accordingly.')
-    if kwargs:
-      raise ValueError('Unrecognized keyword arguments: ' + str(kwargs))
-
-    if not self.built:
-      raise RuntimeError('The model needs to be compiled before being used.')
-    return self.model.evaluate_generator(
-        generator,
-        steps,
-        max_queue_size=max_queue_size,
-        workers=workers,
-        use_multiprocessing=use_multiprocessing)
-
-  def predict_generator(self,
-                        generator,
-                        steps=None,
-                        max_queue_size=10,
-                        workers=1,
-                        use_multiprocessing=False,
-                        verbose=0,
-                        **kwargs):
-    """Generates predictions for the input samples from a data generator.
-
-    The generator should return the same kind of data as accepted by
-    `predict_on_batch`.
-
-    Arguments:
-        generator: generator yielding batches of input samples.
-        steps: Total number of steps (batches of samples)
-            to yield from `generator` before stopping.
-            Optional for `Sequence`: if unspecified, will use
-            the `len(generator)` as a number of steps.
-        max_queue_size: maximum size for the generator queue
-        workers: maximum number of processes to spin up
-        use_multiprocessing: if True, use process based threading.
-            Note that because this implementation
-            relies on multiprocessing, you should not pass
-            non picklable arguments to the generator
-            as they can't be passed easily to children processes.
-        verbose: verbosity mode, 0 or 1.
-        **kwargs: support for legacy arguments.
-
-    Returns:
-        A Numpy array of predictions.
-
-    Raises:
-        ValueError: In case the generator yields
-            data in an invalid format.
-    """
-    # Legacy support
-    if 'max_q_size' in kwargs:
-      max_queue_size = kwargs.pop('max_q_size')
-      logging.warning('The argument `max_q_size` has been renamed '
-                      '`max_queue_size`. Update your method calls accordingly.')
-    if 'pickle_safe' in kwargs:
-      use_multiprocessing = kwargs.pop('pickle_safe')
-      logging.warning('The argument `pickle_safe` has been renamed '
-                      '`use_multiprocessing`. '
-                      'Update your method calls accordingly.')
-    if kwargs:
-      raise ValueError('Unrecognized keyword arguments: ' + str(kwargs))
-
-    if not self.built:
-      self.build()
-    return self.model.predict_generator(
-        generator,
-        steps,
-        max_queue_size=max_queue_size,
-        workers=workers,
-        use_multiprocessing=use_multiprocessing,
-        verbose=verbose)
 
-  def get_config(self):
-    config = []
-    for layer in self.layers:
-      config.append({
-          'class_name': layer.__class__.__name__,
-          'config': layer.get_config()
-      })
-    return copy.deepcopy(config)
 
-  @classmethod
-  def from_config(cls, config, custom_objects=None):
-    model = cls()
-    for conf in config:
-      layer = layer_module.deserialize(conf, custom_objects=custom_objects)
-      model.add(layer)
-    return model
+# API entries importable from `keras.models`:
+Model = training.Model  # pylint: disable=invalid-name
+Sequential = sequential.Sequential  # pylint: disable=invalid-name
+save_model = saving.save_model
+load_model = saving.load_model
+model_from_config = saving.model_from_config
+model_from_yaml = saving.model_from_yaml
+model_from_json = saving.model_from_json
 
 
 def _clone_functional_model(model, input_tensors=None):
@@ -1365,7 +90,7 @@ def _clone_functional_model(model, input_tensors=None):
   else:
     # Make sure that all input tensors come from a Keras layer.
     # If tensor comes from an input layer: cache the input layer.
-    input_tensors = topology._to_list(input_tensors)
+    input_tensors = generic_utils.to_list(input_tensors)
     input_tensors_ = []
     for i, x in enumerate(input_tensors):
       if not K.is_keras_tensor(x):
@@ -1402,7 +127,7 @@ def _clone_functional_model(model, input_tensors=None):
         # Reuse previously cloned layer.
         layer = layer_map[layer]
         # Don't call InputLayer multiple times.
-        if isinstance(layer, topology.InputLayer):
+        if isinstance(layer, InputLayer):
           continue
 
       # Gather inputs to call the new layer.
@@ -1427,8 +152,9 @@ def _clone_functional_model(model, input_tensors=None):
           if has_arg(layer.call, 'mask'):
             if 'mask' not in kwargs:
               kwargs['mask'] = computed_mask
-          output_tensors = topology._to_list(layer(computed_tensor, **kwargs))
-          output_masks = topology._to_list(
+          output_tensors = generic_utils.to_list(layer(computed_tensor,
+                                                       **kwargs))
+          output_masks = generic_utils.to_list(
               layer.compute_mask(computed_tensor, computed_mask))
           computed_tensors = [computed_tensor]
           computed_masks = [computed_mask]
@@ -1438,8 +164,9 @@ def _clone_functional_model(model, input_tensors=None):
           if has_arg(layer.call, 'mask'):
             if 'mask' not in kwargs:
               kwargs['mask'] = computed_masks
-          output_tensors = topology._to_list(layer(computed_tensors, **kwargs))
-          output_masks = topology._to_list(
+          output_tensors = generic_utils.to_list(layer(computed_tensors,
+                                                       **kwargs))
+          output_masks = generic_utils.to_list(
               layer.compute_mask(computed_tensors, computed_masks))
         # Update tensor_map.
         for x, y, mask in zip(reference_output_tensors, output_tensors,
@@ -1489,14 +216,14 @@ def _clone_sequential_model(model, input_tensors=None):
   if input_tensors is None:
     return Sequential(layers=layers, name=model.name)
   else:
-    if len(topology._to_list(input_tensors)) != 1:
+    if len(generic_utils.to_list(input_tensors)) != 1:
       raise ValueError('To clone a `Sequential` model, we expect '
                        ' at most one tensor '
                        'as part of `input_tensors`.')
-    x = topology._to_list(input_tensors)[0]
+    x = generic_utils.to_list(input_tensors)[0]
     if K.is_keras_tensor(x):
       origin_layer = x._keras_history[0]
-      if isinstance(origin_layer, topology.InputLayer):
+      if isinstance(origin_layer, InputLayer):
         return Sequential(layers=[origin_layer] + layers, name=model.name)
       else:
         raise ValueError('Cannot clone a `Sequential` model on top '
diff --git a/tensorflow/python/keras/_impl/keras/models_test.py b/tensorflow/python/keras/_impl/keras/models_test.py
index 04017e4b28b27e52f88a7746fc44510c29edffce..5978ddd987c63b9d87a31be6837172f08512ef73 100644
--- a/tensorflow/python/keras/_impl/keras/models_test.py
+++ b/tensorflow/python/keras/_impl/keras/models_test.py
@@ -12,362 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for training routines."""
+"""Tests for `models.py` (model cloning, mainly)."""
 
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import os
-import shutil
-import tempfile
-
 import numpy as np
 
 from tensorflow.python.keras._impl import keras
 from tensorflow.python.platform import test
-from tensorflow.python.training import training as training_module
-
-try:
-  import h5py  # pylint:disable=g-import-not-at-top
-except ImportError:
-  h5py = None
-
-
-class TestModelSaving(test.TestCase):
-
-  def test_sequential_model_saving(self):
-    if h5py is None:
-      return  # Skip test if models cannot be saved.
-
-    with self.test_session():
-      model = keras.models.Sequential()
-      model.add(keras.layers.Dense(2, input_shape=(3,)))
-      model.add(keras.layers.RepeatVector(3))
-      model.add(keras.layers.TimeDistributed(keras.layers.Dense(3)))
-      model.compile(loss=keras.losses.MSE,
-                    optimizer=keras.optimizers.RMSprop(lr=0.0001),
-                    metrics=[keras.metrics.categorical_accuracy],
-                    sample_weight_mode='temporal')
-      x = np.random.random((1, 3))
-      y = np.random.random((1, 3, 3))
-      model.train_on_batch(x, y)
-
-      out = model.predict(x)
-      fd, fname = tempfile.mkstemp('.h5')
-      keras.models.save_model(model, fname)
-
-      new_model = keras.models.load_model(fname)
-      os.close(fd)
-      os.remove(fname)
-
-      out2 = new_model.predict(x)
-      self.assertAllClose(out, out2, atol=1e-05)
-
-      # test that new updates are the same with both models
-      x = np.random.random((1, 3))
-      y = np.random.random((1, 3, 3))
-      model.train_on_batch(x, y)
-      new_model.train_on_batch(x, y)
-      out = model.predict(x)
-      out2 = new_model.predict(x)
-      self.assertAllClose(out, out2, atol=1e-05)
-
-  def test_sequential_model_saving_2(self):
-    if h5py is None:
-      return  # Skip test if models cannot be saved.
-
-    with self.test_session():
-      # test with custom optimizer, loss
-
-      class CustomOp(keras.optimizers.RMSprop):
-        pass
-
-      def custom_loss(y_true, y_pred):
-        return keras.losses.mse(y_true, y_pred)
-
-      model = keras.models.Sequential()
-      model.add(keras.layers.Dense(2, input_shape=(3,)))
-      model.add(keras.layers.Dense(3))
-      model.compile(loss=custom_loss, optimizer=CustomOp(), metrics=['acc'])
-
-      x = np.random.random((1, 3))
-      y = np.random.random((1, 3))
-      model.train_on_batch(x, y)
-
-      out = model.predict(x)
-      fd, fname = tempfile.mkstemp('.h5')
-      keras.models.save_model(model, fname)
-
-      model = keras.models.load_model(
-          fname,
-          custom_objects={'CustomOp': CustomOp,
-                          'custom_loss': custom_loss})
-      os.close(fd)
-      os.remove(fname)
-
-      out2 = model.predict(x)
-      self.assertAllClose(out, out2, atol=1e-05)
-
-  def test_functional_model_saving(self):
-    if h5py is None:
-      return  # Skip test if models cannot be saved.
-
-    with self.test_session():
-      inputs = keras.layers.Input(shape=(3,))
-      x = keras.layers.Dense(2)(inputs)
-      output = keras.layers.Dense(3)(x)
-
-      model = keras.models.Model(inputs, output)
-      model.compile(loss=keras.losses.MSE,
-                    optimizer=keras.optimizers.RMSprop(lr=0.0001),
-                    metrics=[keras.metrics.categorical_accuracy])
-      x = np.random.random((1, 3))
-      y = np.random.random((1, 3))
-      model.train_on_batch(x, y)
-
-      out = model.predict(x)
-      fd, fname = tempfile.mkstemp('.h5')
-      keras.models.save_model(model, fname)
-
-      model = keras.models.load_model(fname)
-      os.close(fd)
-      os.remove(fname)
-
-      out2 = model.predict(x)
-      self.assertAllClose(out, out2, atol=1e-05)
-
-  def test_saving_without_compilation(self):
-    if h5py is None:
-      return  # Skip test if models cannot be saved.
-
-    with self.test_session():
-      model = keras.models.Sequential()
-      model.add(keras.layers.Dense(2, input_shape=(3,)))
-      model.add(keras.layers.Dense(3))
-      model.compile(loss='mse', optimizer='sgd', metrics=['acc'])
-
-      fd, fname = tempfile.mkstemp('.h5')
-      keras.models.save_model(model, fname)
-      model = keras.models.load_model(fname)
-      os.close(fd)
-      os.remove(fname)
-
-  def test_saving_with_tf_optimizer(self):
-    if h5py is None:
-      return  # Skip test if models cannot be saved.
-
-    with self.test_session():
-      model = keras.models.Sequential()
-      model.add(keras.layers.Dense(2, input_shape=(3,)))
-      model.add(keras.layers.Dense(3))
-      model.compile(loss='mse',
-                    optimizer=training_module.AdadeltaOptimizer(0.1),
-                    metrics=['acc'])
-
-      fd, fname = tempfile.mkstemp('.h5')
-      keras.models.save_model(model, fname)
-      model = keras.models.load_model(fname)
-      os.close(fd)
-      os.remove(fname)
-
-  def test_saving_right_after_compilation(self):
-    if h5py is None:
-      return  # Skip test if models cannot be saved.
-
-    with self.test_session():
-      model = keras.models.Sequential()
-      model.add(keras.layers.Dense(2, input_shape=(3,)))
-      model.add(keras.layers.Dense(3))
-      model.compile(loss='mse', optimizer='sgd', metrics=['acc'])
-      model.model._make_train_function()
-
-      fd, fname = tempfile.mkstemp('.h5')
-      keras.models.save_model(model, fname)
-      model = keras.models.load_model(fname)
-      os.close(fd)
-      os.remove(fname)
-
-  def test_saving_lambda_numpy_array_arguments(self):
-    if h5py is None:
-      return  # Skip test if models cannot be saved.
-
-    mean = np.random.random((4, 2, 3))
-    std = np.abs(np.random.random((4, 2, 3))) + 1e-5
-    inputs = keras.layers.Input(shape=(4, 2, 3))
-    output = keras.layers.Lambda(lambda image, mu, std: (image - mu) / std,
-                                 arguments={'mu': mean, 'std': std})(inputs)
-    model = keras.models.Model(inputs, output)
-    model.compile(loss='mse', optimizer='sgd', metrics=['acc'])
-
-    fd, fname = tempfile.mkstemp('.h5')
-    keras.models.save_model(model, fname)
-
-    model = keras.models.load_model(fname)
-    os.close(fd)
-    os.remove(fname)
-
-    self.assertAllClose(mean, model.layers[1].arguments['mu'])
-    self.assertAllClose(std, model.layers[1].arguments['std'])
-
-
-class TestSequential(test.TestCase):
-  """Most Sequential model API tests are covered in `training_test.py`.
-  """
-
-  def test_basic_methods(self):
-    model = keras.models.Sequential()
-    model.add(keras.layers.Dense(1, input_dim=2))
-    model.add(keras.layers.Dropout(0.3, name='dp'))
-    model.add(keras.layers.Dense(2, kernel_regularizer='l2',
-                                 kernel_constraint='max_norm'))
-    model.build()
-    self.assertEqual(model.state_updates, model.model.state_updates)
-    self.assertEqual(model.get_layer(name='dp').name, 'dp')
-
-  def test_sequential_pop(self):
-    num_hidden = 5
-    input_dim = 3
-    batch_size = 5
-    num_classes = 2
-    with self.test_session():
-      model = keras.models.Sequential()
-      model.add(keras.layers.Dense(num_hidden, input_dim=input_dim))
-      model.add(keras.layers.Dense(num_classes))
-      model.compile(loss='mse', optimizer='sgd')
-      x = np.random.random((batch_size, input_dim))
-      y = np.random.random((batch_size, num_classes))
-      model.fit(x, y, epochs=1)
-      model.pop()
-      self.assertEqual(len(model.layers), 1)
-      self.assertEqual(model.output_shape, (None, num_hidden))
-      model.compile(loss='mse', optimizer='sgd')
-      y = np.random.random((batch_size, num_hidden))
-      model.fit(x, y, epochs=1)
-
-      # Test popping single-layer model
-      model = keras.models.Sequential()
-      model.add(keras.layers.Dense(num_hidden, input_dim=input_dim))
-      model.pop()
-      self.assertEqual(len(model.layers), 0)
-      self.assertEqual(len(model.outputs), 0)
-
-      # Invalid use case
-      model = keras.models.Sequential()
-      with self.assertRaises(TypeError):
-        model.pop()
-
-  def test_sequential_weight_loading(self):
-    if h5py is None:
-      return
-
-    temp_dir = self.get_temp_dir()
-    self.addCleanup(shutil.rmtree, temp_dir)
-    h5_path = os.path.join(temp_dir, 'test.h5')
-
-    num_hidden = 5
-    input_dim = 3
-    batch_size = 5
-    num_classes = 2
-
-    with self.test_session():
-      model = keras.models.Sequential()
-      model.add(keras.layers.Dense(num_hidden, input_dim=input_dim))
-      model.add(keras.layers.Dense(num_classes))
-
-      x = np.random.random((batch_size, input_dim))
-      ref_y = model.predict(x)
-
-      model.save_weights(h5_path)
-
-      model = keras.models.Sequential()
-      model.add(keras.layers.Dense(num_hidden, input_dim=input_dim))
-      model.add(keras.layers.Dense(num_classes))
-      model.load_weights(h5_path)
-      y = model.predict(x)
-
-      self.assertAllClose(y, ref_y)
-
-  def test_invalid_use_cases(self):
-    with self.test_session():
-      # Added objects must be layer instances
-      with self.assertRaises(TypeError):
-        model = keras.models.Sequential()
-        model.add(None)
-
-      # Added layers must have an inputs shape
-      with self.assertRaises(ValueError):
-        model = keras.models.Sequential()
-        model.add(keras.layers.Dense(1))
-
-      # Added layers cannot have multiple outputs
-      class MyLayer(keras.layers.Layer):
-
-        def call(self, inputs):
-          return [3 * inputs, 2 * inputs]
-
-        def compute_output_shape(self, input_shape):
-          return [input_shape, input_shape]
-
-      with self.assertRaises(ValueError):
-        model = keras.models.Sequential()
-        model.add(MyLayer(input_shape=(3,)))
-      with self.assertRaises(TypeError):
-        model = keras.models.Sequential()
-        model.add(keras.layers.Dense(1, input_dim=1))
-        model.add(MyLayer())
-
-      # Building empty model
-      model = keras.models.Sequential()
-      with self.assertRaises(TypeError):
-        model.build()
-
-  def test_nested_sequential_trainability(self):
-    input_dim = 20
-    num_units = 10
-    num_classes = 2
-
-    inner_model = keras.models.Sequential()
-    inner_model.add(keras.layers.Dense(num_units, input_shape=(input_dim,)))
-
-    model = keras.models.Sequential()
-    model.add(inner_model)
-    model.add(keras.layers.Dense(num_classes))
-
-    self.assertEqual(len(model.trainable_weights), 4)
-    inner_model.trainable = False
-    self.assertEqual(len(model.trainable_weights), 2)
-    inner_model.trainable = True
-    self.assertEqual(len(model.trainable_weights), 4)
-
-  def test_sequential_update_disabling(self):
-    val_a = np.random.random((10, 4))
-    val_out = np.random.random((10, 4))
-
-    with self.test_session():
-      model = keras.models.Sequential()
-      model.add(keras.layers.BatchNormalization(input_shape=(4,)))
-
-      model.trainable = False
-      assert not model.updates
-
-      model.compile('sgd', 'mse')
-      assert not model.updates
-      assert not model.model.updates
-
-      x1 = model.predict(val_a)
-      model.train_on_batch(val_a, val_out)
-      x2 = model.predict(val_a)
-      self.assertAllClose(x1, x2, atol=1e-7)
-
-      model.trainable = True
-      model.compile('sgd', 'mse')
-      assert model.updates
-      assert model.model.updates
-
-      model.train_on_batch(val_a, val_out)
-      x2 = model.predict(val_a)
-      assert np.abs(np.sum(x1 - x2)) > 1e-5
 
 
 class TestModelCloning(test.TestCase):
diff --git a/tensorflow/python/keras/_impl/keras/optimizers.py b/tensorflow/python/keras/_impl/keras/optimizers.py
index 76a97156ed7d9ca89b0d94f31bed3a23eca9609d..6520128c5b65451aef20ec9626245fba5ef29927 100644
--- a/tensorflow/python/keras/_impl/keras/optimizers.py
+++ b/tensorflow/python/keras/_impl/keras/optimizers.py
@@ -704,8 +704,10 @@ class TFOptimizer(Optimizer):
     return self.optimizer.compute_gradients(loss, params)
 
   def get_updates(self, loss, params):
-    grads = self.optimizer.compute_gradients(loss, params)
     self.updates = [K.update_add(self.iterations, 1)]
+    if not params:
+      return self.updates
+    grads = self.optimizer.compute_gradients(loss, params)
     opt_update = self.optimizer.apply_gradients(
         grads, global_step=self.iterations)
     self.updates.append(opt_update)
diff --git a/tensorflow/python/keras/_impl/keras/testing_utils.py b/tensorflow/python/keras/_impl/keras/testing_utils.py
index fa1ee2fa3da3fbc7650ee80960b00907013cc37c..60799ee1e038b4466351248bb5de7c8fc0de02a2 100644
--- a/tensorflow/python/keras/_impl/keras/testing_utils.py
+++ b/tensorflow/python/keras/_impl/keras/testing_utils.py
@@ -22,6 +22,7 @@ import numpy as np
 
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.keras._impl import keras
+from tensorflow.python.training.rmsprop import RMSPropOptimizer
 from tensorflow.python.util import tf_inspect
 
 
@@ -145,7 +146,7 @@ def layer_test(layer_cls, kwargs=None, input_shape=None, input_dtype=None,
     np.testing.assert_allclose(output, actual_output, rtol=1e-3)
 
   # test training mode (e.g. useful for dropout tests)
-  model.compile('rmsprop', 'mse')
+  model.compile(RMSPropOptimizer(0.01), 'mse')
   model.train_on_batch(input_data, actual_output)
 
   # test as first layer in Sequential API
@@ -181,9 +182,5 @@ def layer_test(layer_cls, kwargs=None, input_shape=None, input_dtype=None,
     output = recovered_model.predict(input_data)
     np.testing.assert_allclose(output, actual_output, rtol=1e-3)
 
-  # test training mode (e.g. useful for dropout tests)
-  model.compile('rmsprop', 'mse')
-  model.train_on_batch(input_data, actual_output)
-
   # for further checks in the caller function
   return actual_output
diff --git a/tensorflow/python/keras/_impl/keras/utils/__init__.py b/tensorflow/python/keras/_impl/keras/utils/__init__.py
index 370ae0dd0f0d00059f1b0cc79459abe75c8ca494..0c9f19a0c8dcf3bf929e102b31679a03b27728f7 100644
--- a/tensorflow/python/keras/_impl/keras/utils/__init__.py
+++ b/tensorflow/python/keras/_impl/keras/utils/__init__.py
@@ -31,8 +31,8 @@ from tensorflow.python.keras._impl.keras.utils.generic_utils import serialize_ke
 from tensorflow.python.keras._impl.keras.utils.io_utils import HDF5Matrix
 from tensorflow.python.keras._impl.keras.utils.layer_utils import convert_all_kernels_in_model
 from tensorflow.python.keras._impl.keras.utils.layer_utils import print_summary
+from tensorflow.python.keras._impl.keras.utils.multi_gpu_utils import multi_gpu_model
 from tensorflow.python.keras._impl.keras.utils.np_utils import normalize
 from tensorflow.python.keras._impl.keras.utils.np_utils import to_categorical
-from tensorflow.python.keras._impl.keras.utils.training_utils import multi_gpu_model
 from tensorflow.python.keras._impl.keras.utils.vis_utils import plot_model
 
diff --git a/tensorflow/python/keras/_impl/keras/utils/generic_utils.py b/tensorflow/python/keras/_impl/keras/utils/generic_utils.py
index 462d600bf827768b0f2e6265aebdaad48e70fcd9..5196bf17400c33d876daa430a9d3d5b4f4b491a1 100644
--- a/tensorflow/python/keras/_impl/keras/utils/generic_utils.py
+++ b/tensorflow/python/keras/_impl/keras/utils/generic_utils.py
@@ -509,3 +509,20 @@ def slice_arrays(arrays, start=None, stop=None):
       return arrays[start:stop]
     else:
       return [None]
+
+
+def to_list(x):
+  """Normalizes a list/tensor into a list.
+
+  If a tensor is passed, we return
+  a list of size 1 containing the tensor.
+
+  Arguments:
+      x: target object to be normalized.
+
+  Returns:
+      A list.
+  """
+  if isinstance(x, list):
+    return x
+  return [x]
diff --git a/tensorflow/python/keras/_impl/keras/utils/training_utils.py b/tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils.py
similarity index 100%
rename from tensorflow/python/keras/_impl/keras/utils/training_utils.py
rename to tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils.py
diff --git a/tensorflow/python/keras/_impl/keras/utils/training_utils_test.py b/tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils_test.py
similarity index 100%
rename from tensorflow/python/keras/_impl/keras/utils/training_utils_test.py
rename to tensorflow/python/keras/_impl/keras/utils/multi_gpu_utils_test.py
diff --git a/tensorflow/python/keras/utils/__init__.py b/tensorflow/python/keras/utils/__init__.py
index 91cc8607274a80a14dd27a64274da7f8f0aafab1..2f74cf031d0520c8d874b7269c52e3b9e1b9931b 100644
--- a/tensorflow/python/keras/utils/__init__.py
+++ b/tensorflow/python/keras/utils/__init__.py
@@ -30,9 +30,9 @@ from tensorflow.python.keras._impl.keras.utils.generic_utils import Progbar
 from tensorflow.python.keras._impl.keras.utils.generic_utils import serialize_keras_object
 from tensorflow.python.keras._impl.keras.utils.io_utils import HDF5Matrix
 from tensorflow.python.keras._impl.keras.utils.layer_utils import convert_all_kernels_in_model
+from tensorflow.python.keras._impl.keras.utils.multi_gpu_utils import multi_gpu_model
 from tensorflow.python.keras._impl.keras.utils.np_utils import normalize
 from tensorflow.python.keras._impl.keras.utils.np_utils import to_categorical
-from tensorflow.python.keras._impl.keras.utils.training_utils import multi_gpu_model
 from tensorflow.python.keras._impl.keras.utils.vis_utils import plot_model
 
 del absolute_import
diff --git a/tensorflow/python/kernel_tests/BUILD b/tensorflow/python/kernel_tests/BUILD
index d4ceb2e489c8a20d26eaf9d89b12992d2b8673d7..0f13e8bba585b833c290470ad58705c015c4a4e0 100644
--- a/tensorflow/python/kernel_tests/BUILD
+++ b/tensorflow/python/kernel_tests/BUILD
@@ -712,6 +712,18 @@ cuda_py_test(
     ],
 )
 
+tf_py_test(
+    name = "regex_replace_op_test",
+    size = "small",
+    srcs = ["regex_replace_op_test.py"],
+    additional_deps = [
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:constant_op",
+        "//tensorflow/python:dtypes",
+        "//tensorflow/python:string_ops",
+    ],
+)
+
 tf_py_test(
     name = "save_restore_ops_test",
     size = "small",
@@ -2892,6 +2904,40 @@ tf_py_test(
     ],
 )
 
+tf_py_test(
+    name = "accumulate_n_test",
+    size = "small",
+    srcs = ["accumulate_n_test.py"],
+    additional_deps = [
+        "//third_party/py/numpy",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:gradients",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:platform_test",
+        "//tensorflow/python:variables",
+    ],
+)
+
+tf_py_test(
+    name = "accumulate_n_eager_test",
+    size = "small",
+    srcs = ["accumulate_n_eager_test.py"],
+    additional_deps = [
+        "//third_party/py/numpy",
+        "//tensorflow/python:client_testlib",
+        "//tensorflow/python:framework_for_generated_wrappers",
+        "//tensorflow/python:framework_test_lib",
+        "//tensorflow/python:gradients",
+        "//tensorflow/python:math_ops",
+        "//tensorflow/python:resource_variable_ops",
+        "//tensorflow/python/eager:backprop",
+        "//tensorflow/python/eager:context",
+        "//tensorflow/python/eager:tape",
+    ],
+)
+
 filegroup(
     name = "all_files",
     srcs = glob(
diff --git a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_eager_test.py b/tensorflow/python/kernel_tests/accumulate_n_eager_test.py
similarity index 72%
rename from tensorflow/contrib/framework/python/ops/accumulate_n_v2_eager_test.py
rename to tensorflow/python/kernel_tests/accumulate_n_eager_test.py
index 35974b9e21d2d7423777a95a99f51c9cb4b453b2..dc11b7deceb9040584aca1f629f4d003aef39428 100644
--- a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_eager_test.py
+++ b/tensorflow/python/kernel_tests/accumulate_n_eager_test.py
@@ -12,48 +12,41 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for new version of accumulate_n op that will eventually go into
-`ops.math_ops`.
-
-These test cases spefically exercise the `eager` APIs. They need to be in a
-separate file from the remaining tests because eager mode is currently something
-you can turn on but can't turn off for the lifetime of the current process."""
+"""Tests for new version of accumulate_n op."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 import numpy as np
 
-from tensorflow.contrib.framework.python.ops import accumulate_n_v2 as av2
-
 from tensorflow.python.eager import backprop
 
 
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.platform import test
 
 
-
 class AccumulateNV2EagerTest(test_util.TensorFlowTestCase):
-  """Tests of the new, differentiable version of accumulate_n"""
+  """Tests of the new, differentiable version of accumulate_n."""
 
   def testMinimalEagerMode(self):
     forty = constant_op.constant(40)
     two = constant_op.constant(2)
-    answer = av2.accumulate_n_v2([forty, two])
+    answer = math_ops.accumulate_n([forty, two])
     self.assertEqual(42, answer.numpy())
 
-
   def testFloat(self):
     np.random.seed(12345)
     x = [np.random.random((1, 2, 3, 4, 5)) - 0.5 for _ in range(5)]
     tf_x = ops.convert_n_to_tensor(x)
     with self.test_session(use_gpu=True):
-      self.assertAllClose(sum(x), av2.accumulate_n_v2(tf_x).numpy())
-      self.assertAllClose(x[0] * 5, av2.accumulate_n_v2([tf_x[0]] * 5).numpy())
+      self.assertAllClose(sum(x), math_ops.accumulate_n(tf_x).numpy())
+      self.assertAllClose(x[0] * 5,
+                          math_ops.accumulate_n([tf_x[0]] * 5).numpy())
 
   def testGrad(self):
     np.random.seed(42)
@@ -65,16 +58,14 @@ class AccumulateNV2EagerTest(test_util.TensorFlowTestCase):
     ]
 
     def fn(first, second, third):
-      return av2.accumulate_n_v2([first, second, third])
+      return math_ops.accumulate_n([first, second, third])
 
     grad_fn = backprop.gradients_function(fn)
     grad = grad_fn(input_vars[0], input_vars[1], input_vars[2])
-    self.assertAllEqual(np.repeat(1.0, num_inputs), # d/dx (x + y + ...) = 1
+    self.assertAllEqual(np.repeat(1.0, num_inputs),  # d/dx (x + y + ...) = 1
                         [elem.numpy() for elem in grad])
 
 
-
 if __name__ == "__main__":
   ops.enable_eager_execution()
   test.main()
-
diff --git a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_test.py b/tensorflow/python/kernel_tests/accumulate_n_test.py
similarity index 79%
rename from tensorflow/contrib/framework/python/ops/accumulate_n_v2_test.py
rename to tensorflow/python/kernel_tests/accumulate_n_test.py
index 45962098e93acfac414396ddbeaa847701ff2b4b..0a6d4aea370eb788de0c65b4758a3210a7d2944d 100644
--- a/tensorflow/contrib/framework/python/ops/accumulate_n_v2_test.py
+++ b/tensorflow/python/kernel_tests/accumulate_n_test.py
@@ -12,42 +12,42 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Tests for new version of accumulate_n op that will eventually go into
-`ops.math_ops`."""
+"""Tests for new version of accumulate_n op."""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
 import numpy as np
 
-from tensorflow.contrib.framework.python.ops import accumulate_n_v2 as av2
-
 from tensorflow.python.framework import dtypes as dtypes_lib
 from tensorflow.python.framework import ops
 from tensorflow.python.framework import test_util
 from tensorflow.python.ops import gradients
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import googletest
 
 
 class AccumulateNV2Test(test_util.TensorFlowTestCase):
-  """Tests of the new, differentiable version of accumulate_n"""
+  """Tests of the new, differentiable version of accumulate_n."""
 
   def testFloat(self):
     np.random.seed(12345)
     x = [np.random.random((1, 2, 3, 4, 5)) - 0.5 for _ in range(5)]
     tf_x = ops.convert_n_to_tensor(x)
     with self.test_session(use_gpu=True):
-      self.assertAllClose(sum(x), av2.accumulate_n_v2(tf_x).eval())
-      self.assertAllClose(x[0] * 5, av2.accumulate_n_v2([tf_x[0]] * 5).eval())
+      self.assertAllClose(sum(x), math_ops.accumulate_n(tf_x).eval())
+      self.assertAllClose(x[0] * 5,
+                          math_ops.accumulate_n([tf_x[0]] * 5).eval())
 
   def testInt(self):
     np.random.seed(54321)
     x = [np.random.randint(-128, 128, (5, 4, 3, 2, 1)) for _ in range(6)]
     tf_x = ops.convert_n_to_tensor(x)
     with self.test_session(use_gpu=True):
-      self.assertAllEqual(sum(x), av2.accumulate_n_v2(tf_x).eval())
-      self.assertAllEqual(x[0] * 6, av2.accumulate_n_v2([tf_x[0]] * 6).eval())
+      self.assertAllEqual(sum(x), math_ops.accumulate_n(tf_x).eval())
+      self.assertAllEqual(x[0] * 6,
+                          math_ops.accumulate_n([tf_x[0]] * 6).eval())
 
   def testGrad(self):
     np.random.seed(42)
@@ -55,9 +55,9 @@ class AccumulateNV2Test(test_util.TensorFlowTestCase):
       with self.test_session(use_gpu=True) as sess:
         input_vars = [
             variables.Variable(10.0 * np.random.random())
-            for i in range(0, num_inputs)
+            for _ in range(0, num_inputs)
         ]
-        accum_n = av2.accumulate_n_v2(input_vars)
+        accum_n = math_ops.accumulate_n(input_vars)
         sess.run(variables.global_variables_initializer())
         accum_n_grad = gradients.gradients(accum_n, input_vars)
         self.assertAllEqual(
@@ -77,7 +77,7 @@ class AccumulateNV2Test(test_util.TensorFlowTestCase):
           ops.convert_to_tensor(x, dtype=dtypes_lib.float32)
           for x in random_arrays
       ]
-      tf_val = av2.accumulate_n_v2(random_tensors)
+      tf_val = math_ops.accumulate_n(random_tensors)
       np_val = random_arrays[0]
       for random_array in random_arrays[1:]:
         np_val += random_array
@@ -86,7 +86,7 @@ class AccumulateNV2Test(test_util.TensorFlowTestCase):
   def testZeroArgs(self):
     with self.test_session():
       with self.assertRaises(ValueError):
-        tf_val = av2.accumulate_n_v2([])
+        tf_val = math_ops.accumulate_n([])
         tf_val.eval()
 
   def testWrongShape(self):
@@ -94,28 +94,28 @@ class AccumulateNV2Test(test_util.TensorFlowTestCase):
       with self.assertRaises(ValueError):
         a = variables.Variable(0.2)
         b = variables.Variable(0.1)
-        tf_val = av2.accumulate_n_v2([a, b], shape=[2, 2])  # Should be shape=[]
+        math_ops.accumulate_n([a, b], shape=[2, 2])  # Should be shape=[]
 
   def testIncompatibleShapes(self):
     with self.test_session():
       with self.assertRaises(ValueError):
         a = variables.Variable(np.array([0.1, 0.2]))
         b = variables.Variable(np.array([[0.3], [0.4]]))
-        tf_val = av2.accumulate_n_v2([a, b])
+        math_ops.accumulate_n([a, b])
 
   def testWrongType(self):
     with self.test_session():
       with self.assertRaises(TypeError):
         a = variables.Variable(0.2, dtype=np.float32)
         b = variables.Variable(0.1, dtype=np.float32)
-        tf_val = av2.accumulate_n_v2([a, b], tensor_dtype=np.int32)
+        math_ops.accumulate_n([a, b], tensor_dtype=np.int32)
 
   def testWrongTypeOneInput(self):
     # Scenario that used to trigger a bug, even when testWrongType() worked
     with self.test_session():
       with self.assertRaises(TypeError):
         a = variables.Variable(0.2, dtype=np.float32)
-        tf_val = av2.accumulate_n_v2([a], tensor_dtype=np.int32)
+        math_ops.accumulate_n([a], tensor_dtype=np.int32)
 
 
 if __name__ == "__main__":
diff --git a/tensorflow/python/kernel_tests/array_ops_test.py b/tensorflow/python/kernel_tests/array_ops_test.py
index 365cf72108de5a1e5e1eb47891a6ad64151add22..d35f62b18601a6e33b1981e6b4d564d7e43da4b7 100644
--- a/tensorflow/python/kernel_tests/array_ops_test.py
+++ b/tensorflow/python/kernel_tests/array_ops_test.py
@@ -1223,7 +1223,7 @@ class SnapshotOpTest(test_util.TensorFlowTestCase):
     for dtype in [dtypes.int32, dtypes.int64, dtypes.float32, dtypes.float64]:
       with self.test_session(use_gpu=True):
         x = constant_op.constant([0, 1, 2, 3], dtype=dtype)
-        y = gen_array_ops._snapshot(x)
+        y = gen_array_ops.snapshot(x)
         self.assertAllEqual(y.eval(), [0, 1, 2, 3])
 
 
diff --git a/tensorflow/python/kernel_tests/batchtospace_op_test.py b/tensorflow/python/kernel_tests/batchtospace_op_test.py
index 0c802476a0e788aff3de84ab736fa8f1de5daab4..6143cd3baa6317fc512d80f94b494710037d4082 100644
--- a/tensorflow/python/kernel_tests/batchtospace_op_test.py
+++ b/tensorflow/python/kernel_tests/batchtospace_op_test.py
@@ -44,7 +44,7 @@ class CppOpImpl(object):
 
   @staticmethod
   def batch_to_space(*args, **kwargs):
-    return gen_array_ops._batch_to_space(*args, **kwargs)
+    return gen_array_ops.batch_to_space(*args, **kwargs)
 
 
 class BatchToSpaceDepthToSpace(test.TestCase, PythonOpImpl):
diff --git a/tensorflow/python/kernel_tests/bcast_ops_test.py b/tensorflow/python/kernel_tests/bcast_ops_test.py
index 9e512346053a4c3af089170f47313606c4a307c2..cb46fcb0076c1ca437089f5b9d87100667e2a404 100644
--- a/tensorflow/python/kernel_tests/bcast_ops_test.py
+++ b/tensorflow/python/kernel_tests/bcast_ops_test.py
@@ -20,8 +20,8 @@ from __future__ import print_function
 
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
-from tensorflow.python.ops.gen_array_ops import _broadcast_args
 from tensorflow.python.ops.gen_array_ops import _broadcast_gradient_args
+from tensorflow.python.ops.gen_array_ops import broadcast_args
 from tensorflow.python.platform import test
 
 
@@ -29,7 +29,7 @@ class BcastOpsTest(test.TestCase):
 
   def _GetBroadcastShape(self, xs, ys):
     with self.test_session() as sess:
-      return sess.run(_broadcast_args(xs, ys))
+      return sess.run(broadcast_args(xs, ys))
 
   def _GetGradientArgs(self, xs, ys):
     with self.test_session() as sess:
diff --git a/tensorflow/python/kernel_tests/checkpoint_ops_test.py b/tensorflow/python/kernel_tests/checkpoint_ops_test.py
index a786d0a47e569f71812086fb93c21dc12660a2a5..7f147ba53a71539962f424158731e359724f664f 100644
--- a/tensorflow/python/kernel_tests/checkpoint_ops_test.py
+++ b/tensorflow/python/kernel_tests/checkpoint_ops_test.py
@@ -50,7 +50,7 @@ class GenerateVocabRemappingTest(test.TestCase):
 
   def test_generate_remapping_with_no_vocab_changes(self):
     """Tests where vocab does not change at all."""
-    remapping, num_present = gen_checkpoint_ops._generate_vocab_remapping(
+    remapping, num_present = gen_checkpoint_ops.generate_vocab_remapping(
         new_vocab_file=self.old_vocab_file,
         old_vocab_file=self.old_vocab_file,
         num_new_vocab=3,
@@ -63,7 +63,7 @@ class GenerateVocabRemappingTest(test.TestCase):
 
   def test_generate_remapping_with_shifted_vocab(self):
     """Tests where vocab is the same, but shifted / ordered differently."""
-    remapping, num_present = gen_checkpoint_ops._generate_vocab_remapping(
+    remapping, num_present = gen_checkpoint_ops.generate_vocab_remapping(
         new_vocab_file=self.new_vocab_file,
         old_vocab_file=self.old_vocab_file,
         num_new_vocab=3,
@@ -76,7 +76,7 @@ class GenerateVocabRemappingTest(test.TestCase):
 
   def test_generate_remapping_with_offset(self):
     """Tests offset and num_new_vocab logic."""
-    remapping, num_present = gen_checkpoint_ops._generate_vocab_remapping(
+    remapping, num_present = gen_checkpoint_ops.generate_vocab_remapping(
         new_vocab_file=self.new_vocab_file,
         old_vocab_file=self.old_vocab_file,
         num_new_vocab=1,
@@ -89,7 +89,7 @@ class GenerateVocabRemappingTest(test.TestCase):
 
   def test_generate_remapping_with_old_vocab_size(self):
     """Tests where old_vocab_size is specified."""
-    remapping, num_present = gen_checkpoint_ops._generate_vocab_remapping(
+    remapping, num_present = gen_checkpoint_ops.generate_vocab_remapping(
         new_vocab_file=self.new_vocab_file,
         old_vocab_file=self.old_vocab_file,
         num_new_vocab=3,
@@ -132,7 +132,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
 
     # No column remapping, new weight matrix has second row, then first row.
     row_remapping = [1, 0]
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=row_remapping,
@@ -147,7 +147,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     # No row remapping, new weight matrix has third col, then first col.
     row_remapping = list(range(self.old_num_rows))
     col_remapping = [2, 0]
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=row_remapping,
@@ -162,7 +162,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     # Both row and column remappings.
     row_remapping = [1, 0, 4]
     col_remapping = [1, 15]
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=row_remapping,
@@ -177,7 +177,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
   def test_load_and_remap_with_init(self):
     """Tests the op's load and remap where there are missing entries."""
     init_val = 42
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[2, -1, 0],
@@ -196,7 +196,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     """Tests when all the rows are missing and need to be initialized."""
     num_rows = 7
     initializing_values = [42] * num_rows * self.old_num_cols
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[-1] * num_rows,
@@ -214,7 +214,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     num_rows = 7
     num_cols = 4
     initializing_values = [42] * num_rows * num_cols
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[-1] * num_rows,
@@ -235,7 +235,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     invalid_remapping = [1, 0, 0, 0, 1, 2]
 
     # Invalid row remapping.
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=invalid_remapping,
@@ -247,7 +247,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
       remapped_matrix.eval()
 
     # Invalid column remapping.
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=list(range(self.old_num_rows)),
@@ -260,7 +260,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
 
   def test_load_and_remap_incorrect_initializing_values(self):
     """Tests that errors are raised with incorrect number of init values."""
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[2, -1, 0],
@@ -275,7 +275,7 @@ class LoadAndRemapMatrixTest(test.TestCase):
     with self.test_session(), self.assertRaises(errors.InvalidArgumentError):
       remapped_matrix.eval()
 
-    remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+    remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
         ckpt_path=[self.bundle_file],
         old_tensor_name=self.old_tensor_name,
         row_remapping=[2, -1, 0],
@@ -314,7 +314,7 @@ class LoadAndRemapMatrixWithMaxRowsTest(test.TestCase):
       num_rows, num_cols = np_value.shape
 
       # Tests loading the entire tensor (except reversed).
-      remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+      remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
           ckpt_path=ckpt_path,
           old_tensor_name=old_tensor_name,
           # Simply reverses the rows of the matrix.
@@ -332,7 +332,7 @@ class LoadAndRemapMatrixWithMaxRowsTest(test.TestCase):
       self.assertGreater(num_rows, 2)
       prefix_rows = 2
       suffix_rows = 3
-      remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+      remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
           ckpt_path=ckpt_path,
           old_tensor_name=old_tensor_name,
           # Reverses the rows of the matrix, then prepends and appends
@@ -353,7 +353,7 @@ class LoadAndRemapMatrixWithMaxRowsTest(test.TestCase):
       # Tests when everything is taken from initializing_values.
       new_rows = 7
       initializing_values = [42] * new_rows * num_cols
-      remapped_matrix = gen_checkpoint_ops._load_and_remap_matrix(
+      remapped_matrix = gen_checkpoint_ops.load_and_remap_matrix(
           ckpt_path=ckpt_path,
           old_tensor_name=old_tensor_name,
           # Nothing is loaded from the old tensor.
diff --git a/tensorflow/python/kernel_tests/concat_op_test.py b/tensorflow/python/kernel_tests/concat_op_test.py
index 127bc6bb20ae6b415da94672de68cc4b8ceaa287..81c6a4aa6e6edc4b49338a0bfd354302b1dfac0b 100644
--- a/tensorflow/python/kernel_tests/concat_op_test.py
+++ b/tensorflow/python/kernel_tests/concat_op_test.py
@@ -526,7 +526,7 @@ class ConcatOpTest(test.TestCase):
     with self.test_session(use_gpu=True):
       t1 = []
       t2 = []
-      output = gen_array_ops._concat_v2([t1, t2], 0).eval()
+      output = gen_array_ops.concat_v2([t1, t2], 0).eval()
       self.assertFalse(output)  # Checks that output is empty
 
   def testConcatInvalidAxis(self):
@@ -534,20 +534,20 @@ class ConcatOpTest(test.TestCase):
       with self.test_session(use_gpu=True):
         t1 = [1]
         t2 = [2]
-        gen_array_ops._concat_v2([t1, t2], 1).eval()
+        gen_array_ops.concat_v2([t1, t2], 1).eval()
 
   def testConcatNegativeAxis(self):
     with self.test_session(use_gpu=True):
       t1 = [[1, 2, 3], [4, 5, 6]]
       t2 = [[7, 8, 9], [10, 11, 12]]
 
-      c = gen_array_ops._concat_v2([t1, t2], -2)
+      c = gen_array_ops.concat_v2([t1, t2], -2)
       self.assertEqual([4, 3], c.get_shape().as_list())
       output = c.eval()
       self.assertAllEqual([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
                           output)
 
-      c = gen_array_ops._concat_v2([t1, t2], -1)
+      c = gen_array_ops.concat_v2([t1, t2], -1)
       self.assertEqual([2, 6], c.get_shape().as_list())
       output = c.eval()
       self.assertAllEqual([[1, 2, 3, 7, 8, 9], [4, 5, 6, 10, 11, 12]], output)
@@ -615,7 +615,7 @@ class ConcatOffsetTest(test.TestCase):
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 5], dtypes.int32)
       s2 = constant_op.constant([2, 20, 5], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1, s2])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1, s2])
       ans = sess.run(off)
       self.assertAllEqual(ans, [[0, 0, 0], [0, 3, 0], [0, 10, 0]])
 
@@ -624,7 +624,7 @@ class ConcatOffsetTest(test.TestCase):
       cdim = constant_op.constant(1, dtypes.int32)
       s0 = constant_op.constant([[2, 3, 5]], dtypes.int32)
       s1 = constant_op.constant([[2, 7, 5]], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1])
       with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
                                    r"should be a vector"):
         sess.run(off)
@@ -634,7 +634,7 @@ class ConcatOffsetTest(test.TestCase):
       cdim = constant_op.constant(4, dtypes.int32)
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 5], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1])
       with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
                                    r"Concat dim is out of range: 4 vs. 3"):
         sess.run(off)
@@ -644,7 +644,7 @@ class ConcatOffsetTest(test.TestCase):
       cdim = constant_op.constant(1, dtypes.int32)
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 5, 10], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1])
       with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
                                    r"should contain 3 elem"):
         sess.run(off)
@@ -654,7 +654,7 @@ class ConcatOffsetTest(test.TestCase):
       cdim = constant_op.constant(1, dtypes.int32)
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 10], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1])
       with self.assertRaisesRegexp(
           errors_impl.InvalidArgumentError,
           r"All dimensions except 1 must match. Input 1 has shape \[2 7 10\] "
@@ -667,7 +667,7 @@ class ConcatOffsetTest(test.TestCase):
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([2, 7, 5], dtypes.int32)
       s2 = constant_op.constant([2, 20, 5], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1, s2])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1, s2])
       ans = sess.run(off)
       self.assertAllEqual(ans, [[0, 0, 0], [0, 3, 0], [0, 10, 0]])
 
@@ -675,7 +675,7 @@ class ConcatOffsetTest(test.TestCase):
       s0 = constant_op.constant([2, 3, 5], dtypes.int32)
       s1 = constant_op.constant([1, 3, 5], dtypes.int32)
       s2 = constant_op.constant([3, 3, 5], dtypes.int32)
-      off = gen_array_ops._concat_offset(cdim, [s0, s1, s2])
+      off = gen_array_ops.concat_offset(cdim, [s0, s1, s2])
       ans = sess.run(off)
       self.assertAllEqual(ans, [[0, 0, 0], [2, 0, 0], [3, 0, 0]])
 
diff --git a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
index 15ff0ec09b65a8ba242473fb7b25ee00424e0926..b429fa5c423effce0f0ccb0ad34875dab2808777 100644
--- a/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
+++ b/tensorflow/python/kernel_tests/control_flow_ops_py_test.py
@@ -552,7 +552,7 @@ class ControlFlowTest(test.TestCase):
 
   def testCondRef(self):
     with self.test_session():
-      x = gen_state_ops._variable(
+      x = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="x",
@@ -580,7 +580,7 @@ class ControlFlowTest(test.TestCase):
 
   def testUninitializedRefIdentity(self):
     with self.test_session() as sess:
-      v = gen_state_ops._variable(
+      v = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="v",
@@ -1620,7 +1620,7 @@ class ControlFlowTest(test.TestCase):
 
   def testWhileStack_1(self):
     with self.test_session():
-      s = gen_data_flow_ops._stack_v2(-1, dtypes.int32, stack_name="foo")
+      s = gen_data_flow_ops.stack_v2(-1, dtypes.int32, stack_name="foo")
       i = constant_op.constant(0)
 
       def c(i):
@@ -1629,7 +1629,7 @@ class ControlFlowTest(test.TestCase):
       def b(i):
         ni = math_ops.add(i, 1)
         ni = control_flow_ops.with_dependencies(
-            [gen_data_flow_ops._stack_push_v2(s, i)], ni)
+            [gen_data_flow_ops.stack_push_v2(s, i)], ni)
         return ni
 
       r = control_flow_ops.while_loop(c, b, [i], parallel_iterations=1)
@@ -1641,7 +1641,7 @@ class ControlFlowTest(test.TestCase):
 
       def b1(i, x):
         ni = math_ops.subtract(i, 1)
-        nx = x + gen_data_flow_ops._stack_pop_v2(s, dtypes.int32)
+        nx = x + gen_data_flow_ops.stack_pop_v2(s, dtypes.int32)
         return [ni, nx]
 
       _, rx = control_flow_ops.while_loop(
@@ -1840,6 +1840,23 @@ class ControlFlowTest(test.TestCase):
                                       [tensor_shape.unknown_shape()])
       self.assertAllClose(9.0, r.eval(feed_dict={x: 1.0}))
 
+  def testCondGradInNestedWhiles(self):
+    def outer_body(i, x):
+      _, x = control_flow_ops.while_loop(
+          lambda j, x: j < 3, inner_body, [0, 0.0])
+      return i + 1, x
+
+    def inner_body(j, x):
+      y = control_flow_ops.cond(math_ops.less(x, 1), lambda: 2 * x, lambda: x)
+      return j + 1, gradients_impl.gradients(y, x)[0]
+
+    i, x = control_flow_ops.while_loop(lambda i, x: i < 3, outer_body, [0, 0.0])
+
+    with self.test_session() as sess:
+      i_val, x_val = sess.run([i, x])
+      self.assertEqual(i_val, 3)
+      self.assertAllClose(x_val, 1.0)
+
   def testWhile_NestedInput(self):
     with self.test_session() as sess:
       named = collections.namedtuple("named", ("a", "b"))
diff --git a/tensorflow/python/kernel_tests/conv_ops_test.py b/tensorflow/python/kernel_tests/conv_ops_test.py
index edfb20d6a2b80cec930ddf696e8f0f69623a4de7..f4fe01f868da25660171c614bbf84390aead3ade 100644
--- a/tensorflow/python/kernel_tests/conv_ops_test.py
+++ b/tensorflow/python/kernel_tests/conv_ops_test.py
@@ -302,25 +302,20 @@ class Conv2DTest(test.TestCase):
                                padding, dilations):
     expected_results = []
     computed_results = []
-    default_dilations = (dilations[0] == 1 and dilations[1] == 1)
     for data_format, use_gpu in GetTestConfigs():
-      # If any dilation rate is larger than 1, only do test on the GPU
-      # because we currently do not have a CPU implementation for arbitrary
-      # dilation rates.
-      if default_dilations or use_gpu:
-        expected, computed = self._ComputeReferenceDilatedConv(
-            tensor_in_sizes, filter_in_sizes, strides, dilations, padding,
-            data_format, use_gpu)
-        expected_results.append(expected)
-        computed_results.append(computed)
-        tolerance = 1e-2 if use_gpu else 1e-5
-        expected_values = self.evaluate(expected_results)
-        computed_values = self.evaluate(computed_results)
-        for e_value, c_value in zip(expected_values, computed_values):
-          print("expected = ", e_value)
-          print("actual = ", c_value)
-          self.assertAllClose(
-              e_value.flatten(), c_value.flatten(), atol=tolerance, rtol=1e-4)
+      expected, computed = self._ComputeReferenceDilatedConv(
+          tensor_in_sizes, filter_in_sizes, strides, dilations, padding,
+          data_format, use_gpu)
+      expected_results.append(expected)
+      computed_results.append(computed)
+      tolerance = 1e-2 if use_gpu else 1e-5
+      expected_values = self.evaluate(expected_results)
+      computed_values = self.evaluate(computed_results)
+      for e_value, c_value in zip(expected_values, computed_values):
+        print("expected = ", e_value)
+        print("actual = ", c_value)
+        self.assertAllClose(
+            e_value.flatten(), c_value.flatten(), atol=tolerance, rtol=1e-4)
 
   def _VerifyValues(self, tensor_in_sizes, filter_in_sizes, strides, padding,
                     expected):
@@ -365,13 +360,12 @@ class Conv2DTest(test.TestCase):
 
   @test_util.run_in_graph_and_eager_modes()
   def testConv2D2x2Filter2x1Dilation(self):
-    if test.is_gpu_available(cuda_only=True):
-      self._VerifyDilatedConvValues(
-          tensor_in_sizes=[1, 4, 4, 1],
-          filter_in_sizes=[2, 2, 1, 1],
-          strides=[1, 1],
-          dilations=[2, 1],
-          padding="VALID")
+    self._VerifyDilatedConvValues(
+        tensor_in_sizes=[1, 4, 4, 1],
+        filter_in_sizes=[2, 2, 1, 1],
+        strides=[1, 1],
+        dilations=[2, 1],
+        padding="VALID")
 
   @test_util.run_in_graph_and_eager_modes()
   def testConv2DEmpty(self):
@@ -385,13 +379,12 @@ class Conv2DTest(test.TestCase):
 
   @test_util.run_in_graph_and_eager_modes()
   def testConv2DEmptyDilation(self):
-    if test.is_gpu_available(cuda_only=True):
-      self._VerifyDilatedConvValues(
-          tensor_in_sizes=[0, 2, 3, 3],
-          filter_in_sizes=[1, 1, 3, 3],
-          strides=[1, 1],
-          dilations=[2, 1],
-          padding="VALID")
+    self._VerifyDilatedConvValues(
+        tensor_in_sizes=[0, 2, 3, 3],
+        filter_in_sizes=[1, 1, 3, 3],
+        strides=[1, 1],
+        dilations=[2, 1],
+        padding="VALID")
 
   @test_util.run_in_graph_and_eager_modes()
   def testConv2D2x2Filter(self):
@@ -406,13 +399,12 @@ class Conv2DTest(test.TestCase):
 
   @test_util.run_in_graph_and_eager_modes()
   def testConv2D2x2FilterDilation(self):
-    if test.is_gpu_available(cuda_only=True):
-      self._VerifyDilatedConvValues(
-          tensor_in_sizes=[1, 2, 3, 3],
-          filter_in_sizes=[2, 2, 3, 3],
-          strides=[1, 1],
-          dilations=[1, 2],
-          padding="VALID")
+    self._VerifyDilatedConvValues(
+        tensor_in_sizes=[1, 2, 3, 3],
+        filter_in_sizes=[2, 2, 3, 3],
+        strides=[1, 1],
+        dilations=[1, 2],
+        padding="VALID")
 
   @test_util.run_in_graph_and_eager_modes()
   def testConv2D1x2Filter(self):
@@ -430,13 +422,12 @@ class Conv2DTest(test.TestCase):
 
   @test_util.run_in_graph_and_eager_modes()
   def testConv2D1x2FilterDilation(self):
-    if test.is_gpu_available(cuda_only=True):
-      self._VerifyDilatedConvValues(
-          tensor_in_sizes=[1, 2, 3, 3],
-          filter_in_sizes=[1, 2, 3, 3],
-          strides=[1, 1],
-          dilations=[2, 1],
-          padding="VALID")
+    self._VerifyDilatedConvValues(
+        tensor_in_sizes=[1, 2, 3, 3],
+        filter_in_sizes=[1, 2, 3, 3],
+        strides=[1, 1],
+        dilations=[2, 1],
+        padding="VALID")
 
   @test_util.run_in_graph_and_eager_modes()
   def testConv2D2x2FilterStride2(self):
@@ -512,13 +503,12 @@ class Conv2DTest(test.TestCase):
 
   @test_util.run_in_graph_and_eager_modes()
   def testConv2DKernelSizeMatchesInputSizeDilation(self):
-    if test.is_gpu_available(cuda_only=True):
-      self._VerifyDilatedConvValues(
-          tensor_in_sizes=[1, 3, 3, 1],
-          filter_in_sizes=[2, 2, 1, 2],
-          strides=[1, 1],
-          dilations=[2, 2],
-          padding="VALID")
+    self._VerifyDilatedConvValues(
+        tensor_in_sizes=[1, 3, 3, 1],
+        filter_in_sizes=[2, 2, 1, 2],
+        strides=[1, 1],
+        dilations=[2, 2],
+        padding="VALID")
 
   # TODO(yzhwang): this currently fails.
   # self._VerifyValues(tensor_in_sizes=[1, 8, 8, 1],
@@ -1523,36 +1513,6 @@ class Conv2DTest(test.TestCase):
                 strides=[1, 1, 1, 1],
                 padding="VALID"))
 
-  def testCPUConv2DNCHWUnimplemented(self):
-    with self.test_session(use_gpu=False):
-      with self.assertRaisesRegexp(errors_impl.UnimplementedError,
-                                   "NHWC tensor format for now"):
-        conv = self._SetupValuesForDevice(
-            tensor_in_sizes=[1, 4, 4, 1],
-            filter_in_sizes=[2, 2, 1, 1],
-            dilations=[1, 1],
-            strides=[1, 1],
-            padding="VALID",
-            data_format="NCHW",
-            dtype=dtypes.float32,
-            use_gpu=False)
-        self.evaluate(conv)
-
-  def testCPUConv2DDilatedUnimplemented(self):
-    with self.test_session(use_gpu=False):
-      with self.assertRaisesRegexp(errors_impl.UnimplementedError,
-                                   "dilated rate of 1 for now"):
-        conv = self._SetupValuesForDevice(
-            tensor_in_sizes=[1, 4, 4, 1],
-            filter_in_sizes=[2, 2, 1, 1],
-            dilations=[2, 1],
-            strides=[1, 1],
-            padding="VALID",
-            data_format="NHWC",
-            dtype=dtypes.float32,
-            use_gpu=False)
-        self.evaluate(conv)
-
 
 class DepthwiseConv2DTest(test.TestCase):
 
@@ -1887,7 +1847,7 @@ def GetInceptionFwdTest(input_size, filter_size, stride, padding,
 def GetInceptionFwdDilatedConvTest(input_size, filter_size, stride, padding):
 
   def Test(self):
-    if test.is_gpu_available(cuda_only=True) and stride == 1:
+    if stride == 1:
       tf_logging.info("Testing InceptionFwd with dilations %s",
                       (input_size, filter_size, stride, padding))
       self._VerifyDilatedConvValues(
diff --git a/tensorflow/python/kernel_tests/cwise_ops_test.py b/tensorflow/python/kernel_tests/cwise_ops_test.py
index 0d9b46c30dbbed20dd940e0427fbf6f6d5415106..8db0bb6f0dc495e7be2cd717787acf87156f42af 100644
--- a/tensorflow/python/kernel_tests/cwise_ops_test.py
+++ b/tensorflow/python/kernel_tests/cwise_ops_test.py
@@ -495,11 +495,11 @@ class UnaryOpTest(test.TestCase):
     dtype_tols = [(np.float32, 5e-4), (np.float64, 1e-6), (np.complex64, 5e-4),
                   (np.complex128, 1e-6)]
     op_range = [
-        (gen_math_ops._reciprocal_grad, [-2, 2]),
-        (gen_math_ops._rsqrt_grad, [0.1, 3]),
-        (gen_math_ops._sigmoid_grad, [-2, 2]),
-        (gen_math_ops._sqrt_grad, [0.1, 3]),
-        (gen_math_ops._tanh_grad, [-2, 2]),
+        (gen_math_ops.reciprocal_grad, [-2, 2]),
+        (gen_math_ops.rsqrt_grad, [0.1, 3]),
+        (gen_math_ops.sigmoid_grad, [-2, 2]),
+        (gen_math_ops.sqrt_grad, [0.1, 3]),
+        (gen_math_ops.tanh_grad, [-2, 2]),
     ]
 
     def rand(dtype):
diff --git a/tensorflow/python/kernel_tests/determinant_op_test.py b/tensorflow/python/kernel_tests/determinant_op_test.py
index 222038b22ef3c766efd14fd9b1c9044a0b6e9125..a52b2c0dc32c26ecd5ef08aa3f8678f0006cd4fe 100644
--- a/tensorflow/python/kernel_tests/determinant_op_test.py
+++ b/tensorflow/python/kernel_tests/determinant_op_test.py
@@ -65,7 +65,7 @@ class DeterminantOpTest(test.TestCase):
       self._compareDeterminantBase(matrix_x,
                                    linalg_ops.matrix_determinant(matrix_x))
       self._compareLogDeterminantBase(
-          matrix_x, gen_linalg_ops._log_matrix_determinant(matrix_x))
+          matrix_x, gen_linalg_ops.log_matrix_determinant(matrix_x))
 
   def testBasic(self):
     # 2x2 matrices
diff --git a/tensorflow/python/kernel_tests/dynamic_partition_op_test.py b/tensorflow/python/kernel_tests/dynamic_partition_op_test.py
index fedbf9e696923a34968e7a907e4099c520d1447b..5e8937ad2c36afb2b1ddb58ffb238a45e09e4b30 100644
--- a/tensorflow/python/kernel_tests/dynamic_partition_op_test.py
+++ b/tensorflow/python/kernel_tests/dynamic_partition_op_test.py
@@ -326,6 +326,18 @@ class DynamicPartitionTest(test.TestCase):
     with self.assertRaises(ValueError):
       data_flow_ops.dynamic_partition(data, indices, num_partitions=4)
 
+  #  see https://github.com/tensorflow/tensorflow/issues/17106
+  def testCUBBug(self):
+    x = constant_op.constant(np.random.randn(3072))
+    inds = [0]*189 + [1]*184 + [2]*184 + [3]*191 + [4]*192 + [5]*195 + [6]*195
+    inds += [7]*195 + [8]*188 + [9]*195 + [10]*188 + [11]*202 + [12]*194
+    inds += [13]*194 + [14]*194 + [15]*192
+    self.assertEqual(len(inds), x.shape[0])
+    partitioned = data_flow_ops.dynamic_partition(x, inds, 16)
+    with self.test_session() as sess:
+      res = sess.run(partitioned)
+    self.assertEqual(res[-1].shape[0], 192)
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/fractional_avg_pool_op_test.py b/tensorflow/python/kernel_tests/fractional_avg_pool_op_test.py
index feec9934e459590bb1dd0bc5c7cf40013d3d8b88..faac7d8365dfaa1b6b32f8fe66a76c3694aa0d5b 100644
--- a/tensorflow/python/kernel_tests/fractional_avg_pool_op_test.py
+++ b/tensorflow/python/kernel_tests/fractional_avg_pool_op_test.py
@@ -347,7 +347,7 @@ class FractionalAvgPoolGradTest(test.TestCase):
 
   Two types of tests for FractionalAvgPoolGrad.
   1) Test fractional_avg_pool_grad() directly.
-    This type of test relies on gen_nn_ops._avg_pool_grad() returns the
+    This type of test relies on gen_nn_ops.avg_pool_grad() returns the
   correct result. For example:
     * input_tensor_shape = (1, 10, 10, 1)
     * window_size = (1, 2, 2, 1)
@@ -404,13 +404,13 @@ class FractionalAvgPoolGradTest(test.TestCase):
                 num_elements *= dim_size
               output_backprop = (self._PRNG.rand(num_elements) *
                                  1000).reshape(output_data.shape)
-              input_backprop_tensor = gen_nn_ops._avg_pool_grad(
+              input_backprop_tensor = gen_nn_ops.avg_pool_grad(
                   input_tensor.get_shape(), output_backprop, window_size,
                   stride_size, padding)
               input_backprop = input_backprop_tensor.eval()
               row_seq = list(range(0, num_rows + 1, row_window_size))
               col_seq = list(range(0, num_cols + 1, col_window_size))
-              fap_input_backprop_tensor = gen_nn_ops._fractional_avg_pool_grad(
+              fap_input_backprop_tensor = gen_nn_ops.fractional_avg_pool_grad(
                   input_tensor.get_shape(),
                   output_backprop,
                   row_seq,
@@ -443,7 +443,7 @@ class FractionalAvgPoolGradTest(test.TestCase):
                 num_elements *= dim_size
               output_backprop = (self._PRNG.rand(num_elements) *
                                  1000).reshape(output_data.shape)
-              input_backprop_tensor = gen_nn_ops._avg_pool_grad(
+              input_backprop_tensor = gen_nn_ops.avg_pool_grad(
                   input_tensor.get_shape(), output_backprop, window_size,
                   stride_size, padding)
               input_backprop = input_backprop_tensor.eval()
@@ -451,7 +451,7 @@ class FractionalAvgPoolGradTest(test.TestCase):
               col_seq = list(range(0, num_cols, col_window_size - 1))
               row_seq[-1] += 1
               col_seq[-1] += 1
-              fap_input_backprop_tensor = gen_nn_ops._fractional_avg_pool_grad(
+              fap_input_backprop_tensor = gen_nn_ops.fractional_avg_pool_grad(
                   input_tensor.get_shape(),
                   output_backprop,
                   row_seq,
diff --git a/tensorflow/python/kernel_tests/fractional_max_pool_op_test.py b/tensorflow/python/kernel_tests/fractional_max_pool_op_test.py
index 5983ae7759dbf3eb2db9867def829ce8dbeb4b73..6477c9ebc4c35fcc5963b27a0f5c50624a73fa09 100644
--- a/tensorflow/python/kernel_tests/fractional_max_pool_op_test.py
+++ b/tensorflow/python/kernel_tests/fractional_max_pool_op_test.py
@@ -318,7 +318,7 @@ class FractionalMaxPoolGradTest(test.TestCase):
 
   Two types of tests for FractionalMaxPoolGrad.
   1) Test fractional_max_pool_grad() directly.
-    This type of test relies on gen_nn_ops._max_pool_grad() returns the correct
+    This type of test relies on gen_nn_ops.max_pool_grad() returns the correct
   result. For example:
     * input_tensor_shape = (1, 10, 10, 1)
     * window_size = (1, 2, 2, 1)
@@ -384,16 +384,13 @@ class FractionalMaxPoolGradTest(test.TestCase):
                                               stride_size, padding)
               output_data = output_tensor.eval()
               output_backprop = self._PRNG.randint(100, size=output_data.shape)
-              input_backprop_tensor = gen_nn_ops._max_pool_grad(input_tensor,
-                                                                output_tensor,
-                                                                output_backprop,
-                                                                window_size,
-                                                                stride_size,
-                                                                padding)
+              input_backprop_tensor = gen_nn_ops.max_pool_grad(
+                  input_tensor, output_tensor, output_backprop, window_size,
+                  stride_size, padding)
               input_backprop = input_backprop_tensor.eval()
               row_seq = list(range(0, num_rows + 1, row_window_size))
               col_seq = list(range(0, num_cols + 1, col_window_size))
-              fmp_input_backprop_tensor = gen_nn_ops._fractional_max_pool_grad(
+              fmp_input_backprop_tensor = gen_nn_ops.fractional_max_pool_grad(
                   input_tensor,
                   output_tensor,
                   output_backprop,
@@ -422,18 +419,15 @@ class FractionalMaxPoolGradTest(test.TestCase):
                                               stride_size, padding)
               output_data = output_tensor.eval()
               output_backprop = self._PRNG.randint(100, size=output_data.shape)
-              input_backprop_tensor = gen_nn_ops._max_pool_grad(input_tensor,
-                                                                output_tensor,
-                                                                output_backprop,
-                                                                window_size,
-                                                                stride_size,
-                                                                padding)
+              input_backprop_tensor = gen_nn_ops.max_pool_grad(
+                  input_tensor, output_tensor, output_backprop, window_size,
+                  stride_size, padding)
               input_backprop = input_backprop_tensor.eval()
               row_seq = list(range(0, num_rows, row_window_size - 1))
               col_seq = list(range(0, num_cols, col_window_size - 1))
               row_seq[-1] += 1
               col_seq[-1] += 1
-              fmp_input_backprop_tensor = gen_nn_ops._fractional_max_pool_grad(
+              fmp_input_backprop_tensor = gen_nn_ops.fractional_max_pool_grad(
                   input_tensor,
                   output_tensor,
                   output_backprop,
@@ -591,7 +585,7 @@ class FractionalMaxPoolGradTest(test.TestCase):
       output_tensor = constant_op.constant(
           output_data_not_overlapping, shape=output_size)
       grad = constant_op.constant(output_backprop, shape=output_size)
-      r = gen_nn_ops._fractional_max_pool_grad(
+      r = gen_nn_ops.fractional_max_pool_grad(
           input_tensor,
           output_tensor,
           grad,
@@ -606,7 +600,7 @@ class FractionalMaxPoolGradTest(test.TestCase):
       # Test when overlapping is True
       output_tensor = constant_op.constant(
           output_data_overlapping, shape=output_size)
-      r = gen_nn_ops._fractional_max_pool_grad(
+      r = gen_nn_ops.fractional_max_pool_grad(
           input_tensor, output_tensor, grad, row_seq, col_seq, overlapping=True)
       input_backprop_overlapping = r.eval()
       self.assertShapeEqual(
diff --git a/tensorflow/python/kernel_tests/linalg/linear_operator_diag_test.py b/tensorflow/python/kernel_tests/linalg/linear_operator_diag_test.py
index 343d158498833dd92361bc41d433e28296fc4c9a..8cb9f9e6213cda8daae7b629fc31d4721fd48fa7 100644
--- a/tensorflow/python/kernel_tests/linalg/linear_operator_diag_test.py
+++ b/tensorflow/python/kernel_tests/linalg/linear_operator_diag_test.py
@@ -129,7 +129,7 @@ class LinearOperatorDiagTest(
     with self.test_session() as sess:
       x = random_ops.random_normal(shape=(2, 2, 3, 4))
 
-      # This LinearOperatorDiag will be brodacast to (2, 2, 3, 3) during solve
+      # This LinearOperatorDiag will be broadcast to (2, 2, 3, 3) during solve
       # and matmul with 'x' as the argument.
       diag = random_ops.random_uniform(shape=(2, 1, 3))
       operator = linalg.LinearOperatorDiag(diag, is_self_adjoint=True)
diff --git a/tensorflow/python/kernel_tests/losses_test.py b/tensorflow/python/kernel_tests/losses_test.py
index 197dbf44afaea2cfaf5a1ffebb6ac0a6be09d165..1123c20a165ba93bd380fa471a8be91f7005d7bb 100644
--- a/tensorflow/python/kernel_tests/losses_test.py
+++ b/tensorflow/python/kernel_tests/losses_test.py
@@ -20,6 +20,7 @@ from __future__ import print_function
 
 import numpy as np
 
+from tensorflow.python.eager import context
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors_impl
@@ -32,11 +33,25 @@ from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.ops.losses import losses
+from tensorflow.python.ops.losses import losses_impl
 from tensorflow.python.ops.losses import util
 from tensorflow.python.platform import test
 from tensorflow.python.training import momentum as momentum_lib
 
 
+safe_div = losses_impl._safe_div  # pylint: disable=protected-access
+
+
+class SafeDivTest(test.TestCase):
+
+  def testEager(self):
+    with context.eager_mode():
+      self.assertAllEqual(safe_div(constant_op.constant(1.0),
+                                   constant_op.constant(0.0)), 0.0)
+      self.assertAllEqual(safe_div(constant_op.constant(1.0),
+                                   0.0), 0.0)
+
+
 class AbsoluteDifferenceLossTest(test.TestCase):
 
   def setUp(self):
diff --git a/tensorflow/python/kernel_tests/matrix_exponential_op_test.py b/tensorflow/python/kernel_tests/matrix_exponential_op_test.py
index 6203a412d7faec4fe9f6179141301579b5900291..a0c66c77d8850d3144678870983730537a253556 100644
--- a/tensorflow/python/kernel_tests/matrix_exponential_op_test.py
+++ b/tensorflow/python/kernel_tests/matrix_exponential_op_test.py
@@ -48,7 +48,7 @@ class ExponentialOpTest(test.TestCase):
   def _verifyExponential(self, x, np_type):
     inp = x.astype(np_type)
     with self.test_session(use_gpu=True):
-      tf_ans = gen_linalg_ops._matrix_exponential(inp)
+      tf_ans = gen_linalg_ops.matrix_exponential(inp)
       if x.size == 0:
         np_ans = np.empty(x.shape, dtype=np_type)
       else:
@@ -116,13 +116,13 @@ class ExponentialOpTest(test.TestCase):
     # When the exponential of a non-square matrix is attempted we should return
     # an error
     with self.assertRaises(ValueError):
-      gen_linalg_ops._matrix_exponential(np.array([[1., 2., 3.], [3., 4., 5.]]))
+      gen_linalg_ops.matrix_exponential(np.array([[1., 2., 3.], [3., 4., 5.]]))
 
   def testWrongDimensions(self):
     # The input to the exponential should be at least a 2-dimensional tensor.
     tensor3 = constant_op.constant([1., 2.])
     with self.assertRaises(ValueError):
-      gen_linalg_ops._matrix_exponential(tensor3)
+      gen_linalg_ops.matrix_exponential(tensor3)
 
   def testEmpty(self):
     self._verifyExponentialReal(np.empty([0, 2, 2]))
@@ -143,8 +143,8 @@ class ExponentialOpTest(test.TestCase):
     with self.test_session(use_gpu=True) as sess:
       matrix1 = random_ops.random_normal([5, 5], seed=42)
       matrix2 = random_ops.random_normal([5, 5], seed=42)
-      expm1 = gen_linalg_ops._matrix_exponential(matrix1)
-      expm2 = gen_linalg_ops._matrix_exponential(matrix2)
+      expm1 = gen_linalg_ops.matrix_exponential(matrix1)
+      expm2 = gen_linalg_ops.matrix_exponential(matrix2)
       expm = sess.run([expm1, expm2])
       self.assertAllEqual(expm[0], expm[1])
 
@@ -180,7 +180,7 @@ class MatrixExponentialBenchmark(test.Benchmark):
           session.Session() as sess, \
           ops.device("/cpu:0"):
         matrix = self._GenerateMatrix(shape)
-        expm = gen_linalg_ops._matrix_exponential(matrix)
+        expm = gen_linalg_ops.matrix_exponential(matrix)
         variables.global_variables_initializer().run()
         self.run_op_benchmark(
             sess,
diff --git a/tensorflow/python/kernel_tests/matrix_logarithm_op_test.py b/tensorflow/python/kernel_tests/matrix_logarithm_op_test.py
index 18ed59828c15f5ad21fe054cd6e40991c02bb356..24edc4f59fe6dd84da6732036eb53e2ad367bd06 100644
--- a/tensorflow/python/kernel_tests/matrix_logarithm_op_test.py
+++ b/tensorflow/python/kernel_tests/matrix_logarithm_op_test.py
@@ -39,8 +39,8 @@ class LogarithmOpTest(test.TestCase):
     inp = x.astype(np_type)
     with self.test_session(use_gpu=True):
       # Verify that expm(logm(A)) == A.
-      tf_ans = gen_linalg_ops._matrix_exponential(
-          gen_linalg_ops._matrix_logarithm(inp))
+      tf_ans = gen_linalg_ops.matrix_exponential(
+          gen_linalg_ops.matrix_logarithm(inp))
       out = tf_ans.eval()
       self.assertAllClose(inp, out, rtol=1e-4, atol=1e-3)
 
@@ -85,14 +85,14 @@ class LogarithmOpTest(test.TestCase):
     # When the logarithm of a non-square matrix is attempted we should return
     # an error
     with self.assertRaises(ValueError):
-      gen_linalg_ops._matrix_logarithm(
+      gen_linalg_ops.matrix_logarithm(
           np.array([[1., 2., 3.], [3., 4., 5.]], dtype=np.complex64))
 
   def testWrongDimensions(self):
     # The input to the logarithm should be at least a 2-dimensional tensor.
     tensor3 = constant_op.constant([1., 2.], dtype=dtypes.complex64)
     with self.assertRaises(ValueError):
-      gen_linalg_ops._matrix_logarithm(tensor3)
+      gen_linalg_ops.matrix_logarithm(tensor3)
 
   def testEmpty(self):
     self._verifyLogarithmComplex(np.empty([0, 2, 2], dtype=np.complex64))
@@ -115,8 +115,8 @@ class LogarithmOpTest(test.TestCase):
           random_ops.random_normal([5, 5], seed=42), dtypes.complex64)
       matrix2 = math_ops.cast(
           random_ops.random_normal([5, 5], seed=42), dtypes.complex64)
-      logm1 = gen_linalg_ops._matrix_logarithm(matrix1)
-      logm2 = gen_linalg_ops._matrix_logarithm(matrix2)
+      logm1 = gen_linalg_ops.matrix_logarithm(matrix1)
+      logm2 = gen_linalg_ops.matrix_logarithm(matrix2)
       logm = sess.run([logm1, logm2])
       self.assertAllEqual(logm[0], logm[1])
 
@@ -152,7 +152,7 @@ class MatrixLogarithmBenchmark(test.Benchmark):
           session.Session() as sess, \
           ops.device("/cpu:0"):
         matrix = self._GenerateMatrix(shape)
-        logm = gen_linalg_ops._matrix_logarithm(matrix)
+        logm = gen_linalg_ops.matrix_logarithm(matrix)
         variables.global_variables_initializer().run()
         self.run_op_benchmark(
             sess,
diff --git a/tensorflow/python/kernel_tests/metrics_test.py b/tensorflow/python/kernel_tests/metrics_test.py
index fd78c026c273da1ffecf9e1dfe8c9e6042a4be69..59e7afa2dcb1e02ed9c66e5cf75753f96552b4e0 100644
--- a/tensorflow/python/kernel_tests/metrics_test.py
+++ b/tensorflow/python/kernel_tests/metrics_test.py
@@ -417,7 +417,7 @@ class MeanTensorTest(test.TestCase):
 
       self.assertAllClose([[-0.9 / 4., 3.525]], sess.run(mean), 5)
 
-  def testWeighted1d(self):
+  def testBinaryWeighted1d(self):
     with self.test_session() as sess:
       # Create the queue that populates the values.
       values_queue = data_flow_ops.FIFOQueue(
@@ -444,6 +444,33 @@ class MeanTensorTest(test.TestCase):
         sess.run(update_op)
       self.assertAllClose([[3.25, 0.5]], sess.run(mean), 5)
 
+  def testWeighted1d(self):
+    with self.test_session() as sess:
+      # Create the queue that populates the values.
+      values_queue = data_flow_ops.FIFOQueue(
+          4, dtypes=dtypes_lib.float32, shapes=(1, 2))
+      _enqueue_vector(sess, values_queue, [0, 1])
+      _enqueue_vector(sess, values_queue, [-4.2, 9.1])
+      _enqueue_vector(sess, values_queue, [6.5, 0])
+      _enqueue_vector(sess, values_queue, [-3.2, 4.0])
+      values = values_queue.dequeue()
+
+      # Create the queue that populates the weights.
+      weights_queue = data_flow_ops.FIFOQueue(
+          4, dtypes=dtypes_lib.float32, shapes=(1, 1))
+      _enqueue_vector(sess, weights_queue, [[0.0025]])
+      _enqueue_vector(sess, weights_queue, [[0.005]])
+      _enqueue_vector(sess, weights_queue, [[0.01]])
+      _enqueue_vector(sess, weights_queue, [[0.0075]])
+      weights = weights_queue.dequeue()
+
+      mean, update_op = metrics.mean_tensor(values, weights)
+
+      sess.run(variables.local_variables_initializer())
+      for _ in range(4):
+        sess.run(update_op)
+      self.assertAllClose([[0.8, 3.52]], sess.run(mean), 5)
+
   def testWeighted2d_1(self):
     with self.test_session() as sess:
       # Create the queue that populates the values.
diff --git a/tensorflow/python/kernel_tests/pooling_ops_test.py b/tensorflow/python/kernel_tests/pooling_ops_test.py
index 4466beeec96509b3761e34d885276e1510c62d10..2f3bea5825f8889c5880c819ebf6b17aaa613f08 100644
--- a/tensorflow/python/kernel_tests/pooling_ops_test.py
+++ b/tensorflow/python/kernel_tests/pooling_ops_test.py
@@ -31,6 +31,7 @@ from tensorflow.python.ops import gen_nn_ops
 from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import variables
 import tensorflow.python.ops.nn_grad  # pylint: disable=unused-import
 from tensorflow.python.platform import test
 from tensorflow.python.platform import tf_logging
@@ -405,7 +406,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 3, 3, 3],
           ksize=[1, 2, 2, 1],
           strides=[1, 2, 2, 1],
@@ -427,7 +428,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 2, 3, 3],
           ksize=[1, 2, 2, 1],
           strides=[1, 2, 2, 1],
@@ -456,7 +457,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 2, 2, 1],
           ksize=[1, 1, 2, 1],
           strides=[1, 1, 1, 1],
@@ -485,7 +486,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 4, 4, 1],
           ksize=[1, 2, 2, 1],
           strides=[1, 1, 2, 1],
@@ -494,7 +495,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu,
           v2=v2)
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 4, 4, 1],
           ksize=[1, 2, 2, 1],
           strides=[1, 2, 1, 1],
@@ -519,7 +520,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 4, 4, 4],
           ksize=[1, 2, 2, 1],
           strides=[1, 2, 2, 1],
@@ -554,7 +555,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 8, 8, 8],
           ksize=[1, 3, 3, 1],
           strides=[1, 2, 2, 1],
@@ -565,7 +566,7 @@ class PoolingTest(test.TestCase):
 
   def _testMaxPoolEmptyInput(self, use_gpu):
     self._VerifyValues(
-        gen_nn_ops._max_pool_v2,
+        gen_nn_ops.max_pool_v2,
         input_sizes=[0, 8, 8, 8],
         ksize=[1, 3, 3, 1],
         strides=[1, 2, 2, 1],
@@ -600,7 +601,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 1, 1, 10],
           ksize=[1, 1, 1, 2],
           strides=[1, 1, 1, 2],
@@ -626,7 +627,7 @@ class PoolingTest(test.TestCase):
 
     for v2 in [True, False]:
       self._VerifyValues(
-          gen_nn_ops._max_pool_v2,
+          gen_nn_ops.max_pool_v2,
           input_sizes=[1, 2, 2, 6],
           ksize=[1, 1, 1, 3],
           strides=[1, 1, 1, 3],
@@ -648,7 +649,7 @@ class PoolingTest(test.TestCase):
 
       for v2 in [True, False]:
         self._VerifyValues(
-            gen_nn_ops._max_pool_v2,
+            gen_nn_ops.max_pool_v2,
             input_sizes=[1, 7, 7, 1],
             ksize=[1, 2, 2, 1],
             strides=[1, 3, 3, 1],
@@ -689,7 +690,7 @@ class PoolingTest(test.TestCase):
 
       for v2 in [True, False]:
         self._VerifyValues(
-            gen_nn_ops._max_pool_v2,
+            gen_nn_ops.max_pool_v2,
             input_sizes=[1, 3, 3, 1],
             ksize=[1, 1, 1, 1],
             strides=[1, 2, 2, 1],
@@ -699,7 +700,7 @@ class PoolingTest(test.TestCase):
             v2=v2)
 
         self._VerifyValues(
-            gen_nn_ops._max_pool_v2,
+            gen_nn_ops.max_pool_v2,
             input_sizes=[1, 4, 4, 1],
             ksize=[1, 1, 1, 1],
             strides=[1, 2, 2, 1],
@@ -731,7 +732,8 @@ class PoolingTest(test.TestCase):
                                             [1, 1, 1, 3], "evenly divide")
     if test.is_gpu_available():
       with self.test_session(use_gpu=True):
-        t = constant_op.constant(1.0, shape=[1, 2, 2, 4])
+        t = variables.Variable(np.ones([1, 2, 2, 4]))
+        variables.global_variables_initializer().run()
         with self.assertRaisesOpError("for CPU devices"):
           nn_ops.max_pool(
               t, ksize=[1, 1, 1, 2], strides=[1, 1, 1, 2],
@@ -764,8 +766,8 @@ class PoolingTest(test.TestCase):
         _, argmax_op = nn_ops.max_pool_with_argmax(t, ksize, strides, padding)
         argmax = argmax_op.eval()
         grad_in = constant_op.constant(tensor_output, shape=output_shape)
-        out_op = gen_nn_ops._max_pool_grad_with_argmax(t, grad_in, argmax,
-                                                       ksize, strides, padding)
+        out_op = gen_nn_ops.max_pool_grad_with_argmax(t, grad_in, argmax, ksize,
+                                                      strides, padding)
         gpu_val = out_op.eval()
         self.assertShapeEqual(gpu_val, out_op)
       with self.test_session(use_gpu=False):
@@ -773,8 +775,8 @@ class PoolingTest(test.TestCase):
         out_op = nn_ops.max_pool(t, ksize, strides, padding)
         orig_out = out_op.eval()
         grad_in = constant_op.constant(tensor_output, shape=output_shape)
-        out_op = gen_nn_ops._max_pool_grad(t, orig_out, grad_in, ksize, strides,
-                                           padding)
+        out_op = gen_nn_ops.max_pool_grad(t, orig_out, grad_in, ksize, strides,
+                                          padding)
         cpu_val = out_op.eval()
         self.assertShapeEqual(cpu_val, out_op)
       # The CPU version accumulates its gradient on fp16, so it's less
@@ -793,7 +795,7 @@ class PoolingTest(test.TestCase):
         _, argmax_op = nn_ops.max_pool_with_argmax(t, ksize, strides, padding)
         argmax = argmax_op.eval()
         grad_in = constant_op.constant(tensor_input, shape=input_shape)
-        out_op = gen_nn_ops._max_pool_grad_grad_with_argmax(
+        out_op = gen_nn_ops.max_pool_grad_grad_with_argmax(
             t, grad_in, argmax, ksize, strides, padding)
         gpu_val = out_op.eval()
         self.assertShapeEqual(gpu_val, out_op)
@@ -802,8 +804,8 @@ class PoolingTest(test.TestCase):
         out_op = nn_ops.max_pool(t, ksize, strides, padding)
         orig_out = out_op.eval()
         grad_in = constant_op.constant(tensor_input, shape=input_shape)
-        out_op = gen_nn_ops._max_pool_grad_grad(t, orig_out, grad_in, ksize,
-                                                strides, padding)
+        out_op = gen_nn_ops.max_pool_grad_grad(t, orig_out, grad_in, ksize,
+                                               strides, padding)
         cpu_val = out_op.eval()
         self.assertShapeEqual(cpu_val, out_op)
       # The CPU version accumulates its gradient on fp16, so it's less
@@ -842,7 +844,7 @@ class PoolingTest(test.TestCase):
       t = constant_op.constant(tensor_input, shape=[1, 2, 2, 1])
       argmax = constant_op.constant(
           tensor_argmax, shape=[1, 2, 2, 1], dtype=dtypes.int64)
-      out_op = gen_nn_ops._max_pool_grad_with_argmax(
+      out_op = gen_nn_ops.max_pool_grad_with_argmax(
           orig_in,
           t,
           argmax,
@@ -865,7 +867,7 @@ class PoolingTest(test.TestCase):
       t = constant_op.constant(tensor_input, shape=[1, 3, 3, 1])
       argmax = constant_op.constant(
           tensor_argmax, shape=[1, 2, 2, 1], dtype=dtypes.int64)
-      out_op = gen_nn_ops._max_pool_grad_grad_with_argmax(
+      out_op = gen_nn_ops.max_pool_grad_grad_with_argmax(
           orig_in,
           t,
           argmax,
@@ -1029,7 +1031,7 @@ class PoolingTest(test.TestCase):
     self.assertLess(err, err_tolerance)
 
   def _testMaxPoolGradValidPadding1_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[1, 3, 3, 1],
@@ -1043,7 +1045,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradValidPadding2_1_6(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 6, 6, 3],
@@ -1057,7 +1059,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradValidPadding2_1_7(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 7, 7, 3],
@@ -1071,7 +1073,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradValidPadding1_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[1, 3, 3, 1],
@@ -1085,7 +1087,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradValidPadding2_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 2, 3],
@@ -1099,7 +1101,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding1_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1113,7 +1115,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding1_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1127,7 +1129,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding2_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1141,7 +1143,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding2_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1155,7 +1157,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradSamePadding3_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestGradient(
           pool_func,
           input_sizes=[1, 7, 7, 1],
@@ -1199,7 +1201,7 @@ class PoolingTest(test.TestCase):
     Returns:
       A Tensor.
     """
-    pool_func = gen_nn_ops.max_pool_grad_v2 if v2 else gen_nn_ops._max_pool_grad
+    pool_func = gen_nn_ops.max_pool_grad_v2 if v2 else gen_nn_ops.max_pool_grad
     return pool_func(orig_input, orig_output, grad,
                      [1, window_rows, window_cols, 1],
                      [1, row_stride, col_stride, 1], padding)
@@ -1208,9 +1210,11 @@ class PoolingTest(test.TestCase):
                              expected_input_backprop, input_sizes, output_sizes,
                              window_rows, window_cols, row_stride, col_stride,
                              padding, use_gpu, v2):
-    pool_func = gen_nn_ops._max_pool_v2 if v2 else nn_ops.max_pool
+    pool_func = gen_nn_ops.max_pool_v2 if v2 else nn_ops.max_pool
     with self.test_session(use_gpu=use_gpu):
-      input_tensor = constant_op.constant(input_data, shape=input_sizes)
+      input_tensor = variables.Variable(
+          np.array(input_data, dtype=np.float32).reshape(input_sizes))
+      variables.global_variables_initializer().run()
       output_tensor = pool_func(input_tensor, [1, window_rows, window_cols, 1],
                                 [1, row_stride, col_stride, 1], padding)
       output_backprop_tensor = constant_op.constant(
@@ -1504,7 +1508,7 @@ class PoolingTest(test.TestCase):
     self._testMaxPoolGradDirectWithNans2_2()
 
   def _testMaxPoolGradGradValidPadding1_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[1, 3, 3, 1],
@@ -1518,7 +1522,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradValidPadding2_1_6(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 6, 6, 3],
@@ -1532,7 +1536,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradValidPadding2_1_7(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 7, 7, 3],
@@ -1546,7 +1550,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradValidPadding2_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 2, 2, 3],
@@ -1560,7 +1564,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradSamePadding1_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1574,7 +1578,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradSamePadding2_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1588,7 +1592,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradSamePadding2_2(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[2, 2, 4, 3],
@@ -1602,7 +1606,7 @@ class PoolingTest(test.TestCase):
           use_gpu=use_gpu)
 
   def _testMaxPoolGradGradSamePadding3_1(self, data_format, use_gpu):
-    for pool_func in [gen_nn_ops._max_pool_v2, nn_ops.max_pool]:
+    for pool_func in [gen_nn_ops.max_pool_v2, nn_ops.max_pool]:
       self._ConstructAndTestSecondGradient(
           pool_func,
           input_sizes=[1, 7, 7, 1],
@@ -1644,7 +1648,7 @@ class PoolingTest(test.TestCase):
     Returns:
       A Tensor.
     """
-    return gen_nn_ops._max_pool_grad_grad(
+    return gen_nn_ops.max_pool_grad_grad(
         orig_input, orig_output, grad, [1, window_rows, window_cols, 1],
         [1, row_stride, col_stride, 1], padding)
 
diff --git a/tensorflow/python/kernel_tests/reduction_ops_test.py b/tensorflow/python/kernel_tests/reduction_ops_test.py
index 531478162971575739bbe37abfc57ca427ae22ae..589ea54973c10902c461f552d5c54b6fad6ecf67 100644
--- a/tensorflow/python/kernel_tests/reduction_ops_test.py
+++ b/tensorflow/python/kernel_tests/reduction_ops_test.py
@@ -30,6 +30,7 @@ from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import math_ops
+from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
 
 # The maximum input rank to test.
@@ -212,7 +213,8 @@ class SumReductionTest(BaseReductionTest):
     arr = np.ones([68000], dtype=np.float16)
 
     with self.test_session(graph=ops.Graph(), use_gpu=True) as sess:
-      tf_arr = array_ops.constant(arr)
+      tf_arr = variables.Variable(arr)
+      variables.global_variables_initializer().run()
       tf_mean = math_ops.reduce_mean(tf_arr, 0, False)
       tf_out_mean = sess.run(tf_mean)
     self.assertAllClose(tf_out_mean, 1.)
@@ -887,11 +889,7 @@ class AnyReductionTest(test.TestCase):
 
 class CountNonzeroReductionTest(test.TestCase):
 
-  def _compare(self,
-               x,
-               reduction_axes,
-               keepdims,
-               use_gpu=False,
+  def _compare(self, x, reduction_axes, keepdims, use_gpu=False,
                feed_dict=None):
     np_ans = (x != 0).astype(np.int32)
     if reduction_axes is None:
diff --git a/tensorflow/python/kernel_tests/regex_replace_op_test.py b/tensorflow/python/kernel_tests/regex_replace_op_test.py
new file mode 100644
index 0000000000000000000000000000000000000000..6739ac32245668e98d37673fe9e9fe9d55cc0c5f
--- /dev/null
+++ b/tensorflow/python/kernel_tests/regex_replace_op_test.py
@@ -0,0 +1,71 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Tests for RegexReplace op from string_ops."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from tensorflow.python.framework import constant_op
+from tensorflow.python.framework import dtypes
+from tensorflow.python.ops import string_ops
+from tensorflow.python.platform import test
+
+
+class RegexReplaceOpTest(test.TestCase):
+
+  def testRemovePrefix(self):
+    values = ["a:foo", "a:bar", "a:foo", "b:baz", "b:qux", "ca:b"]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      stripped = string_ops.regex_replace(
+          input_vector, "^(a:|b:)", "", replace_global=False).eval()
+      self.assertAllEqual([b"foo", b"bar", b"foo", b"baz", b"qux", b"ca:b"],
+                          stripped)
+
+  def testRegexReplace(self):
+    values = ["aba\naba", "abcdabcde"]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      stripped = string_ops.regex_replace(input_vector, "a.*a", "(\\0)").eval()
+      self.assertAllEqual([b"(aba)\n(aba)", b"(abcda)bcde"], stripped)
+
+  def testEmptyMatch(self):
+    values = ["abc", "1"]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      stripped = string_ops.regex_replace(input_vector, "", "x").eval()
+      self.assertAllEqual([b"xaxbxcx", b"x1x"], stripped)
+
+  def testInvalidPattern(self):
+    values = ["abc", "1"]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      invalid_pattern = "A["
+      replace = string_ops.regex_replace(input_vector, invalid_pattern, "x")
+      with self.assertRaisesOpError("Invalid pattern"):
+        replace.eval()
+
+  def testGlobal(self):
+    values = ["ababababab", "abcabcabc", ""]
+    with self.test_session():
+      input_vector = constant_op.constant(values, dtypes.string)
+      stripped = string_ops.regex_replace(input_vector, "ab", "abc",
+                                          True).eval()
+      self.assertAllEqual([b"abcabcabcabcabc", b"abccabccabcc", b""], stripped)
+
+
+if __name__ == "__main__":
+  test.main()
diff --git a/tensorflow/python/kernel_tests/relu_op_test.py b/tensorflow/python/kernel_tests/relu_op_test.py
index 6b4091ae5d3c6e469a9cd5237b978eae4c75485f..25e947f09e137b37ea129ba6015a060aa01f02e4 100644
--- a/tensorflow/python/kernel_tests/relu_op_test.py
+++ b/tensorflow/python/kernel_tests/relu_op_test.py
@@ -19,12 +19,14 @@ from __future__ import division
 from __future__ import print_function
 
 import numpy as np
+from six.moves import xrange  # pylint: disable=redefined-builtin
 
 from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gradient_checker
 from tensorflow.python.ops import gradients_impl
+from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import random_ops
 from tensorflow.python.ops import variables
@@ -87,6 +89,35 @@ class ReluTest(test.TestCase):
     print("relu (float32) gradient err = ", err)
     self.assertLess(err, 1e-4)
 
+  # The gradient for fp16 is inaccurate due to the low-precision.
+  # Instead of relying on compute_gradient_error, we compare the fp16 analytical
+  # gradient against their fp32 counterpart.
+  def testGradientFloat16(self):
+    with self.test_session(use_gpu=True) as sess:
+      # Randomly construct a 1D shape from [1, 40)
+      shape = random_ops.random_uniform(
+          [1], minval=1, maxval=40, dtype=dtypes.int32)
+
+      # Construct the fp32 graph and its gradient.
+      x = random_ops.random_uniform(shape, minval=-1, maxval=1, name="x")
+      y1 = nn_ops.relu(x, name="relu_fp32")
+      l1 = nn_ops.l2_loss(y1)
+      dx_f32 = gradients_impl.gradients(l1, x)
+
+      # Construct the fp16 graph and its gradient.
+      # It starts with the same x, in fp32. But before it reaches Relu, it is
+      # cast into fp16. So during backprop, the gradient computation is in fp16.
+      x2 = math_ops.cast(x, dtype=dtypes.float16, name="cast")
+      y2 = nn_ops.relu(x2, name="relu_fp16")
+      l2 = nn_ops.l2_loss(y2)
+      dx_f16 = gradients_impl.gradients(l2, x)
+
+      # Repeat the experiment for 100 times. All tensor shapes and its tensor
+      # values are randomly generated for each run.
+      for _ in xrange(100):
+        dx_f32_v, dx_f16_v = sess.run([dx_f32, dx_f16])
+        self.assertAllClose(dx_f32_v, dx_f16_v, atol=3e-4)
+
   def testGradientFloat64(self):
     with self.test_session():
       x = constant_op.constant(
diff --git a/tensorflow/python/kernel_tests/resource_variable_ops_test.py b/tensorflow/python/kernel_tests/resource_variable_ops_test.py
index 8503f3e0310125bb714942b32bbbf46596f9bddb..10ba9fa674236ecf17f21aa79a2bd25f49bf82e6 100644
--- a/tensorflow/python/kernel_tests/resource_variable_ops_test.py
+++ b/tensorflow/python/kernel_tests/resource_variable_ops_test.py
@@ -277,6 +277,20 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
     self.evaluate(v.assign(2.0))
     self.assertEqual(2.0, self.evaluate(v.value()))
 
+    # Tests for the 'read_value' argument:
+    assign_with_read = v.assign(3.0, read_value=True)
+    if context.in_graph_mode():
+      self.assertEqual(3.0, assign_with_read.eval())
+    else:
+      self.assertEqual(3.0, self.evaluate(assign_with_read))
+    assign_without_read = v.assign(4.0, read_value=False)
+    if context.in_graph_mode():
+      self.assertIsInstance(assign_without_read, ops.Operation)
+    else:
+      self.assertIsNone(assign_without_read)
+    self.evaluate(assign_without_read)
+    self.assertEqual(4.0, self.evaluate(v.value()))
+
   @test_util.run_in_graph_and_eager_modes()
   def testLoad(self):
     v = resource_variable_ops.ResourceVariable(1.0, name="var0")
@@ -329,6 +343,9 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
       w = resource_variable_ops.ResourceVariable.from_proto(v.to_proto())
       self.assertEquals(2, math_ops.add(w, 1).eval())
 
+      self.assertEquals(v._handle, w._handle)
+      self.assertEquals(v._graph_element, w._graph_element)
+
   @test_util.run_in_graph_and_eager_modes()
   def testAssignAddMethod(self):
     v = resource_variable_ops.ResourceVariable(1.0, name="var0")
@@ -336,6 +353,20 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
     self.evaluate(v.assign_add(1.0))
     self.assertEqual(2.0, self.evaluate(v.value()))
 
+    # Tests for the 'read_value' argument:
+    assign_with_read = v.assign_add(1.0, read_value=True)
+    if context.in_graph_mode():
+      self.assertEqual(3.0, assign_with_read.eval())
+    else:
+      self.assertEqual(3.0, self.evaluate(assign_with_read))
+    assign_without_read = v.assign_add(1.0, read_value=False)
+    if context.in_graph_mode():
+      self.assertIsInstance(assign_without_read, ops.Operation)
+    else:
+      self.assertIsNone(assign_without_read)
+    self.evaluate(assign_without_read)
+    self.assertEqual(4.0, self.evaluate(v.value()))
+
   @test_util.run_in_graph_and_eager_modes()
   def testAssignSubMethod(self):
     v = resource_variable_ops.ResourceVariable(3.0, name="var0")
@@ -343,6 +374,20 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
     self.evaluate(v.assign_sub(1.0))
     self.assertEqual(2.0, self.evaluate(v.value()))
 
+    # Tests for the 'read_value' argument:
+    assign_with_read = v.assign_sub(1.0, read_value=True)
+    if context.in_graph_mode():
+      self.assertEqual(1.0, assign_with_read.eval())
+    else:
+      self.assertEqual(1.0, self.evaluate(assign_with_read))
+    assign_without_read = v.assign_sub(1.0, read_value=False)
+    if context.in_graph_mode():
+      self.assertIsInstance(assign_without_read, ops.Operation)
+    else:
+      self.assertIsNone(assign_without_read)
+    self.evaluate(assign_without_read)
+    self.assertEqual(0.0, self.evaluate(v.value()))
+
   @test_util.run_in_graph_and_eager_modes()
   def testDestroyResource(self):
     v = resource_variable_ops.ResourceVariable(3.0, name="var0")
@@ -481,7 +526,6 @@ class ResourceVariableOpsTest(test_util.TensorFlowTestCase):
       self.assertEqual(dtypes.int32, v.dtype)
       self.assertEqual("foo/var7:0", v.name)
       self.assertAllEqual([10, 20, 35], v.shape.as_list())
-      self.assertEqual(context.get_default_context().device_name, v.device)
       self.assertTrue(isinstance(v.handle, ops.EagerTensor))
       self.assertEqual(constraint, v.constraint)
       self.assertAllEqual(init.numpy(), v.read_value().numpy())
diff --git a/tensorflow/python/kernel_tests/save_restore_ops_test.py b/tensorflow/python/kernel_tests/save_restore_ops_test.py
index 1bdfa9ebd8e1a4495e67004f59adfb51bf3a6602..cb9aa1e34d6eb82efa94e60e7b56c26b181cef04 100644
--- a/tensorflow/python/kernel_tests/save_restore_ops_test.py
+++ b/tensorflow/python/kernel_tests/save_restore_ops_test.py
@@ -31,11 +31,10 @@ class ShardedFileOpsTest(test.TestCase):
     with session.Session(
         target="", config=config_pb2.ConfigProto(device_count={"CPU": 2})):
       self.assertEqual(
-          gen_io_ops._sharded_filename("foo", 4, 100).eval(),
+          gen_io_ops.sharded_filename("foo", 4, 100).eval(),
           b"foo-00004-of-00100")
       self.assertEqual(
-          gen_io_ops._sharded_filespec("foo", 100).eval(),
-          b"foo-?????-of-00100")
+          gen_io_ops.sharded_filespec("foo", 100).eval(), b"foo-?????-of-00100")
 
 
 class ShapeInferenceTest(test.TestCase):
@@ -53,7 +52,7 @@ class ShapeInferenceTest(test.TestCase):
                         [dtypes.float32, dtypes.float32])
 
   def testRestoreSlice(self):
-    op = gen_io_ops._restore_slice("model", "var", "3 4 0,1:-", dtypes.float32)
+    op = gen_io_ops.restore_slice("model", "var", "3 4 0,1:-", dtypes.float32)
     self.assertEqual([1, 4], op.get_shape())
 
 
diff --git a/tensorflow/python/kernel_tests/scalar_test.py b/tensorflow/python/kernel_tests/scalar_test.py
index e65241981eac2d42207c1de7a261f7936e588f2a..0d8fd232946883ac1d95c4c2d9744af69175ab90 100644
--- a/tensorflow/python/kernel_tests/scalar_test.py
+++ b/tensorflow/python/kernel_tests/scalar_test.py
@@ -92,11 +92,11 @@ class ScalarTest(test.TestCase):
     self.check(array_ops.reshape, (7, 1), 'sizes input must be 1-D', [7])
 
   def testShardedFilename(self):
-    self.check(gen_io_ops._sharded_filename, ('foo', 4, [100]),
+    self.check(gen_io_ops.sharded_filename, ('foo', 4, [100]),
                'must be a scalar', b'foo-00004-of-00100')
 
   def testShardedFilespec(self):
-    self.check(gen_io_ops._sharded_filespec, ('foo', [100]), 'must be a scalar',
+    self.check(gen_io_ops.sharded_filespec, ('foo', [100]), 'must be a scalar',
                b'foo-?????-of-00100')
 
   def testUnsortedSegmentSum(self):
diff --git a/tensorflow/python/kernel_tests/softmax_op_test.py b/tensorflow/python/kernel_tests/softmax_op_test.py
index 4d89831aae9a5e95210a8defb180e09c9d38f4d6..2b8e99e18e6881143ee77c4f1ec5096635e5c1b2 100644
--- a/tensorflow/python/kernel_tests/softmax_op_test.py
+++ b/tensorflow/python/kernel_tests/softmax_op_test.py
@@ -20,7 +20,6 @@ from __future__ import print_function
 
 import numpy as np
 
-from tensorflow.python.framework import constant_op
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import test_util
@@ -166,7 +165,7 @@ class SoftmaxTest(test.TestCase):
 
   def testEmptyInput(self):
     with self.test_session():
-      x = constant_op.constant([[]], shape=[0, 3])
+      x = array_ops.placeholder(dtypes.float32, shape=[0, 3])
       self.assertEqual(0, array_ops.size(x).eval())
       # reshape would raise if logits is empty
       with self.assertRaises(errors_impl.InvalidArgumentError):
diff --git a/tensorflow/python/kernel_tests/spacetobatch_op_test.py b/tensorflow/python/kernel_tests/spacetobatch_op_test.py
index b943dfa4e5f2a06eddcb3af03764e5e046b715f4..2a9232b6aecb66328f10a62f2251246c4fcec6e6 100644
--- a/tensorflow/python/kernel_tests/spacetobatch_op_test.py
+++ b/tensorflow/python/kernel_tests/spacetobatch_op_test.py
@@ -86,11 +86,11 @@ class CppOpImpl(object):
 
   @staticmethod
   def space_to_batch(*args, **kwargs):
-    return gen_array_ops._space_to_batch(*args, **kwargs)
+    return gen_array_ops.space_to_batch(*args, **kwargs)
 
   @staticmethod
   def batch_to_space(*args, **kwargs):
-    return gen_array_ops._batch_to_space(*args, **kwargs)
+    return gen_array_ops.batch_to_space(*args, **kwargs)
 
 
 class SpaceToBatchTest(test.TestCase, PythonOpImpl):
diff --git a/tensorflow/python/kernel_tests/sparse_xent_op_test.py b/tensorflow/python/kernel_tests/sparse_xent_op_test.py
index cd5b711a0ed18aabff543aa4b6ecb1a885618caf..a841fe83a7f585a69ef33c437570359797484a4a 100644
--- a/tensorflow/python/kernel_tests/sparse_xent_op_test.py
+++ b/tensorflow/python/kernel_tests/sparse_xent_op_test.py
@@ -64,7 +64,7 @@ class SparseXentTest(test.TestCase):
   def _testXent(self, np_features, np_labels):
     np_loss, np_backprop = self._npXent(np_features, np_labels)
     with self.test_session(use_gpu=True) as sess:
-      loss, backprop = gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
+      loss, backprop = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
           np_features, np_labels)
       tf_loss, tf_backprop = sess.run([loss, backprop])
     self.assertAllCloseAccordingToType(np_loss, tf_loss)
@@ -73,7 +73,7 @@ class SparseXentTest(test.TestCase):
   def testSingleClass(self):
     for label_dtype in np.int32, np.int64:
       with self.test_session(use_gpu=True) as sess:
-        loss, backprop = gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
+        loss, backprop = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
             np.array([[1.], [-1.], [0.]]).astype(np.float32),
             np.array([0, 0, 0]).astype(label_dtype))
         tf_loss, tf_backprop = sess.run([loss, backprop])
@@ -87,8 +87,9 @@ class SparseXentTest(test.TestCase):
 
     if test.is_built_with_cuda() and test.is_gpu_available():
       with self.test_session(use_gpu=True) as sess:
-        loss, backprop = (gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
-            features, labels))
+        loss, backprop = (
+            gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
+                features, labels))
         tf_loss, tf_backprop = sess.run([loss, backprop])
         self.assertAllClose(
             [[np.nan] * 4, [0.25, 0.25, 0.25, -0.75],
@@ -100,8 +101,8 @@ class SparseXentTest(test.TestCase):
             [np.nan, 1.3862, 3.4420, np.nan], tf_loss, rtol=1e-3, atol=1e-3)
 
     with self.test_session(use_gpu=False) as sess:
-      loss, backprop = (gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
-          features, labels))
+      loss, backprop = (
+          gen_nn_ops.sparse_softmax_cross_entropy_with_logits(features, labels))
       with self.assertRaisesOpError("Received a label value of"):
         sess.run([loss, backprop])
 
diff --git a/tensorflow/python/kernel_tests/split_op_test.py b/tensorflow/python/kernel_tests/split_op_test.py
index 6171793b148f8d8f195b9548a13df89d29c5e96e..8cfee3eb933afcea7a58d5632948b87b0c4c10df 100644
--- a/tensorflow/python/kernel_tests/split_op_test.py
+++ b/tensorflow/python/kernel_tests/split_op_test.py
@@ -336,6 +336,20 @@ class SplitOpTest(test.TestCase):
     for s in splits:
       self.assertEqual(None, s.get_shape().ndims)
 
+  def testNonexistentDimTensor(self):
+    x = array_ops.placeholder(dtypes.int32)
+    values = np.zeros([5, 30])
+    splits = array_ops.placeholder(dtypes.int32)
+    with self.assertRaisesRegexp(ValueError, "Cannot infer"):
+      y = array_ops.split(values, splits, axis=x)
+
+    splits = array_ops.placeholder(dtypes.int32, [3])
+    y = array_ops.split(values, splits, axis=x)
+    with self.test_session(use_gpu=True) as sess:
+      with self.assertRaisesRegexp(errors_impl.InvalidArgumentError,
+                                   "must have exactly one element"):
+        sess.run(y, {x: np.array([], dtype=np.int32), splits: [4, 11, 15]})
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/kernel_tests/stack_ops_test.py b/tensorflow/python/kernel_tests/stack_ops_test.py
index aa409336f5c50178e4d0ca946190119fb0e4188e..afd2eaffab992bca4b3ae7b4f65e0370f325b548 100644
--- a/tensorflow/python/kernel_tests/stack_ops_test.py
+++ b/tensorflow/python/kernel_tests/stack_ops_test.py
@@ -34,11 +34,11 @@ class StackOpTest(test.TestCase):
 
   def _testStackPushPop(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, [[4.0, 5.0]])
+      c = gen_data_flow_ops.stack_push_v2(h, [[4.0, 5.0]])
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
       self.assertAllClose([[4.0, 5.0]], c1.eval())
 
   def testStackPushPop(self):
@@ -49,11 +49,11 @@ class StackOpTest(test.TestCase):
     with self.test_session(use_gpu=use_gpu):
       a = np.arange(2000)
       x = constant_op.constant(a, dtype=dtypes.float32)
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, x, swap_memory=True)
+      c = gen_data_flow_ops.stack_push_v2(h, x, swap_memory=True)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
       self.assertAllClose(a, c1.eval())
 
   def testStackPushPopSwap(self):
@@ -63,7 +63,7 @@ class StackOpTest(test.TestCase):
   def _testStackWhileSwap(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       n = constant_op.constant(0)
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
 
       def c(x):
@@ -72,7 +72,7 @@ class StackOpTest(test.TestCase):
       def b(x):
         with ops.control_dependencies([x]):
           a = constant_op.constant(np.ones(2000), dtype=dtypes.float32)
-          v = gen_data_flow_ops._stack_push_v2(h, a, swap_memory=True)
+          v = gen_data_flow_ops.stack_push_v2(h, a, swap_memory=True)
         with ops.control_dependencies([v]):
           return math_ops.add(x, 1)
 
@@ -86,7 +86,7 @@ class StackOpTest(test.TestCase):
 
       def b1(x, y):
         nx = math_ops.subtract(x, 1)
-        ny = y + gen_data_flow_ops._stack_pop_v2(h, dtypes.float32)
+        ny = y + gen_data_flow_ops.stack_pop_v2(h, dtypes.float32)
         return [nx, ny]
 
       _, ry = control_flow_ops.while_loop(
@@ -99,16 +99,16 @@ class StackOpTest(test.TestCase):
 
   def _testMultiStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
-      h1 = gen_data_flow_ops._stack_v2(
+      h1 = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_push_v2(h1, 4.0)
+      c1 = gen_data_flow_ops.stack_push_v2(h1, 4.0)
       with ops.control_dependencies([c1]):
-        c1 = gen_data_flow_ops._stack_pop_v2(h1, dtypes.float32)
-      h2 = gen_data_flow_ops._stack_v2(
+        c1 = gen_data_flow_ops.stack_pop_v2(h1, dtypes.float32)
+      h2 = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="bar")
-      c2 = gen_data_flow_ops._stack_push_v2(h2, 5.0)
+      c2 = gen_data_flow_ops.stack_push_v2(h2, 5.0)
       with ops.control_dependencies([c2]):
-        c2 = gen_data_flow_ops._stack_pop_v2(h2, dtypes.float32)
+        c2 = gen_data_flow_ops.stack_pop_v2(h2, dtypes.float32)
       r = c1 + c2
       self.assertAllClose(9.0, r.eval())
 
@@ -119,17 +119,17 @@ class StackOpTest(test.TestCase):
   def _testSameNameStacks(self, use_gpu):
     """Different stacks with the same name do not interfere."""
     with self.test_session(use_gpu=use_gpu) as sess:
-      h1 = gen_data_flow_ops._stack_v2(
+      h1 = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      h2 = gen_data_flow_ops._stack_v2(
+      h2 = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
 
-      c1 = gen_data_flow_ops._stack_push_v2(h1, 4.0)
+      c1 = gen_data_flow_ops.stack_push_v2(h1, 4.0)
       with ops.control_dependencies([c1]):
-        c2 = gen_data_flow_ops._stack_push_v2(h2, 5.0)
+        c2 = gen_data_flow_ops.stack_push_v2(h2, 5.0)
       with ops.control_dependencies([c2]):
-        pop1 = gen_data_flow_ops._stack_pop_v2(h1, dtypes.float32)
-        pop2 = gen_data_flow_ops._stack_pop_v2(h2, dtypes.float32)
+        pop1 = gen_data_flow_ops.stack_pop_v2(h1, dtypes.float32)
+        pop2 = gen_data_flow_ops.stack_pop_v2(h2, dtypes.float32)
 
       out1, out2 = sess.run([pop1, pop2])
       self.assertAllClose(out1, 4.0)
@@ -141,9 +141,9 @@ class StackOpTest(test.TestCase):
 
   def _testCloseStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu) as sess:
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_close_v2(h)
+      c1 = gen_data_flow_ops.stack_close_v2(h)
       sess.run(c1)
 
   def testCloseStack(self):
@@ -152,11 +152,11 @@ class StackOpTest(test.TestCase):
 
   def _testPushCloseStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu) as sess:
-      h = gen_data_flow_ops._stack_v2(
+      h = gen_data_flow_ops.stack_v2(
           -1, elem_type=dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push_v2(h, [[4.0, 5.0]])
+      c = gen_data_flow_ops.stack_push_v2(h, [[4.0, 5.0]])
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_close_v2(h)
+        c1 = gen_data_flow_ops.stack_close_v2(h)
       sess.run(c1)
 
   def testPushCloseStack(self):
@@ -170,9 +170,9 @@ class StackOpRefTest(test.TestCase):
   def _testStackPushPop(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       h = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push(h, [[4.0, 5.0]])
+      c = gen_data_flow_ops.stack_push(h, [[4.0, 5.0]])
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop(h, dtypes.float32)
       self.assertAllClose([[4.0, 5.0]], c1.eval())
 
   def testStackPushPop(self):
@@ -184,9 +184,9 @@ class StackOpRefTest(test.TestCase):
       a = np.arange(2000)
       x = constant_op.constant(a, dtype=dtypes.float32)
       h = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push(h, x, swap_memory=True)
+      c = gen_data_flow_ops.stack_push(h, x, swap_memory=True)
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_pop(h, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop(h, dtypes.float32)
       self.assertAllClose(a, c1.eval())
 
   def testStackPushPopSwap(self):
@@ -196,13 +196,13 @@ class StackOpRefTest(test.TestCase):
   def _testMultiStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       h1 = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_push(h1, 4.0)
+      c1 = gen_data_flow_ops.stack_push(h1, 4.0)
       with ops.control_dependencies([c1]):
-        c1 = gen_data_flow_ops._stack_pop(h1, dtypes.float32)
+        c1 = gen_data_flow_ops.stack_pop(h1, dtypes.float32)
       h2 = gen_data_flow_ops._stack(dtypes.float32, stack_name="bar")
-      c2 = gen_data_flow_ops._stack_push(h2, 5.0)
+      c2 = gen_data_flow_ops.stack_push(h2, 5.0)
       with ops.control_dependencies([c2]):
-        c2 = gen_data_flow_ops._stack_pop(h2, dtypes.float32)
+        c2 = gen_data_flow_ops.stack_pop(h2, dtypes.float32)
       r = c1 + c2
       self.assertAllClose(9.0, r.eval())
 
@@ -217,7 +217,7 @@ class StackOpRefTest(test.TestCase):
       def b(x):
         with ops.control_dependencies([x]):
           a = constant_op.constant(np.ones(2000), dtype=dtypes.float32)
-          v = gen_data_flow_ops._stack_push(h, a, swap_memory=True)
+          v = gen_data_flow_ops.stack_push(h, a, swap_memory=True)
         with ops.control_dependencies([v]):
           return math_ops.add(x, 1)
 
@@ -231,7 +231,7 @@ class StackOpRefTest(test.TestCase):
 
       def b1(x, y):
         nx = math_ops.subtract(x, 1)
-        ny = y + gen_data_flow_ops._stack_pop(h, dtypes.float32)
+        ny = y + gen_data_flow_ops.stack_pop(h, dtypes.float32)
         return [nx, ny]
 
       _, ry = control_flow_ops.while_loop(
@@ -249,9 +249,9 @@ class StackOpRefTest(test.TestCase):
   def _testSameNameStacks(self, use_gpu):
     with self.test_session(use_gpu=use_gpu):
       h1 = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_push(h1, 4.0)
+      c1 = gen_data_flow_ops.stack_push(h1, 4.0)
       h2 = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c2 = gen_data_flow_ops._stack_push(h2, 5.0)
+      c2 = gen_data_flow_ops.stack_push(h2, 5.0)
       _ = c1 + c2
       self.assertNotEqual(h1.eval()[1], h2.eval()[1])
 
@@ -262,7 +262,7 @@ class StackOpRefTest(test.TestCase):
   def _testCloseStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu) as sess:
       h = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c1 = gen_data_flow_ops._stack_close(h)
+      c1 = gen_data_flow_ops.stack_close(h)
       sess.run(c1)
 
   def testCloseStack(self):
@@ -272,9 +272,9 @@ class StackOpRefTest(test.TestCase):
   def _testPushCloseStack(self, use_gpu):
     with self.test_session(use_gpu=use_gpu) as sess:
       h = gen_data_flow_ops._stack(dtypes.float32, stack_name="foo")
-      c = gen_data_flow_ops._stack_push(h, [[4.0, 5.0]])
+      c = gen_data_flow_ops.stack_push(h, [[4.0, 5.0]])
       with ops.control_dependencies([c]):
-        c1 = gen_data_flow_ops._stack_close(h)
+        c1 = gen_data_flow_ops.stack_close(h)
       sess.run(c1)
 
   def testPushCloseStack(self):
diff --git a/tensorflow/python/kernel_tests/template_test.py b/tensorflow/python/kernel_tests/template_test.py
index a519b69b22cf51ab4f4173b215c21a71d83e9f99..c42ae5a77d5b0ab3b33e060675540f26339d0c87 100644
--- a/tensorflow/python/kernel_tests/template_test.py
+++ b/tensorflow/python/kernel_tests/template_test.py
@@ -356,6 +356,10 @@ class TemplateTest(test.TestCase):
     self.assertEqual("s1_1/nested/dummy:0", v5.name)
     self.assertEqual("s1_1/nested_1/dummy:0", v6.name)
 
+    self.assertEqual(2, len(tmpl1._checkpoint_dependencies))
+    self.assertEqual("nested", tmpl1._checkpoint_dependencies[0].name)
+    self.assertEqual("nested_1", tmpl1._checkpoint_dependencies[1].name)
+
   @test_util.run_in_graph_and_eager_modes()
   def test_nested_templates_with_defun(self):
 
diff --git a/tensorflow/python/kernel_tests/tensor_array_ops_test.py b/tensorflow/python/kernel_tests/tensor_array_ops_test.py
index aad2443eea7ad87faf481973e91ca3df32ccfb44..8f09f3d78bb5d3f65dba9e7037126266243e67bd 100644
--- a/tensorflow/python/kernel_tests/tensor_array_ops_test.py
+++ b/tensorflow/python/kernel_tests/tensor_array_ops_test.py
@@ -437,7 +437,7 @@ class TensorArrayTest(test.TestCase):
 
       # Test reading wrong datatype, which is only possible in graph mode
       if context.in_graph_mode():
-        r0_bad = gen_data_flow_ops._tensor_array_read_v3(
+        r0_bad = gen_data_flow_ops.tensor_array_read_v3(
             handle=w0.handle, index=0, dtype=dtypes.float64, flow_in=w0.flow)
         with self.assertRaisesOpError(
             "TensorArray dtype is float but Op requested dtype double."):
diff --git a/tensorflow/python/kernel_tests/unique_op_test.py b/tensorflow/python/kernel_tests/unique_op_test.py
index 6366d2e181c8cfabba8a78b664c25c85debc67ef..bbc040dc13fc151b970f130eeb76fa1639245416 100644
--- a/tensorflow/python/kernel_tests/unique_op_test.py
+++ b/tensorflow/python/kernel_tests/unique_op_test.py
@@ -66,9 +66,9 @@ class UniqueTest(test.TestCase):
     for dtype in [np.int32, np.int64]:
       x = np.array([[1, 0, 0], [1, 0, 0], [2, 0, 0]])
       with self.test_session() as sess:
-        y0, idx0 = gen_array_ops._unique_v2(x, axis=np.array([0], dtype))
+        y0, idx0 = gen_array_ops.unique_v2(x, axis=np.array([0], dtype))
         tf_y0, tf_idx0 = sess.run([y0, idx0])
-        y1, idx1 = gen_array_ops._unique_v2(x, axis=np.array([1], dtype))
+        y1, idx1 = gen_array_ops.unique_v2(x, axis=np.array([1], dtype))
         tf_y1, tf_idx1 = sess.run([y1, idx1])
       self.assertAllEqual(tf_y0, np.array([[1, 0, 0], [2, 0, 0]]))
       self.assertAllEqual(tf_idx0, np.array([0, 0, 1]))
@@ -80,7 +80,7 @@ class UniqueTest(test.TestCase):
     # by default, the axis will be wrapped to allow `axis=None`.
     x = np.random.randint(2, high=10, size=7000)
     with self.test_session() as sess:
-      y, idx = gen_array_ops._unique_v2(x, axis=np.array([], np.int32))
+      y, idx = gen_array_ops.unique_v2(x, axis=np.array([], np.int32))
       tf_y, tf_idx = sess.run([y, idx])
 
     self.assertEqual(len(x), len(tf_idx))
@@ -133,6 +133,39 @@ class UniqueWithCountsTest(test.TestCase):
       v = [1 if x[i] == value.decode('ascii') else 0 for i in range(7000)]
       self.assertEqual(count, sum(v))
 
+  def testInt32Axis(self):
+    for dtype in [np.int32, np.int64]:
+      x = np.array([[1, 0, 0], [1, 0, 0], [2, 0, 0]])
+      with self.test_session() as sess:
+        y0, idx0, count0 = gen_array_ops.unique_with_counts_v2(
+            x, axis=np.array([0], dtype))
+        tf_y0, tf_idx0, tf_count0 = sess.run([y0, idx0, count0])
+        y1, idx1, count1 = gen_array_ops.unique_with_counts_v2(
+            x, axis=np.array([1], dtype))
+        tf_y1, tf_idx1, tf_count1 = sess.run([y1, idx1, count1])
+      self.assertAllEqual(tf_y0, np.array([[1, 0, 0], [2, 0, 0]]))
+      self.assertAllEqual(tf_idx0, np.array([0, 0, 1]))
+      self.assertAllEqual(tf_count0, np.array([2, 1]))
+      self.assertAllEqual(tf_y1, np.array([[1, 0], [1, 0], [2, 0]]))
+      self.assertAllEqual(tf_idx1, np.array([0, 1, 1]))
+      self.assertAllEqual(tf_count1, np.array([1, 2]))
+
+  def testInt32V2(self):
+    # This test is only temporary, once V2 is used
+    # by default, the axis will be wrapped to allow `axis=None`.
+    x = np.random.randint(2, high=10, size=7000)
+    with self.test_session() as sess:
+      y, idx, count = gen_array_ops.unique_with_counts_v2(
+          x, axis=np.array([], np.int32))
+      tf_y, tf_idx, tf_count = sess.run([y, idx, count])
+
+    self.assertEqual(len(x), len(tf_idx))
+    self.assertEqual(len(tf_y), len(np.unique(x)))
+    for i in range(len(x)):
+      self.assertEqual(x[i], tf_y[tf_idx[i]])
+    for value, count in zip(tf_y, tf_count):
+      self.assertEqual(count, np.sum(x == value))
+
 
 if __name__ == '__main__':
   test.main()
diff --git a/tensorflow/python/kernel_tests/variable_ops_test.py b/tensorflow/python/kernel_tests/variable_ops_test.py
index 79071029fd42374964d12f513e9c510bdc7400eb..cf369c071813120fef685b7220292d50b966cf11 100644
--- a/tensorflow/python/kernel_tests/variable_ops_test.py
+++ b/tensorflow/python/kernel_tests/variable_ops_test.py
@@ -165,26 +165,26 @@ class VariableOpTest(test.TestCase):
 
   def testTemporaryVariable(self):
     with self.test_session(use_gpu=True):
-      var = gen_state_ops._temporary_variable(
+      var = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="foo")
       var = state_ops.assign(var, [[4.0, 5.0]])
       var = state_ops.assign_add(var, [[6.0, 7.0]])
-      final = gen_state_ops._destroy_temporary_variable(var, var_name="foo")
+      final = gen_state_ops.destroy_temporary_variable(var, var_name="foo")
       self.assertAllClose([[10.0, 12.0]], final.eval())
 
   def testDestroyNonexistentTemporaryVariable(self):
     with self.test_session(use_gpu=True):
-      var = gen_state_ops._temporary_variable([1, 2], dtypes.float32)
-      final = gen_state_ops._destroy_temporary_variable(var, var_name="bad")
+      var = gen_state_ops.temporary_variable([1, 2], dtypes.float32)
+      final = gen_state_ops.destroy_temporary_variable(var, var_name="bad")
       with self.assertRaises(errors.NotFoundError):
         final.eval()
 
   def testDuplicateTemporaryVariable(self):
     with self.test_session(use_gpu=True):
-      var1 = gen_state_ops._temporary_variable(
+      var1 = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="dup")
       var1 = state_ops.assign(var1, [[1.0, 2.0]])
-      var2 = gen_state_ops._temporary_variable(
+      var2 = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="dup")
       var2 = state_ops.assign(var2, [[3.0, 4.0]])
       final = var1 + var2
@@ -193,25 +193,25 @@ class VariableOpTest(test.TestCase):
 
   def testDestroyTemporaryVariableTwice(self):
     with self.test_session(use_gpu=True):
-      var = gen_state_ops._temporary_variable([1, 2], dtypes.float32)
-      val1 = gen_state_ops._destroy_temporary_variable(var, var_name="dup")
-      val2 = gen_state_ops._destroy_temporary_variable(var, var_name="dup")
+      var = gen_state_ops.temporary_variable([1, 2], dtypes.float32)
+      val1 = gen_state_ops.destroy_temporary_variable(var, var_name="dup")
+      val2 = gen_state_ops.destroy_temporary_variable(var, var_name="dup")
       final = val1 + val2
       with self.assertRaises(errors.NotFoundError):
         final.eval()
 
   def testTemporaryVariableNoLeak(self):
     with self.test_session(use_gpu=True):
-      var = gen_state_ops._temporary_variable(
+      var = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="bar")
       final = array_ops.identity(var)
       final.eval()
 
   def testTwoTemporaryVariablesNoLeaks(self):
     with self.test_session(use_gpu=True):
-      var1 = gen_state_ops._temporary_variable(
+      var1 = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="var1")
-      var2 = gen_state_ops._temporary_variable(
+      var2 = gen_state_ops.temporary_variable(
           [1, 2], dtypes.float32, var_name="var2")
       final = var1 + var2
       final.eval()
diff --git a/tensorflow/python/kernel_tests/variables_test.py b/tensorflow/python/kernel_tests/variables_test.py
index b16c8c002c98a0351d1fc55fce061695327a18c9..27599868b74be323189b872c2147c6a33f84d170 100644
--- a/tensorflow/python/kernel_tests/variables_test.py
+++ b/tensorflow/python/kernel_tests/variables_test.py
@@ -687,7 +687,7 @@ class VariableContainerTest(test.TestCase):
         v1 = variables.Variable([1])
         with ops.container("l2"):
           v2 = variables.Variable([2])
-          special_v = gen_state_ops._variable(
+          special_v = gen_state_ops.variable(
               shape=[1],
               dtype=dtypes.float32,
               name="VariableInL3",
diff --git a/tensorflow/python/kernel_tests/xent_op_test.py b/tensorflow/python/kernel_tests/xent_op_test.py
index e152f02d8e983364603053dc5c8d14b5dfaf3605..e3e120a4eb01885ac5ac5e41f82ad3e480a83a77 100644
--- a/tensorflow/python/kernel_tests/xent_op_test.py
+++ b/tensorflow/python/kernel_tests/xent_op_test.py
@@ -48,7 +48,7 @@ class XentTest(test.TestCase):
   def _testXent(self, np_features, np_labels, use_gpu=False):
     np_loss, np_backprop = self._npXent(np_features, np_labels)
     with self.test_session(use_gpu=use_gpu) as sess:
-      loss, backprop = gen_nn_ops._softmax_cross_entropy_with_logits(
+      loss, backprop = gen_nn_ops.softmax_cross_entropy_with_logits(
           np_features, np_labels)
       tf_loss, tf_backprop = sess.run([loss, backprop])
     self.assertAllCloseAccordingToType(np_loss, tf_loss)
@@ -71,7 +71,7 @@ class XentTest(test.TestCase):
   def _testSingleClass(self, use_gpu=False):
     for dtype in np.float16, np.float32:
       with self.test_session(use_gpu=use_gpu) as sess:
-        loss, backprop = gen_nn_ops._softmax_cross_entropy_with_logits(
+        loss, backprop = gen_nn_ops.softmax_cross_entropy_with_logits(
             np.array([[1.], [-1.], [0.]]).astype(dtype),
             np.array([[-1.], [0.], [1.]]).astype(dtype))
         tf_loss, tf_backprop = sess.run([loss, backprop])
@@ -89,7 +89,7 @@ class XentTest(test.TestCase):
       np_labels = np.array([[[0., 0., 0., 1.]], [[0., .5, .5,
                                                   0.]]]).astype(dtype)
       self.assertRaisesRegexp(ValueError, "must be rank 2",
-                              gen_nn_ops._softmax_cross_entropy_with_logits,
+                              gen_nn_ops.softmax_cross_entropy_with_logits,
                               np_features, np_labels)
 
   def testNpXent(self):
@@ -131,14 +131,14 @@ class XentTest(test.TestCase):
   def testShapeMismatch(self):
     with self.test_session():
       with self.assertRaises(ValueError):
-        gen_nn_ops._softmax_cross_entropy_with_logits(
+        gen_nn_ops.softmax_cross_entropy_with_logits(
             [[0., 1.], [2., 3.]], [[0., 1., 0.], [1., 0., 0.]])
 
   def testNotMatrix(self):
     with self.test_session():
       with self.assertRaises(ValueError):
-        gen_nn_ops._softmax_cross_entropy_with_logits([0., 1., 2., 3.],
-                                                      [0., 1., 0., 1.])
+        gen_nn_ops.softmax_cross_entropy_with_logits([0., 1., 2., 3.],
+                                                     [0., 1., 0., 1.])
 
   def testHalf(self):
     self._testAll(
diff --git a/tensorflow/python/layers/base.py b/tensorflow/python/layers/base.py
index 8314c4aa87a5b54effc44c371703267517ffa07d..2ec9971b88465647be25d98cb1e18202f4479349 100644
--- a/tensorflow/python/layers/base.py
+++ b/tensorflow/python/layers/base.py
@@ -36,12 +36,13 @@ from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops import variables as tf_variables
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpointable
 from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
 
 
 @tf_export('layers.Layer')
-class Layer(object):
+class Layer(checkpointable.CheckpointableBase):
   """Base layer class.
 
   This is the class from which all layers inherit, implementing common
@@ -532,13 +533,17 @@ class Layer(object):
     with vs.variable_scope(
         self._scope, reuse=reuse, auxiliary_name_scope=False) as scope:
       with ops.name_scope(self._name_scope_name(scope)):
-        variable = vs.get_variable(name,
-                                   shape=shape,
-                                   initializer=initializer,
-                                   dtype=dtypes.as_dtype(dtype),
-                                   constraint=constraint,
-                                   trainable=trainable and self.trainable,
-                                   partitioner=partitioner)
+        variable = self._add_variable_with_custom_getter(
+            name=name,
+            shape=shape,
+            getter=vs.get_variable,
+            # Manage errors in Layer rather than Checkpointable.
+            overwrite=True,
+            initializer=initializer,
+            dtype=dtypes.as_dtype(dtype),
+            constraint=constraint,
+            trainable=trainable and self.trainable,
+            partitioner=partitioner)
 
         if init_graph is not None:  # pylint: disable=protected-access
           # The variable was created and initialized in a graph.
diff --git a/tensorflow/python/layers/layers.py b/tensorflow/python/layers/layers.py
index 1555846efde812b9e31f48315decaf1f86aa4f70..13a8e8e39caaf9c74d1c7d0ea4d6856f725256fd 100644
--- a/tensorflow/python/layers/layers.py
+++ b/tensorflow/python/layers/layers.py
@@ -68,7 +68,6 @@ from tensorflow.python.util.all_util import remove_undocumented
 # Base objects.
 from tensorflow.python.layers.base import Layer
 from tensorflow.python.layers.base import InputSpec
-from tensorflow.python.layers.network import Input
 
 # Core layers.
 from tensorflow.python.layers.core import Dense
diff --git a/tensorflow/python/layers/maxout.py b/tensorflow/python/layers/maxout.py
deleted file mode 100644
index 765a1c4fdafdfdc5d3ea6629d4d9290d8b658902..0000000000000000000000000000000000000000
--- a/tensorflow/python/layers/maxout.py
+++ /dev/null
@@ -1,111 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#   http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# =============================================================================
-
-# pylint: disable=unused-import,g-bad-import-order
-"""Contains the maxout layer
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.python.eager import context
-from tensorflow.python.framework import ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import gen_array_ops
-
-from tensorflow.python.layers import base
-
-
-def maxout(inputs, num_units, axis=-1, name=None):
-  """Adds a maxout op from https://arxiv.org/abs/1302.4389
-
-  "Maxout Networks" Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron
-  Courville,
-   Yoshua Bengio
-
-   Usually the operation is performed in the filter/channel dimension. This can
-   also be
-   used after fully-connected layers to reduce number of features.
-
-   Arguments:
-   inputs: Tensor input
-   num_units: Specifies how many features will remain after maxout in the `axis`
-     dimension
-         (usually channel). This must be multiple of number of `axis`.
-   axis: The dimension where max pooling will be performed. Default is the
-   last dimension.
-   name: Optional scope for name_scope.
-
-   Returns:
-    A `Tensor` representing the results of the pooling operation.
-
-   Raises:
-    ValueError: if num_units is not multiple of number of features.
-  """
-  return MaxOut(num_units=num_units, axis=axis, name=name)(inputs)
-
-
-class MaxOut(base.Layer):
-  """Adds a maxout op from https://arxiv.org/abs/1302.4389
-
-  "Maxout Networks" Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron
-  Courville, Yoshua
-  Bengio
-
-  Usually the operation is performed in the filter/channel dimension. This can
-  also be
-  used after fully-connected layers to reduce number of features.
-
-  Arguments:
-    inputs: Tensor input
-    num_units: Specifies how many features will remain after maxout in the
-      `axis` dimension
-         (usually channel).
-    This must be multiple of number of `axis`.
-    axis: The dimension where max pooling will be performed. Default is the
-    last dimension.
-    name: Optional scope for name_scope.
-
-  Returns:
-    A `Tensor` representing the results of the pooling operation.
-
-  Raises:
-    ValueError: if num_units is not multiple of number of features.
-  """
-
-  def __init__(self, num_units, axis=-1, name=None, **kwargs):
-    super(MaxOut, self).__init__(name=name, trainable=False, **kwargs)
-    self.axis = axis
-    self.num_units = num_units
-
-  def call(self, inputs):
-    inputs = ops.convert_to_tensor(inputs)
-    shape = inputs.get_shape().as_list()
-    num_channels = shape[self.axis]
-    if num_channels % self.num_units:
-      raise ValueError('number of features({}) is not '
-                       'a multiple of num_units({})'.format(
-                           num_channels, self.num_units))
-    shape[self.axis] = -1
-    shape += [num_channels // self.num_units]
-
-    # Dealing with batches with arbitrary sizes
-    for i in range(len(shape)):
-      if shape[i] is None:
-        shape[i] = gen_array_ops.shape(inputs)[i]
-    outputs = math_ops.reduce_max(
-        gen_array_ops.reshape(inputs, shape), -1, keepdims=False)
-
-    return outputs
diff --git a/tensorflow/python/layers/maxout_test.py b/tensorflow/python/layers/maxout_test.py
deleted file mode 100644
index 26acac57c41da759f288f255c0cd523f9c6b1dbd..0000000000000000000000000000000000000000
--- a/tensorflow/python/layers/maxout_test.py
+++ /dev/null
@@ -1,61 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#   http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# =============================================================================
-
-# pylint: disable=unused-import,g-bad-import-order
-
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from tensorflow.python.layers import maxout
-from tensorflow.python.layers import convolutional as conv_layers
-from tensorflow.python.layers import core as core_layers
-
-from tensorflow.python.ops import random_ops
-from tensorflow.python.platform import test
-import numpy as np
-
-"""
-Contains the maxout layer tests
-"""
-
-
-class MaxOutTest(test.TestCase):
-  def test_simple(self):
-    inputs = random_ops.random_uniform((64, 10, 36), seed=1)
-    graph = maxout.maxout(inputs, num_units=3)
-    self.assertEqual(graph.get_shape().as_list(), [64, 10, 3])
-
-  def test_fully_connected(self):
-    inputs = random_ops.random_uniform((64, 50), seed=1)
-    graph = core_layers.dense(inputs, 50)
-    graph = maxout.maxout(graph, num_units=10)
-    self.assertEqual(graph.get_shape().as_list(), [64, 10])
-
-  def test_nchw(self):
-    inputs = random_ops.random_uniform((10, 100, 100, 3), seed=1)
-    graph = conv_layers.conv2d(inputs, 10, 3, padding="SAME")
-    graph = maxout.maxout(graph, num_units=1)
-    self.assertEqual(graph.get_shape().as_list(), [10, 100, 100, 1])
-
-  def test_invalid_shape(self):
-    inputs = random_ops.random_uniform((10, 100, 100, 3), seed=1)
-    graph = conv_layers.conv2d(inputs, 3, 10, strides=(1, 1))
-    with self.assertRaisesRegexp(ValueError, 'number of features'):
-      graph = maxout.maxout(graph, num_units=2)
-
-if __name__ == '__main__':
-  test.main()
diff --git a/tensorflow/python/layers/network.py b/tensorflow/python/layers/network.py
deleted file mode 100644
index 9f16559687c52a1149b78a1ccca796cadd8208d0..0000000000000000000000000000000000000000
--- a/tensorflow/python/layers/network.py
+++ /dev/null
@@ -1,1024 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# =============================================================================
-"""Contains Network, a composition of layers."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-
-from tensorflow.python.eager import context
-from tensorflow.python.estimator import util as estimator_util
-from tensorflow.python.framework import dtypes
-from tensorflow.python.framework import ops
-from tensorflow.python.framework import tensor_shape
-from tensorflow.python.layers import base
-from tensorflow.python.layers import utils as layers_util
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import variable_scope as vs
-from tensorflow.python.platform import tf_logging as logging
-from tensorflow.python.util import nest
-from tensorflow.python.util.tf_export import tf_export
-
-
-class InputLayer(base.Layer):
-  """Layer to be used as an entry point into a Network (a graph of layers).
-
-  It can either wrap an existing tensor (pass an `input_tensor` argument)
-  or create its a placeholder tensor (pass arguments `input_shape`
-  as well as `dtype`).
-
-  It is generally recommend to use the functional layer API via `Input`,
-  (which creates an `InputLayer`) without directly using `InputLayer`.
-
-  Arguments:
-      input_shape: Shape tuple (not including the batch axis), or `TensorShape`
-        instance (not including the batch axis).
-      batch_size: Optional input batch size (integer or None).
-      dtype: Datatype of the input.
-      input_tensor: Optional tensor to use as layer input
-          instead of creating a placeholder.
-      sparse: Boolean, whether the placeholder created
-          is meant to be sparse.
-      name: Name of the layer (string).
-
-    Raises:
-      RuntimeError: If created in Eager mode.
-  """
-
-  def __init__(self,
-               input_shape=None,
-               batch_size=None,
-               dtype=dtypes.float32,
-               input_tensor=None,
-               sparse=False,
-               name=None):
-    super(InputLayer, self).__init__(dtype=dtype, name=name)
-    self.built = True
-    self.sparse = sparse
-    self.batch_size = batch_size
-
-    if isinstance(input_shape, tensor_shape.TensorShape):
-      input_shape = tuple(input_shape.as_list())
-
-    if input_tensor is None:
-      if input_shape is not None:
-        batch_input_shape = (batch_size,) + tuple(input_shape)
-      else:
-        batch_input_shape = None
-
-      if context.in_eager_mode():
-        # In eager mode, create a temporary placeholder to call the layer on.
-        input_tensor = base._DeferredTensor(  # pylint: disable=protected-access
-            shape=batch_input_shape,
-            dtype=dtype,
-            name=self.name)
-      else:
-        # In graph mode, create a graph placeholder to call the layer on.
-        if sparse:
-          input_tensor = array_ops.sparse_placeholder(
-              shape=batch_input_shape,
-              dtype=dtype,
-              name=self.name)
-        else:
-          input_tensor = array_ops.placeholder(
-              shape=batch_input_shape,
-              dtype=dtype,
-              name=self.name)
-
-      # For compatibility with Keras API.
-      self.is_placeholder = True
-      self._batch_input_shape = batch_input_shape
-    else:
-      # For compatibility with Keras API.
-      self.is_placeholder = False
-      self._batch_input_shape = tuple(input_tensor.get_shape().as_list())
-
-    # Create an input node to add to self.outbound_node
-    # and set output_tensors' _keras_history.
-    input_tensor._keras_history = (self, 0, 0)  # pylint: disable=protected-access
-    base.Node(
-        self,
-        inbound_layers=[],
-        node_indices=[],
-        tensor_indices=[],
-        input_tensors=[input_tensor],
-        output_tensors=[input_tensor])
-
-
-@tf_export('layers.Input')
-def Input(  # pylint: disable=invalid-name
-    shape=None,
-    batch_size=None,
-    name=None,
-    dtype=dtypes.float32,
-    sparse=False,
-    tensor=None):
-  """`Input()` is used to instantiate an input tensor for use with a `Network`.
-
-  For instance, if a, b and c are tensors created via `Input`,
-  it becomes possible to do:
-
-  `network = Network(inputs=[a, b], outputs=c)`
-
-  Example:
-
-      ```python
-      # This is a logistic regression
-      x = tf.layers.Input(shape=(32,))
-      y = tf.layers.Dense(16, activation='softmax')(x)
-      network = tf.layers.Network(x, y)
-      ```
-
-  Arguments:
-      shape: A shape tuple (integer), not including the batch size.
-          For instance, `shape=(32,)` indicates that the expected input
-          will be batches of 32-dimensional vectors.
-      batch_size: Optional input batch size (integer or None).
-      name: An optional name string for the layer.
-          Should be unique in a model (do not reuse the same name twice).
-          It will be autogenerated if it isn't provided.
-      dtype: The data type expected by the input, as a string
-          (`float32`, `float64`, `int32`...)
-      sparse: A boolean specifying whether the placeholder
-          to be created is sparse.
-      tensor: Optional existing tensor to wrap into the `Input` layer.
-          If set, the layer will not create a placeholder tensor.
-
-  Returns:
-      A tensor: either a new placeholder (with history metadata) or
-      `tensor` (if passed), with added history metadata.
-
-  Raises:
-    RuntimeError: If called in Eager mode.
-  """
-  input_layer = InputLayer(
-      input_shape=shape,
-      batch_size=batch_size,
-      name=name,
-      dtype=dtype,
-      sparse=sparse,
-      input_tensor=tensor)
-  # Return tensor including `_keras_history` metadata.
-  # Note that in this case train_output and test_output are the same pointer.
-  outputs = input_layer._inbound_nodes[0].output_tensors  # pylint: disable=protected-access
-  if len(outputs) == 1:
-    return outputs[0]
-  else:
-    return outputs
-
-
-class GraphNetwork(base.Layer):
-  """A GraphNetwork is a directed acyclic graph of layers.
-
-  It is the topological form of a `tf.keras.models.Model`. A `Model` is simply a
-  `GraphNetwork` with added training/evaluation routines.
-
-  A `GraphNetwork` instance implements the full `Layer` API. In particular, a
-  `GraphNetwork` can be called on new inputs.
-
-  Example:
-
-      ```python
-      # This is a logistic regression
-      x = tf.layers.Input(shape=(32,))
-      y = tf.layers.Dense(16, activation='softmax')(x)
-      network = tf.layers.GraphNetwork(x, y)
-
-      # It is then possible to call the network on compatible inputs:
-      z = tf.layers.Input(shape=(32,))
-      w = network(z)
-
-      # It is possible to retrieve the same properties as a layer:
-      weights = network.trainable_weights
-      ```
-
-  Arguments:
-      inputs: Input tensor or list of input tensors.
-        Must come from `tf.layers.Input`.
-      output: Output tensor or list of output tensors. Must come from
-        tf.layers Layers or Keras layers.
-      name: Optional name of the model (string).
-
-  Attributes:
-    GraphNetwork has the same attributes as Layer. On top of it, it also has:
-      - layers: a list of the children layers of the network,
-        a list of layer instances, ordered from "earlier in the graph"
-        to "later in the graph".
-
-  Methods:
-    GraphNetwork has the same methods as Layer. On top of it, it also has:
-      - get_layer: retrieves a child layer by name or index in the graph.
-
-  Raises:
-    TypeError: If created when eager execution is enabled, with inputs that
-      don't come from a call to `Input` or outputs that don't come from layers.
-  """
-
-  def __init__(self, inputs, outputs, name=None):  # pylint: disable=super-init-not-called
-    if isinstance(inputs, (list, tuple)):
-      self.inputs = list(inputs)  # Tensor or list of tensors.
-    else:
-      self.inputs = [inputs]
-    if isinstance(outputs, (list, tuple)):
-      self.outputs = list(outputs)
-    else:
-      self.outputs = [outputs]
-
-    if context.in_eager_mode():
-      # Check that all inputs/outputs are DeferredTensors.
-      for tensor in self.inputs:
-        if not isinstance(tensor, base._DeferredTensor):  # pylint: disable=protected-access
-          raise TypeError('When eager execution is enabled, '
-                          'inputs must come from a call to '
-                          '`tf.keras.Input` (called after '
-                          'tfe.enable_eager_execution()). '
-                          'Received invalid input: ' + str(tensor))
-      for tensor in self.outputs:
-        if not isinstance(tensor, base._DeferredTensor):  # pylint: disable=protected-access
-          raise TypeError('When eager execution is enabled, '
-                          'outputs must come from a call to '
-                          'a layer (called after '
-                          'tfe.enable_eager_execution()). '
-                          'Received invalid output: ' + str(tensor))
-
-    self._init_set_name(name)
-    self._activity_regularizer = None
-    with vs.variable_scope(
-        None, default_name=self._base_name) as captured_scope:
-      self._scope = captured_scope
-    call_fn_args = estimator_util.fn_args(self.call)
-    self._compute_previous_mask = ('mask' in call_fn_args or
-                                   hasattr(self, 'compute_mask'))
-    self._call_has_scope_arg = 'scope' in call_fn_args
-
-    # This acts just like the `trainable` attribute of any layer instance.
-    # It does not affect users of the underlying layers, only users of the
-    # GraphNetwork instance.
-    self.trainable = True
-    # A GraphNetwork does not create weights of its own, thus it is already
-    # built.
-    self.built = True
-    # A GraphNetwork does not create weights of its own, thus has no dtype.
-    self._dtype = None
-    self._is_graph_network = True
-    # The following are implemented as property functions:
-    # self.trainable_weights
-    # self.non_trainable_weights
-    # self.input_spec
-
-    # Private attributes to implement compatibility with Layer.
-    self._updates = []
-    self._losses = []
-    self._scope = None
-    self._reuse = None
-    self._graph = ops.get_default_graph()
-
-    # All layers in order of horizontal graph traversal.
-    # Entries are unique. Includes input and output layers.
-    self._layers = []
-
-    # Check for redundancy in inputs.
-    if len(set(self.inputs)) != len(self.inputs):
-      raise ValueError('The list of inputs passed to the model '
-                       'is redundant. '
-                       'All inputs should only appear once.'
-                       ' Found: ' + str(self.inputs))
-
-    # # List of initial layers (1 to 1 mapping with self.inputs,
-    # # hence the same layer might appear twice)
-    # self._input_layers = []
-    # self._input_layers_node_indices = []
-    # self._input_layers_tensor_indices = []
-    # # list of layers (1 to 1 mapping with self.inputs,
-    # # hence the same layer might appear twice)
-    # self._output_layers = []
-    # self._output_layers_node_indices = []
-    # self._output_layers_tensor_indices = []
-
-    self._input_layers = []
-    self._output_layers = []
-    self._input_coordinates = []
-    self._output_coordinates = []
-
-    # This is for performance optimization when calling the GraphNetwork on new
-    # inputs. Every time the GraphNetwork is called on a set on input tensors,
-    # we compute the output tensors, output masks and output shapes in one pass,
-    # then cache them here. When any of these outputs is queried later, we
-    # retrieve it from there instead of recomputing it.
-    self._output_mask_cache = {}
-    self._output_tensor_cache = {}
-    self._output_shape_cache = {}
-
-    # User-provided arguments validation.
-    for x in self.inputs:
-      # Check that x has appropriate `_keras_history` metadata.
-      if not hasattr(x, '_keras_history'):
-        cls_name = self.__class__.__name__
-        raise ValueError('Input tensors to a ' + cls_name + ' ' +
-                         'must come from `tf.layers.Input`. '
-                         'Received: ' + str(x) +
-                         ' (missing previous layer metadata).')
-      # Check that x is an input tensor.
-      # pylint: disable=protected-access
-      layer, node_index, tensor_index = x._keras_history
-      if len(layer._inbound_nodes) > 1 or (
-          layer._inbound_nodes and layer._inbound_nodes[0].inbound_layers):
-        cls_name = self.__class__.__name__
-        logging.warning(cls_name + ' inputs must come from '
-                        '`tf.layers.Input` (thus holding past layer metadata), '
-                        'they cannot be the output of '
-                        'a previous non-Input layer. '
-                        'Here, a tensor specified as '
-                        'input to "' + self.name + '" was not an Input tensor, '
-                        'it was generated by layer ' + layer.name + '.\n'
-                        'Note that input tensors are '
-                        'instantiated via `tensor = tf.layers.Input(shape)`.\n'
-                        'The tensor that caused the issue was: ' + str(x.name))
-      # pylint: enable=protected-access
-    for x in self.outputs:
-      if not hasattr(x, '_keras_history'):
-        cls_name = self.__class__.__name__
-        raise ValueError('Output tensors to a ' + cls_name + ' must be '
-                         'the output of a TensorFlow `Layer` '
-                         '(thus holding past layer metadata). Found: ' + str(x))
-
-    # Build self._output_layers:
-    for x in self.outputs:
-      layer, node_index, tensor_index = x._keras_history  # pylint: disable=protected-access
-      self._output_layers.append(layer)
-      self._output_coordinates.append((layer, node_index, tensor_index))
-
-    # Build self._input_layers:
-    for x in self.inputs:
-      layer, node_index, tensor_index = x._keras_history  # pylint: disable=protected-access
-      # It's supposed to be an input layer, so only one node
-      # and one tensor output.
-      assert node_index == 0
-      assert tensor_index == 0
-      self._input_layers.append(layer)
-      self._input_coordinates.append((layer, node_index, tensor_index))
-
-    # Network_nodes: set of nodes included in the graph
-    # (not all nodes included in the layers
-    # are relevant to the current graph).
-    network_nodes = set()  # ids of all nodes relevant to the GraphNetwork
-    nodes_depths = {}  # dict {node: depth value}
-    layers_depths = {}  # dict {layer: depth value}
-    layer_indices = {}  # dict {layer: index in traversal}
-    nodes_in_decreasing_depth = []
-
-    def build_map_of_graph(tensor,
-                           finished_nodes,
-                           nodes_in_progress,
-                           layer,
-                           node_index,
-                           tensor_index):
-      """Builds a map of the graph of layers.
-
-      This recursively updates the map `layer_indices`,
-      the list `nodes_in_decreasing_depth` and the set `network_nodes`.
-
-      Arguments:
-          tensor: Some tensor in a graph.
-          finished_nodes: Set of nodes whose subgraphs have been traversed
-              completely. Useful to prevent duplicated work.
-          nodes_in_progress: Set of nodes that are currently active on the
-              recursion stack. Useful to detect cycles.
-          layer: Layer from which `tensor` comes from. If not provided,
-              will be obtained from `tensor._keras_history`.
-          node_index: Node index from which `tensor` comes from.
-          tensor_index: Tensor_index from which `tensor` comes from.
-
-      Raises:
-          ValueError: if a cycle is detected.
-      """
-      node = layer._inbound_nodes[node_index]  # pylint: disable=protected-access
-
-      # Prevent cycles.
-      if node in nodes_in_progress:
-        raise ValueError('The tensor ' + str(tensor) + ' at layer "' +
-                         layer.name + '" is part of a cycle.')
-
-      # Don't repeat work for shared subgraphs
-      if node in finished_nodes:
-        return
-
-      node_key = _make_node_key(layer.name, node_index)
-      # Update network_nodes.
-      network_nodes.add(node_key)
-
-      # Store the traversal order for layer sorting.
-      if layer not in layer_indices:
-        layer_indices[layer] = len(layer_indices)
-
-      nodes_in_progress.add(node)
-
-      # Propagate to all previous tensors connected to this node.
-      for i in range(len(node.inbound_layers)):
-        x = node.input_tensors[i]
-        layer = node.inbound_layers[i]
-        node_index = node.node_indices[i]
-        tensor_index = node.tensor_indices[i]
-        build_map_of_graph(x, finished_nodes, nodes_in_progress, layer,
-                           node_index, tensor_index)
-
-      finished_nodes.add(node)
-      nodes_in_progress.remove(node)
-      nodes_in_decreasing_depth.append(node)
-
-    finished_nodes = set()
-    nodes_in_progress = set()
-    for x in self.outputs:
-      layer, node_index, tensor_index = x._keras_history  # pylint: disable=protected-access
-      build_map_of_graph(x, finished_nodes, nodes_in_progress,
-                         layer=layer,
-                         node_index=node_index,
-                         tensor_index=tensor_index)
-
-    for node in reversed(nodes_in_decreasing_depth):
-      # If the depth is not set, the node has no outbound nodes (depth 0).
-      depth = nodes_depths.setdefault(node, 0)
-
-      # Update the depth of the corresponding layer
-      previous_depth = layers_depths.get(node.outbound_layer, 0)
-      # If we've seen this layer before at a higher depth,
-      # we should use that depth instead of the node depth.
-      # This is necessary for shared layers that have inputs at different
-      # depth levels in the graph.
-      depth = max(depth, previous_depth)
-      layers_depths[node.outbound_layer] = depth
-      nodes_depths[node] = depth
-
-      # Update the depth of inbound nodes.
-      # The "depth" of a node is the max of the depths
-      # of all layers it is connected to.
-      for i in range(len(node.inbound_layers)):
-        inbound_layer = node.inbound_layers[i]
-        node_index = node.node_indices[i]
-        inbound_node = inbound_layer._inbound_nodes[node_index]  # pylint: disable=protected-access
-        previous_depth = nodes_depths.get(inbound_node, 0)
-        nodes_depths[inbound_node] = max(depth + 1, previous_depth)
-
-    # Build a dict {depth: list of nodes with this depth}
-    nodes_by_depth = {}
-    for node, depth in nodes_depths.items():
-      if depth not in nodes_by_depth:
-        nodes_by_depth[depth] = []
-      nodes_by_depth[depth].append(node)
-
-    # Build a dict {depth: list of layers with this depth}
-    layers_by_depth = {}
-    for layer, depth in layers_depths.items():
-      if depth not in layers_by_depth:
-        layers_by_depth[depth] = []
-      layers_by_depth[depth].append(layer)
-
-    # Get sorted list of layer depths.
-    depth_keys = list(layers_by_depth.keys())
-    depth_keys.sort(reverse=True)
-
-    # Set self.layers and self._layers_by_depth.
-    layers = []
-    for depth in depth_keys:
-      layers_for_depth = layers_by_depth[depth]
-      # GraphNetwork.layers needs to have a deterministic order:
-      # here we order them by traversal order.
-      layers_for_depth.sort(key=lambda x: layer_indices[x])
-      layers.extend(layers_for_depth)
-    self._layers = layers
-    self._layers_by_depth = layers_by_depth
-
-    # Get sorted list of node depths.
-    depth_keys = list(nodes_by_depth.keys())
-    depth_keys.sort(reverse=True)
-
-    # Check that all tensors required are computable.
-    # computable_tensors: all tensors in the graph
-    # that can be computed from the inputs provided.
-    computable_tensors = []
-    for x in self.inputs:
-      computable_tensors.append(x)
-
-    layers_with_complete_input = []  # To provide a better error msg.
-    for depth in depth_keys:
-      for node in nodes_by_depth[depth]:
-        layer = node.outbound_layer
-        if layer:
-          for x in node.input_tensors:
-            if x not in computable_tensors:
-              raise ValueError('Graph disconnected: '
-                               'cannot obtain value for tensor ' + str(x) +
-                               ' at layer "' + layer.name + '". '
-                               'The following previous layers '
-                               'were accessed without issue: ' +
-                               str(layers_with_complete_input))
-          for x in node.output_tensors:
-            computable_tensors.append(x)
-          layers_with_complete_input.append(layer.name)
-
-    # Keep track of the network's nodes.
-    self._network_nodes = network_nodes
-    self._nodes_by_depth = nodes_by_depth
-
-    # Ensure name unicity, which will be crucial for serialization
-    # (since serialized nodes refer to layers by their name).
-    all_names = [layer.name for layer in self.layers]
-    for name in all_names:
-      if all_names.count(name) != 1:
-        raise ValueError('The name "' + name + '" is used ' +
-                         str(all_names.count(name)) + ' times in the model. '
-                         'All layer names should be unique.')
-
-    # Layer parameters.
-    # The new network starts with a single inbound node
-    # for its inputs, and no outbound nodes.
-    self._outbound_nodes = []  # Will be appended to by future calls to __call__
-    self._inbound_nodes = [
-    ]  # Will be appended to below, and by future calls to __call__
-    # Create the node linking internal inputs to internal outputs.
-    base.Node(
-        outbound_layer=self,
-        inbound_layers=[],
-        node_indices=[],
-        tensor_indices=[],
-        input_tensors=self.inputs,
-        output_tensors=self.outputs)
-
-  @property
-  def layers(self):
-    return self._layers
-
-  def get_layer(self, name=None, index=None):
-    """Retrieves a layer based on either its name (unique) or index.
-
-    Indices are based on order of horizontal graph traversal (bottom-up).
-
-    Arguments:
-        name: String, name of layer.
-        index: Integer, index of layer.
-
-    Returns:
-        A layer instance.
-
-    Raises:
-        ValueError: In case of invalid layer name or index.
-    """
-    # TODO(fchollet): We could build a dictionary based on layer names
-    # since they are constant, but we have not done that yet.
-    if index is not None:
-      if len(self.layers) <= index:
-        raise ValueError('Was asked to retrieve layer at index ' + str(index) +
-                         ' but model only has ' + str(len(self.layers)) +
-                         ' layers.')
-      else:
-        return self.layers[index]
-    else:
-      if not name:
-        raise ValueError('Provide either a layer name or layer index.')
-    for layer in self.layers:
-      if layer.name == name:
-        return layer
-    raise ValueError('No such layer: ' + name)
-
-  @property
-  def stateful(self):
-    return any([(hasattr(layer, 'stateful') and layer.stateful)
-                for layer in self.layers])
-
-  @property
-  def updates(self):
-    """Retrieve the network's updates.
-
-    Will only include updates that are either
-    unconditional, or conditional on inputs to this model
-    (e.g. will not include updates that were created by layers of this model
-    outside of the model).
-
-    Effectively, `network.updates` behaves like `layer.updates`.
-
-    Concrete example:
-
-    ```python
-      bn = keras.layers.BatchNormalization()
-      x1 = keras.layers.Input(shape=(10,))
-      _ = bn(x1)  # This creates 2 updates.
-
-      x2 = keras.layers.Input(shape=(10,))
-      y2 = bn(x2)  # This creates 2 more updates.
-
-      # The BN layer has now 4 updates.
-      self.assertEqual(len(bn.updates), 4)
-
-      # Let's create a model from x2 to y2.
-      model = keras.models.Model(x2, y2)
-
-      # The model does not list all updates from its underlying layers,
-      # but only the updates that are relevant to it. Updates created by layers
-      # outside of the model are discarded.
-      self.assertEqual(len(model.updates), 2)
-
-      # If you keep calling the model, you append to its updates, just like
-      # what happens for a layer.
-      x3 = keras.layers.Input(shape=(10,))
-      y3 = model(x3)
-      self.assertEqual(len(model.updates), 4)
-
-      # But if you call the inner BN layer independently, you don't affect
-      # the model's updates.
-      x4 = keras.layers.Input(shape=(10,))
-      _ = bn(x4)
-      self.assertEqual(len(model.updates), 4)
-    ```
-
-    Returns:
-        A list of update ops.
-    """
-    if context.in_eager_mode():
-      return []
-
-    if not self.trainable and not self.stateful:
-      return []
-
-    updates = []
-    for layer in self.layers:
-      updates += layer.updates
-
-    # `updates` might contain irrelevant updates, so it needs to be filtered
-    # with respect to inputs the model has been called on.
-    relevant_inputs = self.inputs or []
-    for i in range(1, len(self._inbound_nodes)):
-      inputs = self.get_input_at(i)
-      if isinstance(inputs, list):
-        relevant_inputs += inputs
-      else:
-        relevant_inputs.append(inputs)
-    reachable = layers_util.get_reachable_from_inputs(relevant_inputs, updates)
-    relevant_conditional_updates = [x for x in updates if x in reachable]
-    unconditional_updates = [
-        x for x in updates if x._unconditional_update]  # pylint: disable=protected-access
-    # A layer could be used multiple times in a nested structure,
-    # so the updates list must be de-duped.
-    return list(set(
-        relevant_conditional_updates + unconditional_updates + self._updates))
-
-  @property
-  def losses(self):
-    """Retrieve the network's losses.
-
-    Will only include losses that are either
-    unconditional, or conditional on inputs to this model
-    (e.g. will not include losses that depend on tensors
-    that aren't inputs to this model).
-
-    Returns:
-        A list of loss tensors.
-    """
-    losses = []
-    for layer in self.layers:
-      losses += layer.losses
-    if context.in_eager_mode():
-      return losses
-
-    relevant_inputs = self.inputs or []
-    for i in range(1, len(self._inbound_nodes)):
-      inputs = self.get_input_at(i)
-      if isinstance(inputs, list):
-        relevant_inputs += inputs
-      else:
-        relevant_inputs.append(inputs)
-    reachable = layers_util.get_reachable_from_inputs(relevant_inputs, losses)
-    relevant_conditional_losses = [x for x in losses if x in reachable]
-    unconditional_losses = [
-        x for x in losses if x._unconditional_loss]  # pylint: disable=protected-access
-    return list(set(
-        relevant_conditional_losses + unconditional_losses + self._losses))
-
-  @property
-  def trainable_weights(self):
-    if not self.trainable:
-      return []
-    weights = []
-    for layer in self.layers:
-      weights += layer.trainable_weights
-    return weights
-
-  @property
-  def non_trainable_weights(self):
-    weights = []
-    for layer in self.layers:
-      weights += layer.non_trainable_weights
-    if not self.trainable:
-      trainable_weights = []
-      for layer in self.layers:
-        trainable_weights += layer.trainable_weights
-      return trainable_weights + weights
-    return weights
-
-  @property
-  def input_spec(self):
-    """Gets the network's input specs.
-
-    Returns:
-        A list of `InputSpec` instances (one per input to the model)
-            or a single instance if the model has only one input.
-    """
-    # If not a graph network, can't assume anything.
-    if not self._is_graph_network:
-      return None
-
-    specs = []
-    for layer in self._input_layers:
-      if layer.input_spec is None:
-        specs.append(None)
-      else:
-        if not isinstance(layer.input_spec, list):
-          raise TypeError('Layer ' + layer.name +
-                          ' has an input_spec attribute that '
-                          'is not a list. We expect a list. '
-                          'Found input_spec = ' + str(layer.input_spec))
-        specs += layer.input_spec
-    if len(specs) == 1:
-      return specs[0]
-    return specs
-
-  def call(self, inputs, mask=None):
-    """Call the model on new inputs.
-
-    In this case `call` just reapplies
-    all ops in the graph to the new inputs
-    (e.g. build a new computational graph from the provided inputs).
-
-    Arguments:
-        inputs: A tensor or list of tensors.
-        mask: A mask or list of masks. A mask can be
-            either a tensor or None (no mask).
-
-    Returns:
-        A tensor if there is a single output, or
-        a list of tensors if there are more than one outputs.
-    """
-    inputs = nest.flatten(inputs)
-    if mask is None:
-      masks = [None for _ in range(len(inputs))]
-    else:
-      masks = nest.flatten(mask)
-
-    if context.in_graph_mode():
-      # Try to retrieve cached outputs if the layer has already been called
-      # on these exact inputs.
-      cache_key = (layers_util.object_list_uid(inputs)
-                   + '_' + layers_util.object_list_uid(masks))
-      if cache_key in self._output_tensor_cache:
-        # Cache hit.
-        return self._output_tensor_cache[cache_key]
-    # Actually apply the network graph to the new inputs.
-    outputs, _ = self._run_internal_graph(inputs, masks)
-    return outputs
-
-  def compute_output_shape(self, input_shape):
-    if not self._is_graph_network:
-      raise NotImplementedError
-
-    if isinstance(input_shape, list):
-      input_shapes = []
-      for shape in input_shape:
-        if shape is not None:
-          input_shapes.append(tuple(tensor_shape.TensorShape(shape).as_list()))
-        else:
-          input_shapes.append(None)
-    else:
-      if input_shape is not None:
-        input_shapes = [tuple(tensor_shape.TensorShape(input_shape).as_list())]
-      else:
-        input_shapes = [None]
-
-    if len(input_shapes) != len(self._input_layers):
-      raise ValueError('Invalid input_shape argument ' + str(input_shape) +
-                       ': model has ' + str(len(self._input_layers)) +
-                       ' tensor inputs.')
-
-    cache_key = layers_util.object_list_uid(input_shapes)
-    if cache_key not in self._output_shape_cache:
-      # Cache miss. We have to run the network graph manually (recursive calls
-      # to `compute_output_shape`).
-      layers_to_output_shapes = {}
-      for i in range(len(input_shapes)):
-        layer = self._input_layers[i]
-        input_shape = input_shapes[i]
-        # It's an input layer: then `compute_output_shape` is identity,
-        # and there is only one node and one tensor output.
-        shape_key = layer.name + '_0_0'
-        layers_to_output_shapes[shape_key] = input_shape
-
-      depth_keys = list(self._nodes_by_depth.keys())
-      depth_keys.sort(reverse=True)
-      # Iterate over nodes, by depth level.
-      if len(depth_keys) > 1:
-        for depth in depth_keys:
-          nodes = self._nodes_by_depth[depth]
-          for node in nodes:
-            # This is always a single layer, never a list.
-            layer = node.outbound_layer
-            if layer in self._input_layers:
-              # We've already covered the input layers
-              # a few lines above.
-              continue
-            # Potentially redundant list,
-            # same size as node.input_tensors.
-            input_shapes = []
-            for j in range(len(node.inbound_layers)):
-              inbound_layer = node.inbound_layers[j]
-              node_index = node.node_indices[j]
-              tensor_index = node.tensor_indices[j]
-              shape_key = inbound_layer.name + '_%s_%s' % (node_index,
-                                                           tensor_index)
-              input_shape = layers_to_output_shapes[shape_key]
-              input_shapes.append(input_shape)
-
-            if len(input_shapes) == 1:
-              output_shape = layer.compute_output_shape(input_shapes[0])
-            else:
-              output_shape = layer.compute_output_shape(input_shapes)
-            if isinstance(output_shape, list):
-              output_shapes = [
-                  tuple(tensor_shape.TensorShape(shape).as_list())
-                  for shape in output_shape
-              ]
-            else:
-              output_shapes = [
-                  tuple(tensor_shape.TensorShape(output_shape).as_list())
-              ]
-
-            node_index = layer._inbound_nodes.index(node)  # pylint: disable=protected-access
-            for j in range(len(output_shapes)):
-              shape_key = layer.name + '_%s_%s' % (node_index, j)
-              layers_to_output_shapes[shape_key] = output_shapes[j]
-
-        # Read final output shapes from layers_to_output_shapes.
-        output_shapes = []
-        for i in range(len(self._output_layers)):
-          layer, node_index, tensor_index = self._output_coordinates[i]
-          shape_key = layer.name + '_%s_%s' % (node_index, tensor_index)
-          output_shapes.append(layers_to_output_shapes[shape_key])
-        # Store in cache.
-        self._output_shape_cache[cache_key] = output_shapes
-    else:
-      # Cache hit.
-      output_shapes = self._output_shape_cache[cache_key]
-
-    if isinstance(output_shapes, list):
-      if len(output_shapes) == 1:
-        return tensor_shape.TensorShape(output_shapes[0])
-      else:
-        return [tensor_shape.TensorShape(shape) for shape in output_shapes]
-    else:
-      return tensor_shape.TensorShape(output_shapes)
-
-  def _run_internal_graph(self, inputs, masks=None):
-    """Computes output tensors for new inputs.
-
-    # Note:
-        - Expects `inputs` to be a list (potentially with 1 element).
-        - Can be run on non-Keras tensors.
-
-    Arguments:
-        inputs: List of tensors
-        masks: List of masks (tensors or None).
-
-    Returns:
-        Three lists: output_tensors, output_masks, output_shapes
-    """
-    # Note: masking support is relevant mainly for Keras.
-    # It cannot be factored out without having the fully reimplement the network
-    # calling logic on the Keras side. We choose to incorporate it in
-    # GraphNetwork because 1) it may be useful to fully support in tf.layers in
-    # the future and 2) Keras is a major user of GraphNetwork.  If you don't
-    # use masking, it does not interfere with regular behavior at all and you
-    # can ignore it.
-    if masks is None:
-      masks = [None for _ in range(len(inputs))]
-
-    # Dictionary mapping reference tensors to tuples
-    # (computed tensor, compute mask)
-    # we assume a 1:1 mapping from tensor to mask
-    # TODO(fchollet): raise exception when a `.compute_mask()` call
-    # does not return a list the same size as `call`
-    tensor_map = {}
-    for x, y, mask in zip(self.inputs, inputs, masks):
-      tensor_map[str(id(x))] = (y, mask)
-
-    depth_keys = list(self._nodes_by_depth.keys())
-    depth_keys.sort(reverse=True)
-    for depth in depth_keys:
-      nodes = self._nodes_by_depth[depth]
-      for node in nodes:
-        # This is always a single layer, never a list.
-        layer = node.outbound_layer
-        reference_input_tensors = node.input_tensors
-        reference_output_tensors = node.output_tensors
-
-        # If all previous input tensors are available in tensor_map,
-        # then call node.inbound_layer on them.
-        computed_data = []  # List of tuples (input, mask).
-        for x in reference_input_tensors:
-          if str(id(x)) in tensor_map:
-            computed_data.append(tensor_map[str(id(x))])
-
-        if len(computed_data) == len(reference_input_tensors):
-          # Call layer (reapplying ops to new inputs).
-          with ops.name_scope(layer.name):
-            if node.arguments:
-              kwargs = node.arguments
-            else:
-              kwargs = {}
-            if len(computed_data) == 1:
-              computed_tensor, computed_mask = computed_data[0]
-              # Ensure mask propagation if applicable.
-              if 'mask' in estimator_util.fn_args(layer.call):
-                if 'mask' not in kwargs:
-                  kwargs['mask'] = computed_mask
-
-              output_tensors = nest.flatten(
-                  layer.call(computed_tensor, **kwargs))
-              if hasattr(layer, 'compute_mask'):
-                output_masks = nest.flatten(
-                    layer.compute_mask(computed_tensor, computed_mask))
-              else:
-                output_masks = [None for _ in range(len(output_tensors))]
-              computed_tensors = [computed_tensor]
-              computed_masks = [computed_mask]
-            else:
-              computed_tensors = [x[0] for x in computed_data]
-              computed_masks = [x[1] for x in computed_data]
-              if 'mask' in estimator_util.fn_args(layer.call):
-                if 'mask' not in kwargs:
-                  kwargs['mask'] = computed_masks
-              output_tensors = nest.flatten(
-                  layer.call(computed_tensors, **kwargs))
-              if hasattr(layer, 'compute_mask'):
-                output_masks = nest.flatten(
-                    layer.compute_mask(computed_tensors, computed_masks))
-              else:
-                output_masks = [None for _ in range(len(output_tensors))]
-
-            if context.in_graph_mode():
-              if layer.activity_regularizer is not None:
-                regularization_losses = [
-                    layer.activity_regularizer(x) for x in output_tensors
-                ]
-                # Apply activity regularizer if any:
-                layer.add_loss(regularization_losses, computed_tensors)
-
-          # Update tensor_map.
-          for x, y, mask in zip(reference_output_tensors, output_tensors,
-                                output_masks):
-            tensor_map[str(id(x))] = (y, mask)
-
-    output_tensors = []
-    output_masks = []
-    output_shapes = []
-    for x in self.outputs:
-      assert str(id(x)) in tensor_map, 'Could not compute output ' + str(x)
-      tensor, mask = tensor_map[str(id(x))]
-      output_shapes.append(layers_util.static_shape(x))
-      output_tensors.append(tensor)
-      output_masks.append(mask)
-
-    if len(output_tensors) == 1:
-      output_tensors = output_tensors[0]
-      if output_shapes is not None:
-        output_shapes = output_shapes[0]
-      if output_masks is not None:
-        output_masks = output_masks[0]
-
-    if context.in_graph_mode():
-      # Update cache;
-      # keys are based on ids on input tensors and inputs masks.
-      cache_key = (layers_util.object_list_uid(inputs)
-                   + '_' + layers_util.object_list_uid(masks))
-      self._output_tensor_cache[cache_key] = output_tensors
-      self._output_mask_cache[cache_key] = output_masks
-
-      if output_shapes is not None:
-        input_shapes = [layers_util.static_shape(x) for x in inputs]
-        cache_key = layers_util.object_list_uid(input_shapes)
-        self._output_shape_cache[cache_key] = output_shapes
-
-    return output_tensors, output_masks
-
-
-def _make_node_key(layer_name, node_index):
-  return layer_name + '_ib-' + str(node_index)
diff --git a/tensorflow/python/layers/network_test.py b/tensorflow/python/layers/network_test.py
deleted file mode 100644
index cc6e8ca9f41cd1f6aa0a3f64d7ce11ac24c04967..0000000000000000000000000000000000000000
--- a/tensorflow/python/layers/network_test.py
+++ /dev/null
@@ -1,633 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Tests for tf.layers.network."""
-
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import numpy as np
-
-from tensorflow.python.eager import context
-from tensorflow.python.framework import constant_op
-from tensorflow.python.framework import test_util
-from tensorflow.python.layers import base as base_layers
-from tensorflow.python.layers import core as core_layers
-from tensorflow.python.layers import network as network_layers
-from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import math_ops
-from tensorflow.python.ops import sparse_ops
-from tensorflow.python.ops import state_ops
-from tensorflow.python.platform import test
-
-
-class BaseLayerCompatibilityTest(test.TestCase):
-
-  def test_get_updates(self):
-
-    class MyLayer(base_layers.Layer):
-
-      def build(self, input_shape):
-        self.a = self.add_variable('a',
-                                   (1, 1),
-                                   'float32',
-                                   trainable=False)
-        self.b = self.add_variable('b',
-                                   (1, 1),
-                                   'float32',
-                                   trainable=False)
-        self.add_update(state_ops.assign_add(self.a, [[1.]]))
-        self.built = True
-
-      def call(self, inputs):
-        self.add_update(state_ops.assign_add(self.a, inputs),
-                        inputs=True)
-        return inputs + 1
-
-    x1 = network_layers.Input(shape=(1,))
-    layer = MyLayer()
-    _ = layer.apply(x1)
-
-    self.assertEqual(len(layer.updates), 2)
-    self.assertEqual(len(layer.get_updates_for(x1)), 1)
-    self.assertEqual(len(layer.get_updates_for(None)), 1)
-
-    x2 = network_layers.Input(shape=(1,))
-    y2 = layer.apply(x2)
-
-    self.assertEqual(len(layer.updates), 3)
-    self.assertEqual(len(layer.get_updates_for(x1)), 1)
-    self.assertEqual(len(layer.get_updates_for(x2)), 1)
-    self.assertEqual(len(layer.get_updates_for(None)), 1)
-
-    network = network_layers.GraphNetwork(x2, y2)
-    self.assertEqual(len(network.updates), 2)
-    self.assertEqual(len(network.get_updates_for(x1)), 0)
-    self.assertEqual(len(network.get_updates_for(x2)), 1)
-    self.assertEqual(len(network.get_updates_for(None)), 1)
-
-    x3 = network_layers.Input(shape=(1,))
-    _ = layer.apply(x3)
-    self.assertEqual(len(network.updates), 2)
-
-    x4 = network_layers.Input(shape=(1,))
-    _ = network(x4)
-    self.assertEqual(len(network.updates), 3)
-    self.assertEqual(len(network.get_updates_for(x2)), 1)
-    self.assertEqual(len(network.get_updates_for(x4)), 1)
-    self.assertEqual(len(network.get_updates_for(None)), 1)
-
-    network.add_update(state_ops.assign_add(layer.a, [[1]]))
-    self.assertEqual(len(network.updates), 4)
-    self.assertEqual(len(network.get_updates_for(None)), 2)
-
-    network.add_update(state_ops.assign_add(layer.a, x4), inputs=True)
-    self.assertEqual(len(network.updates), 5)
-    self.assertEqual(len(network.get_updates_for(x4)), 2)
-
-  def test_get_losses(self):
-
-    class MyLayer(base_layers.Layer):
-
-      def build(self, input_shape):
-        self.a = self.add_variable('a',
-                                   (1, 1),
-                                   'float32',
-                                   trainable=False)
-        self.b = self.add_variable('b',
-                                   (1, 1),
-                                   'float32',
-                                   trainable=False)
-        self.add_loss(math_ops.reduce_sum(self.a))
-        self.built = True
-
-      def call(self, inputs):
-        self.add_loss(math_ops.reduce_sum(inputs),
-                      inputs=True)
-        return inputs + 1
-
-    x1 = network_layers.Input(shape=(1,))
-    layer = MyLayer()
-    _ = layer.apply(x1)
-
-    self.assertEqual(len(layer.losses), 2)
-    self.assertEqual(len(layer.get_losses_for(x1)), 1)
-    self.assertEqual(len(layer.get_losses_for(None)), 1)
-
-    x2 = network_layers.Input(shape=(1,))
-    y2 = layer.apply(x2)
-
-    self.assertEqual(len(layer.losses), 3)
-    self.assertEqual(len(layer.get_losses_for(x1)), 1)
-    self.assertEqual(len(layer.get_losses_for(x2)), 1)
-    self.assertEqual(len(layer.get_losses_for(None)), 1)
-
-    network = network_layers.GraphNetwork(x2, y2)
-    self.assertEqual(len(network.losses), 2)
-    self.assertEqual(len(network.get_losses_for(x1)), 0)
-    self.assertEqual(len(network.get_losses_for(x2)), 1)
-    self.assertEqual(len(network.get_losses_for(None)), 1)
-
-    x3 = network_layers.Input(shape=(1,))
-    _ = layer.apply(x3)
-    self.assertEqual(len(network.losses), 2)
-
-    x4 = network_layers.Input(shape=(1,))
-    _ = network(x4)
-    self.assertEqual(len(network.losses), 3)
-    self.assertEqual(len(network.get_losses_for(x2)), 1)
-    self.assertEqual(len(network.get_losses_for(x4)), 1)
-    self.assertEqual(len(network.get_losses_for(None)), 1)
-
-    network.add_loss(math_ops.reduce_sum(layer.a))
-    self.assertEqual(len(network.losses), 4)
-    self.assertEqual(len(network.get_losses_for(None)), 2)
-
-    network.add_loss(math_ops.reduce_sum(x4), inputs=True)
-    self.assertEqual(len(network.losses), 5)
-    self.assertEqual(len(network.get_losses_for(x4)), 2)
-
-  def testTopologicalAttributes(self):
-    # test layer attributes / methods related to cross-layer connectivity.
-    a = network_layers.Input(shape=(32,), name='input_a')
-    b = network_layers.Input(shape=(32,), name='input_b')
-
-    # test input, output, input_shape, output_shape
-    test_layer = core_layers.Dense(16, name='test_layer')
-    a_test = test_layer(a)
-    self.assertEqual(test_layer.input, a)
-    self.assertEqual(test_layer.output, a_test)
-    self.assertEqual(test_layer.input_shape, (None, 32))
-    self.assertEqual(test_layer.output_shape, (None, 16))
-
-    # test `get_*_at` methods
-    dense = core_layers.Dense(16, name='dense_1')
-    a_2 = dense(a)
-    b_2 = dense(b)
-
-    self.assertEqual(dense.get_input_at(0), a)
-    self.assertEqual(dense.get_input_at(1), b)
-    self.assertEqual(dense.get_output_at(0), a_2)
-    self.assertEqual(dense.get_output_at(1), b_2)
-    self.assertEqual(dense.get_input_shape_at(0), (None, 32))
-    self.assertEqual(dense.get_input_shape_at(1), (None, 32))
-    self.assertEqual(dense.get_output_shape_at(0), (None, 16))
-    self.assertEqual(dense.get_output_shape_at(1), (None, 16))
-
-    # Test invalid value for attribute retrieval.
-    with self.assertRaises(ValueError):
-      dense.get_input_at(2)
-    with self.assertRaises(AttributeError):
-      new_dense = core_layers.Dense(16)
-      _ = new_dense.input
-    with self.assertRaises(AttributeError):
-      new_dense = core_layers.Dense(16)
-      _ = new_dense.output
-    with self.assertRaises(AttributeError):
-      new_dense = core_layers.Dense(16)
-      _ = new_dense.output_shape
-    with self.assertRaises(AttributeError):
-      new_dense = core_layers.Dense(16)
-      _ = new_dense.input_shape
-    with self.assertRaises(AttributeError):
-      new_dense = core_layers.Dense(16)
-      a = network_layers.Input(shape=(3, 32))
-      a = network_layers.Input(shape=(5, 32))
-      a_2 = dense(a)
-      b_2 = dense(b)
-      _ = new_dense.input_shape
-    with self.assertRaises(AttributeError):
-      new_dense = core_layers.Dense(16)
-      a = network_layers.Input(shape=(3, 32))
-      a = network_layers.Input(shape=(5, 32))
-      a_2 = dense(a)
-      b_2 = dense(b)
-      _ = new_dense.output_shape
-
-  def testTopologicalAttributesMultiOutputLayer(self):
-
-    class PowersLayer(base_layers.Layer):
-
-      def call(self, inputs):
-        return [inputs**2, inputs**3]
-
-    x = network_layers.Input(shape=(32,))
-    test_layer = PowersLayer()
-    p1, p2 = test_layer(x)  # pylint: disable=not-callable
-
-    self.assertEqual(test_layer.input, x)
-    self.assertEqual(test_layer.output, [p1, p2])
-    self.assertEqual(test_layer.input_shape, (None, 32))
-    self.assertEqual(test_layer.output_shape, [(None, 32), (None, 32)])
-
-  def testTopologicalAttributesMultiInputLayer(self):
-
-    class AddLayer(base_layers.Layer):
-
-      def call(self, inputs):
-        assert len(inputs) == 2
-        return inputs[0] + inputs[1]
-
-    a = network_layers.Input(shape=(32,))
-    b = network_layers.Input(shape=(32,))
-    test_layer = AddLayer()
-    y = test_layer([a, b])  # pylint: disable=not-callable
-
-    self.assertEqual(test_layer.input, [a, b])
-    self.assertEqual(test_layer.output, y)
-    self.assertEqual(test_layer.input_shape, [(None, 32), (None, 32)])
-    self.assertEqual(test_layer.output_shape, (None, 32))
-
-
-class NetworkTest(test.TestCase):
-
-  def testBasicNetwork(self):
-    # minimum viable network
-    x = network_layers.Input(shape=(32,))
-    dense = core_layers.Dense(2)
-    y = dense(x)
-    network = network_layers.GraphNetwork(x, y, name='dense_network')
-
-    # test basic attributes
-    self.assertEqual(network.name, 'dense_network')
-    self.assertEqual(len(network.layers), 2)  # InputLayer + Dense
-    self.assertEqual(network.layers[1], dense)
-    self.assertEqual(network.weights, dense.weights)
-    self.assertEqual(network.trainable_weights, dense.trainable_weights)
-    self.assertEqual(network.non_trainable_weights, dense.non_trainable_weights)
-
-    # test callability on Input
-    x_2 = network_layers.Input(shape=(32,))
-    y_2 = network(x_2)
-    self.assertEqual(y_2.get_shape().as_list(), [None, 2])
-
-    # test callability on regular tensor
-    x_2 = array_ops.placeholder(dtype='float32', shape=(None, 32))
-    y_2 = network(x_2)
-    self.assertEqual(y_2.get_shape().as_list(), [None, 2])
-
-    # test network `trainable` attribute
-    network.trainable = False
-    self.assertEqual(network.weights, dense.weights)
-    self.assertEqual(network.trainable_weights, [])
-    self.assertEqual(network.non_trainable_weights,
-                     dense.trainable_weights + dense.non_trainable_weights)
-
-  def test_node_construction(self):
-    # test graph topology construction basics
-    a = network_layers.Input(shape=(32,), name='input_a')
-    b = network_layers.Input(shape=(32,), name='input_b')
-
-    self.assertEqual(a.get_shape().as_list(), [None, 32])
-    a_layer, a_node_index, a_tensor_index = a._keras_history
-    b_layer, _, _ = b._keras_history
-    self.assertEqual(len(a_layer._inbound_nodes), 1)
-    self.assertEqual(a_tensor_index, 0)
-    node = a_layer._inbound_nodes[a_node_index]
-    self.assertEqual(node.outbound_layer, a_layer)
-
-    self.assertEqual(node.inbound_layers, [])
-    self.assertEqual(node.input_tensors, [a])
-    self.assertEqual(node.input_shapes, [(None, 32)])
-    self.assertEqual(node.output_tensors, [a])
-    self.assertEqual(node.output_shapes, [(None, 32)])
-
-    dense = core_layers.Dense(16, name='dense_1')
-    dense(a)
-    dense(b)
-
-    self.assertEqual(len(dense._inbound_nodes), 2)
-    self.assertEqual(len(dense._outbound_nodes), 0)
-    self.assertEqual(dense._inbound_nodes[0].inbound_layers, [a_layer])
-    self.assertEqual(dense._inbound_nodes[0].outbound_layer, dense)
-    self.assertEqual(dense._inbound_nodes[1].inbound_layers, [b_layer])
-    self.assertEqual(dense._inbound_nodes[1].outbound_layer, dense)
-    self.assertEqual(dense._inbound_nodes[0].input_tensors, [a])
-    self.assertEqual(dense._inbound_nodes[1].input_tensors, [b])
-
-    # Test config
-    config_0 = dense._inbound_nodes[0].get_config()
-    self.assertEqual(config_0['outbound_layer'], dense.name)
-
-  def testMultiInputNetwork(self):
-    a = network_layers.Input(shape=(32,), name='input_a')
-    b = network_layers.Input(shape=(32,), name='input_b')
-
-    class AddLayer(base_layers.Layer):
-
-      def call(self, inputs):
-        assert len(inputs) == 2
-        return inputs[0] + inputs[1]
-
-    c = AddLayer()([a, b])  # pylint: disable=not-callable
-    network = network_layers.GraphNetwork([a, b], c)
-    self.assertEqual(len(network.layers), 3)  # 2 * InputLayer + AddLayer
-
-    # Test callability.
-    a2 = network_layers.Input(shape=(32,))
-    b2 = network_layers.Input(shape=(32,))
-    c2 = network([a2, b2])
-    self.assertEqual(c2.get_shape().as_list(), [None, 32])
-
-  def testMultiOutputNetwork(self):
-    x = network_layers.Input(shape=(32,))
-    y1 = core_layers.Dense(2)(x)
-    y2 = core_layers.Dense(3)(x)
-    network = network_layers.GraphNetwork(x, [y1, y2])
-
-    self.assertEqual(len(network.layers), 3)  # InputLayer + 2 * Dense
-
-    # Test callability.
-    x2 = network_layers.Input(shape=(32,))
-    outputs = network(x2)
-
-    self.assertEqual(type(outputs), list)
-    self.assertEqual(len(outputs), 2)
-    self.assertEqual(outputs[0].get_shape().as_list(), [None, 2])
-    self.assertEqual(outputs[1].get_shape().as_list(), [None, 3])
-
-  def testMultiInputMultiOutputNetworkSharedLayer(self):
-    a = network_layers.Input(shape=(32,), name='input_a')
-    b = network_layers.Input(shape=(32,), name='input_b')
-
-    dense = core_layers.Dense(2)
-
-    y1 = dense(a)
-    y2 = dense(b)
-    network = network_layers.GraphNetwork([a, b], [y1, y2])
-    self.assertEqual(len(network.layers), 3)  # 2 * InputLayer + Dense
-
-    # Test callability.
-    a2 = network_layers.Input(shape=(32,))
-    b2 = network_layers.Input(shape=(32,))
-    outputs = network([a2, b2])
-
-    self.assertEqual(type(outputs), list)
-    self.assertEqual(len(outputs), 2)
-    self.assertEqual(outputs[0].get_shape().as_list(), [None, 2])
-    self.assertEqual(outputs[1].get_shape().as_list(), [None, 2])
-
-  def testCrossDataFlows(self):
-    # Test the ability to have multi-output layers with outputs that get routed
-    # to separate layers
-
-    class PowersLayer(base_layers.Layer):
-
-      def call(self, inputs):
-        return [inputs**2, inputs**3]
-
-    x = network_layers.Input(shape=(32,))
-    p1, p2 = PowersLayer()(x)  # pylint: disable=not-callable
-    y1 = core_layers.Dense(2)(p1)
-    y2 = core_layers.Dense(3)(p2)
-    network = network_layers.GraphNetwork(x, [y1, y2])
-
-    self.assertEqual(len(network.layers), 4)  # InputLayer + 2 * Dense + PLayer
-
-    # Test callability.
-    x2 = network_layers.Input(shape=(32,))
-    outputs = network(x2)
-
-    self.assertEqual(type(outputs), list)
-    self.assertEqual(len(outputs), 2)
-    self.assertEqual(outputs[0].get_shape().as_list(), [None, 2])
-    self.assertEqual(outputs[1].get_shape().as_list(), [None, 3])
-
-  def testNetworkAttributes(self):
-    x = network_layers.Input(shape=(32,))
-    layer = core_layers.Dense(2, kernel_regularizer=lambda x: 0.01 * (x**2))
-    z = layer(x)
-    dense = core_layers.Dense(2, name='dense')
-    dense.add_update(state_ops.assign_add(layer.kernel, layer.kernel * 2.))
-    y = dense(z)
-    net = network_layers.GraphNetwork(x, y)
-
-    # losses
-    self.assertEqual(len(net.losses), 1)
-
-    # updates
-    self.assertEqual(len(net.updates), 1)
-
-    # get_layer
-    self.assertEqual(net.get_layer('dense'), dense)
-    self.assertEqual(net.get_layer(index=2), dense)
-    with self.assertRaises(ValueError):
-      net.get_layer('dense_unknown')
-    with self.assertRaises(ValueError):
-      net.get_layer()
-    with self.assertRaises(ValueError):
-      net.get_layer(index=4)
-
-    # input, output
-    self.assertEqual(net.input, x)
-    self.assertEqual(net.output, y)
-
-    # input_shape, output_shape
-    self.assertEqual(net.input_shape, (None, 32))
-    self.assertEqual(net.output_shape, (None, 2))
-
-    # get_*_at
-    self.assertEqual(net.get_input_at(0), x)
-    self.assertEqual(net.get_output_at(0), y)
-
-    # compute_output_shape
-    self.assertEqual(net.compute_output_shape((3, 32)).as_list(), [3, 2])
-
-  def testInvalidNetworks(self):
-    # redundant inputs
-    x = network_layers.Input(shape=(32,))
-    y = core_layers.Dense(2)(x)
-    with self.assertRaises(ValueError):
-      network_layers.GraphNetwork([x, x], y)
-
-    # inputs that don't come from Input
-    x = array_ops.placeholder(dtype='float32', shape=(None, 32))
-    y = core_layers.Dense(2)(x)
-    with self.assertRaises(ValueError):
-      network_layers.GraphNetwork(x, y)
-
-    # inputs that don't come from Input but have a layer history
-    x = network_layers.Input(shape=(32,))
-    x = core_layers.Dense(32)(x)
-    y = core_layers.Dense(2)(x)
-    with self.assertRaises(ValueError):
-      network_layers.GraphNetwork(x, y)
-
-    # outputs that don't come from layers
-    x = network_layers.Input(shape=(32,))
-    y = core_layers.Dense(2)(x)
-    y = 2 * y
-    with self.assertRaises(ValueError):
-      network_layers.GraphNetwork(x, y)
-
-    # disconnected graphs
-    x1 = network_layers.Input(shape=(32,))
-    x2 = network_layers.Input(shape=(32,))
-    y = core_layers.Dense(2)(x1)
-    with self.assertRaises(ValueError):
-      network_layers.GraphNetwork(x2, y)
-
-    # redundant layer names
-    x = network_layers.Input(shape=(32,))
-    z = core_layers.Dense(2, name='dense')(x)
-    y = core_layers.Dense(2, name='dense')(z)
-    with self.assertRaises(ValueError):
-      network_layers.GraphNetwork(x, y)
-
-  def testInputTensorWrapping(self):
-    x = array_ops.placeholder(dtype='float32', shape=(None, 32))
-    x = network_layers.Input(tensor=x)
-    y = core_layers.Dense(2)(x)
-    network_layers.GraphNetwork(x, y)
-
-  def testExplicitBatchSize(self):
-    x = network_layers.Input(shape=(32,), batch_size=3)
-    y = core_layers.Dense(2)(x)
-    self.assertEqual(y.get_shape().as_list(), [3, 2])
-
-  def testNetworkRecursion(self):
-    # test the ability of networks to be used as layers inside networks.
-    a = network_layers.Input(shape=(32,))
-    b = core_layers.Dense(2)(a)
-    net = network_layers.GraphNetwork(a, b)
-
-    c = network_layers.Input(shape=(32,))
-    d = net(c)
-
-    recursive_net = network_layers.GraphNetwork(c, d)
-    self.assertEqual(len(recursive_net.layers), 2)
-    self.assertEqual(recursive_net.layers[1], net)
-    self.assertEqual(len(recursive_net.weights), 2)
-
-    # test callability
-    x = array_ops.placeholder(dtype='float32', shape=(None, 32))
-    y = recursive_net(x)
-    self.assertEqual(y.get_shape().as_list(), [None, 2])
-
-  def testSparseInput(self):
-
-    class SparseSoftmax(base_layers.Layer):
-
-      def call(self, inputs):
-        return sparse_ops.sparse_softmax(inputs)
-
-    x = network_layers.Input(shape=(32,), sparse=True)
-    y = SparseSoftmax()(x)  # pylint: disable=not-callable
-    network = network_layers.GraphNetwork(x, y)
-
-    self.assertEqual(len(network.layers), 2)
-    self.assertEqual(network.layers[0].sparse, True)
-
-  def testMaskingSingleInput(self):
-
-    class MaskedLayer(base_layers.Layer):
-
-      def call(self, inputs, mask=None):
-        if mask is not None:
-          return inputs * mask
-        return inputs
-
-      def compute_mask(self, inputs, mask=None):
-        return array_ops.ones_like(inputs)
-
-    if context.in_graph_mode():
-      x = network_layers.Input(shape=(32,))
-      y = MaskedLayer()(x)  # pylint: disable=not-callable
-      network = network_layers.GraphNetwork(x, y)
-
-      # test callability on Input
-      x_2 = network_layers.Input(shape=(32,))
-      y_2 = network(x_2)
-      self.assertEqual(y_2.get_shape().as_list(), [None, 32])
-
-      # test callability on regular tensor
-      x_2 = array_ops.placeholder(dtype='float32', shape=(None, 32))
-      y_2 = network(x_2)
-      self.assertEqual(y_2.get_shape().as_list(), [None, 32])
-    else:
-      a = constant_op.constant([2] * 32)
-      mask = constant_op.constant([0, 1] * 16)
-      a._keras_mask = mask
-      b = MaskedLayer().apply(a)
-      self.assertTrue(hasattr(b, '_keras_mask'))
-      self.assertAllEqual(self.evaluate(array_ops.ones_like(mask)),
-                          self.evaluate(getattr(b, '_keras_mask')))
-      self.assertAllEqual(self.evaluate(a * mask), self.evaluate(b))
-
-
-class DeferredModeTest(test.TestCase):
-
-  def testDeferredTensorAttributes(self):
-    x = base_layers._DeferredTensor(shape=(None, 2), dtype='float32', name='x')
-    self.assertEqual(str(x),
-                     'DeferredTensor(\'x\', shape=(?, 2), dtype=float32)')
-    self.assertEqual(repr(x),
-                     '<_DeferredTensor \'x\' shape=(?, 2) dtype=float32>')
-
-  @test_util.run_in_graph_and_eager_modes()
-  def testSimpleNetworkBuilding(self):
-    inputs = network_layers.Input(shape=(32,))
-    if context.in_eager_mode():
-      self.assertIsInstance(inputs, base_layers._DeferredTensor)
-      self.assertEqual(inputs.dtype.name, 'float32')
-      self.assertEqual(inputs.shape.as_list(), [None, 32])
-
-    x = core_layers.Dense(2)(inputs)
-    if context.in_eager_mode():
-      self.assertIsInstance(x, base_layers._DeferredTensor)
-      self.assertEqual(x.dtype.name, 'float32')
-      self.assertEqual(x.shape.as_list(), [None, 2])
-
-    outputs = core_layers.Dense(4)(x)
-    network = network_layers.GraphNetwork(inputs, outputs)
-    self.assertIsInstance(network, network_layers.GraphNetwork)
-
-    if context.in_eager_mode():
-      # It should be possible to call such a network on EagerTensors.
-      inputs = constant_op.constant(
-          np.random.random((10, 32)).astype('float32'))
-      outputs = network(inputs)
-      self.assertEqual(outputs.shape.as_list(), [10, 4])
-
-  @test_util.run_in_graph_and_eager_modes()
-  def testMultiIONetworkbuilding(self):
-    input_a = network_layers.Input(shape=(32,))
-    input_b = network_layers.Input(shape=(16,))
-    a = core_layers.Dense(16)(input_a)
-
-    class AddLayer(base_layers.Layer):
-
-      def call(self, inputs):
-        return inputs[0] + inputs[1]
-
-      def compute_output_shape(self, input_shape):
-        return input_shape[0]
-
-    c = AddLayer()([a, input_b])  # pylint: disable=not-callable
-    c = core_layers.Dense(2)(c)
-
-    network = network_layers.GraphNetwork([input_a, input_b], [a, c])
-    if context.in_eager_mode():
-      a_val = constant_op.constant(
-          np.random.random((10, 32)).astype('float32'))
-      b_val = constant_op.constant(
-          np.random.random((10, 16)).astype('float32'))
-      outputs = network([a_val, b_val])
-      self.assertEqual(len(outputs), 2)
-      self.assertEqual(outputs[0].shape.as_list(), [10, 16])
-      self.assertEqual(outputs[1].shape.as_list(), [10, 2])
-
-if __name__ == '__main__':
-  test.main()
diff --git a/tensorflow/python/layers/utils.py b/tensorflow/python/layers/utils.py
index 1bbf4e6dffd3415ba246e26cd92923df8116edab..3b156c36a2ff35fb9e05af1406d7b3f6cf883394 100644
--- a/tensorflow/python/layers/utils.py
+++ b/tensorflow/python/layers/utils.py
@@ -20,9 +20,11 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.eager import context
 from tensorflow.python.ops import variables
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.framework import ops
+from tensorflow.python.framework import smart_cond as smart_module
 from tensorflow.python.framework import tensor_util
 from tensorflow.python.util import nest
 
@@ -178,67 +180,56 @@ def deconv_output_length(input_length, filter_size, padding, stride):
   return input_length
 
 
-def smart_cond(pred, fn1, fn2, name=None):
-  """Return either `fn1()` or `fn2()` based on the boolean predicate `pred`.
+def smart_cond(pred, true_fn=None, false_fn=None, name=None):
+  """Return either `true_fn()` if predicate `pred` is true else `false_fn()`.
 
-  If `pred` is a bool or has a constant value, we return either `fn1()`
-  or `fn2()`, otherwise we use `tf.cond` to dynamically route to both.
+  If `pred` is a bool or has a constant value, we return either `true_fn()`
+  or `false_fn()`, otherwise we use `tf.cond` to dynamically route to both.
 
   Arguments:
-    pred: A scalar determining whether to return the result of `fn1` or `fn2`.
-    fn1: The callable to be performed if pred is true.
-    fn2: The callable to be performed if pred is false.
+    pred: A scalar determining whether to return the result of `true_fn` or
+      `false_fn`.
+    true_fn: The callable to be performed if pred is true.
+    false_fn: The callable to be performed if pred is false.
     name: Optional name prefix when using `tf.cond`.
 
   Returns:
-    Tensors returned by the call to either `fn1` or `fn2`.
+    Tensors returned by the call to either `true_fn` or `false_fn`.
 
   Raises:
-    TypeError: If `fn1` or `fn2` is not callable.
+    TypeError: If `true_fn` or `false_fn` is not callable.
   """
-  if not callable(fn1):
-    raise TypeError('`fn1` must be callable.')
-  if not callable(fn2):
-    raise TypeError('`fn2` must be callable.')
-
-  pred_value = constant_value(pred)
-  if pred_value is not None:
-    if pred_value:
-      return fn1()
-    else:
-      return fn2()
-  else:
-    return control_flow_ops.cond(pred, true_fn=fn1, false_fn=fn2, name=name)
+  if isinstance(pred, variables.Variable):
+    return control_flow_ops.cond(
+        pred, true_fn=true_fn, false_fn=false_fn, name=name)
+  return smart_module.smart_cond(
+      pred, true_fn=true_fn, false_fn=false_fn, name=name)
 
 
 def constant_value(pred):
   """Return the bool value for `pred`, or None if `pred` had a dynamic value.
 
-  Arguments:
-    pred: A scalar, either a Python bool or a TensorFlow boolean variable
-      or tensor, or the Python integer 1 or 0.
+    Arguments:
+      pred: A scalar, either a Python bool or a TensorFlow boolean variable
+        or tensor, or the Python integer 1 or 0.
 
-  Returns:
-    True or False if `pred` has a constant boolean value, None otherwise.
+    Returns:
+      True or False if `pred` has a constant boolean value, None otherwise.
 
-  Raises:
-    TypeError: If `pred` is not a Variable, Tensor or bool.
-  """
+    Raises:
+      TypeError: If `pred` is not a Variable, Tensor or bool, or Python
+        interger 1 or 0.
+    """
   # Allow integer booleans.
-  if pred == 0:
-    pred = False
-  elif pred == 1:
-    pred = True
-
-  if isinstance(pred, bool):
-    pred_value = pred
-  elif isinstance(pred, variables.Variable):
-    pred_value = None
-  elif isinstance(pred, ops.Tensor):
-    pred_value = tensor_util.constant_value(pred)
-  else:
-    raise TypeError('`pred` must be a Tensor, a Variable, or a Python bool.')
-  return pred_value
+  if isinstance(pred, int):
+    if pred == 1:
+      pred = True
+    elif pred == 0:
+      pred = False
+
+  if isinstance(pred, variables.Variable):
+    return None
+  return smart_module.smart_constant_value(pred)
 
 
 def object_list_uid(object_list):
diff --git a/tensorflow/python/lib/core/py_func.cc b/tensorflow/python/lib/core/py_func.cc
index e0422ef80add42307268be2743e668eb8c8acb68..343415b2645e00003e51fad18cbb1ec602db472d 100644
--- a/tensorflow/python/lib/core/py_func.cc
+++ b/tensorflow/python/lib/core/py_func.cc
@@ -79,10 +79,11 @@ Status MakeArgTuple(const PyCall* call, PyObject** tuple) {
     const Tensor& t = call->ins[i];
     if (call->eager) {
       if (call->gpu) {
-        arg = EagerTensorFromHandle(new TFE_TensorHandle(t, call->device));
+        arg = EagerTensorFromHandle(
+            new TFE_TensorHandle(t, call->device, call->device));
       } else {
         // TFE_TensorHandle assumes that CPU is identified by `nullptr`.
-        arg = EagerTensorFromHandle(new TFE_TensorHandle(t, nullptr));
+        arg = EagerTensorFromHandle(new TFE_TensorHandle(t, nullptr, nullptr));
       }
       if (arg == nullptr) {
         return errors::Internal("Unable to procure EagerTensor from Tensor.");
diff --git a/tensorflow/python/ops/accumulate_n_benchmark.py b/tensorflow/python/ops/accumulate_n_benchmark.py
index c58d36f39705ecf0f24214ce4ba4574e70a93e77..a709066cae4da2811b3e98d2e93bf44ec12dcee6 100644
--- a/tensorflow/python/ops/accumulate_n_benchmark.py
+++ b/tensorflow/python/ops/accumulate_n_benchmark.py
@@ -39,7 +39,7 @@ from tensorflow.python.platform import test
 class AccumulateNBenchmark(test.Benchmark):
 
   def _AccumulateNTemplate(self, inputs, init, shape, validate_shape):
-    var = gen_state_ops._temporary_variable(
+    var = gen_state_ops.temporary_variable(
         shape=shape, dtype=inputs[0].dtype.base_dtype)
     ref = state_ops.assign(var, init, validate_shape=validate_shape)
     update_ops = [
@@ -47,8 +47,7 @@ class AccumulateNBenchmark(test.Benchmark):
             ref, tensor, use_locking=True).op for tensor in inputs
     ]
     with ops.control_dependencies(update_ops):
-      return gen_state_ops._destroy_temporary_variable(
-          ref, var_name=var.op.name)
+      return gen_state_ops.destroy_temporary_variable(ref, var_name=var.op.name)
 
   def _AccumulateNInitializedWithFirst(self, inputs):
     return self._AccumulateNTemplate(
@@ -60,7 +59,7 @@ class AccumulateNBenchmark(test.Benchmark):
   def _AccumulateNInitializedWithMerge(self, inputs):
     return self._AccumulateNTemplate(
         inputs,
-        init=array_ops.zeros_like(gen_control_flow_ops._merge(inputs)[0]),
+        init=array_ops.zeros_like(gen_control_flow_ops.merge(inputs)[0]),
         shape=tensor_shape.vector(0),
         validate_shape=False)
 
diff --git a/tensorflow/python/ops/array_grad.py b/tensorflow/python/ops/array_grad.py
index 9745d38dc23dba806a2d0dd2ef588a5a950aa05c..925cf8ef32b70658801a98ed6bdd2bb6046ce14e 100644
--- a/tensorflow/python/ops/array_grad.py
+++ b/tensorflow/python/ops/array_grad.py
@@ -139,7 +139,6 @@ def _ConcatGradHelper(op, grad, start_value_index, end_value_index, dim_index):
       # on CPUs and a Maxwell TitanX.  A speedup was seen in a large majority of
       # cases when switching implementations at N=16, but it is possible that
       # there will be a small number of performance regressions.
-      # pylint: disable=protected-access
       if len(sizes) > 16:
         # extract the size of each input along the concat dimension
         sizes = array_ops.squeeze(
@@ -148,10 +147,9 @@ def _ConcatGradHelper(op, grad, start_value_index, end_value_index, dim_index):
                 [1, -1]))
         out_grads = array_ops.split(grad, sizes, non_neg_concat_dim)
       else:
-        offset = gen_array_ops._concat_offset(non_neg_concat_dim, sizes)
+        offset = gen_array_ops.concat_offset(non_neg_concat_dim, sizes)
         for (begin, size) in zip(offset, sizes):
           out_grads.append(array_ops.slice(grad, begin, size))
-      # pylint: enable=protected-access
   elif isinstance(grad, ops.IndexedSlices):
     # Using mod here for convenience since concat_dim is already verified
     # in concat implementation to be within the allowed [-rank, rank) range.
@@ -627,9 +625,7 @@ def _ReverseSequenceGrad(op, grad):
 @ops.RegisterGradient("Reverse")
 def _ReverseGrad(op, grad):
   reverse_dims = op.inputs[1]
-  # pylint: disable=protected-access
-  return gen_array_ops._reverse(grad, reverse_dims), None
-  # pylint: enable=protected-access
+  return gen_array_ops.reverse(grad, reverse_dims), None
 
 
 @ops.RegisterGradient("ReverseV2")
@@ -700,17 +696,13 @@ ops.NotDifferentiable("OneHot")
 @ops.RegisterGradient("MirrorPad")
 def _MirrorPadGrad(op, grad):
   mode = op.get_attr("mode")
-  # pylint: disable=protected-access
-  return [gen_array_ops._mirror_pad_grad(grad, op.inputs[1], mode=mode), None]
-  # pylint: enable=protected-access
+  return [gen_array_ops.mirror_pad_grad(grad, op.inputs[1], mode=mode), None]
 
 
 @ops.RegisterGradient("MirrorPadGrad")
 def _MirrorPadGradGrad(op, grad):
   mode = op.get_attr("mode")
-  # pylint: disable=protected-access
-  return [gen_array_ops._mirror_pad(grad, op.inputs[1], mode=mode), None]
-  # pylint: enable=protected-access
+  return [gen_array_ops.mirror_pad(grad, op.inputs[1], mode=mode), None]
 
 
 @ops.RegisterGradient("QuantizeAndDequantize")
diff --git a/tensorflow/python/ops/array_ops.py b/tensorflow/python/ops/array_ops.py
index ad409ad7e5a152bbc4312e1d16f324bb8be71c33..e0bcac0641f85dd6a625c6fdb9997e4ac49d693e 100644
--- a/tensorflow/python/ops/array_ops.py
+++ b/tensorflow/python/ops/array_ops.py
@@ -134,7 +134,10 @@ def identity(input, name=None):  # pylint: disable=redefined-builtin
     input = ops.convert_to_tensor(input)
     in_device = input.device
     # TODO(ashankar): Does 'identity' need to invoke execution callbacks?
-    if context.context().device_name != in_device:
+    context_device = context.context().device_name
+    if not context_device:
+      context_device = "/job:localhost/replica:0/task:0/device:CPU:0"
+    if context_device != in_device:
       return input._copy()  # pylint: disable=protected-access
     return input
 
@@ -195,7 +198,7 @@ def expand_dims(input, axis=None, name=None, dim=None):
     if axis is not None:
       raise ValueError("can't specify both 'dim' and 'axis'")
     axis = dim
-  return gen_array_ops._expand_dims(input, axis, name)
+  return gen_array_ops.expand_dims(input, axis, name)
 
 
 # pylint: enable=redefined-builtin,protected-access
@@ -208,28 +211,25 @@ def expand_dims(input, axis=None, name=None, dim=None):
     "This op will be removed after the deprecation date. "
     "Please switch to tf.setdiff1d().")
 def listdiff(x, y, out_idx=None, name=None):
-  return gen_array_ops._list_diff(x, y, out_idx, name)
+  return gen_array_ops.list_diff(x, y, out_idx, name)
 
 
-listdiff.__doc__ = gen_array_ops._list_diff.__doc__ + "\n" + listdiff.__doc__
+listdiff.__doc__ = gen_array_ops.list_diff.__doc__ + "\n" + listdiff.__doc__
 
 # pylint: enable=protected-access
 
 
-# pylint: disable=undefined-variable,protected-access
+# pylint: disable=undefined-variable
 @tf_export("setdiff1d")
 def setdiff1d(x, y, index_dtype=dtypes.int32, name=None):
-  return gen_array_ops._list_diff(x, y, index_dtype, name)
+  return gen_array_ops.list_diff(x, y, index_dtype, name)
 
 
-setdiff1d.__doc__ = gen_array_ops._list_diff.__doc__
-
-# pylint: enable=protected-access
+setdiff1d.__doc__ = gen_array_ops.list_diff.__doc__
 
 
 @tf_export("broadcast_dynamic_shape")
 def broadcast_dynamic_shape(shape_x, shape_y):
-  # pylint: disable=protected-access
   """Returns the broadcasted dynamic shape between `shape_x` and `shape_y`.
 
   Args:
@@ -239,8 +239,7 @@ def broadcast_dynamic_shape(shape_x, shape_y):
   Returns:
     A rank 1 integer `Tensor` representing the broadcasted shape.
   """
-  return gen_array_ops._broadcast_args(shape_x, shape_y)
-  # pylint: enable=protected-access
+  return gen_array_ops.broadcast_args(shape_x, shape_y)
 
 
 @tf_export("broadcast_static_shape")
@@ -386,16 +385,26 @@ def size_internal(input, name=None, optimize=True, out_type=dtypes.int32):
   Returns:
     A `Tensor` of type `out_type`. Defaults to `tf.int32`.
   """
+  if context.in_eager_mode() and not isinstance(
+      input, (sparse_tensor.SparseTensor,
+              sparse_tensor.SparseTensorValue)):
+    size_ = 1
+    for dim in ops.convert_to_tensor(input)._shape_tuple():  # pylint: disable=protected-access
+      size_ *= dim
+    return size_
   with ops.name_scope(name, "Size", [input]) as name:
     if isinstance(input, (sparse_tensor.SparseTensor,
                           sparse_tensor.SparseTensorValue)):
-      return gen_math_ops._prod(
+      return gen_math_ops.prod(
           gen_math_ops.cast(input.dense_shape, out_type), 0, name=name)
     else:
       input_tensor = ops.convert_to_tensor(input)
       input_shape = input_tensor.get_shape()
-      if optimize and input_shape.is_fully_defined():
-        return constant(input_shape.num_elements(), out_type, name=name)
+      if optimize:
+        if input_shape.is_fully_defined():
+          return constant(input_shape.num_elements(), out_type, name=name)
+        if input_shape.dims and any(dim == 0 for dim in input_shape.dims):
+          return constant(0, out_type, name=name)
       return gen_array_ops.size(input, name=name, out_type=out_type)
 
 
@@ -605,7 +614,7 @@ def slice(input_, begin, size, name=None):
 
   Note that @{tf.Tensor.__getitem__} is typically a more pythonic way to
   perform slices, as it allows you to write `foo[3:7, :-2]` instead of
-  `tf.slice([3, 0], [4, foo.get_shape()[1]-2])`.
+  `tf.slice(foo, [3, 0], [4, foo.get_shape()[1]-2])`.
 
   `begin` is zero-based; `size` is one-based. If `size[i]` is -1,
   all remaining elements in dimension i are included in the
@@ -879,7 +888,7 @@ def parallel_stack(values, name="parallel_stack"):
     output_shape = tensor_shape.TensorShape([len(values)])
     output_shape = output_shape.concatenate(value_shape)
     # expand_dims converts concat to stack.
-    return gen_array_ops._parallel_concat(
+    return gen_array_ops.parallel_concat(
         [expand_dims(value, 0) for value in values], shape=output_shape)
 
 
@@ -937,7 +946,7 @@ def stack(values, axis=0, name="stack"):
       raise ValueError("axis = %d not in [%d, %d)" % (axis, -expanded_num_dims,
                                                       expanded_num_dims))
 
-  return gen_array_ops._pack(values, axis=axis, name=name)
+  return gen_array_ops.pack(values, axis=axis, name=name)
 
 
 # pylint: disable=invalid-name
@@ -981,7 +990,7 @@ def _autopacking_helper(list_or_tuple, dtype, name):
           # convertible-to-tensor types, such as numpy arrays.
           elems_as_tensors.append(
               constant_op.constant(elem, dtype=dtype, name=str(i)))
-      return gen_array_ops._pack(elems_as_tensors, name=scope)
+      return gen_array_ops.pack(elems_as_tensors, name=scope)
     else:
       return converted_elems
 
@@ -1076,7 +1085,7 @@ def unstack(value, num=None, axis=0, name="unstack"):
       num = value_shape[axis].value
   if num is None:
     raise ValueError("Cannot infer num from shape %s" % value_shape)
-  return gen_array_ops._unpack(value, num=num, axis=axis, name=name)
+  return gen_array_ops.unpack(value, num=num, axis=axis, name=name)
 
 
 @tf_export("concat")
@@ -1173,7 +1182,7 @@ def concat(values, axis, name="concat"):
           dtype=dtypes.int32).get_shape().assert_is_compatible_with(
               tensor_shape.scalar())
       return identity(values[0], name=scope)
-  return gen_array_ops._concat_v2(values=values, axis=axis, name=name)
+  return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
 
 
 @tf_export("boolean_mask")
@@ -1241,8 +1250,7 @@ def boolean_mask(tensor, mask, name="boolean_mask", axis=None):
     axis = 0 if axis is None else axis
     shape_tensor[axis:axis + ndims_mask].assert_is_compatible_with(shape_mask)
 
-    leading_size = gen_math_ops._prod(
-        shape(tensor)[axis:axis + ndims_mask], [0])
+    leading_size = gen_math_ops.prod(shape(tensor)[axis:axis + ndims_mask], [0])
     tensor = reshape(tensor,
                      concat([
                          shape(tensor)[:axis], [leading_size],
@@ -1306,10 +1314,22 @@ def unique(x, out_idx=dtypes.int32, name=None):
   # period (3 weeks) pass.
   # TODO(yongtang): The documentation should also
   # be updated when switch  to v2.
-  return gen_array_ops._unique(x, out_idx, name)
+  return gen_array_ops.unique(x, out_idx, name)
+
+
+unique.__doc__ = gen_array_ops.unique.__doc__
+
+
+@tf_export("unique_with_counts")
+def unique_with_counts(x, out_idx=dtypes.int32, name=None):
+  # TODO(yongtang): switch to v2 once API deprecation
+  # period (3 weeks) pass.
+  # TODO(yongtang): The documentation should also
+  # be updated when switch  to v2.
+  return gen_array_ops.unique_with_counts(x, out_idx, name)
 
 
-unique.__doc__ = gen_array_ops._unique.__doc__
+unique_with_counts.__doc__ = gen_array_ops.unique_with_counts.__doc__
 
 
 @tf_export("split")
@@ -1363,20 +1383,18 @@ def split(value, num_or_size_splits, axis=0, num=None, name="split"):
   """
   size_splits = ops.convert_to_tensor(num_or_size_splits)
   if size_splits._rank() == 0 and size_splits.dtype.is_integer:
-    return gen_array_ops._split(
+    return gen_array_ops.split(
         axis=axis, num_split=num_or_size_splits, value=value, name=name)
 
   if num is None:
-    num = size_splits._shape_tuple()[0]
+    size_splits_shape = size_splits._shape_tuple()
+    if size_splits_shape:
+      num = size_splits_shape[0]
     if num is None:
       raise ValueError("Cannot infer num from shape %s" % num_or_size_splits)
 
-  return gen_array_ops._split_v(
-      value=value,
-      size_splits=size_splits,
-      axis=axis,
-      num_split=num,
-      name=name)
+  return gen_array_ops.split_v(
+      value=value, size_splits=size_splits, axis=axis, num_split=num, name=name)
 
 
 @tf_export("transpose")
@@ -1390,6 +1408,14 @@ def transpose(a, perm=None, name="transpose", conjugate=False):
   `a.dtype` is either `complex64` or `complex128` then the values of `a`
   are conjugated and transposed.
 
+  @compatibility(numpy)
+  In `numpy` transposes are memory-efficient constant time operations as they
+  simply return a new view of the same data with adjusted `strides`.
+
+  TensorFlow does not support strides, so `transpose` returns a new tensor with
+  the items permuted.
+  @end_compatibility
+
   For example:
 
   ```python
@@ -1438,7 +1464,7 @@ def transpose(a, perm=None, name="transpose", conjugate=False):
   """
   with ops.name_scope(name, "transpose", [a]) as name:
     transpose_fn = (
-        gen_array_ops._conjugate_transpose
+        gen_array_ops.conjugate_transpose
         if (conjugate and a.dtype.is_complex) else gen_array_ops.transpose)
     if perm is None:
       rank = gen_array_ops.rank(a)
@@ -1490,6 +1516,14 @@ def matrix_transpose(a, name="matrix_transpose", conjugate=False):
   tf.matmul(matrix, tf.matrix_transpose(b))
   ```
 
+  @compatibility(numpy)
+  In `numpy` transposes are memory-efficient constant time operations as they
+  simply return a new view of the same data with adjusted `strides`.
+
+  TensorFlow does not support strides, `matrix_transposes` return a new tensor
+  with the items permuted.
+  @end_compatibility
+
   Args:
     a: A `Tensor` with `rank >= 2`.
     name: A name for the operation (optional).
@@ -1608,7 +1642,7 @@ def zeros_like(tensor, dtype=None, name=None, optimize=True):
         return zeros(
             shape_internal(tensor, optimize=optimize), dtype=dtype, name=name)
       with ops.device(tensor.device):
-        return gen_array_ops._zeros_like(tensor, name=name)
+        return gen_array_ops.zeros_like(tensor, name=name)
 
     # For now, variant types must be created via zeros_like; as we need to
     # pass the input variant object to the proper zeros callback.
@@ -1623,7 +1657,7 @@ def zeros_like(tensor, dtype=None, name=None, optimize=True):
       return zeros(
           shape_internal(tensor, optimize=optimize), dtype=dtype, name=name)
     else:
-      return gen_array_ops._zeros_like(tensor, name=name)
+      return gen_array_ops.zeros_like(tensor, name=name)
 
 
 @tf_export("ones_like")
@@ -1744,7 +1778,7 @@ def placeholder(dtype, shape=None, name=None):
     raise RuntimeError("tf.placeholder() is not compatible with "
                        "eager execution.")
 
-  return gen_array_ops._placeholder(dtype=dtype, shape=shape, name=name)
+  return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
 
 
 # pylint: disable=redefined-outer-name
@@ -1888,15 +1922,15 @@ def pad(tensor, paddings, mode="CONSTANT", name=None, constant_values=0):  # pyl
     # TODO(rjryan): Once the forward compatibility period (3 weeks) have passed
     # remove the "Pad" fallback here.
     if constant_values != 0:
-      result = gen_array_ops._pad_v2(
+      result = gen_array_ops.pad_v2(
           tensor, paddings, constant_values, name=name)
     else:
-      result = gen_array_ops._pad(tensor, paddings, name=name)
+      result = gen_array_ops.pad(tensor, paddings, name=name)
   elif mode == "REFLECT":
-    result = gen_array_ops._mirror_pad(
+    result = gen_array_ops.mirror_pad(
         tensor, paddings, mode="REFLECT", name=name)
   elif mode == "SYMMETRIC":
-    result = gen_array_ops._mirror_pad(
+    result = gen_array_ops.mirror_pad(
         tensor, paddings, mode="SYMMETRIC", name=name)
   else:
     raise ValueError("Unknown padding mode: %s" % mode)
@@ -2126,7 +2160,7 @@ def edit_distance(hypothesis, truth, normalize=True, name="edit_distance"):
                             sparse_tensor.SparseTensorValue)):
     raise TypeError("Truth must be a SparseTensor.")
 
-  return gen_array_ops._edit_distance(
+  return gen_array_ops.edit_distance(
       hypothesis.indices,
       hypothesis.values,
       hypothesis.dense_shape,
@@ -2263,7 +2297,7 @@ def space_to_batch(input, paddings, block_size, name=None):  # pylint: disable=r
   return result
 
 
-space_to_batch.__doc__ = gen_array_ops._space_to_batch.__doc__
+space_to_batch.__doc__ = gen_array_ops.space_to_batch.__doc__
 
 
 @tf_export("space_to_depth")
@@ -2293,7 +2327,7 @@ def batch_to_space(input, crops, block_size, name=None):  # pylint: disable=rede
   return result
 
 
-batch_to_space.__doc__ = gen_array_ops._batch_to_space.__doc__
+batch_to_space.__doc__ = gen_array_ops.batch_to_space.__doc__
 
 
 @tf_export("one_hot")
@@ -2437,8 +2471,8 @@ def one_hot(indices,
       raise TypeError("dtype {0} of on_value does not match "
                       "dtype {1} of off_value".format(on_dtype, off_dtype))
 
-    return gen_array_ops._one_hot(indices, depth, on_value, off_value, axis,
-                                  name)
+    return gen_array_ops.one_hot(indices, depth, on_value, off_value, axis,
+                                 name)
 
 
 def _all_dimensions(x):
@@ -2566,7 +2600,7 @@ def squeeze(input, axis=None, name=None, squeeze_dims=None):
     axis = squeeze_dims
   if np.isscalar(axis):
     axis = [axis]
-  return gen_array_ops._squeeze(input, axis, name)
+  return gen_array_ops.squeeze(input, axis, name)
 
 
 @tf_export("where")
@@ -2617,7 +2651,7 @@ def where(condition, x=None, y=None, name=None):
           condition, preferred_dtype=dtypes.bool, name="condition")
       return gen_array_ops.where(condition=condition, name=name)
   elif x is not None and y is not None:
-    return gen_math_ops._select(condition=condition, x=x, y=y, name=name)
+    return gen_math_ops.select(condition=condition, x=x, y=y, name=name)
   else:
     raise ValueError("x and y must both be non-None or both be None.")
 
diff --git a/tensorflow/python/ops/batch_norm_benchmark.py b/tensorflow/python/ops/batch_norm_benchmark.py
index c2ee2b383231333239c6e2d4e874a0ad1cdf493e..5d68b47aeaef3a90973387ecd5b265eef1e96a5f 100644
--- a/tensorflow/python/ops/batch_norm_benchmark.py
+++ b/tensorflow/python/ops/batch_norm_benchmark.py
@@ -41,9 +41,8 @@ def batch_norm_op(tensor, mean, variance, beta, gamma, scale):
   # _batch_norm_with_global_normalization is deprecated in v9
   ops.get_default_graph().graph_def_versions.producer = 8
   # pylint: disable=protected-access
-  return gen_nn_ops._batch_norm_with_global_normalization(tensor, mean,
-                                                          variance, beta, gamma,
-                                                          0.001, scale)
+  return gen_nn_ops._batch_norm_with_global_normalization(
+      tensor, mean, variance, beta, gamma, 0.001, scale)
   # pylint: enable=protected-access
 
 
diff --git a/tensorflow/python/ops/candidate_sampling_ops.py b/tensorflow/python/ops/candidate_sampling_ops.py
index 220ef1754d2e1a2d54a8962148b47806df48e98f..9ea1ea9c92c9b016a3f9126c89ee4dc1e73c9f27 100644
--- a/tensorflow/python/ops/candidate_sampling_ops.py
+++ b/tensorflow/python/ops/candidate_sampling_ops.py
@@ -77,7 +77,7 @@ def uniform_candidate_sampler(true_classes, num_true, num_sampled, unique,
       of each of `sampled_candidates`.
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._uniform_candidate_sampler(
+  return gen_candidate_sampling_ops.uniform_candidate_sampler(
       true_classes, num_true, num_sampled, unique, range_max, seed=seed1,
       seed2=seed2, name=name)
 
@@ -136,7 +136,7 @@ def log_uniform_candidate_sampler(true_classes, num_true, num_sampled, unique,
       of each of `sampled_candidates`.
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._log_uniform_candidate_sampler(
+  return gen_candidate_sampling_ops.log_uniform_candidate_sampler(
       true_classes, num_true, num_sampled, unique, range_max, seed=seed1,
       seed2=seed2, name=name)
 
@@ -193,7 +193,7 @@ def learned_unigram_candidate_sampler(true_classes, num_true, num_sampled,
 
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._learned_unigram_candidate_sampler(
+  return gen_candidate_sampling_ops.learned_unigram_candidate_sampler(
       true_classes, num_true, num_sampled, unique, range_max, seed=seed1,
       seed2=seed2, name=name)
 
@@ -283,7 +283,7 @@ def fixed_unigram_candidate_sampler(true_classes,
 
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._fixed_unigram_candidate_sampler(
+  return gen_candidate_sampling_ops.fixed_unigram_candidate_sampler(
       true_classes, num_true, num_sampled, unique, range_max,
       vocab_file=vocab_file, distortion=distortion,
       num_reserved_ids=num_reserved_ids, num_shards=num_shards, shard=shard,
@@ -321,7 +321,7 @@ def all_candidate_sampler(true_classes, num_true, num_sampled, unique,
       of each of `sampled_candidates`. All returned values are 1.0.
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._all_candidate_sampler(
+  return gen_candidate_sampling_ops.all_candidate_sampler(
       true_classes, num_true, num_sampled, unique, seed=seed1, seed2=seed2,
       name=name)
 
@@ -370,6 +370,6 @@ def compute_accidental_hits(true_classes, sampled_candidates, num_true,
 
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_candidate_sampling_ops._compute_accidental_hits(
+  return gen_candidate_sampling_ops.compute_accidental_hits(
       true_classes, sampled_candidates, num_true, seed=seed1, seed2=seed2,
       name=name)
diff --git a/tensorflow/python/ops/check_ops.py b/tensorflow/python/ops/check_ops.py
index 0fd6e29a49c8e4e31e244bfbbfca525d72e4d811..64567ac54ae43acf6f8b674c46525db7a6c4fab7 100644
--- a/tensorflow/python/ops/check_ops.py
+++ b/tensorflow/python/ops/check_ops.py
@@ -334,9 +334,9 @@ def assert_equal(x, y, data=None, summarize=None, message=None, name=None):
     @compatibility{eager} returns None
 
   Raises:
-    InvalidArgumentError if the check can be performed immediately and
-    `x == y` is False. The check can be performed immediately during
-    eager execution or if `x` and `y` are statically known.
+    InvalidArgumentError: if the check can be performed immediately and
+      `x == y` is False. The check can be performed immediately during eager
+      execution or if `x` and `y` are statically known.
   """
   message = message or ''
   with ops.name_scope(name, 'assert_equal', [x, y, data]):
diff --git a/tensorflow/python/ops/confusion_matrix.py b/tensorflow/python/ops/confusion_matrix.py
index e4ce2ab28a15f82e80194ab17ef939411982076a..b9a93c3bedfff1f398e3b42cedf02a2f0a3ddd5c 100644
--- a/tensorflow/python/ops/confusion_matrix.py
+++ b/tensorflow/python/ops/confusion_matrix.py
@@ -99,19 +99,16 @@ def confusion_matrix(labels, predictions, num_classes=None, dtype=dtypes.int32,
                      name=None, weights=None):
   """Computes the confusion matrix from predictions and labels.
 
-  Calculate the Confusion Matrix for a pair of prediction and
-  label 1-D int arrays.
-
   The matrix columns represent the prediction labels and the rows represent the
   real labels. The confusion matrix is always a 2-D array of shape `[n, n]`,
   where `n` is the number of valid labels for a given classification task. Both
   prediction and labels must be 1-D arrays of the same shape in order for this
   function to work.
 
-  If `num_classes` is None, then `num_classes` will be set to the one plus
-  the maximum value in either predictions or labels.
-  Class labels are expected to start at 0. E.g., if `num_classes` was
-  three, then the possible labels would be `[0, 1, 2]`.
+  If `num_classes` is `None`, then `num_classes` will be set to one plus the
+  maximum value in either predictions or labels. Class labels are expected to
+  start at 0. For example, if `num_classes` is 3, then the possible labels
+  would be `[0, 1, 2]`.
 
   If `weights` is not `None`, then each prediction contributes its
   corresponding weight to the total value of the confusion matrix cell.
@@ -141,8 +138,9 @@ def confusion_matrix(labels, predictions, num_classes=None, dtype=dtypes.int32,
     weights: An optional `Tensor` whose shape matches `predictions`.
 
   Returns:
-    A k X k matrix representing the confusion matrix, where k is the number of
-    possible labels in the classification task.
+    A `Tensor` of type `dtype` with shape `[n, n]` representing the confusion
+    matrix, where `n` is the number of possible labels in the classification
+    task.
 
   Raises:
     ValueError: If both predictions and labels are not 1-D vectors and have
@@ -188,7 +186,7 @@ def confusion_matrix(labels, predictions, num_classes=None, dtype=dtypes.int32,
       weights = math_ops.cast(weights, dtype)
 
     shape = array_ops.stack([num_classes, num_classes])
-    indices = array_ops.transpose(array_ops.stack([labels, predictions]))
+    indices = array_ops.stack([labels, predictions], axis=1)
     values = (array_ops.ones_like(predictions, dtype)
               if weights is None else weights)
     cm_sparse = sparse_tensor.SparseTensor(
diff --git a/tensorflow/python/ops/control_flow_grad.py b/tensorflow/python/ops/control_flow_grad.py
index 97b57177b29986a006df992f4c0c2b79e11467aa..21354b5ae8ff1724bbb2539aff370b3df6da2598 100644
--- a/tensorflow/python/ops/control_flow_grad.py
+++ b/tensorflow/python/ops/control_flow_grad.py
@@ -28,7 +28,6 @@ from tensorflow.python.ops import math_ops
 # go/tf-wildcard-import
 # pylint: disable=wildcard-import,undefined-variable
 from tensorflow.python.ops.control_flow_ops import *
-from tensorflow.python.ops.gen_control_flow_ops import *
 # pylint: enable=wildcard-import
 
 
diff --git a/tensorflow/python/ops/control_flow_ops.py b/tensorflow/python/ops/control_flow_ops.py
index 179f38f035962a98682dd8789127d71c3b372f63..689f7cdc8feaf71944a305c45ee80ccad7daa3e0 100644
--- a/tensorflow/python/ops/control_flow_ops.py
+++ b/tensorflow/python/ops/control_flow_ops.py
@@ -44,6 +44,7 @@ See the @{$python/control_flow_ops} guide.
 @@add_check_numerics_ops
 @@Assert
 @@Print
+@@timestamp
 """
 # pylint: disable=g-bad-name
 from __future__ import absolute_import
@@ -328,7 +329,7 @@ def exit(data, name=None):  # pylint: disable=redefined-builtin
   data = ops.internal_convert_to_tensor_or_indexed_slices(data, as_ref=True)
   if isinstance(data, ops.Tensor):
     if data.dtype._is_ref_dtype:  # pylint: disable=protected-access
-      return gen_control_flow_ops._ref_exit(data, name)
+      return gen_control_flow_ops.ref_exit(data, name)
     else:
       return gen_control_flow_ops._exit(data, name)
   else:
@@ -370,17 +371,17 @@ def switch(data, pred, dtype=None, name=None):
         data, dtype=dtype, name="data", as_ref=True)
     pred = ops.convert_to_tensor(pred, name="pred")
     if isinstance(data, ops.Tensor):
-      return gen_control_flow_ops._switch(data, pred, name=name)
+      return gen_control_flow_ops.switch(data, pred, name=name)
     else:
       if not isinstance(data, (ops.IndexedSlices, sparse_tensor.SparseTensor)):
         raise TypeError("Type %s not supported" % type(data))
       val, ind = data.values, data.indices
-      val_f, val_t = gen_control_flow_ops._switch(val, pred, name=name)
-      ind_f, ind_t = gen_control_flow_ops._switch(ind, pred, name="indices")
+      val_f, val_t = gen_control_flow_ops.switch(val, pred, name=name)
+      ind_f, ind_t = gen_control_flow_ops.switch(ind, pred, name="indices")
       if isinstance(data, ops.IndexedSlices):
         dense_shape = data.dense_shape
         if dense_shape is not None:
-          dense_shape_f, dense_shape_t = gen_control_flow_ops._switch(
+          dense_shape_f, dense_shape_t = gen_control_flow_ops.switch(
               dense_shape, pred, name="dense_shape")
         else:
           dense_shape_f, dense_shape_t = None, None
@@ -388,7 +389,7 @@ def switch(data, pred, dtype=None, name=None):
                 ops.IndexedSlices(val_t, ind_t, dense_shape_t))
       else:
         dense_shape = data.dense_shape
-        dense_shape_f, dense_shape_t = gen_control_flow_ops._switch(
+        dense_shape_f, dense_shape_t = gen_control_flow_ops.switch(
             data.dense_shape, pred, name="dense_shape")
         return (sparse_tensor.SparseTensor(ind_f, val_f, dense_shape_f),
                 sparse_tensor.SparseTensor(ind_t, val_t, dense_shape_t))
@@ -472,15 +473,15 @@ def merge(inputs, name=None):
     ]
     if all([isinstance(v, ops.Tensor) for v in inputs]):
       if all([v.dtype._is_ref_dtype for v in inputs]):  # pylint: disable=protected-access
-        return gen_control_flow_ops._ref_merge(inputs, name)
+        return gen_control_flow_ops.ref_merge(inputs, name)
       else:
-        return gen_control_flow_ops._merge(inputs, name)
+        return gen_control_flow_ops.merge(inputs, name)
     elif all([isinstance(v, sparse_tensor.SparseTensor) for v in inputs]):
       # Only handle the case when all inputs are SparseTensor.
       values, _ = merge([inp.values for inp in inputs], name=name)
-      indices, chosen_index = gen_control_flow_ops._merge(
+      indices, chosen_index = gen_control_flow_ops.merge(
           [inp.indices for inp in inputs], name="indices")
-      dense_shape, _ = gen_control_flow_ops._merge(
+      dense_shape, _ = gen_control_flow_ops.merge(
           [inp.dense_shape for inp in inputs], name="dense_shape")
       return (sparse_tensor.SparseTensor(indices, values, dense_shape),
               chosen_index)
@@ -488,13 +489,13 @@ def merge(inputs, name=None):
       # For now convert all the inputs as IndexedSlices.
       inputs = math_ops._as_indexed_slices_list(inputs, optimize=False)
       values, _ = merge([inp.values for inp in inputs], name=name)
-      indices, chosen_index = gen_control_flow_ops._merge(
+      indices, chosen_index = gen_control_flow_ops.merge(
           [inp.indices for inp in inputs], name="indices")
       if any(inp.dense_shape is not None for inp in inputs):
         if any(inp.dense_shape is None for inp in inputs):
           raise ValueError("Either all merged IndexedSlices must have a "
                            "dense_shape, or none must have a dense_shape.")
-        dense_shape, _ = gen_control_flow_ops._merge(
+        dense_shape, _ = gen_control_flow_ops.merge(
             [inp.dense_shape for inp in inputs], name="dense_shape")
       else:
         dense_shape = None
@@ -1014,10 +1015,8 @@ class GradLoopState(object):
         else:
           max_size = GetMaxSizeFromNestedMaximumIterations(
               value, self.forward_context)
-        # pylint: disable=protected-access
-        acc = gen_data_flow_ops._stack_v2(
+        acc = gen_data_flow_ops.stack_v2(
             max_size=max_size, elem_type=value.dtype.base_dtype, name="f_acc")
-        # pylint: enable=protected-access
       if curr_ctxt:
         curr_ctxt.Exit()
 
@@ -1030,10 +1029,8 @@ class GradLoopState(object):
       if value_ctxt == self.forward_context:
         # value is not nested in the forward context.
         self.forward_context.Enter()
-        # pylint: disable=protected-access
-        push = gen_data_flow_ops._stack_push_v2(
+        push = gen_data_flow_ops.stack_push_v2(
             enter_acc, value, swap_memory=swap_enabled)
-        # pylint: enable=protected-access
         self.forward_context.Exit()
         # Protect stack push and order it before forward_index.
         self.forward_index.op._add_control_input(push.op)
@@ -1045,18 +1042,14 @@ class GradLoopState(object):
           # The special case for creating a zero tensor for a dead
           # branch of a switch. See ControlFlowState.ZerosLike().
           value_ctxt.outer_context.Enter()
-          # pylint: disable=protected-access
-          push = gen_data_flow_ops._stack_push_v2(
+          push = gen_data_flow_ops.stack_push_v2(
               enter_acc, value, swap_memory=swap_enabled)
-          # pylint: enable=protected-access
           value_ctxt.outer_context.Exit()
           push.op._set_control_flow_context(value_ctxt)
         else:
           value_ctxt.Enter()
-          # pylint: disable=protected-access
-          push = gen_data_flow_ops._stack_push_v2(
+          push = gen_data_flow_ops.stack_push_v2(
               enter_acc, value, swap_memory=swap_enabled)
-          # pylint: enable=protected-access
           value_ctxt.Exit()
         # Protect stack push and order it before forward_sync.
         self.forward_sync._add_control_input(push.op)
@@ -1103,10 +1096,8 @@ class GradLoopState(object):
           pred = cond_ctxt.pred
         branch = (1 - cond_ctxt.branch) if dead_branch else cond_ctxt.branch
         history_value = _SwitchRefOrTensor(history_value, pred)[branch]
-      # pylint: disable=protected-access
-      pop = gen_data_flow_ops._stack_pop_v2(history_value,
-                                            value.dtype.base_dtype)
-      # pylint: enable=protected-access
+      pop = gen_data_flow_ops.stack_pop_v2(history_value,
+                                           value.dtype.base_dtype)
       pop.set_shape(value.get_shape())
       self.grad_context.Exit()
     parallel_iterations = self.grad_context.parallel_iterations
@@ -1717,8 +1708,15 @@ class CondContext(ControlFlowContext):
     self._pivot = g.as_graph_element(
         ops.prepend_name_scope(context_def.pivot_name, import_scope))
     self._branch = context_def.branch
-    super(CondContext, self).__init__(
-        values_def=context_def.values_def, import_scope=import_scope)
+    super(CondContext, self).__init__(values_def=context_def.values_def,
+                                      import_scope=import_scope)
+    # The predicate and pivot ops appear in self._values, but don't have self
+    # set as their control context. The __init__ call above will set self for
+    # all values, so manually override the predicate and pivot contexts here.
+    # pylint: disable=protected-access
+    self._pred.op._set_control_flow_context(self.outer_context)
+    self._pivot.op._set_control_flow_context(self.outer_context)
+    # pylint: enable=protected-access
 
   @property
   def pred(self):
@@ -1766,13 +1764,9 @@ class CondContext(ControlFlowContext):
       context_def.branch = self._branch
       context_def.values_def.MergeFrom(super(CondContext, self)._to_values_def(
           export_scope))
-      # TODO(b/72868227): enable this once the corresponding control_flow.proto
-      # changes have been checked in (they aren't checked in and this is
-      # disabled for now to ensure forwards compatibility).
-      if False:  # pylint: disable=using-constant-test
-        for nested in self._nested_contexts:
-          nested_def = context_def.nested_contexts.add()
-          nested.to_control_flow_context_def(nested_def)
+      for nested in self._nested_contexts:
+        nested_def = context_def.nested_contexts.add()
+        nested.to_control_flow_context_def(nested_def)
 
       return context_def
     else:
@@ -1784,14 +1778,10 @@ class CondContext(ControlFlowContext):
     ret = CondContext(context_def=context_def,
                       import_scope=import_scope)
 
-    # TODO(b/72868227): remove "if hasattr(...)" once the corresponding
-    # control_flow.proto changes have been checked in (they aren't checked in
-    # and this is here for now to ensure forwards compatibility).
-    if hasattr(context_def, "nested_contexts"):
-      ret.Enter()
-      for nested_def in context_def.nested_contexts:
-        from_control_flow_context_def(nested_def)
-      ret.Exit()
+    ret.Enter()
+    for nested_def in context_def.nested_contexts:
+      from_control_flow_context_def(nested_def, import_scope=import_scope)
+    ret.Exit()
     return ret
 
   def to_control_flow_context_def(self, context_def, export_scope=None):
@@ -1835,8 +1825,6 @@ class CondContext(ControlFlowContext):
       # pylint: disable=protected-access
       op._add_control_input(self._pivot.op)
       # pylint: enable=protected-access
-      for x in op.outputs:
-        self._values.add(x.name)
     else:
       for index in range(len(op.inputs)):
         x = op.inputs[index]
@@ -1847,13 +1835,20 @@ class CondContext(ControlFlowContext):
           # pylint: enable=protected-access
       # Remove any external control dependency on this op.
       self._RemoveExternalControlEdges(op)
-      for x in op.outputs:
-        self._values.add(x.name)
       # pylint: disable=protected-access
       if op.graph._is_function(op.type) or op.type == "SymbolicGradient":
         op._add_control_input(self._pivot.op)
       # pylint: enable=protected-access
 
+    # Mark op's outputs as seen by this context and any outer contexts.
+    output_names = [x.name for x in op.outputs]
+    ctxt = self
+    while ctxt is not None:
+      # pylint: disable=protected-access
+      ctxt._values.update(output_names)
+      ctxt = ctxt._outer_context
+      # pylint: enable=protected-access
+
     if self._outer_context or not util.IsLoopExit(op):
       op.graph.prevent_fetching(op)
 
@@ -2104,10 +2099,7 @@ def cond(pred,
     # Only add non-nested conds to the collection. Any nested control flow will
     # be encapsulated in the root context.
     assert context_t.outer_context == context_f.outer_context
-    # TODO(b/72868227): remove "if True..." once the corresponding
-    # control_flow.proto changes have been checked in (they aren't checked in
-    # and this is disabled for now to ensure forwards compatibility).
-    if True or context_t.outer_context is None:
+    if context_t.outer_context is None:
       ops.add_to_collection(ops.GraphKeys.COND_CONTEXT, context_t)
       ops.add_to_collection(ops.GraphKeys.COND_CONTEXT, context_f)
 
@@ -2330,13 +2322,9 @@ class WhileContext(ControlFlowContext):
       context_def.values_def.MergeFrom(
           super(WhileContext, self)._to_values_def(
               export_scope=export_scope))
-      # TODO(b/72868227): remove "if True..." once the corresponding
-      # control_flow.proto changes have been checked in (they aren't checked in
-      # and this is disabled for now to ensure forwards compatibility).
-      if False:  # pylint: disable=using-constant-test
-        for nested in self._nested_contexts:
-          nested_def = context_def.nested_contexts.add()
-          nested.to_control_flow_context_def(nested_def)
+      for nested in self._nested_contexts:
+        nested_def = context_def.nested_contexts.add()
+        nested.to_control_flow_context_def(nested_def)
 
       return context_def
     else:
@@ -2358,14 +2346,10 @@ class WhileContext(ControlFlowContext):
     """
     ret = WhileContext(context_def=context_def,
                        import_scope=import_scope)
-    # TODO(b/72868227): remove "if hasattr(...)" once the corresponding
-    # control_flow.proto changes have been checked in (they aren't checked in
-    # and this is disabled for now to ensure forwards compatibility).
-    if hasattr(context_def, "nested_contexts"):
-      ret.Enter()
-      for nested_def in context_def.nested_contexts:
-        from_control_flow_context_def(nested_def, import_scope=import_scope)
-      ret.Exit()
+    ret.Enter()
+    for nested_def in context_def.nested_contexts:
+      from_control_flow_context_def(nested_def, import_scope=import_scope)
+    ret.Exit()
     return ret
 
   def GetWhileContext(self):
@@ -3119,24 +3103,24 @@ def while_loop(cond,
       c, b, loop_vars=[i0, m0],
       shape_invariants=[i0.get_shape(), tf.TensorShape([None, 2])])
   ```
-  
-  Example which demonstrates non-strict semantics: In the following 
-  example, the final value of the counter `i` does not depend on `x`. So 
-  the `while_loop` can increment the counter parallel to updates of `x`. 
+
+  Example which demonstrates non-strict semantics: In the following
+  example, the final value of the counter `i` does not depend on `x`. So
+  the `while_loop` can increment the counter parallel to updates of `x`.
   However, because the loop counter at one loop iteration depends
   on the value at the previous iteration, the loop counter itself cannot
-  be incremented in parallel. Hence if we just want the final value of the 
-  counter (which we print on the line `print(sess.run(i))`), then  
-  `x` will never be incremented, but the counter will be updated on a 
+  be incremented in parallel. Hence if we just want the final value of the
+  counter (which we print on the line `print(sess.run(i))`), then
+  `x` will never be incremented, but the counter will be updated on a
   single thread. Conversely, if we want the value of the output (which we
-  print on the line `print(sess.run(out).shape)`), then the counter may be 
-  incremented on its own thread, while `x` can be incremented in 
-  parallel on a separate thread. In the extreme case, it is conceivable 
-  that the thread incrementing the counter runs until completion before 
-  `x` is incremented even a single time. The only thing that can never 
-  happen is that the thread updating `x` can never get ahead of the 
-  counter thread because the thread incrementing `x` depends on the value 
-  of the counter. 
+  print on the line `print(sess.run(out).shape)`), then the counter may be
+  incremented on its own thread, while `x` can be incremented in
+  parallel on a separate thread. In the extreme case, it is conceivable
+  that the thread incrementing the counter runs until completion before
+  `x` is incremented even a single time. The only thing that can never
+  happen is that the thread updating `x` can never get ahead of the
+  counter thread because the thread incrementing `x` depends on the value
+  of the counter.
   ```python
   import tensorflow as tf
 
@@ -3148,13 +3132,13 @@ def while_loop(cond,
   with tf.Session() as sess:
       print(sess.run(i))  # prints [0] ... [9999]
 
-      # The following line may increment the counter and x in parallel. 
-      # The counter thread may get ahead of the other thread, but not the 
-      # other way around. So you may see things like 
+      # The following line may increment the counter and x in parallel.
+      # The counter thread may get ahead of the other thread, but not the
+      # other way around. So you may see things like
       # [9996] x:[9987]
-      # meaning that the counter thread is on iteration 9996, 
+      # meaning that the counter thread is on iteration 9996,
       # while the other thread is on iteration 9987
-      print(sess.run(out).shape)  
+      print(sess.run(out).shape)
   ```
 
   """
@@ -3210,10 +3194,7 @@ def while_loop(cond,
         swap_memory=swap_memory)
     # Only add non-nested loops to the collection. Any nested control flow will
     # be encapsulated in the root context.
-    # TODO(b/72868227): enable condition once the corresponding
-    # control_flow.proto changes have been checked in (they aren't checked in
-    # and this is disabled for now to ensure forwards compatibility).
-    if True or loop_context.outer_context is None:
+    if loop_context.outer_context is None:
       ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, loop_context)
     result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
     if maximum_iterations is not None:
@@ -3415,7 +3396,12 @@ def tuple(tensors, name=None, control_inputs=None):  # pylint: disable=redefined
   if context.in_eager_mode():
     return tensors
   with ops.name_scope(name, "tuple", tensors) as name:
-    gating_ops = [t.op for t in tensors if t is not None]
+    tensors = [t if (isinstance(t, ops.Operation)
+                     or tensor_util.is_tensor(t)
+                     or t is None)
+               else ops.convert_to_tensor(t) for t in tensors]
+    gating_ops = [t if isinstance(t, ops.Operation) else t.op for t in tensors
+                  if t is not None]
     if control_inputs:
       for c in control_inputs:
         if isinstance(c, ops.Tensor):
@@ -3431,8 +3417,11 @@ def tuple(tensors, name=None, control_inputs=None):  # pylint: disable=redefined
     gate = group(*gating_ops)
     tpl = []
     for t in tensors:
-      if t is not None:
+      if tensor_util.is_tensor(t):
         tpl.append(with_dependencies([gate], t))
+      elif isinstance(t, ops.Operation):
+        with ops.control_dependencies([gate]):
+          tpl.append(group(t))
       else:
         tpl.append(None)
     return tpl
diff --git a/tensorflow/python/ops/ctc_ops.py b/tensorflow/python/ops/ctc_ops.py
index 83da6739db673644f59fda3044769b18b2138fbc..4b57e2de790af13499bc73cfcfa98e999eab1603 100644
--- a/tensorflow/python/ops/ctc_ops.py
+++ b/tensorflow/python/ops/ctc_ops.py
@@ -148,7 +148,7 @@ def ctc_loss(labels, inputs, sequence_length,
   if not time_major:
     inputs = array_ops.transpose(inputs, [1, 0, 2])  # (B,T,N) => (T,B,N)
 
-  loss, _ = gen_ctc_ops._ctc_loss(
+  loss, _ = gen_ctc_ops.ctc_loss(
       inputs,
       labels.indices,
       labels.values,
@@ -224,7 +224,7 @@ def ctc_greedy_decoder(inputs, sequence_length, merge_repeated=True):
         sequence found, the negative of the sum of the greatest logit at each
         timeframe.
   """
-  outputs = gen_ctc_ops._ctc_greedy_decoder(
+  outputs = gen_ctc_ops.ctc_greedy_decoder(
       inputs, sequence_length, merge_repeated=merge_repeated)
   (decoded_ix, decoded_val, decoded_shape, log_probabilities) = outputs
   return ([sparse_tensor.SparseTensor(decoded_ix, decoded_val, decoded_shape)],
@@ -272,7 +272,7 @@ def ctc_beam_search_decoder(inputs, sequence_length, beam_width=100,
   """
 
   decoded_ixs, decoded_vals, decoded_shapes, log_probabilities = (
-      gen_ctc_ops._ctc_beam_search_decoder(
+      gen_ctc_ops.ctc_beam_search_decoder(
           inputs, sequence_length, beam_width=beam_width, top_paths=top_paths,
           merge_repeated=merge_repeated))
 
diff --git a/tensorflow/python/ops/data_flow_ops.py b/tensorflow/python/ops/data_flow_ops.py
index 03ed537cfcf27151a0200d7a17f63b1a2bc7ba1a..052caffd4936a37cbd9954ba7b358da2979e8eb5 100644
--- a/tensorflow/python/ops/data_flow_ops.py
+++ b/tensorflow/python/ops/data_flow_ops.py
@@ -342,10 +342,10 @@ class QueueBase(object):
         val.get_shape().assert_is_compatible_with(shape)
 
       if self._queue_ref.dtype == _dtypes.resource:
-        return gen_data_flow_ops._queue_enqueue_v2(
+        return gen_data_flow_ops.queue_enqueue_v2(
             self._queue_ref, vals, name=scope)
       else:
-        return gen_data_flow_ops._queue_enqueue(
+        return gen_data_flow_ops.queue_enqueue(
             self._queue_ref, vals, name=scope)
 
   def enqueue_many(self, vals, name=None):
@@ -387,7 +387,7 @@ class QueueBase(object):
             val.get_shape().with_rank_at_least(1)[0])
         val.get_shape()[1:].assert_is_compatible_with(shape)
 
-      return gen_data_flow_ops._queue_enqueue_many_v2(
+      return gen_data_flow_ops.queue_enqueue_many_v2(
           self._queue_ref, vals, name=scope)
 
   def _dequeue_return_value(self, tensors):
@@ -436,10 +436,10 @@ class QueueBase(object):
     if name is None:
       name = "%s_Dequeue" % self._name
     if self._queue_ref.dtype == _dtypes.resource:
-      ret = gen_data_flow_ops._queue_dequeue_v2(
+      ret = gen_data_flow_ops.queue_dequeue_v2(
           self._queue_ref, self._dtypes, name=name)
     else:
-      ret = gen_data_flow_ops._queue_dequeue(
+      ret = gen_data_flow_ops.queue_dequeue(
           self._queue_ref, self._dtypes, name=name)
 
     # NOTE(mrry): Not using a shape function because we need access to
@@ -479,7 +479,7 @@ class QueueBase(object):
     if name is None:
       name = "%s_DequeueMany" % self._name
 
-    ret = gen_data_flow_ops._queue_dequeue_many_v2(
+    ret = gen_data_flow_ops.queue_dequeue_many_v2(
         self._queue_ref, n=n, component_types=self._dtypes, name=name)
 
     # NOTE(mrry): Not using a shape function because we need access to
@@ -523,7 +523,7 @@ class QueueBase(object):
     if name is None:
       name = "%s_DequeueUpTo" % self._name
 
-    ret = gen_data_flow_ops._queue_dequeue_up_to_v2(
+    ret = gen_data_flow_ops.queue_dequeue_up_to_v2(
         self._queue_ref, n=n, component_types=self._dtypes, name=name)
 
     # NOTE(mrry): Not using a shape function because we need access to
@@ -560,12 +560,12 @@ class QueueBase(object):
     if name is None:
       name = "%s_Close" % self._name
     if self._queue_ref.dtype == _dtypes.resource:
-      return gen_data_flow_ops._queue_close_v2(
+      return gen_data_flow_ops.queue_close_v2(
           self._queue_ref,
           cancel_pending_enqueues=cancel_pending_enqueues,
           name=name)
     else:
-      return gen_data_flow_ops._queue_close(
+      return gen_data_flow_ops.queue_close(
           self._queue_ref,
           cancel_pending_enqueues=cancel_pending_enqueues,
           name=name)
@@ -601,9 +601,9 @@ class QueueBase(object):
     if name is None:
       name = "%s_Size" % self._name
     if self._queue_ref.dtype == _dtypes.resource:
-      return gen_data_flow_ops._queue_size_v2(self._queue_ref, name=name)
+      return gen_data_flow_ops.queue_size_v2(self._queue_ref, name=name)
     else:
-      return gen_data_flow_ops._queue_size(self._queue_ref, name=name)
+      return gen_data_flow_ops.queue_size(self._queue_ref, name=name)
 
 
 @tf_export("RandomShuffleQueue")
@@ -683,7 +683,7 @@ class RandomShuffleQueue(QueueBase):
       # the id of the last op created.)
       string = (str(seed1) + shared_name).encode("utf-8")
       seed2 = int(hashlib.md5(string).hexdigest()[:8], 16) & 0x7FFFFFFF
-    queue_ref = gen_data_flow_ops._random_shuffle_queue_v2(
+    queue_ref = gen_data_flow_ops.random_shuffle_queue_v2(
         component_types=dtypes,
         shapes=shapes,
         capacity=capacity,
@@ -748,7 +748,7 @@ class FIFOQueue(QueueBase):
     dtypes = _as_type_list(dtypes)
     shapes = _as_shape_list(shapes, dtypes)
     names = _as_name_list(names, dtypes)
-    queue_ref = gen_data_flow_ops._fifo_queue_v2(
+    queue_ref = gen_data_flow_ops.fifo_queue_v2(
         component_types=dtypes,
         shapes=shapes,
         capacity=capacity,
@@ -827,7 +827,7 @@ class PaddingFIFOQueue(QueueBase):
                        "but received %d dtypes and %d shapes." % (len(dtypes),
                                                                   len(shapes)))
 
-    queue_ref = gen_data_flow_ops._padding_fifo_queue_v2(
+    queue_ref = gen_data_flow_ops.padding_fifo_queue_v2(
         component_types=dtypes,
         shapes=shapes,
         capacity=capacity,
@@ -895,7 +895,7 @@ class PriorityQueue(QueueBase):
     types = _as_type_list(types)
     shapes = _as_shape_list(shapes, types)
 
-    queue_ref = gen_data_flow_ops._priority_queue_v2(
+    queue_ref = gen_data_flow_ops.priority_queue_v2(
         component_types=types,
         shapes=shapes,
         capacity=capacity,
@@ -985,7 +985,7 @@ class Barrier(object):
     else:
       self._shapes = [tensor_shape.unknown_shape() for _ in self._types]
 
-    self._barrier_ref = gen_data_flow_ops._barrier(
+    self._barrier_ref = gen_data_flow_ops.barrier(
         component_types=self._types,
         shapes=self._shapes,
         shared_name=shared_name,
@@ -1026,7 +1026,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierInsertMany" % self._name
-    return gen_data_flow_ops._barrier_insert_many(
+    return gen_data_flow_ops.barrier_insert_many(
         self._barrier_ref, keys, values, component_index, name=name)
 
   def take_many(self,
@@ -1073,7 +1073,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierTakeMany" % self._name
-    ret = gen_data_flow_ops._barrier_take_many(
+    ret = gen_data_flow_ops.barrier_take_many(
         self._barrier_ref,
         num_elements,
         self._types,
@@ -1122,7 +1122,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierClose" % self._name
-    return gen_data_flow_ops._barrier_close(
+    return gen_data_flow_ops.barrier_close(
         self._barrier_ref,
         cancel_pending_enqueues=cancel_pending_enqueues,
         name=name)
@@ -1139,7 +1139,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierReadySize" % self._name
-    return gen_data_flow_ops._barrier_ready_size(self._barrier_ref, name=name)
+    return gen_data_flow_ops.barrier_ready_size(self._barrier_ref, name=name)
 
   def incomplete_size(self, name=None):
     """Compute the number of incomplete elements in the given barrier.
@@ -1153,7 +1153,7 @@ class Barrier(object):
     """
     if name is None:
       name = "%s_BarrierIncompleteSize" % self._name
-    return gen_data_flow_ops._barrier_incomplete_size(
+    return gen_data_flow_ops.barrier_incomplete_size(
         self._barrier_ref, name=name)
 
 
diff --git a/tensorflow/python/ops/distributions/beta.py b/tensorflow/python/ops/distributions/beta.py
index be4ef550dddc4f393f3d81730be59fc0def47500..469bcadb8ea3a0ec2a85d3a72c0ca5ba08796856 100644
--- a/tensorflow/python/ops/distributions/beta.py
+++ b/tensorflow/python/ops/distributions/beta.py
@@ -304,11 +304,10 @@ class Beta(distribution.Distribution):
     if not self.validate_args:
       return x
     return control_flow_ops.with_dependencies([
-        check_ops.assert_positive(
-            x,
-            message="sample must be positive"),
+        check_ops.assert_positive(x, message="sample must be positive"),
         check_ops.assert_less(
-            x, array_ops.ones([], self.dtype),
+            x,
+            array_ops.ones([], self.dtype),
             message="sample must be less than `1`."),
     ], x)
 
diff --git a/tensorflow/python/ops/distributions/multinomial.py b/tensorflow/python/ops/distributions/multinomial.py
index 26b5c5aef98fc11b07a8c8357e7ec37819587da9..4ae67a009b0a4052f6e23e2e42262bb7c42f1c14 100644
--- a/tensorflow/python/ops/distributions/multinomial.py
+++ b/tensorflow/python/ops/distributions/multinomial.py
@@ -238,7 +238,7 @@ class Multinomial(distribution.Distribution):
     n_draws = math_ops.cast(self.total_count, dtype=dtypes.int32)
     k = self.event_shape_tensor()[0]
 
-    # boardcast the total_count and logits to same shape
+    # broadcast the total_count and logits to same shape
     n_draws = array_ops.ones_like(
         self.logits[..., 0], dtype=n_draws.dtype) * n_draws
     logits = array_ops.ones_like(
diff --git a/tensorflow/python/ops/distributions/special_math.py b/tensorflow/python/ops/distributions/special_math.py
index bed4cbb2c1a43b6952861f4fab82957229e23c9c..1d605c5dfcca9b709a9178ccbe56619f6a92f869 100644
--- a/tensorflow/python/ops/distributions/special_math.py
+++ b/tensorflow/python/ops/distributions/special_math.py
@@ -213,7 +213,7 @@ def _ndtri(p):
 
   # Compute x for p <= exp(-2): x = z - log(z)/z - (1/z) P(1/z) / Q(1/z),
   # where z = sqrt(-2. * log(p)), and P/Q are chosen between two different
-  # arrays based on wether p < exp(-32).
+  # arrays based on whether p < exp(-32).
   z = math_ops.sqrt(-2. * math_ops.log(sanitized_mcp))
   first_term = z - math_ops.log(z) / z
   second_term_small_p = (_create_polynomial(1. / z, p2)
diff --git a/tensorflow/python/ops/distributions/uniform.py b/tensorflow/python/ops/distributions/uniform.py
index 3580af18f241d777c81340f1c565074914838029..e0c554442f9590403e20eee5e6a26996100ab92d 100644
--- a/tensorflow/python/ops/distributions/uniform.py
+++ b/tensorflow/python/ops/distributions/uniform.py
@@ -45,11 +45,12 @@ class Uniform(distribution.Distribution):
   Z = b - a
   ```
 
-  where:
-  * `low = a`,
-  * `high = b`,
-  * `Z` is the normalizing constant, and,
-  * `I[predicate]` is the [indicator function](
+  where
+
+  - `low = a`,
+  - `high = b`,
+  - `Z` is the normalizing constant, and
+  - `I[predicate]` is the [indicator function](
     https://en.wikipedia.org/wiki/Indicator_function) for `predicate`.
 
   The parameters `low` and `high` must be shaped in a way that supports
diff --git a/tensorflow/python/ops/distributions/util.py b/tensorflow/python/ops/distributions/util.py
index 0a3000ef5ca0decf8aba641e704406b0cf8780af..0fe6aa30f945dc7682a53fa6495823288cf111b7 100644
--- a/tensorflow/python/ops/distributions/util.py
+++ b/tensorflow/python/ops/distributions/util.py
@@ -1060,9 +1060,7 @@ def reduce_weighted_logsumexp(
     wx_over_max_absw_x = (
         math_ops.sign(w) * math_ops.exp(log_absw_x - max_log_absw_x))
     sum_wx_over_max_absw_x = math_ops.reduce_sum(
-        wx_over_max_absw_x,
-        axis=axis,
-        keepdims=keep_dims)
+        wx_over_max_absw_x, axis=axis, keepdims=keep_dims)
     if not keep_dims:
       max_log_absw_x = array_ops.squeeze(max_log_absw_x, axis)
     sgn = math_ops.sign(sum_wx_over_max_absw_x)
@@ -1180,8 +1178,7 @@ def process_quadrature_grid_and_probs(
     grid = ops.convert_to_tensor(grid, name="grid", dtype=dtype)
     probs = ops.convert_to_tensor(probs, name="unnormalized_probs",
                                   dtype=dtype)
-    probs /= linalg_ops.norm(probs, ord=1, axis=-1, keepdims=True,
-                             name="probs")
+    probs /= linalg_ops.norm(probs, ord=1, axis=-1, keepdims=True, name="probs")
 
     def _static_event_size(x):
       """Returns the static size of a specific dimension or `None`."""
diff --git a/tensorflow/python/ops/functional_ops.py b/tensorflow/python/ops/functional_ops.py
index ac03d30fcd2e65f032937d9259bc8fff18626619..09a0e345f2529cb0ecd365313135d2e58149487e 100644
--- a/tensorflow/python/ops/functional_ops.py
+++ b/tensorflow/python/ops/functional_ops.py
@@ -41,7 +41,7 @@ from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops.gen_functional_ops import *
 # pylint: enable=wildcard-import
 # pylint: disable=unused-import
-from tensorflow.python.ops.gen_functional_ops import _symbolic_gradient
+from tensorflow.python.ops.gen_functional_ops import symbolic_gradient
 # pylint: enable=unused-import
 from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
diff --git a/tensorflow/python/ops/gradients_impl.py b/tensorflow/python/ops/gradients_impl.py
index 1418c0b10fb60601e7c3024891b89aadb53e6873..be610143951e1f7f1ade08ee708861694a436d25 100644
--- a/tensorflow/python/ops/gradients_impl.py
+++ b/tensorflow/python/ops/gradients_impl.py
@@ -86,17 +86,19 @@ def _IndexedSlicesToTensor(value, dtype=None, name=None, as_ref=False):
         % str(value))
   # TODO(mrry): Consider adding static shape information to
   # IndexedSlices, to avoid using numpy here.
-  dense_shape_value = tensor_util.constant_value(value.dense_shape)
-  if dense_shape_value is not None:
-    num_elements = np.prod(dense_shape_value)
-    if num_elements >= _LARGE_SPARSE_NUM_ELEMENTS:
+  if context.in_graph_mode():
+    dense_shape_value = tensor_util.constant_value(value.dense_shape)
+    if dense_shape_value is not None:
+      num_elements = np.prod(dense_shape_value)
+      if num_elements >= _LARGE_SPARSE_NUM_ELEMENTS:
+        warnings.warn(
+            "Converting sparse IndexedSlices to a dense Tensor with %d "
+            "elements. This may consume a large amount of memory." %
+            num_elements)
+    else:
       warnings.warn(
-          "Converting sparse IndexedSlices to a dense Tensor with %d elements. "
-          "This may consume a large amount of memory." % num_elements)
-  else:
-    warnings.warn(
-        "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
-        "This may consume a large amount of memory.")
+          "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
+          "This may consume a large amount of memory.")
   return math_ops.unsorted_segment_sum(
       value.values, value.indices, value.dense_shape[0], name=name)
 
@@ -354,7 +356,7 @@ def _SymGrad(op, out_grads):
   for k in op.node_def.attr:
     f.attr[k].CopyFrom(op.node_def.attr[k])
   # pylint: disable=protected-access
-  in_grads = functional_ops._symbolic_gradient(input=f_in, Tout=f_types, f=f)
+  in_grads = functional_ops.symbolic_gradient(input=f_in, Tout=f_types, f=f)
   # pylint: enable=protected-access
   return in_grads
 
diff --git a/tensorflow/python/ops/hidden_ops.txt b/tensorflow/python/ops/hidden_ops.txt
index f6ef6f3f3da4389a16a84fa0b3570d3cd1262472..9b8172bf2639cca0efb663ff4075b36d6f4f2245 100644
--- a/tensorflow/python/ops/hidden_ops.txt
+++ b/tensorflow/python/ops/hidden_ops.txt
@@ -32,6 +32,8 @@ TileGrad  # Exported through array_grad instead of array_ops.
 ZerosLike  # TODO(josh11b): Use this instead of the Python version.
 Unique
 UniqueV2
+UniqueWithCounts
+UniqueWithCountsV2
 Unpack
 
 # candidate_sampling_ops
diff --git a/tensorflow/python/ops/histogram_ops.py b/tensorflow/python/ops/histogram_ops.py
index 6a975160b0698270dfc9ce9140e8b3ff633cdb9e..4a1ef54fb50013881aa832f83674ac66ecccd9bc 100644
--- a/tensorflow/python/ops/histogram_ops.py
+++ b/tensorflow/python/ops/histogram_ops.py
@@ -141,5 +141,7 @@ def histogram_fixed_width(values,
   """
   with ops.name_scope(name, 'histogram_fixed_width',
                       [values, value_range, nbins]) as name:
-    return gen_math_ops._histogram_fixed_width(  # pylint: disable=protected-access
+    # pylint: disable=protected-access
+    return gen_math_ops._histogram_fixed_width(
         values, value_range, nbins, dtype=dtype, name=name)
+    # pylint: enable=protected-access
diff --git a/tensorflow/python/ops/image_grad.py b/tensorflow/python/ops/image_grad.py
index 093843cd5bc0b7c2281a0c9ddf52d93ea3faede3..9f43e3f1466d900ae6d39f3b9ef48043421cb777 100644
--- a/tensorflow/python/ops/image_grad.py
+++ b/tensorflow/python/ops/image_grad.py
@@ -41,12 +41,10 @@ def _ResizeNearestNeighborGrad(op, grad):
   else:
     image_shape = array_ops.shape(image)[1:3]
 
-  # pylint: disable=protected-access
-  grads = gen_image_ops._resize_nearest_neighbor_grad(
+  grads = gen_image_ops.resize_nearest_neighbor_grad(
       grad,
       image_shape,
       align_corners=op.get_attr("align_corners"))
-  # pylint: enable=protected-access
   return [grads, None]
 
 
@@ -61,10 +59,8 @@ def _ResizeBilinearGrad(op, grad):
   Returns:
     The gradients w.r.t. the input.
   """
-  # pylint: disable=protected-access
-  grad0 = gen_image_ops._resize_bilinear_grad(
+  grad0 = gen_image_ops.resize_bilinear_grad(
       grad, op.inputs[0], align_corners=op.get_attr("align_corners"))
-  # pylint: enable=protected-access
   return [grad0, None]
 
 
@@ -82,10 +78,8 @@ def _ResizeBicubicGrad(op, grad):
   allowed_types = [dtypes.float32, dtypes.float64]
   grad0 = None
   if op.inputs[0].dtype in allowed_types:
-    # pylint: disable=protected-access
-    grad0 = gen_image_ops._resize_bicubic_grad(
+    grad0 = gen_image_ops.resize_bicubic_grad(
         grad, op.inputs[0], align_corners=op.get_attr("align_corners"))
-    # pylint: enable=protected-access
   return [grad0, None]
 
 
diff --git a/tensorflow/python/ops/image_ops_impl.py b/tensorflow/python/ops/image_ops_impl.py
index 0c0e92d5b00b36f2fbd800afc046faa1fc77b95c..8c472642dcc4940c1fbe579a8e34c7e06ce96402 100644
--- a/tensorflow/python/ops/image_ops_impl.py
+++ b/tensorflow/python/ops/image_ops_impl.py
@@ -166,6 +166,7 @@ def _Assert3DImage(image):
   return control_flow_ops.with_dependencies(
       _Check3DImage(image, require_static=False), image)
 
+
 def _AssertAtLeast3DImage(image):
   """Assert that we are working with a properly shaped image.
 
@@ -183,10 +184,11 @@ def _AssertAtLeast3DImage(image):
       If the shape of `image` could be verified statically, `image` is
       returned unchanged, otherwise there will be a control dependency
       added that asserts the correct dynamic shape.
-    """
+  """
   return control_flow_ops.with_dependencies(
       _CheckAtLeast3DImage(image, require_static=False), image)
 
+
 def _CheckAtLeast3DImage(image, require_static=True):
   """Assert that we are working with properly shaped image.
 
@@ -326,7 +328,7 @@ def flip_left_right(image):
   Raises:
     ValueError: if the shape of `image` not supported.
   """
-  with ops.name_scope(None, 'flip_left_right', [image]) as scope:
+  with ops.name_scope(None, 'flip_left_right', [image]):
     image = ops.convert_to_tensor(image, name='image')
     image = _AssertAtLeast3DImage(image)
     shape = image.get_shape()
@@ -356,7 +358,7 @@ def flip_up_down(image):
   Raises:
     ValueError: if the shape of `image` not supported.
   """
-  with ops.name_scope(None, 'flip_up_down', [image]) as scope:
+  with ops.name_scope(None, 'flip_up_down', [image]):
     image = ops.convert_to_tensor(image, name='image')
     image = _AssertAtLeast3DImage(image)
     shape = image.get_shape()
@@ -412,23 +414,25 @@ def _rot90_3D(image, k, name_scope):
     A 3-D tensor of the same type and shape as `image`.
 
   """
+
   def _rot90():
-    return array_ops.transpose(array_ops.reverse_v2(image, [1]),
-                               [1, 0, 2])
+    return array_ops.transpose(array_ops.reverse_v2(image, [1]), [1, 0, 2])
+
   def _rot180():
     return array_ops.reverse_v2(image, [0, 1])
+
   def _rot270():
-    return array_ops.reverse_v2(array_ops.transpose(image, [1, 0, 2]),
-                                [1])
-  cases = [(math_ops.equal(k, 1), _rot90),
-           (math_ops.equal(k, 2), _rot180),
+    return array_ops.reverse_v2(array_ops.transpose(image, [1, 0, 2]), [1])
+
+  cases = [(math_ops.equal(k, 1), _rot90), (math_ops.equal(k, 2), _rot180),
            (math_ops.equal(k, 3), _rot270)]
 
-  result = control_flow_ops.case(cases, default=lambda: image, exclusive=True,
-                                 name=name_scope)
+  result = control_flow_ops.case(
+      cases, default=lambda: image, exclusive=True, name=name_scope)
   result.set_shape([None, None, image.get_shape()[2]])
   return result
 
+
 def _rot90_4D(images, k, name_scope):
   """Rotate batch of images counter-clockwise by 90 degrees `k` times.
 
@@ -442,21 +446,20 @@ def _rot90_4D(images, k, name_scope):
     A 4-D tensor of the same type and shape as `images`.
 
   """
+
   def _rot90():
-    return array_ops.transpose(array_ops.reverse_v2(images, [2]),
-                               [0, 2, 1, 3])
+    return array_ops.transpose(array_ops.reverse_v2(images, [2]), [0, 2, 1, 3])
+
   def _rot180():
     return array_ops.reverse_v2(images, [1, 2])
   def _rot270():
-    return array_ops.reverse_v2(array_ops.transpose(images, [0, 2, 1, 3]),
-                                [2])
+    return array_ops.reverse_v2(array_ops.transpose(images, [0, 2, 1, 3]), [2])
 
-  cases = [(math_ops.equal(k, 1), _rot90),
-           (math_ops.equal(k, 2), _rot180),
+  cases = [(math_ops.equal(k, 1), _rot90), (math_ops.equal(k, 2), _rot180),
            (math_ops.equal(k, 3), _rot270)]
 
-  result = control_flow_ops.case(cases, default=lambda: images, exclusive=True,
-                                 name=name_scope)
+  result = control_flow_ops.case(
+      cases, default=lambda: images, exclusive=True, name=name_scope)
   shape = result.get_shape()
   result.set_shape([shape[0], None, None, shape[3]])
   return result
@@ -480,7 +483,7 @@ def transpose_image(image):
   Raises:
     ValueError: if the shape of `image` not supported.
   """
-  with ops.name_scope(None, 'transpose_image', [image]) as scope:
+  with ops.name_scope(None, 'transpose_image', [image]):
     image = ops.convert_to_tensor(image, name='image')
     image = _AssertAtLeast3DImage(image)
     shape = image.get_shape()
@@ -1110,10 +1113,8 @@ def adjust_contrast(images, contrast_factor):
     orig_dtype = images.dtype
     flt_images = convert_image_dtype(images, dtypes.float32)
 
-    # pylint: disable=protected-access
-    adjusted = gen_image_ops._adjust_contrastv2(
+    adjusted = gen_image_ops.adjust_contrastv2(
         flt_images, contrast_factor=contrast_factor, name=name)
-    # pylint: enable=protected-access
 
     return convert_image_dtype(adjusted, orig_dtype, saturate=True)
 
@@ -1727,7 +1728,7 @@ def sample_distorted_bounding_box(image_size,
       Provide as input to `tf.image.draw_bounding_boxes`.
   """
   with ops.name_scope(name, 'sample_distorted_bounding_box'):
-    return gen_image_ops._sample_distorted_bounding_box_v2(  # pylint: disable=protected-access
+    return gen_image_ops.sample_distorted_bounding_box_v2(
         image_size,
         bounding_boxes,
         seed=seed,
@@ -1781,10 +1782,8 @@ def non_max_suppression(boxes,
   """
   with ops.name_scope(name, 'non_max_suppression'):
     iou_threshold = ops.convert_to_tensor(iou_threshold, name='iou_threshold')
-    # pylint: disable=protected-access
-    return gen_image_ops._non_max_suppression_v2(boxes, scores, max_output_size,
-                                                 iou_threshold)
-    # pylint: enable=protected-access
+    return gen_image_ops.non_max_suppression_v2(boxes, scores, max_output_size,
+                                                iou_threshold)
 
 
 _rgb_to_yiq_kernel = [[0.299, 0.59590059,
diff --git a/tensorflow/python/ops/image_ops_test.py b/tensorflow/python/ops/image_ops_test.py
index d944b803f27efa8c6733db62dd4c7d3c3b7af91e..b8c4b27c162acdd86d88da641ff8afffaa5a9e6a 100644
--- a/tensorflow/python/ops/image_ops_test.py
+++ b/tensorflow/python/ops/image_ops_test.py
@@ -943,8 +943,9 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
       self.assertAllEqual(y_tf, x_np)
 
   def testInvolutionLeftRightWithBatch(self):
-    x_np = np.array([[[1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3]]],
-                    dtype=np.uint8).reshape([2, 2, 3, 1])
+    x_np = np.array(
+        [[[1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3]]],
+        dtype=np.uint8).reshape([2, 2, 3, 1])
     with self.test_session(use_gpu=True):
       x_tf = constant_op.constant(x_np, shape=x_np.shape)
       y = image_ops.flip_left_right(image_ops.flip_left_right(x_tf))
@@ -963,10 +964,12 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
       self.assertAllEqual(y_tf, y_np)
 
   def testLeftRightWithBatch(self):
-    x_np = np.array([[[1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3]]],
-                    dtype=np.uint8).reshape([2, 2, 3, 1])
-    y_np = np.array([[[3, 2, 1], [3, 2, 1]], [[3, 2, 1], [3, 2, 1]]],
-                    dtype=np.uint8).reshape([2, 2, 3, 1])
+    x_np = np.array(
+        [[[1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3]]],
+        dtype=np.uint8).reshape([2, 2, 3, 1])
+    y_np = np.array(
+        [[[3, 2, 1], [3, 2, 1]], [[3, 2, 1], [3, 2, 1]]],
+        dtype=np.uint8).reshape([2, 2, 3, 1])
 
     with self.test_session(use_gpu=True):
       x_tf = constant_op.constant(x_np, shape=x_np.shape)
@@ -974,7 +977,6 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
       y_tf = y.eval()
       self.assertAllEqual(y_tf, y_np)
 
-
   def testRandomFlipLeftRight(self):
     x_np = np.array([[1, 2, 3], [1, 2, 3]], dtype=np.uint8).reshape([2, 3, 1])
     y_np = np.array([[3, 2, 1], [3, 2, 1]], dtype=np.uint8).reshape([2, 3, 1])
@@ -1013,8 +1015,9 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
       self.assertAllEqual(y_tf, x_np)
 
   def testInvolutionUpDownWithBatch(self):
-    x_np = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]],
-                    dtype=np.uint8).reshape([2, 2, 3, 1])
+    x_np = np.array(
+        [[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]],
+        dtype=np.uint8).reshape([2, 2, 3, 1])
 
     with self.test_session(use_gpu=True):
       x_tf = constant_op.constant(x_np, shape=x_np.shape)
@@ -1034,10 +1037,12 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
       self.assertAllEqual(y_tf, y_np)
 
   def testUpDownWithBatch(self):
-    x_np = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]],
-                    dtype=np.uint8).reshape([2, 2, 3, 1])
-    y_np = np.array([[[4, 5, 6], [1, 2, 3]], [[10, 11, 12], [7, 8, 9]]],
-                    dtype=np.uint8).reshape([2, 2, 3, 1])
+    x_np = np.array(
+        [[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]],
+        dtype=np.uint8).reshape([2, 2, 3, 1])
+    y_np = np.array(
+        [[[4, 5, 6], [1, 2, 3]], [[10, 11, 12], [7, 8, 9]]],
+        dtype=np.uint8).reshape([2, 2, 3, 1])
 
     with self.test_session(use_gpu=True):
       x_tf = constant_op.constant(x_np, shape=x_np.shape)
@@ -1081,8 +1086,9 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
       self.assertAllEqual(y_tf, x_np)
 
   def testInvolutionTransposeWithBatch(self):
-    x_np = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]],
-                    dtype=np.uint8).reshape([2, 2, 3, 1])
+    x_np = np.array(
+        [[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]],
+        dtype=np.uint8).reshape([2, 2, 3, 1])
 
     with self.test_session(use_gpu=True):
       x_tf = constant_op.constant(x_np, shape=x_np.shape)
@@ -1102,11 +1108,13 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
       self.assertAllEqual(y_tf, y_np)
 
   def testTransposeWithBatch(self):
-    x_np = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]],
-                    dtype=np.uint8).reshape([2, 2, 3, 1])
+    x_np = np.array(
+        [[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]],
+        dtype=np.uint8).reshape([2, 2, 3, 1])
 
-    y_np = np.array([[[1, 4], [2, 5], [3, 6]], [[7, 10], [8, 11], [9, 12]]],
-                    dtype=np.uint8).reshape([2, 3, 2, 1])
+    y_np = np.array(
+        [[[1, 4], [2, 5], [3, 6]], [[7, 10], [8, 11], [9, 12]]],
+        dtype=np.uint8).reshape([2, 3, 2, 1])
 
     with self.test_session(use_gpu=True):
       x_tf = constant_op.constant(x_np, shape=x_np.shape)
@@ -1121,8 +1129,8 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
     p_unknown_dims_4 = array_ops.placeholder(
         dtypes.uint8, shape=[None, None, None, None])
     p_unknown_width = array_ops.placeholder(dtypes.uint8, shape=[64, None, 3])
-    p_unknown_batch = array_ops.placeholder(dtypes.uint8,
-                                            shape=[None, 64, 64, 3])
+    p_unknown_batch = array_ops.placeholder(
+        dtypes.uint8, shape=[None, 64, 64, 3])
     p_wrong_rank = array_ops.placeholder(dtypes.uint8, shape=[None, None])
     p_zero_dim = array_ops.placeholder(dtypes.uint8, shape=[64, 0, 3])
 
@@ -1156,7 +1164,8 @@ class FlipTransposeRotateTest(test_util.TensorFlowTestCase):
         op(p_wrong_rank)
 
     for op in [
-        image_ops.random_flip_left_right, image_ops.random_flip_up_down,
+        image_ops.random_flip_left_right,
+        image_ops.random_flip_up_down,
     ]:
       with self.assertRaisesRegexp(ValueError, "must be three-dimensional"):
         op(p_wrong_rank)
@@ -2025,7 +2034,8 @@ class SelectDistortedCropBoxTest(test_util.TensorFlowTestCase):
       bounding_box = constant_op.constant(
           [[[0.0, 0.0, 1.0, 1.0]]],
           shape=[1, 1, 4],
-          dtype=dtypes.float32,)
+          dtype=dtypes.float32,
+      )
       begin, end, bbox_for_drawing = image_ops.sample_distorted_bounding_box(
           image_size=image_size,
           bounding_boxes=bounding_box,
@@ -2040,6 +2050,7 @@ class SelectDistortedCropBoxTest(test_util.TensorFlowTestCase):
       end = end.eval()
       bbox_for_drawing = bbox_for_drawing.eval()
 
+
 class ResizeImagesTest(test_util.TensorFlowTestCase):
 
   OPTIONS = [
@@ -3289,12 +3300,11 @@ class NonMaxSuppressionTest(test_util.TensorFlowTestCase):
 
     # The boxes is of shape [num_boxes, 4], and the scores is
     # of shape [num_boxes]. So an error will thrown.
-    with self.assertRaisesRegexp(
-        ValueError, 'Dimensions must be equal, but are 1 and 2'):
+    with self.assertRaisesRegexp(ValueError,
+                                 "Dimensions must be equal, but are 1 and 2"):
       boxes = constant_op.constant([[0.0, 0.0, 1.0, 1.0]])
       scores = constant_op.constant([0.9, 0.75])
-      selected_indices = image_ops.non_max_suppression(
-          boxes, scores, 3, 0.5)
+      selected_indices = image_ops.non_max_suppression(boxes, scores, 3, 0.5)
 
     # The scores should be 1D of shape [num_boxes].
     with self.assertRaisesRegexp(ValueError,
diff --git a/tensorflow/python/ops/io_ops.py b/tensorflow/python/ops/io_ops.py
index 5e70b3186f382a0c795b1795b2db27bb2058ee41..7c782c12a535add30247cdd0d3489b238fc43343 100644
--- a/tensorflow/python/ops/io_ops.py
+++ b/tensorflow/python/ops/io_ops.py
@@ -111,10 +111,10 @@ def _save(filename, tensor_names, tensors, tensor_slices=None, name="save"):
     An Operation that saves the tensors.
   """
   if tensor_slices is None:
-    return gen_io_ops._save(filename, tensor_names, tensors, name=name)
+    return gen_io_ops.save(filename, tensor_names, tensors, name=name)
   else:
-    return gen_io_ops._save_slices(filename, tensor_names, tensor_slices,
-                                   tensors, name=name)
+    return gen_io_ops.save_slices(filename, tensor_names, tensor_slices,
+                                  tensors, name=name)
 
 
 def _restore_slice(file_pattern, tensor_name, shape_and_slice, tensor_type,
@@ -136,7 +136,7 @@ def _restore_slice(file_pattern, tensor_name, shape_and_slice, tensor_type,
     A tensor of type "tensor_type".
   """
   base_type = dtypes.as_dtype(tensor_type).base_dtype
-  return gen_io_ops._restore_slice(
+  return gen_io_ops.restore_slice(
       file_pattern, tensor_name, shape_and_slice, base_type,
       preferred_shard, name=name)
 
@@ -208,12 +208,12 @@ class ReaderBase(object):
     else:
       queue_ref = queue.queue_ref
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_read_v2(self._reader_ref, queue_ref, name=name)
+      return gen_io_ops.reader_read_v2(self._reader_ref, queue_ref, name=name)
     else:
       # For compatibility with pre-resource queues, create a ref(string) tensor
       # which can be looked up as the same queue by a resource manager.
-      old_queue_op = gen_data_flow_ops._fake_queue(queue_ref)
-      return gen_io_ops._reader_read(self._reader_ref, old_queue_op, name=name)
+      old_queue_op = gen_data_flow_ops.fake_queue(queue_ref)
+      return gen_io_ops.reader_read(self._reader_ref, old_queue_op, name=name)
 
   def read_up_to(self, queue, num_records,  # pylint: disable=invalid-name
                  name=None):
@@ -240,18 +240,18 @@ class ReaderBase(object):
     else:
       queue_ref = queue.queue_ref
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_read_up_to_v2(self._reader_ref,
-                                              queue_ref,
-                                              num_records,
-                                              name=name)
+      return gen_io_ops.reader_read_up_to_v2(self._reader_ref,
+                                             queue_ref,
+                                             num_records,
+                                             name=name)
     else:
       # For compatibility with pre-resource queues, create a ref(string) tensor
       # which can be looked up as the same queue by a resource manager.
-      old_queue_op = gen_data_flow_ops._fake_queue(queue_ref)
-      return gen_io_ops._reader_read_up_to(self._reader_ref,
-                                           old_queue_op,
-                                           num_records,
-                                           name=name)
+      old_queue_op = gen_data_flow_ops.fake_queue(queue_ref)
+      return gen_io_ops.reader_read_up_to(self._reader_ref,
+                                          old_queue_op,
+                                          num_records,
+                                          name=name)
 
   def num_records_produced(self, name=None):
     """Returns the number of records this reader has produced.
@@ -267,11 +267,11 @@ class ReaderBase(object):
 
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_num_records_produced_v2(self._reader_ref,
-                                                        name=name)
+      return gen_io_ops.reader_num_records_produced_v2(self._reader_ref,
+                                                       name=name)
     else:
-      return gen_io_ops._reader_num_records_produced(self._reader_ref,
-                                                     name=name)
+      return gen_io_ops.reader_num_records_produced(self._reader_ref,
+                                                    name=name)
 
   def num_work_units_completed(self, name=None):
     """Returns the number of work units this reader has finished processing.
@@ -283,11 +283,11 @@ class ReaderBase(object):
       An int64 Tensor.
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_num_work_units_completed_v2(self._reader_ref,
-                                                            name=name)
+      return gen_io_ops.reader_num_work_units_completed_v2(self._reader_ref,
+                                                           name=name)
     else:
-      return gen_io_ops._reader_num_work_units_completed(self._reader_ref,
-                                                         name=name)
+      return gen_io_ops.reader_num_work_units_completed(self._reader_ref,
+                                                        name=name)
 
   def serialize_state(self, name=None):
     """Produce a string tensor that encodes the state of a reader.
@@ -302,9 +302,9 @@ class ReaderBase(object):
       A string Tensor.
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_serialize_state_v2(self._reader_ref, name=name)
+      return gen_io_ops.reader_serialize_state_v2(self._reader_ref, name=name)
     else:
-      return gen_io_ops._reader_serialize_state(self._reader_ref, name=name)
+      return gen_io_ops.reader_serialize_state(self._reader_ref, name=name)
 
   def restore_state(self, state, name=None):
     """Restore a reader to a previously saved state.
@@ -321,11 +321,10 @@ class ReaderBase(object):
       The created Operation.
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_restore_state_v2(
+      return gen_io_ops.reader_restore_state_v2(
           self._reader_ref, state, name=name)
     else:
-      return gen_io_ops._reader_restore_state(
-          self._reader_ref, state, name=name)
+      return gen_io_ops.reader_restore_state(self._reader_ref, state, name=name)
 
   @property
   def supports_serialize(self):
@@ -342,9 +341,9 @@ class ReaderBase(object):
       The created Operation.
     """
     if self._reader_ref.dtype == dtypes.resource:
-      return gen_io_ops._reader_reset_v2(self._reader_ref, name=name)
+      return gen_io_ops.reader_reset_v2(self._reader_ref, name=name)
     else:
-      return gen_io_ops._reader_reset(self._reader_ref, name=name)
+      return gen_io_ops.reader_reset(self._reader_ref, name=name)
 
 
 ops.NotDifferentiable("ReaderRead")
@@ -377,7 +376,7 @@ class WholeFileReader(ReaderBase):
     Args:
       name: A name for the operation (optional).
     """
-    rr = gen_io_ops._whole_file_reader_v2(name=name)
+    rr = gen_io_ops.whole_file_reader_v2(name=name)
     super(WholeFileReader, self).__init__(rr, supports_serialize=True)
 
 
@@ -406,8 +405,8 @@ class TextLineReader(ReaderBase):
         to skip from the beginning of every file.
       name: A name for the operation (optional).
     """
-    rr = gen_io_ops._text_line_reader_v2(skip_header_lines=skip_header_lines,
-                                         name=name)
+    rr = gen_io_ops.text_line_reader_v2(skip_header_lines=skip_header_lines,
+                                        name=name)
     super(TextLineReader, self).__init__(rr)
 
 
@@ -444,7 +443,7 @@ class FixedLengthRecordReader(ReaderBase):
       name: A name for the operation (optional).
       encoding: The type of encoding for the file. Defaults to none.
     """
-    rr = gen_io_ops._fixed_length_record_reader_v2(
+    rr = gen_io_ops.fixed_length_record_reader_v2(
         record_bytes=record_bytes,
         header_bytes=header_bytes,
         footer_bytes=footer_bytes,
@@ -480,7 +479,7 @@ class TFRecordReader(ReaderBase):
     compression_type = python_io.TFRecordOptions.get_compression_type_string(
         options)
 
-    rr = gen_io_ops._tf_record_reader_v2(
+    rr = gen_io_ops.tf_record_reader_v2(
         name=name, compression_type=compression_type)
     super(TFRecordReader, self).__init__(rr)
 
@@ -506,7 +505,7 @@ class LMDBReader(ReaderBase):
       name: A name for the operation (optional).
       options: A LMDBRecordOptions object (optional).
     """
-    rr = gen_io_ops._lmdb_reader(name=name)
+    rr = gen_io_ops.lmdb_reader(name=name)
     super(LMDBReader, self).__init__(rr)
 
 
@@ -534,7 +533,7 @@ class IdentityReader(ReaderBase):
     Args:
       name: A name for the operation (optional).
     """
-    rr = gen_io_ops._identity_reader_v2(name=name)
+    rr = gen_io_ops.identity_reader_v2(name=name)
     super(IdentityReader, self).__init__(rr, supports_serialize=True)
 
 
diff --git a/tensorflow/python/ops/linalg/linalg_impl.py b/tensorflow/python/ops/linalg/linalg_impl.py
index d5bd916f80d8a03e5423c43d1ca039bc4dceff5e..2be2d5a3d4bbb15c73038dea10e1d7f2cbb17bde 100644
--- a/tensorflow/python/ops/linalg/linalg_impl.py
+++ b/tensorflow/python/ops/linalg/linalg_impl.py
@@ -31,18 +31,16 @@ band_part = array_ops.matrix_band_part
 cholesky = linalg_ops.cholesky
 cholesky_solve = linalg_ops.cholesky_solve
 det = linalg_ops.matrix_determinant
-# pylint: disable=protected-access
-slogdet = gen_linalg_ops._log_matrix_determinant
-# pylint: disable=protected-access
+slogdet = gen_linalg_ops.log_matrix_determinant
 diag = array_ops.matrix_diag
 diag_part = array_ops.matrix_diag_part
 eigh = linalg_ops.self_adjoint_eig
 eigvalsh = linalg_ops.self_adjoint_eigvals
 einsum = special_math_ops.einsum
-expm = gen_linalg_ops._matrix_exponential
+expm = gen_linalg_ops.matrix_exponential
 eye = linalg_ops.eye
 inv = linalg_ops.matrix_inverse
-logm = gen_linalg_ops._matrix_logarithm
+logm = gen_linalg_ops.matrix_logarithm
 lstsq = linalg_ops.matrix_solve_ls
 norm = linalg_ops.norm
 qr = linalg_ops.qr
diff --git a/tensorflow/python/ops/linalg/linear_operator_diag.py b/tensorflow/python/ops/linalg/linear_operator_diag.py
index b3ec3d5b7cf45ac0b2672eea9a4586b2c3295897..e180e830263c44fb5ae290d307f1ef80106c31d5 100644
--- a/tensorflow/python/ops/linalg/linear_operator_diag.py
+++ b/tensorflow/python/ops/linalg/linear_operator_diag.py
@@ -67,7 +67,7 @@ class LinearOperatorDiag(linear_operator.LinearOperator):
   operator = LinearOperatorDiag(diag)
 
   # Create a shape [2, 1, 4, 2] vector.  Note that this shape is compatible
-  # since the batch dimensions, [2, 1], are brodcast to
+  # since the batch dimensions, [2, 1], are broadcast to
   # operator.batch_shape = [2, 3].
   y = tf.random_normal(shape=[2, 1, 4, 2])
   x = operator.solve(y)
diff --git a/tensorflow/python/ops/linalg_ops.py b/tensorflow/python/ops/linalg_ops.py
index 9803eed6aefe072cbe0841dff2de3f640a440dd5..37470e00d7f11b66c7db0785a530721b07c4b859 100644
--- a/tensorflow/python/ops/linalg_ops.py
+++ b/tensorflow/python/ops/linalg_ops.py
@@ -248,7 +248,7 @@ def matrix_solve_ls(matrix, rhs, l2_regularizer=0.0, fast=True, name=None):
     and l2_regularizer != 0 due to poor accuracy.
   """
 
-  # pylint: disable=protected-access,long-lambda
+  # pylint: disable=long-lambda
   def _use_composite_impl(fast, tensor_shape):
     """Determines whether to use the composite or specialized CPU kernel.
 
@@ -323,9 +323,8 @@ def matrix_solve_ls(matrix, rhs, l2_regularizer=0.0, fast=True, name=None):
   if _use_composite_impl(fast, tensor_shape):
     return _composite_impl(matrix, rhs, l2_regularizer)
   else:
-    return gen_linalg_ops._matrix_solve_ls(
+    return gen_linalg_ops.matrix_solve_ls(
         matrix, rhs, l2_regularizer, fast=fast, name=name)
-  # pylint: enable=protected-access
 
 
 @tf_export('self_adjoint_eig', 'linalg.eigh')
@@ -346,8 +345,7 @@ def self_adjoint_eig(tensor, name=None):
     v: Eigenvectors. Shape is `[..., N, N]`. The columns of the inner most
       matrices contain eigenvectors of the corresponding matrices in `tensor`
   """
-  # pylint: disable=protected-access
-  e, v = gen_linalg_ops._self_adjoint_eig_v2(tensor, compute_v=True, name=name)
+  e, v = gen_linalg_ops.self_adjoint_eig_v2(tensor, compute_v=True, name=name)
   return e, v
 
 
@@ -369,8 +367,7 @@ def self_adjoint_eigvals(tensor, name=None):
     e: Eigenvalues. Shape is `[..., N]`. The vector `e[..., :]` contains the `N`
       eigenvalues of `tensor[..., :, :]`.
   """
-  # pylint: disable=protected-access
-  e, _ = gen_linalg_ops._self_adjoint_eig_v2(tensor, compute_v=False, name=name)
+  e, _ = gen_linalg_ops.self_adjoint_eig_v2(tensor, compute_v=False, name=name)
   return e
 
 
@@ -435,10 +432,8 @@ def svd(tensor, full_matrices=False, compute_uv=True, name=None):
   ````
   @end_compatibility
   """
-  # pylint: disable=protected-access
-  s, u, v = gen_linalg_ops._svd(
+  s, u, v = gen_linalg_ops.svd(
       tensor, compute_uv=compute_uv, full_matrices=full_matrices, name=name)
-  # pylint: enable=protected-access
   if compute_uv:
     return math_ops.real(s), u, v
   else:
diff --git a/tensorflow/python/ops/logging_ops.py b/tensorflow/python/ops/logging_ops.py
index eadbc1b7c3b6e66aa76c9afd860b2274ac1976ae..a7ea7dc6e100e809caebed5f03027c4d694cfdd0 100644
--- a/tensorflow/python/ops/logging_ops.py
+++ b/tensorflow/python/ops/logging_ops.py
@@ -170,7 +170,7 @@ def image_summary(tag, tensor, max_images=3, collections=None, name=None):
     buffer.
   """
   with ops.name_scope(name, "ImageSummary", [tag, tensor]) as scope:
-    val = gen_logging_ops._image_summary(
+    val = gen_logging_ops.image_summary(
         tag=tag, tensor=tensor, max_images=max_images, name=scope)
     _Collect(val, collections, [ops.GraphKeys.SUMMARIES])
   return val
@@ -226,11 +226,12 @@ def audio_summary(tag,
   with ops.name_scope(name, "AudioSummary", [tag, tensor]) as scope:
     sample_rate = ops.convert_to_tensor(sample_rate, dtype=dtypes.float32,
                                         name="sample_rate")
-    val = gen_logging_ops._audio_summary_v2(tag=tag,
-                                            tensor=tensor,
-                                            max_outputs=max_outputs,
-                                            sample_rate=sample_rate,
-                                            name=scope)
+    val = gen_logging_ops.audio_summary_v2(
+        tag=tag,
+        tensor=tensor,
+        max_outputs=max_outputs,
+        sample_rate=sample_rate,
+        name=scope)
     _Collect(val, collections, [ops.GraphKeys.SUMMARIES])
   return val
 
@@ -263,7 +264,7 @@ def merge_summary(inputs, collections=None, name=None):
     buffer resulting from the merging.
   """
   with ops.name_scope(name, "MergeSummary", inputs):
-    val = gen_logging_ops._merge_summary(inputs=inputs, name=name)
+    val = gen_logging_ops.merge_summary(inputs=inputs, name=name)
     _Collect(val, collections, [])
   return val
 
@@ -356,3 +357,4 @@ ops.NotDifferentiable("AudioSummary")
 ops.NotDifferentiable("AudioSummaryV2")
 ops.NotDifferentiable("MergeSummary")
 ops.NotDifferentiable("ScalarSummary")
+ops.NotDifferentiable("Timestamp")
diff --git a/tensorflow/python/ops/lookup_ops.py b/tensorflow/python/ops/lookup_ops.py
index f539a7bb68da57e31746bc80fb25339a03a4fafe..baf7cc19fa72602ba8a8d74e090d262681aa5638 100644
--- a/tensorflow/python/ops/lookup_ops.py
+++ b/tensorflow/python/ops/lookup_ops.py
@@ -196,9 +196,7 @@ class InitializableLookupTableBase(LookupInterface):
     """
     with ops.name_scope(name, "%s_Size" % self._name,
                         [self._table_ref]) as scope:
-      # pylint: disable=protected-access
-      return gen_lookup_ops._lookup_table_size_v2(self._table_ref, name=scope)
-      # pylint: enable=protected-access
+      return gen_lookup_ops.lookup_table_size_v2(self._table_ref, name=scope)
 
   def lookup(self, keys, name=None):
     """Looks up `keys` in a table, outputs the corresponding values.
@@ -227,10 +225,8 @@ class InitializableLookupTableBase(LookupInterface):
     with ops.name_scope(name, "%s_Lookup" % self._name,
                         (self._table_ref, key_tensor,
                          self._default_value)) as scope:
-      # pylint: disable=protected-access
-      values = gen_lookup_ops._lookup_table_find_v2(
+      values = gen_lookup_ops.lookup_table_find_v2(
           self._table_ref, key_tensor, self._default_value, name=scope)
-      # pylint: enable=protected-access
 
     values.set_shape(key_tensor.get_shape())
     if isinstance(keys, sparse_tensor.SparseTensor):
@@ -274,13 +270,11 @@ class HashTable(InitializableLookupTableBase):
     """
     with ops.name_scope(name, "hash_table", (initializer,
                                              default_value)) as scope:
-      # pylint: disable=protected-access
-      table_ref = gen_lookup_ops._hash_table_v2(
+      table_ref = gen_lookup_ops.hash_table_v2(
           shared_name=shared_name,
           key_dtype=initializer.key_dtype,
           value_dtype=initializer.value_dtype,
           name=scope)
-      # pylint: enable=protected-access
 
       super(HashTable, self).__init__(table_ref, default_value, initializer)
 
@@ -352,10 +346,8 @@ class KeyValueTensorInitializer(TableInitializerBase):
     with ops.name_scope(
         self._name, values=(table.table_ref, self._keys,
                             self._values)) as scope:
-      # pylint: disable=protected-access
-      init_op = gen_lookup_ops._initialize_table_v2(
+      init_op = gen_lookup_ops.initialize_table_v2(
           table.table_ref, self._keys, self._values, name=scope)
-      # pylint: enable=protected-access
     ops.add_to_collection(ops.GraphKeys.TABLE_INITIALIZERS, init_op)
     return init_op
 
@@ -518,8 +510,7 @@ class TextFileInitializer(TableInitializerBase):
                         (table.table_ref,)) as scope:
       filename = ops.convert_to_tensor(
           self._filename, dtypes.string, name="asset_filepath")
-      # pylint: disable=protected-access
-      init_op = gen_lookup_ops._initialize_table_from_text_file_v2(
+      init_op = gen_lookup_ops.initialize_table_from_text_file_v2(
           table.table_ref,
           filename,
           self._key_index,
@@ -527,7 +518,6 @@ class TextFileInitializer(TableInitializerBase):
           -1 if self._vocab_size is None else self._vocab_size,
           self._delimiter,
           name=scope)
-      # pylint: enable=protected-access
     ops.add_to_collection(ops.GraphKeys.TABLE_INITIALIZERS, init_op)
     # If the filename tensor is anything other than a string constant (e.g., if
     # it is a placeholder) then it does not make sense to track it as an asset.
diff --git a/tensorflow/python/ops/losses/losses_impl.py b/tensorflow/python/ops/losses/losses_impl.py
index 8e003fb7ac6462fb611a020e86b06b5987af9546..7386976e93fbb82f38550f50429af878fadda813 100644
--- a/tensorflow/python/ops/losses/losses_impl.py
+++ b/tensorflow/python/ops/losses/losses_impl.py
@@ -18,6 +18,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
+from tensorflow.python.eager import context
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import confusion_matrix
@@ -88,6 +89,14 @@ def _safe_div(numerator, denominator, name="value"):
   Returns:
     The element-wise value of the numerator divided by the denominator.
   """
+  if isinstance(denominator, float):
+    if math_ops.equal(denominator, 0.0):
+      return ops.convert_to_tensor(0.0, dtype=numerator.dtype)
+    return math_ops.div(numerator, denominator)
+  if context.in_eager_mode() and denominator._rank() == 0:  # pylint: disable=protected-access
+    if math_ops.equal(denominator, 0.0):
+      return ops.convert_to_tensor(0.0, dtype=numerator.dtype)
+    return math_ops.div(numerator, denominator)
   return array_ops.where(
       math_ops.greater(denominator, 0),
       math_ops.div(numerator, array_ops.where(
@@ -134,6 +143,10 @@ def _num_present(losses, weights, per_batch=False):
       `per_batch` is `True`, the value is returned as a tensor of size
       `[batch_size]`. Otherwise, a single scalar tensor is returned.
   """
+  if ((isinstance(weights, float) and weights != 0.0) or
+      (context.in_eager_mode() and weights._rank() == 0  # pylint: disable=protected-access
+       and not math_ops.equal(weights, 0.0))):
+    return _num_elements(losses)
   with ops.name_scope(None, "num_present", (losses, weights)) as scope:
     weights = math_ops.to_float(weights)
     present = array_ops.where(
@@ -143,8 +156,10 @@ def _num_present(losses, weights, per_batch=False):
     present = weights_broadcast_ops.broadcast_weights(present, losses)
     if per_batch:
       return math_ops.reduce_sum(
-          present, axis=math_ops.range(1, array_ops.rank(present)),
-          keepdims=True, name=scope)
+          present,
+          axis=math_ops.range(1, array_ops.rank(present)),
+          keepdims=True,
+          name=scope)
     return math_ops.reduce_sum(present, name=scope)
 
 
@@ -421,8 +436,12 @@ def huber_loss(labels, predictions, weights=1.0, delta=1.0, scope=None,
     # expression when abs_error == delta is 0 (for tf.maximum it would be 1).
     # This is necessary to avoid doubling the gradient, since there is already a
     # nonzero contribution to the gradient from the quadratic term.
-    linear = (abs_error - quadratic)
-    losses = 0.5 * quadratic * quadratic + delta * linear
+    linear = math_ops.subtract(abs_error, quadratic)
+    losses = math_ops.add(
+        math_ops.multiply(
+            ops.convert_to_tensor(0.5, dtype=quadratic.dtype),
+            math_ops.multiply(quadratic, quadratic)),
+        math_ops.multiply(delta, linear))
     return compute_weighted_loss(
         losses, weights, scope, loss_collection, reduction=reduction)
 
@@ -542,7 +561,8 @@ def mean_pairwise_squared_error(
       reduction_indices = math_ops.range(1, array_ops.rank(diffs))
 
       sum_squares_diff_per_batch = math_ops.reduce_sum(
-          math_ops.square(diffs), reduction_indices=reduction_indices,
+          math_ops.square(diffs),
+          reduction_indices=reduction_indices,
           keepdims=True)
       num_present_per_batch = _num_present(diffs, weights, per_batch=True)
 
@@ -634,7 +654,7 @@ def sigmoid_cross_entropy(
 
   Args:
     multi_class_labels: `[batch_size, num_classes]` target integer labels in
-      `(0, 1)`.
+      `{0, 1}`.
     logits: Float `[batch_size, num_classes]` logits outputs of the network.
     weights: Optional `Tensor` whose rank is either 0, or the same rank as
       `labels`, and must be broadcastable to `labels` (i.e., all dimensions must
@@ -731,7 +751,6 @@ def softmax_cross_entropy(
     losses = nn.softmax_cross_entropy_with_logits_v2(
         labels=onehot_labels, logits=logits, name="xentropy")
 
-
     return compute_weighted_loss(
         losses, weights, scope, loss_collection, reduction=reduction)
 
diff --git a/tensorflow/python/ops/math_grad.py b/tensorflow/python/ops/math_grad.py
index 9e7f37d80fdd71e84516ab450d145d79519ae47a..f1cfa9ded5c5faebf26d7984a8d9d01b5779d163 100644
--- a/tensorflow/python/ops/math_grad.py
+++ b/tensorflow/python/ops/math_grad.py
@@ -35,6 +35,12 @@ def _safe_shape_div(x, y):
   return x // math_ops.maximum(y, 1)
 
 
+@ops.RegisterGradient("ArgMax")
+def _ArgMaxGrad(op, grad):
+  del op, grad
+  return [None, None]
+
+
 @ops.RegisterGradient("Sum")
 def _SumGrad(op, grad):
   """Gradient for Sum."""
@@ -382,16 +388,14 @@ def _NegGrad(_, grad):
 def _InvGrad(op, grad):
   """Returns -grad * (1 / x^2)."""
   y = op.outputs[0]  # y = 1 / x
-  # pylint: disable=protected-access
-  return gen_math_ops._reciprocal_grad(y, grad)
+  return gen_math_ops.reciprocal_grad(y, grad)
 
 
 @ops.RegisterGradient("Reciprocal")
 def _ReciprocalGrad(op, grad):
   """Returns -grad * (1 / x^2)."""
   y = op.outputs[0]  # y = 1 / x
-  # pylint: disable=protected-access
-  return gen_math_ops._reciprocal_grad(y, grad)
+  return gen_math_ops.reciprocal_grad(y, grad)
 
 
 @ops.RegisterGradient("InvGrad")
@@ -401,8 +405,7 @@ def _InvGradGrad(op, grad):
   with ops.control_dependencies([grad]):
     ca = math_ops.conj(op.inputs[0])
     cg = math_ops.conj(grad)
-    # pylint: disable=protected-access
-    return cg * -2.0 * b * ca, gen_math_ops._reciprocal_grad(ca, grad)
+    return cg * -2.0 * b * ca, gen_math_ops.reciprocal_grad(ca, grad)
 
 
 @ops.RegisterGradient("ReciprocalGrad")
@@ -412,8 +415,7 @@ def _ReciprocalGradGrad(op, grad):
   with ops.control_dependencies([grad]):
     ca = math_ops.conj(op.inputs[0])
     cg = math_ops.conj(grad)
-    # pylint: disable=protected-access
-    return cg * -2.0 * b * ca, gen_math_ops._reciprocal_grad(ca, grad)
+    return cg * -2.0 * b * ca, gen_math_ops.reciprocal_grad(ca, grad)
 
 
 @ops.RegisterGradient("Square")
@@ -428,9 +430,7 @@ def _SquareGrad(op, grad):
 @ops.RegisterGradient("Sqrt")
 def _SqrtGrad(op, grad):
   y = op.outputs[0]  # y = x^(1/2)
-  # pylint: disable=protected-access
-  return gen_math_ops._sqrt_grad(y, grad)
-  # pylint: enable=protected-access
+  return gen_math_ops.sqrt_grad(y, grad)
 
 
 @ops.RegisterGradient("SqrtGrad")
@@ -446,9 +446,7 @@ def _SqrtGradGrad(op, grad):
 def _RsqrtGrad(op, grad):
   """Returns -0.5 * grad * conj(y)^3."""
   y = op.outputs[0]  # y = x^(-1/2)
-  # pylint: disable=protected-access
-  return gen_math_ops._rsqrt_grad(y, grad)
-  # pylint: enable=protected-access
+  return gen_math_ops.rsqrt_grad(y, grad)
 
 
 @ops.RegisterGradient("RsqrtGrad")
@@ -460,8 +458,7 @@ def _RsqrtGradGrad(op, grad):
     ca = math_ops.conj(a)
     cg = math_ops.conj(grad)
     grad_a = -1.5 * cg * b * math_ops.square(ca)
-    # pylint: disable=protected-access
-    grad_b = gen_math_ops._rsqrt_grad(ca, grad)
+    grad_b = gen_math_ops.rsqrt_grad(ca, grad)
     return grad_a, grad_b
 
 
@@ -526,8 +523,7 @@ def _TanhGrad(op, grad):
   y = op.outputs[0]  # y = tanh(x)
   with ops.control_dependencies([grad]):
     y = math_ops.conj(y)
-    # pylint: disable=protected-access
-    return gen_math_ops._tanh_grad(y, grad)
+    return gen_math_ops.tanh_grad(y, grad)
 
 
 @ops.RegisterGradient("Asinh")
@@ -565,8 +561,7 @@ def _TanhGradGrad(op, grad):
   with ops.control_dependencies([grad]):
     a = math_ops.conj(op.inputs[0])
     b = math_ops.conj(op.inputs[1])
-    # pylint: disable=protected-access
-    return grad * -2.0 * b * a, gen_math_ops._tanh_grad(a, grad)
+    return grad * -2.0 * b * a, gen_math_ops.tanh_grad(a, grad)
 
 
 @ops.RegisterGradient("Erf")
@@ -714,8 +709,7 @@ def _SigmoidGrad(op, grad):
   y = op.outputs[0]  # y = sigmoid(x)
   with ops.control_dependencies([grad]):
     y = math_ops.conj(y)
-    # pylint: disable=protected-access
-    return gen_math_ops._sigmoid_grad(y, grad)
+    return gen_math_ops.sigmoid_grad(y, grad)
 
 
 @ops.RegisterGradient("SigmoidGrad")
@@ -724,8 +718,7 @@ def _SigmoidGradGrad(op, grad):
     a = math_ops.conj(op.inputs[0])
     b = math_ops.conj(op.inputs[1])
     gb = grad * b
-    # pylint: disable=protected-access
-    return gb - 2.0 * gb * a, gen_math_ops._sigmoid_grad(a, grad)
+    return gb - 2.0 * gb * a, gen_math_ops.sigmoid_grad(a, grad)
 
 
 @ops.RegisterGradient("Sign")
@@ -872,7 +865,7 @@ def _MulGrad(op, grad):
   if (isinstance(grad, ops.Tensor) and
       _ShapesFullySpecifiedAndEqual(x, y, grad) and
       grad.dtype in (dtypes.int32, dtypes.float32)):
-    return gen_math_ops._mul(grad, y), gen_math_ops._mul(grad, x)
+    return gen_math_ops.mul(grad, y), gen_math_ops.mul(grad, x)
   assert x.dtype.base_dtype == y.dtype.base_dtype, (x.dtype, " vs. ", y.dtype)
   sx = array_ops.shape(x)
   sy = array_ops.shape(y)
@@ -880,9 +873,9 @@ def _MulGrad(op, grad):
   x = math_ops.conj(x)
   y = math_ops.conj(y)
   return (array_ops.reshape(
-      math_ops.reduce_sum(gen_math_ops._mul(grad, y), rx), sx),
+      math_ops.reduce_sum(gen_math_ops.mul(grad, y), rx), sx),
           array_ops.reshape(
-              math_ops.reduce_sum(gen_math_ops._mul(x, grad), ry), sy))
+              math_ops.reduce_sum(gen_math_ops.mul(x, grad), ry), sy))
   # pylint: enable=protected-access
 
 
@@ -1056,20 +1049,18 @@ def _MatMulGrad(op, grad):
   t_b = op.get_attr("transpose_b")
   a = math_ops.conj(op.inputs[0])
   b = math_ops.conj(op.inputs[1])
-  # pylint: disable=protected-access
   if not t_a and not t_b:
-    grad_a = gen_math_ops._mat_mul(grad, b, transpose_b=True)
-    grad_b = gen_math_ops._mat_mul(a, grad, transpose_a=True)
+    grad_a = gen_math_ops.mat_mul(grad, b, transpose_b=True)
+    grad_b = gen_math_ops.mat_mul(a, grad, transpose_a=True)
   elif not t_a and t_b:
-    grad_a = gen_math_ops._mat_mul(grad, b)
-    grad_b = gen_math_ops._mat_mul(grad, a, transpose_a=True)
+    grad_a = gen_math_ops.mat_mul(grad, b)
+    grad_b = gen_math_ops.mat_mul(grad, a, transpose_a=True)
   elif t_a and not t_b:
-    grad_a = gen_math_ops._mat_mul(b, grad, transpose_b=True)
-    grad_b = gen_math_ops._mat_mul(a, grad)
+    grad_a = gen_math_ops.mat_mul(b, grad, transpose_b=True)
+    grad_b = gen_math_ops.mat_mul(a, grad)
   elif t_a and t_b:
-    grad_a = gen_math_ops._mat_mul(b, grad, transpose_a=True, transpose_b=True)
-    grad_b = gen_math_ops._mat_mul(grad, a, transpose_a=True, transpose_b=True)
-  # pylint: enable=protected-access
+    grad_a = gen_math_ops.mat_mul(b, grad, transpose_a=True, transpose_b=True)
+    grad_b = gen_math_ops.mat_mul(grad, a, transpose_a=True, transpose_b=True)
   return grad_a, grad_b
 
 
diff --git a/tensorflow/python/ops/math_ops.py b/tensorflow/python/ops/math_ops.py
index 4c7dc9559fa47dd68c9b461d72d92238ab7a38ee..56d58016b83e3a7859c8ee20ce241cc043cc2698 100644
--- a/tensorflow/python/ops/math_ops.py
+++ b/tensorflow/python/ops/math_ops.py
@@ -89,8 +89,6 @@ See the @{$python/math_ops} guide.
 @@matrix_inverse
 @@cholesky
 @@cholesky_solve
-@@matrix_exponential
-@@matrix_logarithm
 @@matrix_solve
 @@matrix_triangular_solve
 @@matrix_solve_ls
@@ -161,14 +159,11 @@ from tensorflow.python.framework import ops
 from tensorflow.python.framework import sparse_tensor
 from tensorflow.python.framework import tensor_shape
 from tensorflow.python.ops import array_ops
-from tensorflow.python.ops import gen_control_flow_ops
 from tensorflow.python.ops import gen_data_flow_ops
 from tensorflow.python.ops import gen_math_ops
 from tensorflow.python.ops import gen_nn_ops
 from tensorflow.python.ops import gen_sparse_ops
 from tensorflow.python.ops import gen_spectral_ops
-from tensorflow.python.ops import gen_state_ops
-from tensorflow.python.ops import state_ops
 from tensorflow.python.platform import tf_logging as logging
 # go/tf-wildcard-import
 # pylint: disable=wildcard-import
@@ -267,7 +262,7 @@ def abs(x, name=None):  # pylint: disable=redefined-builtin
   with ops.name_scope(name, "Abs", [x]) as name:
     if isinstance(x, sparse_tensor.SparseTensor):
       if x.values.dtype.is_complex:
-        x_abs = gen_math_ops._complex_abs(
+        x_abs = gen_math_ops.complex_abs(
             x.values, Tout=x.values.dtype.real_dtype, name=name)
         return sparse_tensor.SparseTensor(
             indices=x.indices, values=x_abs, dense_shape=x.dense_shape)
@@ -277,7 +272,7 @@ def abs(x, name=None):  # pylint: disable=redefined-builtin
     else:
       x = ops.convert_to_tensor(x, name="x")
       if x.dtype.is_complex:
-        return gen_math_ops._complex_abs(x, Tout=x.dtype.real_dtype, name=name)
+        return gen_math_ops.complex_abs(x, Tout=x.dtype.real_dtype, name=name)
       return gen_math_ops._abs(x, name=name)
 
 
@@ -286,7 +281,7 @@ def abs(x, name=None):  # pylint: disable=redefined-builtin
 
 # pylint: disable=redefined-builtin
 def _bucketize(input, boundaries, name=None):
-  return gen_math_ops._bucketize(input=input, boundaries=boundaries, name=name)
+  return gen_math_ops.bucketize(input=input, boundaries=boundaries, name=name)
 
 
 # pylint: enable=redefined-builtin
@@ -329,10 +324,10 @@ def divide(x, y, name=None):
 
 @tf_export("multiply")
 def multiply(x, y, name=None):
-  return gen_math_ops._mul(x, y, name)
+  return gen_math_ops.mul(x, y, name)
 
 
-multiply.__doc__ = gen_math_ops._mul.__doc__.replace("Mul", "`tf.multiply`")
+multiply.__doc__ = gen_math_ops.mul.__doc__.replace("Mul", "`tf.multiply`")
 
 
 # TODO(aselle): put deprecation in after another round of global code changes
@@ -340,19 +335,19 @@ multiply.__doc__ = gen_math_ops._mul.__doc__.replace("Mul", "`tf.multiply`")
     "2016-12-30",
     "`tf.mul(x, y)` is deprecated, please use `tf.multiply(x, y)` or `x * y`")
 def _mul(x, y, name=None):
-  return gen_math_ops._mul(x, y, name)
+  return gen_math_ops.mul(x, y, name)
 
 
 _mul.__doc__ = (
-    gen_math_ops._mul.__doc__ + ("" if _mul.__doc__ is None else _mul.__doc__))
+    gen_math_ops.mul.__doc__ + ("" if _mul.__doc__ is None else _mul.__doc__))
 
 
 @tf_export("subtract")
 def subtract(x, y, name=None):
-  return gen_math_ops._sub(x, y, name)
+  return gen_math_ops.sub(x, y, name)
 
 
-subtract.__doc__ = gen_math_ops._sub.__doc__.replace("`Sub`", "`tf.subtract`")
+subtract.__doc__ = gen_math_ops.sub.__doc__.replace("`Sub`", "`tf.subtract`")
 
 
 # TODO(aselle): put deprecation in after another round of global code changes
@@ -360,11 +355,11 @@ subtract.__doc__ = gen_math_ops._sub.__doc__.replace("`Sub`", "`tf.subtract`")
     "2016-12-30",
     "`tf.sub(x, y)` is deprecated, please use `tf.subtract(x, y)` or `x - y`")
 def _sub(x, y, name=None):
-  return gen_math_ops._sub(x, y, name)
+  return gen_math_ops.sub(x, y, name)
 
 
 _sub.__doc__ = (
-    gen_math_ops._sub.__doc__ + ("" if _sub.__doc__ is None else _sub.__doc__))
+    gen_math_ops.sub.__doc__ + ("" if _sub.__doc__ is None else _sub.__doc__))
 
 
 # pylint: disable=g-docstring-has-escape
@@ -384,11 +379,11 @@ def negative(x, name=None):
   """
   with ops.name_scope(name, "Neg", [x]) as name:
     if isinstance(x, sparse_tensor.SparseTensor):
-      x_neg = gen_math_ops._neg(x.values, name=name)
+      x_neg = gen_math_ops.neg(x.values, name=name)
       return sparse_tensor.SparseTensor(
           indices=x.indices, values=x_neg, dense_shape=x.dense_shape)
     else:
-      return gen_math_ops._neg(x, name=name)
+      return gen_math_ops.neg(x, name=name)
 
 
 # pylint: enable=g-docstring-has-escape
@@ -904,7 +899,41 @@ def to_bfloat16(x, name="ToBFloat16"):
   return cast(x, dtypes.bfloat16, name=name)
 
 
-ops.Tensor._override_operator("__neg__", gen_math_ops._neg)
+@tf_export("to_complex64")
+def to_complex64(x, name="ToComplex64"):
+  """Casts a tensor to type `complex64`.
+
+  Args:
+    x: A `Tensor` or `SparseTensor`.
+    name: A name for the operation (optional).
+
+  Returns:
+    A `Tensor` or `SparseTensor` with same shape as `x` with type `complex64`.
+
+  Raises:
+    TypeError: If `x` cannot be cast to the `complex64`.
+  """
+  return cast(x, dtypes.complex64, name=name)
+
+
+@tf_export("to_complex128")
+def to_complex128(x, name="ToComplex128"):
+  """Casts a tensor to type `complex128`.
+
+  Args:
+    x: A `Tensor` or `SparseTensor`.
+    name: A name for the operation (optional).
+
+  Returns:
+    A `Tensor` or `SparseTensor` with same shape as `x` with type `complex128`.
+
+  Raises:
+    TypeError: If `x` cannot be cast to the `complex128`.
+  """
+  return cast(x, dtypes.complex128, name=name)
+
+
+ops.Tensor._override_operator("__neg__", gen_math_ops.neg)
 ops.Tensor._override_operator("__abs__", abs)
 # __invert__ corresponds to the ~ operator.  Here we follow the numpy convention
 # ~ marks an elementwise bit-wise inverse.  This is only implemented for boolean
@@ -1033,7 +1062,7 @@ def _truediv_python3(x, y, name=None):
     if dtype is not None:
       x = cast(x, dtype)
       y = cast(y, dtype)
-    return gen_math_ops._real_div(x, y, name=name)
+    return gen_math_ops.real_div(x, y, name=name)
 
 
 def _div_python2(x, y, name=None):
@@ -1056,9 +1085,9 @@ def _div_python2(x, y, name=None):
       raise TypeError("x and y must have the same dtype, got %r != %r" %
                       (x_dtype, y_dtype))
     if x_dtype.is_floating or x_dtype.is_complex:
-      return gen_math_ops._real_div(x, y, name=name)
+      return gen_math_ops.real_div(x, y, name=name)
     else:
-      return gen_math_ops._floor_div(x, y, name=name)
+      return gen_math_ops.floor_div(x, y, name=name)
 
 
 @tf_export("truediv")
@@ -1116,7 +1145,7 @@ def div(x, y, name=None):
 
 
 # TODO(aselle): This should be removed
-mod = gen_math_ops._floor_mod
+mod = gen_math_ops.floor_mod
 
 
 # TODO(aselle): Deprecate this once all internal functionality uses
@@ -1149,22 +1178,22 @@ def floordiv(x, y, name=None):
     TypeError: If the inputs are complex.
   """
   with ops.name_scope(name, "floordiv", [x, y]) as name:
-    return gen_math_ops._floor_div(x, y, name=name)
+    return gen_math_ops.floor_div(x, y, name=name)
 
 
-realdiv = gen_math_ops._real_div
-truncatediv = gen_math_ops._truncate_div
+realdiv = gen_math_ops.real_div
+truncatediv = gen_math_ops.truncate_div
 # TODO(aselle): Rename this to floordiv when we can.
-floor_div = gen_math_ops._floor_div
-truncatemod = gen_math_ops._truncate_mod
-floormod = gen_math_ops._floor_mod
+floor_div = gen_math_ops.floor_div
+truncatemod = gen_math_ops.truncate_mod
+floormod = gen_math_ops.floor_mod
 
 
 def _mul_dispatch(x, y, name=None):
   """Dispatches cwise mul for "Dense*Dense" and "Dense*Sparse"."""
   is_tensor_y = isinstance(y, ops.Tensor)
   if is_tensor_y:
-    return gen_math_ops._mul(x, y, name=name)
+    return gen_math_ops.mul(x, y, name=name)
   else:
     assert isinstance(y, sparse_tensor.SparseTensor)  # Case: Dense * Sparse.
     new_vals = gen_sparse_ops.sparse_dense_cwise_mul(y.indices, y.values,
@@ -1183,12 +1212,12 @@ _OverrideBinaryOperatorHelper(gen_sparse_ops.sparse_dense_cwise_mul, "mul",
                               sparse_tensor.SparseTensor)
 
 _OverrideBinaryOperatorHelper(gen_math_ops.add, "add")
-_OverrideBinaryOperatorHelper(gen_math_ops._sub, "sub")
+_OverrideBinaryOperatorHelper(gen_math_ops.sub, "sub")
 _OverrideBinaryOperatorHelper(_mul_dispatch, "mul")
 _OverrideBinaryOperatorHelper(_div_python2, "div")
 _OverrideBinaryOperatorHelper(_truediv_python3, "truediv")
 _OverrideBinaryOperatorHelper(floordiv, "floordiv")
-_OverrideBinaryOperatorHelper(gen_math_ops._floor_mod, "mod")
+_OverrideBinaryOperatorHelper(gen_math_ops.floor_mod, "mod")
 _OverrideBinaryOperatorHelper(pow, "pow")
 
 
@@ -1298,9 +1327,9 @@ def _ReductionDims(x, axis, reduction_indices):
     return axis
   else:
     # Fast path: avoid creating Rank and Range ops if ndims is known.
-    if isinstance(x, ops.Tensor) and x.get_shape().ndims is not None:
+    if isinstance(x, ops.Tensor) and x._rank() is not None:  # pylint: disable=protected-access
       return constant_op.constant(
-          np.arange(x.get_shape().ndims), dtype=dtypes.int32)
+          np.arange(x._rank()), dtype=dtypes.int32)  # pylint: disable=protected-access
     if (isinstance(x, sparse_tensor.SparseTensor) and
         x.dense_shape.get_shape().is_fully_defined()):
       rank = x.dense_shape.get_shape()[0].value  # sparse.dense_shape is 1-D.
@@ -1510,7 +1539,7 @@ def reduce_mean(input_tensor,
   if keepdims is None:
     keepdims = False
   return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
-                               gen_math_ops._mean(
+                               gen_math_ops.mean(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
@@ -1560,7 +1589,7 @@ def reduce_prod(input_tensor,
   if keepdims is None:
     keepdims = False
   return _may_reduce_to_scalar(keepdims, axis, reduction_indices,
-                               gen_math_ops._prod(
+                               gen_math_ops.prod(
                                    input_tensor,
                                    _ReductionDims(input_tensor, axis,
                                                   reduction_indices),
@@ -2029,7 +2058,7 @@ def matmul(a,
       if transpose_b:
         b = conj(b)
         adjoint_b = True
-      return gen_math_ops._batch_mat_mul(
+      return gen_math_ops.batch_mat_mul(
           a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
 
     # Neither matmul nor sparse_matmul support adjoint, so we conjugate
@@ -2066,13 +2095,13 @@ def matmul(a,
         ret = cast(ret, dtypes.bfloat16)
       return ret
     else:
-      return gen_math_ops._mat_mul(
+      return gen_math_ops.mat_mul(
           a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
 
 
 _OverrideBinaryOperatorHelper(matmul, "matmul")
 
-sparse_matmul = gen_math_ops._sparse_mat_mul
+sparse_matmul = gen_math_ops.sparse_mat_mul
 
 
 @ops.RegisterStatistics("MatMul", "flops")
@@ -2177,7 +2206,7 @@ def add_n(inputs, name=None):
     if name:
       return array_ops.identity(inputs[0], name=name)
     return inputs[0]
-  return gen_math_ops._add_n(inputs, name=name)
+  return gen_math_ops.add_n(inputs, name=name)
 
 
 @tf_export("accumulate_n")
@@ -2187,14 +2216,12 @@ def accumulate_n(inputs, shape=None, tensor_dtype=None, name=None):
   Optionally, pass `shape` and `tensor_dtype` for shape and type checking,
   otherwise, these are inferred.
 
-  NOTE: This operation is not differentiable and cannot be used if inputs depend
-  on trainable variables. Please use `tf.add_n` for such cases.
+  `tf.accumulate_n` performs the same operation as `tf.add_n`, but does not
+  wait for all of its inputs to be ready before beginning to sum. This can
+  save memory if inputs are ready at different times, since minimum temporary
+  storage is proportional to the output size rather than the inputs size.
 
-  Aside from differentiability, `tf.accumulate_n` performs the same operation as
-  `tf.add_n`, but does not wait for all of its inputs to be ready before
-  beginning to sum. This can save memory if inputs are ready at different times,
-  since minimum temporary storage is proportional to the output size rather than
-  the inputs size.
+  `accumulate_n` is differentiable (but wasn't previous to TensorFlow 1.7).
 
   For example:
 
@@ -2204,8 +2231,9 @@ def accumulate_n(inputs, shape=None, tensor_dtype=None, name=None):
   tf.accumulate_n([a, b, a])  # [[7, 4], [6, 14]]
 
   # Explicitly pass shape and type
-  tf.accumulate_n([a, b, a], shape=[2, 2], tensor_dtype=tf.int32)  # [[7,  4],
-                                                                   #  [6, 14]]
+  tf.accumulate_n([a, b, a], shape=[2, 2], tensor_dtype=tf.int32)
+                                                                 # [[7,  4],
+                                                                 #  [6, 14]]
   ```
 
   Args:
@@ -2221,20 +2249,17 @@ def accumulate_n(inputs, shape=None, tensor_dtype=None, name=None):
     ValueError: If `inputs` don't all have same shape and dtype or the shape
     cannot be inferred.
   """
-  if context.in_eager_mode():
-    # TODO(apassos) remove this once the lifetime of eager variables gets
-    # addressed.
-    raise ValueError("accumulate_n not supported in eager mode")
+  def _input_error():
+    return ValueError(
+        "inputs must be a list of at least one Tensor with the "
+        "same dtype and shape")
   if not inputs or not isinstance(inputs, (list, tuple)):
-    raise ValueError("inputs must be a list of at least one Tensor with the "
-                     "same dtype and shape")
+    raise _input_error()
   inputs = ops.convert_n_to_tensor_or_indexed_slices(inputs)
   if not all(isinstance(x, ops.Tensor) for x in inputs):
-    raise ValueError("inputs must be a list of at least one Tensor with the "
-                     "same dtype and shape")
+    raise _input_error()
   if not all(x.dtype == inputs[0].dtype for x in inputs):
-    raise ValueError("inputs must be a list of at least one Tensor with the "
-                     "same dtype and shape")
+    raise _input_error()
   if shape is not None:
     shape = tensor_shape.as_shape(shape)
   else:
@@ -2242,27 +2267,31 @@ def accumulate_n(inputs, shape=None, tensor_dtype=None, name=None):
   for input_tensor in inputs:
     if isinstance(input_tensor, ops.Tensor):
       shape = shape.merge_with(input_tensor.get_shape())
-  if tensor_dtype is None:
-    tensor_dtype = inputs[0].dtype
-  if tensor_dtype != inputs[0].dtype:
-    raise TypeError("tensor_dtype is {}, but input is of type {}".format(
-        tensor_dtype, inputs[0].dtype))
-  if len(inputs) == 1:
+
+  # tensor_dtype is for safety only; operator's output type computed in C++
+  if tensor_dtype is not None and tensor_dtype != inputs[0].dtype:
+    raise TypeError("tensor_dtype is {}, but input is of type {}"
+                    .format(tensor_dtype, inputs[0].dtype))
+
+  if len(inputs) == 1 and name is None:
     return inputs[0]
-  with ops.name_scope(name, "AccumulateN", inputs) as name:
-    var = gen_state_ops._temporary_variable(
-        shape=tensor_shape.vector(0), dtype=tensor_dtype)
-    with ops.colocate_with(var):
-      zeros = array_ops.zeros_like(gen_control_flow_ops._merge(inputs)[0])
-      zeros.set_shape(shape)
-      ref = state_ops.assign(var, zeros, validate_shape=False)
-      update_ops = [
-          state_ops.assign_add(ref, input_tensor, use_locking=True)
-          for input_tensor in inputs
-      ]
-      with ops.control_dependencies(update_ops):
-        return gen_state_ops._destroy_temporary_variable(
-            ref, var_name=var.op.name, name=name)
+  elif len(inputs) == 1 and name is not None:
+    return array_ops.identity(inputs[0], name=name)
+  elif context.in_eager_mode():
+    # TemporaryVariable not currently supported in eager mode; fall back
+    # onto AddN for now.
+    # TODO(frreiss) remove this once the lifetime of eager variables gets
+    # addressed
+    return add_n(inputs, name=name)
+  else:
+    return gen_math_ops.accumulate_nv2(inputs, name=name, shape=shape)  # pylint: disable=protected-access
+
+
+@ops.RegisterGradient("AccumulateNV2")
+def _accumulate_n_grad(op, grad):
+  """Same as gradient for AddN. Copies the gradient to all inputs."""
+  # Not broadcasting.
+  return [grad] * len(op.inputs)
 
 
 @tf_export("nn.sigmoid", "sigmoid")
@@ -2285,7 +2314,7 @@ def sigmoid(x, name=None):
   """
   with ops.name_scope(name, "Sigmoid", [x]) as name:
     x = ops.convert_to_tensor(x, name="x")
-    return gen_math_ops._sigmoid(x, name=name)
+    return gen_math_ops.sigmoid(x, name=name)
 
 
 @tf_export("log_sigmoid")
@@ -2304,7 +2333,7 @@ def log_sigmoid(x, name=None):
   """
   with ops.name_scope(name, "LogSigmoid", [x]) as name:
     x = ops.convert_to_tensor(x, name="x")
-    return gen_math_ops._neg(gen_nn_ops.softplus(-x), name=name)
+    return gen_math_ops.neg(gen_nn_ops.softplus(-x), name=name)
 
 
 @tf_export("nn.tanh", "tanh")
@@ -2321,11 +2350,11 @@ def tanh(x, name=None):
   """
   with ops.name_scope(name, "Tanh", [x]) as name:
     if isinstance(x, sparse_tensor.SparseTensor):
-      x_tanh = gen_math_ops._tanh(x.values, name=name)
+      x_tanh = gen_math_ops.tanh(x.values, name=name)
       return sparse_tensor.SparseTensor(
           indices=x.indices, values=x_tanh, dense_shape=x.dense_shape)
     else:
-      return gen_math_ops._tanh(x, name=name)
+      return gen_math_ops.tanh(x, name=name)
 
 
 @tf_export("bincount")
@@ -2514,7 +2543,7 @@ def conj(x, name=None):
   with ops.name_scope(name, "Conj", [x]) as name:
     x = ops.convert_to_tensor(x, name="x")
     if x.dtype.is_complex or x.dtype == dtypes.variant:
-      return gen_math_ops._conj(x, name=name)
+      return gen_math_ops.conj(x, name=name)
     elif x.dtype.is_floating or x.dtype.is_integer:
       return x
     else:
diff --git a/tensorflow/python/ops/metrics_impl.py b/tensorflow/python/ops/metrics_impl.py
index 44c2f304cf9245539e42da2ce54260990de980e0..043c0e30cd8476b1a91e136df60edfbedf85ab24 100644
--- a/tensorflow/python/ops/metrics_impl.py
+++ b/tensorflow/python/ops/metrics_impl.py
@@ -1247,13 +1247,8 @@ def mean_tensor(values,
     with ops.control_dependencies([values]):
       update_count_op = state_ops.assign_add(count, num_values)
 
-    def compute_mean(total, count, name):
-      non_zero_count = math_ops.maximum(
-          count, array_ops.ones_like(count), name=name)
-      return math_ops.truediv(total, non_zero_count, name=name)
-
-    mean_t = compute_mean(total, count, 'value')
-    update_op = compute_mean(update_total_op, update_count_op, 'update_op')
+    mean_t = _safe_div(total, count, 'value')
+    update_op = _safe_div(update_total_op, update_count_op, 'update_op')
 
     if metrics_collections:
       ops.add_to_collections(metrics_collections, mean_t)
diff --git a/tensorflow/python/ops/nn_batchnorm_test.py b/tensorflow/python/ops/nn_batchnorm_test.py
index eebfb17085a568f48769f6df7dddd3ae2f799efc..3ac2c8eb17ef31b46638ce50e0e9f9705adce189 100644
--- a/tensorflow/python/ops/nn_batchnorm_test.py
+++ b/tensorflow/python/ops/nn_batchnorm_test.py
@@ -57,7 +57,6 @@ class BatchNormalizationTest(test.TestCase):
     test_util.set_producer_version(ops.get_default_graph(), 8)
     return gen_nn_ops._batch_norm_with_global_normalization(
         x, m, v, beta, gamma, epsilon, scale_after_normalization)
-    # pylint: enable=protected-access
 
   def _tfBatchNormV1BW(self, x, m, v, beta, gamma, epsilon,
                        scale_after_normalization):
@@ -223,7 +222,7 @@ class BatchNormalizationTest(test.TestCase):
         for scale_after_normalization in [True, False]:
           # _batch_norm_with_global_normalization_grad is deprecated in v9
           test_util.set_producer_version(ops.get_default_graph(), 8)
-          grad = gen_nn_ops._batch_norm_with_global_normalization_grad(
+          grad = gen_nn_ops.batch_norm_with_global_normalization_grad(
               x, m, v, gamma, backprop, epsilon, scale_after_normalization)
           dx, dm, dv, db, dg = grad
           self.assertEqual(grad.dx, dx)
diff --git a/tensorflow/python/ops/nn_grad.py b/tensorflow/python/ops/nn_grad.py
index dc24b821a5580e3581f153f3cbf63ad2868b8a18..5582daf2daf5df49538caaf011f73dfea2a37f8e 100644
--- a/tensorflow/python/ops/nn_grad.py
+++ b/tensorflow/python/ops/nn_grad.py
@@ -150,7 +150,7 @@ def _Conv3DBackpropFilterGrad(op, grad):
 
 @ops.RegisterGradient("AvgPool3D")
 def _AvgPool3DGrad(op, grad):
-  return gen_nn_ops._avg_pool3d_grad(
+  return gen_nn_ops.avg_pool3d_grad(
       array_ops.shape(op.inputs[0]),
       grad,
       ksize=op.get_attr("ksize"),
@@ -172,7 +172,7 @@ def _AvgPool3DGradGrad(op, grad):
 
 @ops.RegisterGradient("MaxPool3D")
 def _MaxPool3DGrad(op, grad):
-  return gen_nn_ops._max_pool3d_grad(
+  return gen_nn_ops.max_pool3d_grad(
       op.inputs[0],
       op.outputs[0],
       grad,
@@ -188,7 +188,7 @@ def _MaxPool3DGradGrad(op, grad):
       shape=array_ops.shape(op.inputs[0]), dtype=op.inputs[0].dtype),
           array_ops.zeros(
               shape=array_ops.shape(op.inputs[1]), dtype=op.inputs[1].dtype),
-          gen_nn_ops._max_pool3d_grad_grad(
+          gen_nn_ops.max_pool3d_grad_grad(
               op.inputs[0],
               op.inputs[1],
               grad,
@@ -204,7 +204,7 @@ def _MaxPool3DGradGradGrad(op, grad):
       shape=array_ops.shape(op.inputs[0]), dtype=op.inputs[0].dtype),
           array_ops.zeros(
               shape=array_ops.shape(op.inputs[1]), dtype=op.inputs[1].dtype),
-          gen_nn_ops._max_pool3d_grad(
+          gen_nn_ops.max_pool3d_grad(
               op.inputs[0],
               op.inputs[1],
               grad,
@@ -352,13 +352,13 @@ def _BiasAddGradV1(unused_bias_op, received_grad):
 
 @ops.RegisterGradient("Relu")
 def _ReluGrad(op, grad):
-  return gen_nn_ops._relu_grad(grad, op.outputs[0])
+  return gen_nn_ops.relu_grad(grad, op.outputs[0])
 
 
 @ops.RegisterGradient("EluGrad")
 def _EluGradGrad(op, grad):
   elu_x = op.inputs[1]
-  return (gen_nn_ops._elu_grad(grad, op.outputs[0]),
+  return (gen_nn_ops.elu_grad(grad, op.outputs[0]),
           array_ops.where(elu_x < 0, grad * op.inputs[0],
                           array_ops.zeros(
                               shape=array_ops.shape(elu_x), dtype=elu_x.dtype)))
@@ -368,63 +368,63 @@ def _EluGradGrad(op, grad):
 def _SeluGradGrad(op, grad):
   x = op.inputs[1]
   scale_alpha = 1.7580993408473768599402175208123
-  return (gen_nn_ops._elu_grad(grad, op.outputs[0]),
+  return (gen_nn_ops.elu_grad(grad, op.outputs[0]),
           array_ops.where(x < 0.,
-                          gen_nn_ops._elu_grad(grad,
-                                               op.outputs[0] + scale_alpha),
+                          gen_nn_ops.elu_grad(grad,
+                                              op.outputs[0] + scale_alpha),
                           array_ops.zeros(
                               shape=array_ops.shape(x), dtype=x.dtype)))
 
 
 @ops.RegisterGradient("Relu6")
 def _Relu6Grad(op, grad):
-  return gen_nn_ops._relu6_grad(grad, op.outputs[0])  # pylint: disable=protected-access
+  return gen_nn_ops.relu6_grad(grad, op.outputs[0])
 
 
 @ops.RegisterGradient("Relu6Grad")
 def _Relu6GradGrad(op, grad):
   x = op.inputs[1]
-  return (gen_nn_ops._relu6_grad(grad, x),
+  return (gen_nn_ops.relu6_grad(grad, x),
           array_ops.zeros(shape=array_ops.shape(x), dtype=x.dtype))
 
 
 @ops.RegisterGradient("Elu")
 def _EluGrad(op, grad):
-  return gen_nn_ops._elu_grad(grad, op.outputs[0])
+  return gen_nn_ops.elu_grad(grad, op.outputs[0])
 
 
 @ops.RegisterGradient("Selu")
 def _SeluGrad(op, grad):
-  return gen_nn_ops._selu_grad(grad, op.outputs[0])
+  return gen_nn_ops.selu_grad(grad, op.outputs[0])
 
 
 @ops.RegisterGradient("Softplus")
 def _SoftplusGrad(op, grad):
-  return gen_nn_ops._softplus_grad(grad, op.inputs[0])
+  return gen_nn_ops.softplus_grad(grad, op.inputs[0])
 
 
 @ops.RegisterGradient("SoftplusGrad")
 def _SoftplusGradGrad(op, grad):
   # Let:
   #   y = tf.nn.softplus(x)
-  #   dx = gen_nn_ops._softplus_grad(dy, x) = dy / (1 + exp(-x))
+  #   dx = gen_nn_ops.softplus_grad(dy, x) = dy / (1 + exp(-x))
   # This op computes (ddy, d2x) from op.inputs == [dy, x] and grad == ddx.
   dy, x = op.inputs
   with ops.control_dependencies([grad]):
-    ddy = gen_nn_ops._softplus_grad(grad, x)  # pylint: disable=protected-access
+    ddy = gen_nn_ops.softplus_grad(grad, x)
     d2x = grad * dy / (math_ops.exp(-x) + 2.0 + math_ops.exp(x))
     return (ddy, d2x)
 
 
 @ops.RegisterGradient("Softsign")
 def _SoftsignGrad(op, grad):
-  return gen_nn_ops._softsign_grad(grad, op.inputs[0])
+  return gen_nn_ops.softsign_grad(grad, op.inputs[0])
 
 
 @ops.RegisterGradient("ReluGrad")
 def _ReluGradGrad(op, grad):
   x = op.inputs[1]
-  return (gen_nn_ops._relu_grad(grad, x),
+  return (gen_nn_ops.relu_grad(grad, x),
           array_ops.zeros(shape=array_ops.shape(x), dtype=x.dtype))
 
 
@@ -565,14 +565,14 @@ def _LRNGrad(op, grad):
   alpha = op.get_attr("alpha")
   beta = op.get_attr("beta")
   return [
-      gen_nn_ops._lrn_grad(grad, op.inputs[0], op.outputs[0], depth_radius,
-                           bias, alpha, beta)
+      gen_nn_ops.lrn_grad(grad, op.inputs[0], op.outputs[0], depth_radius, bias,
+                          alpha, beta)
   ]
 
 
 @ops.RegisterGradient("AvgPool")
 def _AvgPoolGrad(op, grad):
-  return gen_nn_ops._avg_pool_grad(
+  return gen_nn_ops.avg_pool_grad(
       array_ops.shape(op.inputs[0]),
       grad,
       op.get_attr("ksize"),
@@ -584,7 +584,7 @@ def _AvgPoolGrad(op, grad):
 @ops.RegisterGradient("AvgPoolGrad")
 def _AvgPoolGradGrad(op, grad):
   return (array_ops.stop_gradient(op.inputs[0]),
-          gen_nn_ops._avg_pool(
+          gen_nn_ops.avg_pool(
               grad,
               op.get_attr("ksize"),
               op.get_attr("strides"),
@@ -594,7 +594,7 @@ def _AvgPoolGradGrad(op, grad):
 
 @ops.RegisterGradient("MaxPool")
 def _MaxPoolGrad(op, grad):
-  return gen_nn_ops._max_pool_grad(
+  return gen_nn_ops.max_pool_grad(
       op.inputs[0],
       op.outputs[0],
       grad,
@@ -620,7 +620,7 @@ def _MaxPoolGradV2(op, grad):
 
 @ops.RegisterGradient("MaxPoolWithArgmax")
 def _MaxPoolGradWithArgmax(op, grad, unused_argmax_grad):
-  return gen_nn_ops._max_pool_grad_with_argmax(
+  return gen_nn_ops.max_pool_grad_with_argmax(
       op.inputs[0],
       grad,
       op.outputs[1],
@@ -635,7 +635,7 @@ def _MaxPoolGradGrad(op, grad):
       shape=array_ops.shape(op.inputs[0]), dtype=op.inputs[0].dtype),
           array_ops.zeros(
               shape=array_ops.shape(op.inputs[1]), dtype=op.inputs[1].dtype),
-          gen_nn_ops._max_pool_grad_grad(
+          gen_nn_ops.max_pool_grad_grad(
               op.inputs[0],
               op.inputs[1],
               grad,
@@ -669,7 +669,7 @@ def _MaxPoolGradGradGrad(op, grad):
       shape=array_ops.shape(op.inputs[0]), dtype=op.inputs[0].dtype),
           array_ops.zeros(
               shape=array_ops.shape(op.inputs[1]), dtype=op.inputs[1].dtype),
-          gen_nn_ops._max_pool_grad(
+          gen_nn_ops.max_pool_grad(
               op.inputs[0],
               op.inputs[1],
               grad,
@@ -696,8 +696,7 @@ def _FractionalMaxPoolGrad(op, grad_0, unused_grad_1, unused_grad_2):
   Returns:
     Input backprop for FractionalMaxPool op.
   """
-  # pylint: disable=protected-access
-  return gen_nn_ops._fractional_max_pool_grad(
+  return gen_nn_ops.fractional_max_pool_grad(
       op.inputs[0], op.outputs[0], grad_0, op.outputs[1], op.outputs[2],
       op.get_attr("overlapping"))
 
@@ -719,10 +718,9 @@ def _FractionalAvgPoolGrad(op, grad_0, unused_grad_1, unused_grad_2):
   Returns:
     Input backprop for FractionalAvgPool op.
   """
-  # pylint: disable=protected-access
-  return gen_nn_ops._fractional_avg_pool_grad(op.inputs[0].get_shape(), grad_0,
-                                              op.outputs[1], op.outputs[2],
-                                              op.get_attr("overlapping"))
+  return gen_nn_ops.fractional_avg_pool_grad(op.inputs[0].get_shape(), grad_0,
+                                             op.outputs[1], op.outputs[2],
+                                             op.get_attr("overlapping"))
 
 
 @ops.RegisterGradient("BatchNormWithGlobalNormalization")
@@ -746,7 +744,7 @@ def _BatchNormWithGlobalNormalizationGrad(op, grad):
         last dimension.
     dg: Backprop for gamma, which is (grad * ((x - m) * rsqrt(v + epsilon)))
   """
-  dx, dm, dv, db, dg = gen_nn_ops._batch_norm_with_global_normalization_grad(
+  dx, dm, dv, db, dg = gen_nn_ops.batch_norm_with_global_normalization_grad(
       op.inputs[0], op.inputs[1], op.inputs[2], op.inputs[4], grad,
       op.get_attr("variance_epsilon"), op.get_attr("scale_after_normalization"))
   return dx, dm, dv, db, dg
diff --git a/tensorflow/python/ops/nn_impl.py b/tensorflow/python/ops/nn_impl.py
index 5fa5708114fd5cda6afbca78fa0debf68f0252cc..9d6f65dbbfd322100cd1047757e3e69df1d328d4 100644
--- a/tensorflow/python/ops/nn_impl.py
+++ b/tensorflow/python/ops/nn_impl.py
@@ -888,12 +888,10 @@ def fused_batch_norm(
   # TODO(reedwm): In a few weeks, switch to using the V2 version exclusively. We
   # currently only use the V2 version for float16 inputs, which is not supported
   # by the V1 version.
-  # pylint: disable=protected-access
   if x.dtype == dtypes.float16 or x.dtype == dtypes.bfloat16:
-    fused_batch_norm_func = gen_nn_ops._fused_batch_norm_v2
+    fused_batch_norm_func = gen_nn_ops.fused_batch_norm_v2
   else:
-    fused_batch_norm_func = gen_nn_ops._fused_batch_norm
-  # pylint: enable=protected-access
+    fused_batch_norm_func = gen_nn_ops._fused_batch_norm  # pylint: disable=protected-access
   y, batch_mean, batch_var, _, _ = fused_batch_norm_func(
       x,
       scale,
@@ -1345,4 +1343,4 @@ def sampled_softmax_loss(weights,
   sampled_losses = nn_ops.softmax_cross_entropy_with_logits(
       labels=labels, logits=logits)
   # sampled_losses is a [batch_size] tensor.
-  return sampled_losses
+  return sampled_losses
\ No newline at end of file
diff --git a/tensorflow/python/ops/nn_ops.py b/tensorflow/python/ops/nn_ops.py
index 6ab839a503adb228b4c79cce4dc34f58ae017fad..a0d500afce7c3a98e1e6ab8d5e80bd8748af6b0b 100644
--- a/tensorflow/python/ops/nn_ops.py
+++ b/tensorflow/python/ops/nn_ops.py
@@ -1481,7 +1481,6 @@ def conv3d_transpose(
         name=name)
 
 
-# pylint: disable=protected-access
 @tf_export("nn.bias_add")
 def bias_add(value, bias, data_format=None, name=None):
   """Adds `bias` to `value`.
@@ -1506,10 +1505,9 @@ def bias_add(value, bias, data_format=None, name=None):
   with ops.name_scope(name, "BiasAdd", [value, bias]) as name:
     value = ops.convert_to_tensor(value, name="input")
     bias = ops.convert_to_tensor(bias, dtype=value.dtype, name="bias")
-    return gen_nn_ops._bias_add(value, bias, data_format=data_format, name=name)
+    return gen_nn_ops.bias_add(value, bias, data_format=data_format, name=name)
 
 
-# pylint: disable=protected-access
 def bias_add_v1(value, bias, name=None):
   """Adds `bias` to `value`.
 
@@ -1534,7 +1532,7 @@ def bias_add_v1(value, bias, name=None):
   with ops.name_scope(name, "BiasAddV1", [value, bias]) as name:
     value = ops.convert_to_tensor(value, name="input")
     bias = ops.convert_to_tensor(bias, dtype=value.dtype, name="bias")
-    return gen_nn_ops._bias_add_v1(value, bias, name=name)
+    return gen_nn_ops.bias_add_v1(value, bias, name=name)
 
 
 @tf_export("nn.crelu")
@@ -1580,7 +1578,7 @@ def relu6(features, name=None):
   """
   with ops.name_scope(name, "Relu6", [features]) as name:
     features = ops.convert_to_tensor(features, name="features")
-    return gen_nn_ops._relu6(features, name=name)
+    return gen_nn_ops.relu6(features, name=name)
 
 
 @tf_export("nn.leaky_relu")
@@ -1645,7 +1643,7 @@ def _softmax(logits, compute_op, dim=-1, name=None):
   Args:
     logits: A non-empty `Tensor`. Must be one of the following types: `half`,
       `float32`, `float64`.
-    compute_op: Either gen_nn_ops._softmax or gen_nn_ops._log_softmax
+    compute_op: Either gen_nn_ops.softmax or gen_nn_ops.log_softmax
     dim: The dimension softmax would be performed on. The default is -1 which
       indicates the last dimension.
     name: A name for the operation (optional).
@@ -1739,7 +1737,7 @@ def softmax(logits, axis=None, name=None, dim=None):
   axis = deprecation.deprecated_argument_lookup("axis", axis, "dim", dim)
   if axis is None:
     axis = -1
-  return _softmax(logits, gen_nn_ops._softmax, axis, name)
+  return _softmax(logits, gen_nn_ops.softmax, axis, name)
 
 
 @tf_export("nn.log_softmax")
@@ -1769,7 +1767,7 @@ def log_softmax(logits, axis=None, name=None, dim=None):
   axis = deprecation.deprecated_argument_lookup("axis", axis, "dim", dim)
   if axis is None:
     axis = -1
-  return _softmax(logits, gen_nn_ops._log_softmax, axis, name)
+  return _softmax(logits, gen_nn_ops.log_softmax, axis, name)
 
 
 def _ensure_xent_args(name, sentinel, labels, logits):
@@ -1871,7 +1869,7 @@ def softmax_cross_entropy_with_logits_v2(
     # Do the actual op computation.
     # The second output tensor contains the gradients.  We use it in
     # _CrossEntropyGrad() in nn_grad but not here.
-    cost, unused_backprop = gen_nn_ops._softmax_cross_entropy_with_logits(
+    cost, unused_backprop = gen_nn_ops.softmax_cross_entropy_with_logits(
         precise_logits, labels, name=name)
 
     # The output cost shape should be the input minus dim.
@@ -2038,7 +2036,7 @@ def sparse_softmax_cross_entropy_with_logits(
                        (labels_static_shape.ndims, logits.get_shape().ndims))
     # Check if no reshapes are required.
     if logits.get_shape().ndims == 2:
-      cost, _ = gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
+      cost, _ = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
           precise_logits, labels, name=name)
       if logits.dtype == dtypes.float16:
         return math_ops.cast(cost, dtypes.float16)
@@ -2051,7 +2049,7 @@ def sparse_softmax_cross_entropy_with_logits(
     labels = array_ops.reshape(labels, [-1])
     # The second output tensor contains the gradients.  We use it in
     # _CrossEntropyGrad() in nn_grad but not here.
-    cost, _ = gen_nn_ops._sparse_softmax_cross_entropy_with_logits(
+    cost, _ = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
         precise_logits, labels, name=name)
     cost = array_ops.reshape(cost, labels_shape)
     cost.set_shape(labels_static_shape)
@@ -2086,7 +2084,7 @@ def avg_pool(value, ksize, strides, padding, data_format="NHWC", name=None):
   """
   with ops.name_scope(name, "AvgPool", [value]) as name:
     value = ops.convert_to_tensor(value, name="input")
-    return gen_nn_ops._avg_pool(
+    return gen_nn_ops.avg_pool(
         value,
         ksize=ksize,
         strides=strides,
@@ -2116,12 +2114,13 @@ def max_pool(value, ksize, strides, padding, data_format="NHWC", name=None):
   """
   with ops.name_scope(name, "MaxPool", [value]) as name:
     value = ops.convert_to_tensor(value, name="input")
-    return gen_nn_ops._max_pool(value,
-                                ksize=ksize,
-                                strides=strides,
-                                padding=padding,
-                                data_format=data_format,
-                                name=name)
+    return gen_nn_ops.max_pool(
+        value,
+        ksize=ksize,
+        strides=strides,
+        padding=padding,
+        data_format=data_format,
+        name=name)
 
 
 @ops.RegisterStatistics("Conv2D", "flops")
@@ -2214,6 +2213,7 @@ def xw_plus_b_v1(x, weights, biases, name=None):  # pylint: disable=invalid-name
     mm = math_ops.matmul(x, weights)
     return bias_add_v1(mm, biases, name=name)
 
+
 def _get_noise_shape(x, noise_shape):
   # If noise_shape is none return immediately.
   if noise_shape is None:
@@ -2227,8 +2227,7 @@ def _get_noise_shape(x, noise_shape):
   except (TypeError, ValueError):
     return noise_shape
 
-  if (x.shape.dims is not None and
-      len(x.shape.dims) == len(noise_shape_.dims)):
+  if x.shape.dims is not None and len(x.shape.dims) == len(noise_shape_.dims):
     new_dims = []
     for i, dim in enumerate(x.shape.dims):
       if noise_shape_.dims[i].value is None and dim.value is not None:
@@ -2239,6 +2238,7 @@ def _get_noise_shape(x, noise_shape):
 
   return noise_shape
 
+
 @tf_export("nn.dropout")
 def dropout(x, keep_prob, noise_shape=None, seed=None, name=None):  # pylint: disable=invalid-name
   """Computes dropout.
@@ -2330,7 +2330,7 @@ def top_k(input, k=1, sorted=True, name=None):  # pylint: disable=redefined-buil
     values: The `k` largest elements along each last dimensional slice.
     indices: The indices of `values` within the last dimension of `input`.
   """
-  return gen_nn_ops._top_kv2(input, k=k, sorted=sorted, name=name)
+  return gen_nn_ops.top_kv2(input, k=k, sorted=sorted, name=name)
 
 
 def nth_element(input, n, reverse=False, name=None):  # pylint: disable=redefined-builtin
@@ -2649,4 +2649,4 @@ def in_top_k(predictions, targets, k, name=None):
     A `Tensor` of type `bool`. Computed Precision at `k` as a `bool Tensor`.
   """
   with ops.name_scope(name, "in_top_k"):
-    return gen_nn_ops._in_top_kv2(predictions, targets, k, name=name)
+    return gen_nn_ops.in_top_kv2(predictions, targets, k, name=name)
diff --git a/tensorflow/python/ops/parsing_ops.py b/tensorflow/python/ops/parsing_ops.py
index b0315ceee268be8ac1813dae5a262a7d9496e154..075b38d743d13329e646c0b268e938b5c5704e47 100644
--- a/tensorflow/python/ops/parsing_ops.py
+++ b/tensorflow/python/ops/parsing_ops.py
@@ -700,8 +700,7 @@ def _parse_example_raw(serialized,
     # Finally, convert dense_shapes to TensorShapeProto
     dense_shapes = [shape.as_proto() for shape in dense_shapes]
 
-    # pylint: disable=protected-access
-    outputs = gen_parsing_ops._parse_example(
+    outputs = gen_parsing_ops.parse_example(
         serialized=serialized,
         names=names,
         dense_defaults=dense_defaults_vec,
@@ -710,7 +709,6 @@ def _parse_example_raw(serialized,
         dense_keys=dense_keys,
         dense_shapes=dense_shapes,
         name=name)
-    # pylint: enable=protected-access
 
     (sparse_indices, sparse_values, sparse_shapes, dense_values) = outputs
 
@@ -1132,8 +1130,7 @@ def _parse_single_sequence_example_raw(serialized,
     feature_list_dense_shapes = [tensor_shape.as_shape(shape).as_proto()
                                  for shape in feature_list_dense_shapes]
 
-    # pylint: disable=protected-access
-    outputs = gen_parsing_ops._parse_single_sequence_example(
+    outputs = gen_parsing_ops.parse_single_sequence_example(
         serialized=serialized,
         debug_name=debug_name,
         context_dense_defaults=context_dense_defaults_vec,
@@ -1149,7 +1146,6 @@ def _parse_single_sequence_example_raw(serialized,
         feature_list_dense_missing_assumed_empty=(
             feature_list_dense_missing_assumed_empty),
         name=name)
-    # pylint: enable=protected-access
 
     (context_sparse_indices, context_sparse_values,
      context_sparse_shapes, context_dense_values,
@@ -1182,7 +1178,6 @@ def _parse_single_sequence_example_raw(serialized,
 @tf_export("decode_csv")
 def decode_csv(records, record_defaults, field_delim=",",
                use_quote_delim=True, name=None, na_value=""):
-  # pylint: disable=protected-access
   """Convert CSV records to tensors. Each column maps to one tensor.
 
   RFC 4180 format is expected for the CSV records.
@@ -1211,11 +1206,13 @@ def decode_csv(records, record_defaults, field_delim=",",
     Each tensor will have the same shape as records.
   """
   # TODO(martinwicke), remove the wrapper when new Python API generator is done.
-  return gen_parsing_ops._decode_csv(
-      records=records, record_defaults=record_defaults,
-      field_delim=field_delim, use_quote_delim=use_quote_delim,
-      na_value=na_value, name=name)
-  # pylint: enable=protected-access
+  return gen_parsing_ops.decode_csv(
+      records=records,
+      record_defaults=record_defaults,
+      field_delim=field_delim,
+      use_quote_delim=use_quote_delim,
+      na_value=na_value,
+      name=name)
 
 
 # TODO(b/70890287): Combine the implementation of this op and
@@ -1391,7 +1388,6 @@ def _parse_single_example_v2_raw(serialized, sparse_keys, sparse_types,
     # Finally, convert dense_shapes to TensorShapeProto
     dense_shapes = [shape.as_proto() for shape in dense_shapes]
 
-    # pylint: disable=protected-access
     outputs = gen_parsing_ops.parse_single_example(
         serialized=serialized,
         dense_defaults=dense_defaults_vec,
@@ -1401,7 +1397,6 @@ def _parse_single_example_v2_raw(serialized, sparse_keys, sparse_types,
         dense_keys=dense_keys,
         dense_shapes=dense_shapes,
         name=name)
-    # pylint: enable=protected-access
 
     (sparse_indices, sparse_values, sparse_shapes, dense_values) = outputs
 
diff --git a/tensorflow/python/ops/random_ops.py b/tensorflow/python/ops/random_ops.py
index 2c86358d21b1c280b8d7ade625fd4b7a44c5de26..db8159579a21d9b98b06b6172f8d0df7e8ff95ca 100644
--- a/tensorflow/python/ops/random_ops.py
+++ b/tensorflow/python/ops/random_ops.py
@@ -43,7 +43,6 @@ def _ShapeTensor(shape):
   return ops.convert_to_tensor(shape, dtype=dtype, name="shape")
 
 
-# pylint: disable=protected-access
 @tf_export("random_normal")
 def random_normal(shape,
                   mean=0.0,
@@ -74,7 +73,7 @@ def random_normal(shape,
     mean_tensor = ops.convert_to_tensor(mean, dtype=dtype, name="mean")
     stddev_tensor = ops.convert_to_tensor(stddev, dtype=dtype, name="stddev")
     seed1, seed2 = random_seed.get_seed(seed)
-    rnd = gen_random_ops._random_standard_normal(
+    rnd = gen_random_ops.random_standard_normal(
         shape_tensor, dtype, seed=seed1, seed2=seed2)
     mul = rnd * stddev_tensor
     value = math_ops.add(mul, mean_tensor, name=name)
@@ -126,7 +125,7 @@ def parameterized_truncated_normal(shape,
     minvals_tensor = ops.convert_to_tensor(minvals, dtype=dtype, name="minvals")
     maxvals_tensor = ops.convert_to_tensor(maxvals, dtype=dtype, name="maxvals")
     seed1, seed2 = random_seed.get_seed(seed)
-    rnd = gen_random_ops._parameterized_truncated_normal(
+    rnd = gen_random_ops.parameterized_truncated_normal(
         shape_tensor,
         means_tensor,
         stddevs_tensor,
@@ -171,7 +170,7 @@ def truncated_normal(shape,
     mean_tensor = ops.convert_to_tensor(mean, dtype=dtype, name="mean")
     stddev_tensor = ops.convert_to_tensor(stddev, dtype=dtype, name="stddev")
     seed1, seed2 = random_seed.get_seed(seed)
-    rnd = gen_random_ops._truncated_normal(
+    rnd = gen_random_ops.truncated_normal(
         shape_tensor, dtype, seed=seed1, seed2=seed2)
     mul = rnd * stddev_tensor
     value = math_ops.add(mul, mean_tensor, name=name)
@@ -237,11 +236,10 @@ def random_uniform(shape,
     maxval = ops.convert_to_tensor(maxval, dtype=dtype, name="max")
     seed1, seed2 = random_seed.get_seed(seed)
     if dtype.is_integer:
-      return gen_random_ops._random_uniform_int(
+      return gen_random_ops.random_uniform_int(
           shape, minval, maxval, seed=seed1, seed2=seed2, name=name)
     else:
-      rnd = gen_random_ops._random_uniform(
-          shape, dtype, seed=seed1, seed2=seed2)
+      rnd = gen_random_ops.random_uniform(shape, dtype, seed=seed1, seed2=seed2)
       return math_ops.add(rnd * (maxval - minval), minval, name=name)
 
 
@@ -275,7 +273,7 @@ def random_shuffle(value, seed=None, name=None):
     dimension.
   """
   seed1, seed2 = random_seed.get_seed(seed)
-  return gen_random_ops._random_shuffle(
+  return gen_random_ops.random_shuffle(
       value, seed=seed1, seed2=seed2, name=name)
 
 
@@ -420,7 +418,7 @@ def random_gamma(shape,
     seed1, seed2 = random_seed.get_seed(seed)
     return math_ops.maximum(
         np.finfo(dtype.as_numpy_dtype).tiny,
-        gen_random_ops._random_gamma(
+        gen_random_ops.random_gamma(
             shape, alpha_broadcast, seed=seed1, seed2=seed2) / beta)
 
 ops.NotDifferentiable("RandomGamma")
diff --git a/tensorflow/python/ops/resource_variable_ops.py b/tensorflow/python/ops/resource_variable_ops.py
index 09d349fc2db61a09649a801a5d4784522b969d38..6c5d692e820783278e1580137d60db1a680b35ee 100644
--- a/tensorflow/python/ops/resource_variable_ops.py
+++ b/tensorflow/python/ops/resource_variable_ops.py
@@ -21,6 +21,7 @@ from __future__ import print_function
 
 from tensorflow.core.framework import attr_value_pb2
 from tensorflow.core.framework import variable_pb2
+from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import context
 from tensorflow.python.eager import tape
 from tensorflow.python.framework import dtypes
@@ -117,8 +118,7 @@ class EagerResourceDeleter(object):
 
   def __del__(self):
     # Resources follow object-identity when executing eagerly, so it is safe to
-    # delete the resource we have a handle to. Each Graph has a unique container
-    # name, which prevents resource sharing.
+    # delete the resource we have a handle to.
     try:
       # This resource was created in eager mode. However, this destructor may be
       # running in graph mode (especially during unit tests). To clean up
@@ -385,9 +385,6 @@ class ResourceVariable(variables.Variable):
                   shared_name=handle_name,
                   name=name,
                   graph_mode=self._in_graph_mode)
-              self._handle_device = (
-                  self._handle.device if self._in_graph_mode else
-                  context.get_default_context().device_name)
               self._shape = initial_value.get_shape()
           else:
             initial_value = initial_value()
@@ -400,9 +397,6 @@ class ResourceVariable(variables.Variable):
                 shared_name=handle_name,
                 name=name,
                 graph_mode=False)
-            self._handle_device = (
-                self._handle.device if self._in_graph_mode else
-                context.get_default_context().device_name)
             self._shape = initial_value.get_shape()
         # pylint: enable=protected-access
 
@@ -426,8 +420,6 @@ class ResourceVariable(variables.Variable):
               shared_name=handle_name,
               name=name,
               graph_mode=self._in_graph_mode)
-          self._handle_device = (self._handle.device if self._in_graph_mode else
-                                 context.get_default_context().device_name)
           self._shape = initial_value.get_shape()
 
         self._initial_value = initial_value if self._in_graph_mode else None
@@ -450,7 +442,7 @@ class ResourceVariable(variables.Variable):
           with ops.name_scope("Read"), ops.colocate_with(self._handle):
             # Manually assign reads to the handle's device to avoid log
             # messages.
-            with ops.device(self._handle_device):
+            with ops.device(self._handle.device):
               value = self._read_variable_op()
             self._graph_element = value
             if caching_device is not None:
@@ -490,7 +482,7 @@ class ResourceVariable(variables.Variable):
       # cycles being uncollectable, and means that no __del__ will be defined at
       # all in graph mode.
       self._handle_deleter = EagerResourceDeleter(
-          handle=self._handle, handle_device=self._handle_device)
+          handle=self._handle, handle_device=self._handle.device)
 
   def _init_from_proto(self, variable_def, import_scope=None):
     """Initializes from `VariableDef` proto."""
@@ -508,7 +500,6 @@ class ResourceVariable(variables.Variable):
             variable_def.variable_name, import_scope=import_scope))
     self._shape = tensor_shape.TensorShape(
         self._handle.op.get_attr("shape"))
-    self._handle_device = self._handle.device
     self._handle_name = self._handle.name
     self._initializer_op = g.as_graph_element(
         ops.prepend_name_scope(
@@ -535,7 +526,8 @@ class ResourceVariable(variables.Variable):
       self._save_slice_info = None
     self._caching_device = None
     self._dtype = dtypes.as_dtype(self._handle.op.get_attr("dtype"))
-    self._graph_element = self.value()
+    self._graph_element = g.get_tensor_by_name(
+        self._handle.op.name + "/Read/ReadVariableOp:0")
     self._constraint = None
 
   def __nonzero__(self):
@@ -552,7 +544,7 @@ class ResourceVariable(variables.Variable):
   @property
   def device(self):
     """The device this variable is on."""
-    return self._handle_device
+    return self._handle.device
 
   @property
   def graph(self):
@@ -586,7 +578,7 @@ class ResourceVariable(variables.Variable):
     if self._cached_value is not None:
       return self._cached_value
     with ops.colocate_with(None, ignore_existing=True):
-      with ops.device(self._handle_device):
+      with ops.device(self._handle.device):
         return self._read_variable_op()
 
   def _as_graph_element(self):
@@ -683,7 +675,7 @@ class ResourceVariable(variables.Variable):
     """
     with ops.name_scope("Read"):
       # Ensure we read the variable in the same device as the handle.
-      with ops.device(self._handle_device):
+      with ops.device(self._handle.device):
         value = self._read_variable_op()
     # Return an identity so it can get placed on whatever device the context
     # specifies instead of the device where the variable is.
@@ -789,37 +781,83 @@ class ResourceVariable(variables.Variable):
 
   __array_priority__ = 100
 
-  def assign_sub(self, delta, use_locking=None, name=None):
+  def assign_sub(self, delta, use_locking=None, name=None, read_value=True):
+    """Subtracts a value from this variable.
+
+    Args:
+      delta: A `Tensor`. The value to subtract from this variable.
+      use_locking: If `True`, use locking during the operation.
+      name: The name to use for the operation.
+      read_value: A `bool`. Whether to read and return the new value of the
+          variable or not.
+
+    Returns:
+      If `read_value` is `True`, this method will return the new value of the
+      variable after the assignment has completed. Otherwise, when in graph mode
+      it will return the `Operation` that does the assignment, and when in eager
+      mode it will return `None`.
+    """
     # TODO(apassos): this here and below is not atomic. Consider making it
     # atomic if there's a way to do so without a performance cost for those who
     # don't need it.
-    return self._lazy_read(gen_resource_variable_ops.assign_sub_variable_op(
-        self.handle,
-        ops.convert_to_tensor(delta, dtype=self.dtype),
-        name=name))
+    assign_sub_op = gen_resource_variable_ops.assign_sub_variable_op(
+        self.handle, ops.convert_to_tensor(delta, dtype=self.dtype), name=name)
+    if read_value:
+      return self._lazy_read(assign_sub_op)
+    return assign_sub_op
 
-  def assign_add(self, delta, use_locking=None, name=None):
-    return self._lazy_read(gen_resource_variable_ops.assign_add_variable_op(
-        self.handle,
-        ops.convert_to_tensor(delta, dtype=self.dtype),
-        name=name))
+  def assign_add(self, delta, use_locking=None, name=None, read_value=True):
+    """Adds a value to this variable.
+
+    Args:
+      delta: A `Tensor`. The value to add to this variable.
+      use_locking: If `True`, use locking during the operation.
+      name: The name to use for the operation.
+      read_value: A `bool`. Whether to read and return the new value of the
+          variable or not.
+
+    Returns:
+      If `read_value` is `True`, this method will return the new value of the
+      variable after the assignment has completed. Otherwise, when in graph mode
+      it will return the `Operation` that does the assignment, and when in eager
+      mode it will return `None`.
+    """
+    assign_add_op = gen_resource_variable_ops.assign_add_variable_op(
+        self.handle, ops.convert_to_tensor(delta, dtype=self.dtype), name=name)
+    if read_value:
+      return self._lazy_read(assign_add_op)
+    return assign_add_op
 
   def _lazy_read(self, op):
     if hasattr(self, "_trainable") and self._trainable:
       tape.watch_variable(self)
     return _UnreadVariable(
-        self._handle, self.dtype, self._handle_device, self._shape,
-        self._in_graph_mode,
+        self._handle, self.dtype, self._shape, self._in_graph_mode,
         self._handle_deleter if not self._in_graph_mode else None, op)
 
-  def assign(self, value, use_locking=None, name=None):
+  def assign(self, value, use_locking=None, name=None, read_value=True):
+    """Assigns a new value to this variable.
+
+    Args:
+      value: A `Tensor`. The new value for this variable.
+      use_locking: If `True`, use locking during the assignment.
+      name: The name to use for the assignment.
+      read_value: A `bool`. Whether to read and return the new value of the
+          variable or not.
+
+    Returns:
+      If `read_value` is `True`, this method will return the new value of the
+      variable after the assignment has completed. Otherwise, when in graph mode
+      it will return the `Operation` that does the assignment, and when in eager
+      mode it will return `None`.
+    """
     value_tensor = ops.convert_to_tensor(value, dtype=self.dtype)
     self._shape.assert_is_compatible_with(value_tensor.shape)
-    return self._lazy_read(
-        gen_resource_variable_ops.assign_variable_op(
-            self.handle,
-            value_tensor,
-            name=name))
+    assign_op = gen_resource_variable_ops.assign_variable_op(
+        self.handle, value_tensor, name=name)
+    if read_value:
+      return self._lazy_read(assign_op)
+    return assign_op
 
   def _strided_slice_assign(self, begin, end, strides, value, name, begin_mask,
                             end_mask, ellipsis_mask, new_axis_mask,
@@ -895,6 +933,9 @@ class ResourceVariable(variables.Variable):
                        "Tensor object.")
 
 
+pywrap_tensorflow.TFE_Py_RegisterResourceVariableType(ResourceVariable)
+
+
 def _dense_var_to_tensor(var, dtype=None, name=None, as_ref=False):
   return var._dense_var_to_tensor(dtype=dtype, name=name, as_ref=as_ref)  # pylint: disable=protected-access
 
@@ -905,7 +946,7 @@ class _UnreadVariable(ResourceVariable):
   Pretends to be the tensor if anyone looks.
   """
 
-  def __init__(self, handle, dtype, handle_device,  # pylint: disable=super-init-not-called
+  def __init__(self, handle, dtype,  # pylint: disable=super-init-not-called
                shape, in_graph_mode, deleter, parent_op):
     # We do not call super init on purpose.
     self._trainable = False
@@ -913,7 +954,6 @@ class _UnreadVariable(ResourceVariable):
     self._graph_key = ops.get_default_graph()._graph_key  # pylint: disable=protected-access
     self._in_graph_mode = in_graph_mode
     self._handle = handle
-    self._handle_device = handle_device
     self._shape = shape
     self._initial_value = None
     if isinstance(self._handle, ops.EagerTensor):
diff --git a/tensorflow/python/ops/rnn_cell_impl.py b/tensorflow/python/ops/rnn_cell_impl.py
index 923348ea44e18a87e09fe1c0424f0323eb967e3d..3ae1d1184d57bccee90f6a55a90b247f10bf7df3 100644
--- a/tensorflow/python/ops/rnn_cell_impl.py
+++ b/tensorflow/python/ops/rnn_cell_impl.py
@@ -46,6 +46,7 @@ from tensorflow.python.ops import tensor_array_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops import variables as tf_variables
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpointable
 from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
 
@@ -1187,6 +1188,12 @@ class MultiRNNCell(RNNCell):
           "cells must be a list or tuple, but saw: %s." % cells)
 
     self._cells = cells
+    for cell_number, cell in enumerate(self._cells):
+      # Add Checkpointable dependencies on these cells so their variables get
+      # saved with this object when using object-based saving.
+      if isinstance(cell, checkpointable.CheckpointableBase):
+        # TODO(allenl): Track down non-Checkpointable callers.
+        self._track_checkpointable(cell, name="cell-%d" % (cell_number,))
     self._state_is_tuple = state_is_tuple
     if not state_is_tuple:
       if any(nest.is_sequence(c.state_size) for c in self._cells):
diff --git a/tensorflow/python/ops/script_ops.py b/tensorflow/python/ops/script_ops.py
index dcf1bffaf2cdea2395c6db54727a1e2ef71161c0..01f0b816849fafc4d9621a22588c1d9118a3906f 100644
--- a/tensorflow/python/ops/script_ops.py
+++ b/tensorflow/python/ops/script_ops.py
@@ -33,6 +33,7 @@ from tensorflow.python.eager import context
 from tensorflow.python.framework import function
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import gen_script_ops
+from tensorflow.python.util import nest
 from tensorflow.python.util.tf_export import tf_export
 
 
@@ -218,18 +219,16 @@ def _internal_py_func(func, inp, Tout, stateful=None, eager=False, name=None):
   graph._cleanup_py_funcs_used_in_graph.append(cleanup)
   # pylint: enable=protected-access
 
-  # pylint: disable=protected-access
   if eager:
-    result = gen_script_ops._eager_py_func(
+    result = gen_script_ops.eager_py_func(
         input=inp, token=token, Tout=Tout, name=name)
   else:
     if stateful:
-      result = gen_script_ops._py_func(
+      result = gen_script_ops.py_func(
           input=inp, token=token, Tout=Tout, name=name)
     else:
-      result = gen_script_ops._py_func_stateless(
+      result = gen_script_ops.py_func_stateless(
           input=inp, token=token, Tout=Tout, name=name)
-  # pylint: enable=protected-access
   return result if is_list_or_tuple else result[0]
 
 
@@ -318,6 +317,12 @@ def py_func(func, inp, Tout, stateful=True, name=None):
   Returns:
     A list of `Tensor` or a single `Tensor` which `func` computes.
   """
+  if context.in_eager_mode():
+    result = func(*[x.numpy() for x in inp])
+    result = nest.flatten(result)
+
+    return [x if x is None else ops.convert_to_tensor(x) for x in result]
+
   return _internal_py_func(
       func=func, inp=inp, Tout=Tout, stateful=stateful, eager=False, name=name)
 
diff --git a/tensorflow/python/ops/session_ops.py b/tensorflow/python/ops/session_ops.py
index cedd36c1deed541adcf601ff9447345e2279e8f9..ad38845153c94e9bb31e6e3ee05ebed0a4313efc 100644
--- a/tensorflow/python/ops/session_ops.py
+++ b/tensorflow/python/ops/session_ops.py
@@ -16,7 +16,6 @@
 """Tensor Handle Operations. See the @{$python/session_ops} guide.
 
 @@get_session_handle
-@@get_session_handle_v2
 @@get_session_tensor
 @@delete_session_tensor
 """
@@ -182,7 +181,7 @@ def get_session_handle(data, name=None):
 
   # Colocate this operation with data.
   with ops.colocate_with(data):
-    return gen_data_flow_ops._get_session_handle(data, name=name)  # pylint: disable=protected-access
+    return gen_data_flow_ops.get_session_handle(data, name=name)
 
 
 @tf_export("get_session_tensor")
@@ -222,7 +221,7 @@ def get_session_tensor(handle, dtype, name=None):
   with ops.device(handle_device):
     holder = array_ops.placeholder(dtypes.string)
     _register_handle_feeder(holder.graph, holder, dtype)
-    tensor = gen_data_flow_ops._get_session_tensor(holder, dtype, name=name)
+    tensor = gen_data_flow_ops.get_session_tensor(holder, dtype, name=name)
   return (holder, tensor)
 
 
@@ -246,7 +245,7 @@ def delete_session_tensor(handle, name=None):
   handle_device = TensorHandle._get_device_name(handle)
   with ops.device(handle_device):
     holder = array_ops.placeholder(dtypes.string)
-    deleter = gen_data_flow_ops._delete_session_tensor(holder, name=name)
+    deleter = gen_data_flow_ops.delete_session_tensor(holder, name=name)
   return (holder, deleter)
 
 
@@ -268,7 +267,7 @@ def _get_handle_reader(graph, handle, dtype):
     with graph.as_default(), graph.device(handle_device):
       holder = array_ops.placeholder(dtypes.string)
       _register_handle_feeder(holder.graph, holder, dtype)
-      reader = gen_data_flow_ops._get_session_tensor(holder, dtype)
+      reader = gen_data_flow_ops.get_session_tensor(holder, dtype)
     result = (holder, reader)
     graph._handle_readers[graph_key] = result
   return result
@@ -289,7 +288,7 @@ def _get_handle_mover(graph, feeder, handle):
     # Create mover if we haven't done it.
     holder, reader = _get_handle_reader(graph, handle, dtype)
     with graph.as_default(), graph.device(feeder.op.device):
-      mover = gen_data_flow_ops._get_session_handle(reader)  # pylint: disable=protected-access
+      mover = gen_data_flow_ops.get_session_handle(reader)
     result = (holder, mover)
     graph._handle_movers[graph_key] = result
   return result
@@ -303,7 +302,7 @@ def _get_handle_deleter(graph, deleter_key, handle):
     handle_device = TensorHandle._get_device_name(handle)
     with graph.as_default(), graph.device(handle_device):
       holder = array_ops.placeholder(dtypes.string)
-      deleter = gen_data_flow_ops._delete_session_tensor(holder)
+      deleter = gen_data_flow_ops.delete_session_tensor(holder)
     result = (holder, deleter)
     graph._handle_deleters[deleter_key] = result
   return result
diff --git a/tensorflow/python/ops/sparse_grad.py b/tensorflow/python/ops/sparse_grad.py
index 5295e7d21c2b5810422ec36f5aced63c9039feca..97353d6c747cb7e4d3c1fa92ad61af24fb17de91 100644
--- a/tensorflow/python/ops/sparse_grad.py
+++ b/tensorflow/python/ops/sparse_grad.py
@@ -88,10 +88,8 @@ def _SparseAddGrad(op, *grads):
   # the non-zero elements of the sum, and we will peek into `sum_indices` in the
   # gradient op.
 
-  # pylint: disable=protected-access
-  a_val_grad, b_val_grad = gen_sparse_ops._sparse_add_grad(val_grad, a_indices,
-                                                           b_indices,
-                                                           sum_indices)
+  a_val_grad, b_val_grad = gen_sparse_ops.sparse_add_grad(
+      val_grad, a_indices, b_indices, sum_indices)
   a_val_grad.set_shape(op.inputs[1].get_shape())
   b_val_grad.set_shape(op.inputs[4].get_shape())
   # (a_indices, a_values, a_shape, b_indices, b_values, b_shape, thresh)
@@ -151,7 +149,7 @@ def _SparseTensorDenseMatMulGrad(op, grad):
                               "complex gradients.")
 
   # gradient w.r.t. dense
-  b_grad = gen_sparse_ops._sparse_tensor_dense_mat_mul(  # pylint: disable=protected-access
+  b_grad = gen_sparse_ops.sparse_tensor_dense_mat_mul(
       a_indices, a_values, a_shape, grad, adjoint_a=not adj_a)
   if adj_b:
     b_grad = array_ops.transpose(b_grad)
@@ -278,8 +276,7 @@ def _SparseFillEmptyRowsGrad(op, unused_grad_output_indices, output_grad_values,
   """Gradients for SparseFillEmptyRows."""
   reverse_index_map = op.outputs[3]
 
-  # pylint: disable=protected-access
-  d_values, d_default_value = gen_sparse_ops._sparse_fill_empty_rows_grad(
+  d_values, d_default_value = gen_sparse_ops.sparse_fill_empty_rows_grad(
       reverse_index_map=reverse_index_map, grad_values=output_grad_values)
 
   # d_indices, d_values, d_dense_shape, d_default_value.
diff --git a/tensorflow/python/ops/sparse_ops.py b/tensorflow/python/ops/sparse_ops.py
index 0fbbf5a805f1439d85ad53f02bdb665c04248606..c580052c32c8b61467b857af3d237be41718c1a1 100644
--- a/tensorflow/python/ops/sparse_ops.py
+++ b/tensorflow/python/ops/sparse_ops.py
@@ -234,7 +234,7 @@ def sparse_concat(axis,
     ]
 
   output_ind, output_val, output_shape = (
-      gen_sparse_ops._sparse_concat(inds, vals, shapes, axis, name=name))
+      gen_sparse_ops.sparse_concat(inds, vals, shapes, axis, name=name))
 
   return sparse_tensor.SparseTensor(output_ind, output_val, output_shape)
 
@@ -302,8 +302,8 @@ def sparse_add(a, b, thresh=0):
     thresh = ops.convert_to_tensor(
         thresh, dtype=a.values.dtype.real_dtype.base_dtype, name="thresh")
     output_ind, output_val, output_shape = (
-        gen_sparse_ops._sparse_add(a.indices, a.values, a.dense_shape,
-                                   b.indices, b.values, b.dense_shape, thresh))
+        gen_sparse_ops.sparse_add(a.indices, a.values, a.dense_shape,
+                                  b.indices, b.values, b.dense_shape, thresh))
 
     # Attempt to get output_shape statically.
     a.get_shape().assert_is_compatible_with(b.get_shape())
@@ -317,8 +317,8 @@ def sparse_add(a, b, thresh=0):
     # swap to make `a` the SparseTensor.
     if isinstance(b, sparse_classes):
       a, b = b, a
-    return gen_sparse_ops._sparse_tensor_dense_add(a.indices, a.values,
-                                                   a.dense_shape, b)
+    return gen_sparse_ops.sparse_tensor_dense_add(a.indices, a.values,
+                                                  a.dense_shape, b)
 
 
 def _sparse_cross(inputs, name=None):
@@ -402,7 +402,7 @@ def _sparse_cross_internal(inputs,
                            num_buckets=0,
                            hash_key=None,
                            name=None):
-  """See gen_sparse_ops._sparse_cross."""
+  """See gen_sparse_ops.sparse_cross."""
   if not isinstance(inputs, list):
     raise TypeError("Inputs must be a list")
   if not all(
@@ -432,7 +432,7 @@ def _sparse_cross_internal(inputs,
       dense_inputs[i] = math_ops.to_int64(dense_inputs[i])
       internal_type = dtypes.int64
 
-  indices_out, values_out, shape_out = gen_sparse_ops._sparse_cross(
+  indices_out, values_out, shape_out = gen_sparse_ops.sparse_cross(
       indices=indices,
       values=values,
       shapes=shapes,
@@ -511,7 +511,7 @@ def sparse_reorder(sp_input, name=None):
   sp_input = _convert_to_sparse_tensor(sp_input)
 
   reordered_ind, reordered_val = (
-      gen_sparse_ops._sparse_reorder(
+      gen_sparse_ops.sparse_reorder(
           sp_input.indices, sp_input.values, sp_input.dense_shape, name=name))
 
   if sp_input.get_shape().is_fully_defined():
@@ -575,7 +575,7 @@ def sparse_reshape(sp_input, shape, name=None):
   shape = math_ops.cast(shape, dtype=dtypes.int64)
 
   with ops.name_scope(name, "SparseReshape", [sp_input]) as name:
-    reshaped_ind, reshaped_shape = gen_sparse_ops._sparse_reshape(
+    reshaped_ind, reshaped_shape = gen_sparse_ops.sparse_reshape(
         sp_input.indices, sp_input.dense_shape, shape, name=name)
 
     reshaped_shape_const = tensor_util.constant_value(shape)
@@ -671,7 +671,7 @@ def sparse_split(keyword_required=KeywordRequired(),
   sp_input = _convert_to_sparse_tensor(sp_input)
 
   output_inds, output_vals, output_shapes = (
-      gen_sparse_ops._sparse_split(
+      gen_sparse_ops.sparse_split(
           axis,
           sp_input.indices,
           sp_input.values,
@@ -782,7 +782,7 @@ def sparse_to_dense(sparse_indices,
     Dense `Tensor` of shape `output_shape`.  Has the same type as
     `sparse_values`.
   """
-  return gen_sparse_ops._sparse_to_dense(
+  return gen_sparse_ops.sparse_to_dense(
       sparse_indices,
       output_shape,
       sparse_values,
@@ -1412,7 +1412,7 @@ def sparse_fill_empty_rows(sp_input, default_value, name=None):
     default_value = ops.convert_to_tensor(
         default_value, dtype=sp_input.values.dtype)
     (output_indices, output_values, empty_row_indicator,
-     unused_reverse_index_map) = gen_sparse_ops._sparse_fill_empty_rows(
+     unused_reverse_index_map) = gen_sparse_ops.sparse_fill_empty_rows(
          indices=sp_input.indices,
          values=sp_input.values,
          dense_shape=sp_input.dense_shape,
@@ -1441,7 +1441,7 @@ def serialize_sparse(sp_input, name=None, out_type=dtypes.string):
   """
   sp_input = _convert_to_sparse_tensor(sp_input)
 
-  return gen_sparse_ops._serialize_sparse(
+  return gen_sparse_ops.serialize_sparse(
       sp_input.indices,
       sp_input.values,
       sp_input.dense_shape,
@@ -1476,7 +1476,7 @@ def serialize_many_sparse(sp_input, name=None, out_type=dtypes.string):
   """
   sp_input = _convert_to_sparse_tensor(sp_input)
 
-  return gen_sparse_ops._serialize_many_sparse(
+  return gen_sparse_ops.serialize_many_sparse(
       sp_input.indices,
       sp_input.values,
       sp_input.dense_shape,
@@ -1541,7 +1541,7 @@ def deserialize_sparse(serialized_sparse, dtype, rank=None, name=None):
 
   """
   output_indices, output_values, output_shape = (
-      gen_sparse_ops._deserialize_sparse(serialized_sparse, dtype, name=name))
+      gen_sparse_ops.deserialize_sparse(serialized_sparse, dtype, name=name))
 
   # Feed rank data back in, if available
   output_indices.set_shape([None, rank])
@@ -1610,7 +1610,7 @@ def deserialize_many_sparse(serialized_sparse, dtype, rank=None, name=None):
     All of the serialized `SparseTensor`s must have had the same rank and type.
   """
   output_indices, output_values, output_shape = (
-      gen_sparse_ops._deserialize_many_sparse(
+      gen_sparse_ops.deserialize_many_sparse(
           serialized_sparse, dtype, name=name))
 
   # Feed rank data back in, if available
@@ -1828,7 +1828,7 @@ def sparse_tensor_dense_matmul(sp_a,
   with ops.name_scope(name, "SparseTensorDenseMatMul",
                       [sp_a.indices, sp_a.values, b]) as name:
     b = ops.convert_to_tensor(b, name="b")
-    return gen_sparse_ops._sparse_tensor_dense_mat_mul(
+    return gen_sparse_ops.sparse_tensor_dense_mat_mul(
         a_indices=sp_a.indices,
         a_values=sp_a.values,
         a_shape=sp_a.dense_shape,
@@ -2046,7 +2046,7 @@ def _add_sparse_to_tensors_map(sp_input,
   """
   sp_input = _convert_to_sparse_tensor(sp_input)
 
-  return gen_sparse_ops._add_sparse_to_tensors_map(
+  return gen_sparse_ops.add_sparse_to_tensors_map(
       sp_input.indices,
       sp_input.values,
       sp_input.dense_shape,
@@ -2086,7 +2086,7 @@ def _add_many_sparse_to_tensors_map(sp_input,
   """
   sp_input = _convert_to_sparse_tensor(sp_input)
 
-  return gen_sparse_ops._add_many_sparse_to_tensors_map(
+  return gen_sparse_ops.add_many_sparse_to_tensors_map(
       sp_input.indices,
       sp_input.values,
       sp_input.dense_shape,
@@ -2167,7 +2167,7 @@ def _take_many_sparse_from_tensors_map(sparse_map_op,
   with ops.colocate_with(sparse_map_op):
     shared_name = sparse_map_op.get_attr("shared_name") or sparse_map_op.name
     output_indices, output_values, output_shape = (
-        gen_sparse_ops._take_many_sparse_from_tensors_map(
+        gen_sparse_ops.take_many_sparse_from_tensors_map(
             sparse_handles,
             dtype=sparse_map_op.get_attr("T"),
             container=sparse_map_op.get_attr("container"),
diff --git a/tensorflow/python/ops/standard_ops.py b/tensorflow/python/ops/standard_ops.py
index f6d9111009dc4f6a58ac81e7071ed7fe406600fa..65b788c31abb9956ef2f623dccdee67def190e4c 100644
--- a/tensorflow/python/ops/standard_ops.py
+++ b/tensorflow/python/ops/standard_ops.py
@@ -60,6 +60,7 @@ from tensorflow.python.ops.io_ops import *
 from tensorflow.python.ops.linalg_ops import *
 from tensorflow.python.ops.logging_ops import Print
 from tensorflow.python.ops.logging_ops import get_summary_op
+from tensorflow.python.ops.logging_ops import timestamp
 from tensorflow.python.ops.lookup_ops import initialize_all_tables
 from tensorflow.python.ops.lookup_ops import tables_initializer
 from tensorflow.python.ops.manip_ops import *
@@ -185,7 +186,6 @@ _allowed_symbols_array_ops = [
     "quantize_and_dequantize",  # to-doc
 
     # TODO(drpng): legacy symbols to be removed.
-    "list_diff",  # Use tf.listdiff instead.
     "batch_matrix_diag",
     "batch_matrix_band_part",
     "batch_matrix_diag_part",
@@ -232,7 +232,7 @@ _allowed_symbols_clip_ops = [
     "global_norm",
 ]
 
-_allowed_symbols_image_ops = [
+_allowed_symbols_logging_ops = [
     # Documented in training.py.
     # We are not importing training.py to avoid complex dependencies.
     "audio_summary",
@@ -262,8 +262,8 @@ _allowed_symbols = (_allowed_symbols_array_ops +
                     _allowed_symbols_clip_ops +
                     _allowed_symbols_control_flow_ops +
                     _allowed_symbols_functional_ops +
-                    _allowed_symbols_image_ops +
                     _allowed_symbols_gradients +
+                    _allowed_symbols_logging_ops +
                     _allowed_symbols_math_ops +
                     _allowed_symbols_variable_scope_ops +
                     _allowed_symbols_misc +
diff --git a/tensorflow/python/ops/state_ops.py b/tensorflow/python/ops/state_ops.py
index 6c0a090d16bb328de40f02edf9865a0e0a62d385..fd4419640aa06a68e999daa244cc1ca6998dd7cb 100644
--- a/tensorflow/python/ops/state_ops.py
+++ b/tensorflow/python/ops/state_ops.py
@@ -99,8 +99,8 @@ def variable_op(shape, dtype, name="Variable", set_shape=True, container="",
   """Deprecated. Used variable_op_v2 instead."""
   if not set_shape:
     shape = tensor_shape.unknown_shape()
-  ret = gen_state_ops._variable(shape=shape, dtype=dtype, name=name,
-                                container=container, shared_name=shared_name)
+  ret = gen_state_ops.variable(shape=shape, dtype=dtype, name=name,
+                               container=container, shared_name=shared_name)
   # TODO(mrry): Move this to where it is used, so we can get rid of this op
   #   wrapper?
   if set_shape:
@@ -127,11 +127,12 @@ def variable_op_v2(shape, dtype, name="Variable", container="", shared_name=""):
   Returns:
     A variable tensor.
   """
-  return gen_state_ops._variable_v2(shape=shape,
-                                    dtype=dtype,
-                                    name=name,
-                                    container=container,
-                                    shared_name=shared_name)
+  return gen_state_ops.variable_v2(
+      shape=shape,
+      dtype=dtype,
+      name=name,
+      container=container,
+      shared_name=shared_name)
 
 
 def init_variable(v, init, name="init"):
diff --git a/tensorflow/python/ops/string_ops.py b/tensorflow/python/ops/string_ops.py
index b8c39d91b41790c6441594b175e8eaa03620e1ec..5bd75b9215fdbccd5882ea39c2b35ccbbe29d5b0 100644
--- a/tensorflow/python/ops/string_ops.py
+++ b/tensorflow/python/ops/string_ops.py
@@ -17,6 +17,7 @@
 
 See the @{$python/string_ops} guide.
 
+@@regex_replace
 @@string_to_hash_bucket_fast
 @@string_to_hash_bucket_strong
 @@string_to_hash_bucket
@@ -93,10 +94,8 @@ def string_split(source, delimiter=" ", skip_empty=True):  # pylint: disable=inv
   delimiter = ops.convert_to_tensor(delimiter, dtype=dtypes.string)
   source = ops.convert_to_tensor(source, dtype=dtypes.string)
 
-  # pylint: disable=protected-access
-  indices, values, shape = gen_string_ops._string_split(
+  indices, values, shape = gen_string_ops.string_split(
       source, delimiter=delimiter, skip_empty=skip_empty)
-  # pylint: enable=protected-access
   indices.set_shape([None, 2])
   values.set_shape([None])
   shape.set_shape([2])
@@ -141,6 +140,7 @@ def reduce_join(inputs, axis=None,
 reduce_join.__doc__ = deprecation.rewrite_argument_docstring(
     gen_string_ops.reduce_join.__doc__, "reduction_indices", "axis")
 
+ops.NotDifferentiable("RegexReplace")
 ops.NotDifferentiable("StringToHashBucket")
 ops.NotDifferentiable("StringToHashBucketFast")
 ops.NotDifferentiable("StringToHashBucketStrong")
diff --git a/tensorflow/python/ops/summary_ops.py b/tensorflow/python/ops/summary_ops.py
index 7f4f4ce5ab4ee2bd309932cb81f05775996371d6..037bc9845a3f734f65b73b0c4b4ca19fb653731d 100644
--- a/tensorflow/python/ops/summary_ops.py
+++ b/tensorflow/python/ops/summary_ops.py
@@ -13,7 +13,6 @@
 # limitations under the License.
 # ==============================================================================
 """Summary Operations."""
-# pylint: disable=protected-access
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
@@ -74,7 +73,7 @@ def tensor_summary(name,
 
   with summary_op_util.summary_scope(
       name, family, values=[tensor]) as (tag, scope):
-    val = gen_logging_ops._tensor_summary_v2(
+    val = gen_logging_ops.tensor_summary_v2(
         tensor=tensor,
         tag=tag,
         name=scope,
diff --git a/tensorflow/python/ops/template.py b/tensorflow/python/ops/template.py
index 806fdd3da7aa6de01b7cd4d9d36dbf43f6139db6..70e8040512032c1aaa0cad6e1ad1b26a14a27059 100644
--- a/tensorflow/python/ops/template.py
+++ b/tensorflow/python/ops/template.py
@@ -26,6 +26,7 @@ from tensorflow.python.eager import function
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpointable
 from tensorflow.python.util import tf_contextlib
 from tensorflow.python.util import tf_decorator
 from tensorflow.python.util.deprecation import deprecated
@@ -230,7 +231,7 @@ def _skip_common_stack_elements(stacktrace, base_case):
   return stacktrace[-1:]
 
 
-class Template(object):
+class Template(checkpointable.CheckpointableBase):
   """Wrap a function to aid in variable sharing.
 
   Templates are functions that create variables the first time they are called
@@ -294,12 +295,115 @@ class Template(object):
     # which is not the same as whether the scope has been created.
     self._variables_created = False
 
+  @property
+  def _checkpoint_dependencies(self):
+    """Sanity checking for object-based saving.
+
+    Does not override Checkpointable dependency tracking, but checks that
+    variables accessible through Checkpointable dependencies on other `Template`
+    objects include all of the variable_scope-filtered `Template.variables`.
+
+    Returns:
+      A list of checkpointable.CheckpointableReference objects.
+    Raises:
+      ValueError: If this object is not compatible with object-based saving.
+    """
+    dependencies = super(Template, self)._checkpoint_dependencies
+    dependency_variables = []
+    for _, dependency in dependencies:
+      if isinstance(dependency, Template):
+        dependency_variables.extend(dependency.variables)
+      else:
+        dependency_variables.append(dependency)
+    dependency_variables = set(dependency_variables)
+    not_included_variables = []
+    for expected_variable in sorted(self.variables, key=lambda v: v.name):
+      if expected_variable not in dependency_variables:
+        not_included_variables.append(expected_variable)
+    if not_included_variables:
+      # Trying to save a Template which improperly tracks its variables.
+      raise ValueError(
+          ("The Template '%s' references variables which are not included via "
+           "object-based dependency tracking. Most likely a custom "
+           "getter/creator was registered which does not call Template's "
+           "custom variable creator (which is responsible for tracking "
+           "dependencies).\n\nExpected these variables to be dependencies: %s")
+          % (self, not_included_variables))
+    return dependencies
+
+  def _checkpointable_custom_creator(self, next_creator, name, initial_value,
+                                     checkpointable_parent=None, **kwargs):
+    """A variable creation hook which adds Checkpointable dependencies.
+
+    Set during the `Template`'s first wrapped function execution. Ensures that
+    (a) `Template` objects depend on `Template`s created inside them which
+    create variables, and (b) that any variables not in a more deeply nested
+    `Template` are added as dependencies directly.
+
+    The `checkpointable_parent` argument is passed between `Template` custom
+    creators but ignored when the variable object itself is created. This
+    argument indicates (if not `None`) that a more deeply nested `Template` has
+    already added the variable as a dependency, and that parent `Template`s
+    should add a dependency on that `Template` rather than on the variable
+    directly.
+
+    Args:
+      next_creator: See `variable_scope.variable_creator_scope`; the next
+        creator in the chain.
+      name: The (full, scope-influenced) name of the variable. The scope name
+        for the Template itself is stripped for the purposes of object-based
+        dependency tracking, but scopes within Templates are respected.
+      initial_value: See `variable_scope.variable_creator_scope`. Taken
+        explicitly so the argument can be re-named and used with
+        `Checkpointable._add_variable_with_custom_getter`.
+      checkpointable_parent: If not None, a more deeply nested Template object
+        to add a dependency on (rather than depending on the variable directly).
+      **kwargs: Passed through to the next creator.
+    Returns:
+      The output of `next_creator`: the fetched/created variable object.
+    """
+    def _call_next_creator_renaming_initializer(initializer, **inner_kwargs):
+      inner_kwargs.pop("name")  # Ignored; this is the scope-stripped name which
+                                # we don't want to propagate.
+      return next_creator(
+          initial_value=initializer,
+          name=name,
+          **inner_kwargs)
+    if name.startswith(self._variable_scope.name):
+      scope_stripped_name = name[len(self._variable_scope.name) + 1:]
+      if not checkpointable_parent:
+        return self._add_variable_with_custom_getter(
+            initializer=initial_value,
+            name=scope_stripped_name,
+            getter=_call_next_creator_renaming_initializer,
+            # Disable error checking for Checkpointable. Exceptions are instead
+            # raised if necessary when the object-based saver tries to
+            # save/restore the object.
+            overwrite=True,
+            checkpointable_parent=self,
+            **kwargs)
+      else:
+        self._track_checkpointable(
+            checkpointable_parent,
+            name=checkpointable_parent._variable_scope.name[  # pylint: disable=protected-access
+                len(self._variable_scope.name) + 1:],
+            overwrite=True)
+    return next_creator(name=name, initial_value=initial_value,
+                        checkpointable_parent=self, **kwargs)
+
   def _call_func(self, args, kwargs):
     try:
       vars_at_start = len(ops.get_collection(ops.GraphKeys.GLOBAL_VARIABLES))
       trainable_at_start = len(
           ops.get_collection(ops.GraphKeys.TRAINABLE_VARIABLES))
-      result = self._func(*args, **kwargs)
+      if self._variables_created:
+        result = self._func(*args, **kwargs)
+      else:
+        # The first time we run, restore variables if necessary (via
+        # Checkpointable).
+        with variable_scope.variable_creator_scope(
+            self._checkpointable_custom_creator):
+          result = self._func(*args, **kwargs)
 
       if self._variables_created:
         # Variables were previously created, implying this is not the first
@@ -557,12 +661,20 @@ class EagerTemplate(Template):
       # is created in __call__.
       variable_scope_name = None
     self._template_store = _EagerTemplateVariableStore(variable_scope_name)
+    self._variable_scope_context_manager = None
 
   def _call_func(self, args, kwargs):
     try:
       vars_at_start = self._template_store.variables()
       trainable_at_start = self._template_store.trainable_variables()
-      result = self._func(*args, **kwargs)
+      if self._variables_created:
+        result = self._func(*args, **kwargs)
+      else:
+        # The first time we run, restore variables if necessary (via
+        # Checkpointable).
+        with variable_scope.variable_creator_scope(
+            self._checkpointable_custom_creator):
+          result = self._func(*args, **kwargs)
 
       if self._variables_created:
         # Variables were previously created, implying this is not the first
@@ -611,8 +723,12 @@ class EagerTemplate(Template):
     # the variable scope is opened in order to ensure that templates nested at
     # the same level correctly uniquify lower variable scope names.
     if self._variable_scope:
-      with variable_scope.variable_scope(
-          self._variable_scope, reuse=variable_scope.AUTO_REUSE):
+      # Create a cache for the variable scope context manager the first time
+      # around so that we don't have to keep recreating it.
+      if not self._variable_scope_context_manager:
+        self._variable_scope_context_manager = variable_scope.variable_scope(
+            self._variable_scope, reuse=variable_scope.AUTO_REUSE)
+      with self._variable_scope_context_manager:
         with self._template_store.as_default():
           result = self._call_func(args, kwargs)
       return result
diff --git a/tensorflow/python/ops/tensor_array_ops.py b/tensorflow/python/ops/tensor_array_ops.py
index 3c08870146e447d84d4a5f620cbead633d94751f..6226f426be468a168c442e08e69a1fde2a65bdf1 100644
--- a/tensorflow/python/ops/tensor_array_ops.py
+++ b/tensorflow/python/ops/tensor_array_ops.py
@@ -148,7 +148,7 @@ class _GraphTensorArray(object):
         # will retroactively set the device value of this op.
         def create():
           """Create the TensorArray op."""
-          return gen_data_flow_ops._tensor_array_v3(
+          return gen_data_flow_ops.tensor_array_v3(
               dtype=dtype,
               size=size,
               element_shape=element_shape,
@@ -237,7 +237,7 @@ class _GraphTensorArray(object):
       flow = self.flow
     with ops.name_scope(name, "TensorArrayGrad", [self._handle]):
       with ops.colocate_with(self._handle):
-        g_handle, unused_flow = gen_data_flow_ops._tensor_array_grad_v3(
+        g_handle, unused_flow = gen_data_flow_ops.tensor_array_grad_v3(
             handle=self._handle, source=source, flow_in=flow, name=name)
         with ops.control_dependencies([g_handle]):
           flow = array_ops.identity(flow, name="gradient_flow")
@@ -252,7 +252,7 @@ class _GraphTensorArray(object):
 
   def read(self, index, name=None):
     """See TensorArray."""
-    value = gen_data_flow_ops._tensor_array_read_v3(
+    value = gen_data_flow_ops.tensor_array_read_v3(
         handle=self._handle,
         index=index,
         flow_in=self._flow,
@@ -270,7 +270,7 @@ class _GraphTensorArray(object):
       if self._infer_shape:
         self._merge_element_shape(value.shape)
       with self._maybe_colocate_with(value):
-        flow_out = gen_data_flow_ops._tensor_array_write_v3(
+        flow_out = gen_data_flow_ops.tensor_array_write_v3(
             handle=self._handle,
             index=index,
             value=value,
@@ -296,7 +296,7 @@ class _GraphTensorArray(object):
       element_shape = self._element_shape[0]
     else:
       element_shape = tensor_shape.TensorShape(None)
-    value = gen_data_flow_ops._tensor_array_gather_v3(
+    value = gen_data_flow_ops.tensor_array_gather_v3(
         handle=self._handle,
         indices=indices,
         flow_in=self._flow,
@@ -314,7 +314,7 @@ class _GraphTensorArray(object):
           tensor_shape.TensorShape(self._element_shape[0].dims[1:]))
     else:
       element_shape_except0 = tensor_shape.TensorShape(None)
-    value, _ = gen_data_flow_ops._tensor_array_concat_v3(
+    value, _ = gen_data_flow_ops.tensor_array_concat_v3(
         handle=self._handle,
         flow_in=self._flow,
         dtype=self._dtype,
@@ -341,7 +341,7 @@ class _GraphTensorArray(object):
       if self._infer_shape and context.in_graph_mode():
         self._merge_element_shape(value.shape[1:])
       with self._maybe_colocate_with(value):
-        flow_out = gen_data_flow_ops._tensor_array_scatter_v3(
+        flow_out = gen_data_flow_ops.tensor_array_scatter_v3(
             handle=self._handle,
             indices=indices,
             value=value,
@@ -370,7 +370,7 @@ class _GraphTensorArray(object):
               self._merge_element_shape(
                   tensor_shape.TensorShape([clengths[0]]).concatenate(
                       value.shape[1:]))
-        flow_out = gen_data_flow_ops._tensor_array_split_v3(
+        flow_out = gen_data_flow_ops.tensor_array_split_v3(
             handle=self._handle,
             value=value,
             lengths=lengths_64,
@@ -386,13 +386,13 @@ class _GraphTensorArray(object):
 
   def size(self, name=None):
     """See TensorArray."""
-    return gen_data_flow_ops._tensor_array_size_v3(
+    return gen_data_flow_ops.tensor_array_size_v3(
         handle=self._handle, flow_in=self.flow, name=name)
 
   @tf_should_use.should_use_result
   def close(self, name=None):
     """See TensorArray."""
-    return gen_data_flow_ops._tensor_array_close_v3(
+    return gen_data_flow_ops.tensor_array_close_v3(
         handle=self._handle, name=name)
 
 # pylint: enable=protected-access
diff --git a/tensorflow/python/ops/variables.py b/tensorflow/python/ops/variables.py
index 125922e296414ebbcd22918ed4ff8858d56274fe..643a3b7edc97b0c1d9d25a9715e074d12da6555a 100644
--- a/tensorflow/python/ops/variables.py
+++ b/tensorflow/python/ops/variables.py
@@ -37,7 +37,7 @@ from tensorflow.python.util.tf_export import tf_export
 
 
 @tf_export("Variable")
-class Variable(checkpointable.Checkpointable):
+class Variable(checkpointable.CheckpointableBase):
   """See the @{$variables$Variables How To} for a high level overview.
 
   A variable maintains state in the graph across calls to `run()`. You add a
@@ -307,6 +307,9 @@ class Variable(checkpointable.Checkpointable):
     if constraint is not None and not callable(constraint):
       raise ValueError("The `constraint` argument must be a callable.")
 
+    # Store the graph key so optimizers know how to only retrieve variables from
+    # this graph.
+    self._graph_key = ops.get_default_graph()._graph_key  # pylint: disable=protected-access
     if isinstance(initial_value, checkpointable.CheckpointInitialValue):
       self._maybe_initialize_checkpointable()
       self._update_uid = initial_value.checkpoint_position.restore_uid
@@ -792,17 +795,7 @@ class Variable(checkpointable.Checkpointable):
 
     setattr(Variable, operator, _run_op)
 
-  def _scatter_tensors_from_checkpoint(self, attributes):
-    """For implementing `Checkpointable`. Return an assignment op to run."""
-    if (len(attributes) != 1
-        or checkpointable.VARIABLE_VALUE_KEY not in attributes):
-      raise ValueError(
-          ("The variable %s was restored with unexpected values (expected one "
-           "with key %s, got %s)") % (
-               self, checkpointable.VARIABLE_VALUE_KEY, attributes))
-    return self.assign(attributes[checkpointable.VARIABLE_VALUE_KEY])
-
-  def _gather_tensors_for_checkpoint(self):
+  def _gather_saveables_for_checkpoint(self):
     """For implementing `Checkpointable`. This object is saveable on its own."""
     return {checkpointable.VARIABLE_VALUE_KEY: self}
 
diff --git a/tensorflow/python/pywrap_tfe.i b/tensorflow/python/pywrap_tfe.i
index 50f481d29e9d39bd12741b5f9e02b7201336134d..b481ddf5d4798aeed970d435234fd82de3b93a06 100644
--- a/tensorflow/python/pywrap_tfe.i
+++ b/tensorflow/python/pywrap_tfe.i
@@ -29,9 +29,12 @@ limitations under the License.
 %rename("%s") TFE_OpNameGetAttrType;
 %rename("%s") TFE_Py_InitEagerTensor;
 %rename("%s") TFE_Py_RegisterExceptionClass;
+%rename("%s") TFE_Py_RegisterBackwardFunctionGetter;
 %rename("%s") TFE_Py_RegisterFallbackExceptionClass;
+%rename("%s") TFE_Py_RegisterResourceVariableType;
 %rename("%s") TFE_Py_Execute;
 %rename("%s") TFE_Py_FastPathExecute;
+%rename("%s") TFE_Py_RecordGradient;
 %rename("%s") TFE_Py_UID;
 %rename("%s") TFE_Py_TapeSetNew;
 %rename("%s") TFE_Py_TapeSetRemove;
diff --git a/tensorflow/python/summary/summary.py b/tensorflow/python/summary/summary.py
index b80ad79074e85bdeae70148b2822c319c29468bc..7ff633a654ad969207fe864ee530892900258538 100644
--- a/tensorflow/python/summary/summary.py
+++ b/tensorflow/python/summary/summary.py
@@ -152,8 +152,7 @@ def image(name, tensor, max_outputs=3, collections=None, family=None):
   """
   with _summary_op_util.summary_scope(
       name, family, values=[tensor]) as (tag, scope):
-    # pylint: disable=protected-access
-    val = _gen_logging_ops._image_summary(
+    val = _gen_logging_ops.image_summary(
         tag=tag, tensor=tensor, max_images=max_outputs, name=scope)
     _summary_op_util.collect(val, collections, [_ops.GraphKeys.SUMMARIES])
   return val
@@ -237,10 +236,9 @@ def audio(name, tensor, sample_rate, max_outputs=3, collections=None,
   """
   with _summary_op_util.summary_scope(
       name, family=family, values=[tensor]) as (tag, scope):
-    # pylint: disable=protected-access
     sample_rate = _ops.convert_to_tensor(
         sample_rate, dtype=_dtypes.float32, name='sample_rate')
-    val = _gen_logging_ops._audio_summary_v2(
+    val = _gen_logging_ops.audio_summary_v2(
         tag=tag, tensor=tensor, max_outputs=max_outputs,
         sample_rate=sample_rate, name=scope)
     _summary_op_util.collect(val, collections, [_ops.GraphKeys.SUMMARIES])
@@ -286,8 +284,7 @@ def merge(inputs, collections=None, name=None):
         'Use tf.contrib.summary instead.')
   name = _summary_op_util.clean_tag(name)
   with _ops.name_scope(name, 'Merge', inputs):
-    # pylint: disable=protected-access
-    val = _gen_logging_ops._merge_summary(inputs=inputs, name=name)
+    val = _gen_logging_ops.merge_summary(inputs=inputs, name=name)
     _summary_op_util.collect(val, collections, [])
   return val
 
diff --git a/tensorflow/python/tools/freeze_graph.py b/tensorflow/python/tools/freeze_graph.py
index f7a578e52b00b6d97b3b314c7a2a08d9071c8f73..a52f325ddbcd90ad011c1c056965912b96f27aaa 100644
--- a/tensorflow/python/tools/freeze_graph.py
+++ b/tensorflow/python/tools/freeze_graph.py
@@ -109,7 +109,7 @@ def freeze_graph_with_def_protos(input_graph_def,
           input_meta_graph_def, clear_devices=True)
       restorer.restore(sess, input_checkpoint)
       if initializer_nodes:
-        sess.run(initializer_nodes.replace(' ', '').split(","))
+        sess.run(initializer_nodes.replace(" ", "").split(","))
     elif input_saved_model_dir:
       if saved_model_tags is None:
         saved_model_tags = []
@@ -130,27 +130,27 @@ def freeze_graph_with_def_protos(input_graph_def,
           var_list=var_list, write_version=checkpoint_version)
       saver.restore(sess, input_checkpoint)
       if initializer_nodes:
-        sess.run(initializer_nodes.replace(' ', '').split(","))
+        sess.run(initializer_nodes.replace(" ", "").split(","))
 
     variable_names_whitelist = (
-        variable_names_whitelist.replace(' ', '').split(",")
+        variable_names_whitelist.replace(" ", "").split(",")
         if variable_names_whitelist else None)
     variable_names_blacklist = (
-        variable_names_blacklist.replace(' ', '').split(",")
+        variable_names_blacklist.replace(" ", "").split(",")
         if variable_names_blacklist else None)
 
     if input_meta_graph_def:
       output_graph_def = graph_util.convert_variables_to_constants(
           sess,
           input_meta_graph_def.graph_def,
-          output_node_names.replace(' ', '').split(","),
+          output_node_names.replace(" ", "").split(","),
           variable_names_whitelist=variable_names_whitelist,
           variable_names_blacklist=variable_names_blacklist)
     else:
       output_graph_def = graph_util.convert_variables_to_constants(
           sess,
           input_graph_def,
-          output_node_names.replace(' ', '').split(","),
+          output_node_names.replace(" ", "").split(","),
           variable_names_whitelist=variable_names_whitelist,
           variable_names_blacklist=variable_names_blacklist)
 
@@ -252,7 +252,7 @@ def freeze_graph(input_graph,
       variable_names_blacklist,
       input_meta_graph_def,
       input_saved_model_dir,
-      saved_model_tags.replace(' ', '').split(","),
+      saved_model_tags.replace(" ", "").split(","),
       checkpoint_version=checkpoint_version)
 
 
diff --git a/tensorflow/python/tools/saved_model_cli.py b/tensorflow/python/tools/saved_model_cli.py
index 33f6debbcbecb652774c776be54323bbaa824822..b0e9e3e5ed2117937bbd275784c44aebd2ea2515 100644
--- a/tensorflow/python/tools/saved_model_cli.py
+++ b/tensorflow/python/tools/saved_model_cli.py
@@ -115,7 +115,7 @@ def _get_outputs_tensor_info_from_meta_graph_def(meta_graph_def,
                                                       signature_def_key).outputs
 
 
-def _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key):
+def _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key, indent=0):
   """Prints input and output TensorInfos.
 
   Prints the details of input and output TensorInfos for the SignatureDef mapped
@@ -126,6 +126,7 @@ def _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key):
     tag_set: Group of tag(s) of the MetaGraphDef, in string format, separated by
         ','. For tag-set contains multiple tags, all tags must be passed in.
     signature_def_key: A SignatureDef key string.
+    indent: How far (in increments of 2 spaces) to indent each line of output.
   """
   meta_graph_def = saved_model_utils.get_meta_graph_def(saved_model_dir,
                                                         tag_set)
@@ -134,29 +135,39 @@ def _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key):
   outputs_tensor_info = _get_outputs_tensor_info_from_meta_graph_def(
       meta_graph_def, signature_def_key)
 
-  print('The given SavedModel SignatureDef contains the following input(s):')
+  indent_str = "  " * indent
+  def in_print(s):
+    print(indent_str + s)
+
+  in_print('The given SavedModel SignatureDef contains the following input(s):')
   for input_key, input_tensor in sorted(inputs_tensor_info.items()):
-    print('inputs[\'%s\'] tensor_info:' % input_key)
-    _print_tensor_info(input_tensor)
+    in_print('  inputs[\'%s\'] tensor_info:' % input_key)
+    _print_tensor_info(input_tensor, indent+1)
 
-  print('The given SavedModel SignatureDef contains the following output(s):')
+  in_print('The given SavedModel SignatureDef contains the following '
+           'output(s):')
   for output_key, output_tensor in sorted(outputs_tensor_info.items()):
-    print('outputs[\'%s\'] tensor_info:' % output_key)
-    _print_tensor_info(output_tensor)
+    in_print('  outputs[\'%s\'] tensor_info:' % output_key)
+    _print_tensor_info(output_tensor, indent+1)
 
-  print('Method name is: %s' %
-        meta_graph_def.signature_def[signature_def_key].method_name)
+  in_print('Method name is: %s' %
+           meta_graph_def.signature_def[signature_def_key].method_name)
 
 
-def _print_tensor_info(tensor_info):
+def _print_tensor_info(tensor_info, indent=0):
   """Prints details of the given tensor_info.
 
   Args:
     tensor_info: TensorInfo object to be printed.
+    indent: How far (in increments of 2 spaces) to indent each line output
   """
-  print('    dtype: ' +
-        {value: key
-         for (key, value) in types_pb2.DataType.items()}[tensor_info.dtype])
+  indent_str = "  " * indent
+  def in_print(s):
+    print(indent_str + s)
+
+  in_print('    dtype: ' +
+           {value: key
+            for (key, value) in types_pb2.DataType.items()}[tensor_info.dtype])
   # Display shape as tuple.
   if tensor_info.tensor_shape.unknown_rank:
     shape = 'unknown_rank'
@@ -164,8 +175,8 @@ def _print_tensor_info(tensor_info):
     dims = [str(dim.size) for dim in tensor_info.tensor_shape.dim]
     shape = ', '.join(dims)
     shape = '(' + shape + ')'
-  print('    shape: ' + shape)
-  print('    name: ' + tensor_info.name)
+  in_print('    shape: ' + shape)
+  in_print('    name: ' + tensor_info.name)
 
 
 def _show_all(saved_model_dir):
@@ -186,7 +197,8 @@ def _show_all(saved_model_dir):
     signature_def_map = get_signature_def_map(saved_model_dir, tag_set)
     for signature_def_key in sorted(signature_def_map.keys()):
       print('\nsignature_def[\'' + signature_def_key + '\']:')
-      _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key)
+      _show_inputs_outputs(saved_model_dir, tag_set, signature_def_key, 
+                           indent=1)
 
 
 def get_meta_graph_def(saved_model_dir, tag_set):
@@ -614,19 +626,19 @@ def create_parser():
   show_msg = (
       'Usage examples:\n'
       'To show all tag-sets in a SavedModel:\n'
-      '$saved_model_cli show --dir /tmp/saved_model\n'
+      '$saved_model_cli show --dir /tmp/saved_model\n\n'
       'To show all available SignatureDef keys in a '
       'MetaGraphDef specified by its tag-set:\n'
-      '$saved_model_cli show --dir /tmp/saved_model --tag_set serve\n'
+      '$saved_model_cli show --dir /tmp/saved_model --tag_set serve\n\n'
       'For a MetaGraphDef with multiple tags in the tag-set, all tags must be '
       'passed in, separated by \';\':\n'
       '$saved_model_cli show --dir /tmp/saved_model --tag_set serve,gpu\n\n'
       'To show all inputs and outputs TensorInfo for a specific'
       ' SignatureDef specified by the SignatureDef key in a'
       ' MetaGraph.\n'
-      '$saved_model_cli show --dir /tmp/saved_model --tag_set serve '
-      '--signature_def serving_default\n\n'
-      'To show all available information in the SavedModel\n:'
+      '$saved_model_cli show --dir /tmp/saved_model --tag_set serve'
+      ' --signature_def serving_default\n\n'
+      'To show all available information in the SavedModel:\n'
       '$saved_model_cli show --dir /tmp/saved_model --all')
   parser_show = subparsers.add_parser(
       'show',
@@ -658,12 +670,14 @@ def create_parser():
   run_msg = ('Usage example:\n'
              'To run input tensors from files through a MetaGraphDef and save'
              ' the output tensors to files:\n'
-             '$saved_model_cli show --dir /tmp/saved_model --tag_set serve '
-             '--signature_def serving_default '
-             '--inputs input1_key=/tmp/124.npz[x],input2_key=/tmp/123.npy '
-             '--input_exprs \'input3_key=np.ones(2)\' --input_examples '
-             '\'input4_key=[{"id":[26],"weights":[0.5, 0.5]}]\' '
-             '--outdir=/out\n\n'
+             '$saved_model_cli show --dir /tmp/saved_model --tag_set serve \\\n'
+             '   --signature_def serving_default \\\n'
+             '   --inputs input1_key=/tmp/124.npz[x],input2_key=/tmp/123.npy '
+             '\\\n'
+             '   --input_exprs \'input3_key=np.ones(2)\' \\\n'
+             '   --input_examples '
+             '\'input4_key=[{"id":[26],"weights":[0.5, 0.5]}]\' \\\n'
+             '   --outdir=/out\n\n'
              'For more information about input file format, please see:\n'
              'https://www.tensorflow.org/programmers_guide/saved_model_cli\n')
   parser_run = subparsers.add_parser(
diff --git a/tensorflow/python/tools/saved_model_cli_test.py b/tensorflow/python/tools/saved_model_cli_test.py
index d6cbc49ba1e08a6b808b228fb8d69fc14f36e3d2..f99c8448458078935fda477c6e4e15dde8d7d4ab 100644
--- a/tensorflow/python/tools/saved_model_cli_test.py
+++ b/tensorflow/python/tools/saved_model_cli_test.py
@@ -61,83 +61,84 @@ class SavedModelCLITestCase(test.TestCase):
     exp_out = """MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
 
 signature_def['classify_x2_to_y3']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x2:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['scores'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y3:0
-Method name is: tensorflow/serving/classify
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x2:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['scores'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y3:0
+  Method name is: tensorflow/serving/classify
 
 signature_def['classify_x_to_y']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_STRING
-    shape: unknown_rank
-    name: tf_example:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['scores'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
-Method name is: tensorflow/serving/classify
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_STRING
+        shape: unknown_rank
+        name: tf_example:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['scores'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y:0
+  Method name is: tensorflow/serving/classify
 
 signature_def['regress_x2_to_y3']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x2:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['outputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y3:0
-Method name is: tensorflow/serving/regress
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x2:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['outputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y3:0
+  Method name is: tensorflow/serving/regress
 
 signature_def['regress_x_to_y']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_STRING
-    shape: unknown_rank
-    name: tf_example:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['outputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
-Method name is: tensorflow/serving/regress
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_STRING
+        shape: unknown_rank
+        name: tf_example:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['outputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y:0
+  Method name is: tensorflow/serving/regress
 
 signature_def['regress_x_to_y2']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['inputs'] tensor_info:
-    dtype: DT_STRING
-    shape: unknown_rank
-    name: tf_example:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['outputs'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y2:0
-Method name is: tensorflow/serving/regress
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['inputs'] tensor_info:
+        dtype: DT_STRING
+        shape: unknown_rank
+        name: tf_example:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['outputs'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y2:0
+  Method name is: tensorflow/serving/regress
 
 signature_def['serving_default']:
-The given SavedModel SignatureDef contains the following input(s):
-inputs['x'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: x:0
-The given SavedModel SignatureDef contains the following output(s):
-outputs['y'] tensor_info:
-    dtype: DT_FLOAT
-    shape: (-1, 1)
-    name: y:0
-Method name is: tensorflow/serving/predict"""
+  The given SavedModel SignatureDef contains the following input(s):
+    inputs['x'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: x:0
+  The given SavedModel SignatureDef contains the following output(s):
+    outputs['y'] tensor_info:
+        dtype: DT_FLOAT
+        shape: (-1, 1)
+        name: y:0
+  Method name is: tensorflow/serving/predict"""
     # pylint: enable=line-too-long
+    self.maxDiff = None # Produce a useful error msg if the comparison fails
     self.assertMultiLineEqual(output, exp_out)
     self.assertEqual(err.getvalue().strip(), '')
 
@@ -193,11 +194,11 @@ Method name is: tensorflow/serving/predict"""
     output = out.getvalue().strip()
     expected_output = (
         'The given SavedModel SignatureDef contains the following input(s):\n'
-        'inputs[\'x\'] tensor_info:\n'
-        '    dtype: DT_FLOAT\n    shape: (-1, 1)\n    name: x:0\n'
+        '  inputs[\'x\'] tensor_info:\n'
+        '      dtype: DT_FLOAT\n      shape: (-1, 1)\n      name: x:0\n'
         'The given SavedModel SignatureDef contains the following output(s):\n'
-        'outputs[\'y\'] tensor_info:\n'
-        '    dtype: DT_FLOAT\n    shape: (-1, 1)\n    name: y:0\n'
+        '  outputs[\'y\'] tensor_info:\n'
+        '      dtype: DT_FLOAT\n      shape: (-1, 1)\n      name: y:0\n'
         'Method name is: tensorflow/serving/predict')
     self.assertEqual(output, expected_output)
     self.assertEqual(err.getvalue().strip(), '')
diff --git a/tensorflow/python/training/checkpoint_ops.py b/tensorflow/python/training/checkpoint_ops.py
index 7f92d94d2be369709608d36c109863b0ebfb7bbe..a6e9662b7305a00f1fcf03245685e93b756942d3 100644
--- a/tensorflow/python/training/checkpoint_ops.py
+++ b/tensorflow/python/training/checkpoint_ops.py
@@ -149,7 +149,7 @@ def _load_and_remap_matrix(ckpt_path,
   num_rows_present = num_rows_to_load
   if remap_rows:
     row_remapping, num_rows_present = (
-        gen_checkpoint_ops._generate_vocab_remapping(  # pylint: disable=protected-access
+        gen_checkpoint_ops.generate_vocab_remapping(
             new_vocab_file=new_row_vocab_file,
             old_vocab_file=old_row_vocab_file,
             new_vocab_offset=new_row_vocab_offset,
@@ -168,7 +168,7 @@ def _load_and_remap_matrix(ckpt_path,
   num_cols_present = new_col_vocab_size
   if remap_cols:
     col_remapping, num_cols_present = (
-        gen_checkpoint_ops._generate_vocab_remapping(  # pylint: disable=protected-access
+        gen_checkpoint_ops.generate_vocab_remapping(
             new_vocab_file=new_col_vocab_file,
             old_vocab_file=old_col_vocab_file,
             new_vocab_offset=0,  # Offset is unused for cols (no partitioning).
@@ -178,7 +178,7 @@ def _load_and_remap_matrix(ckpt_path,
       num_rows_to_load * new_col_vocab_size -
       num_rows_present * num_cols_present, 1
   ])
-  return_tensor = gen_checkpoint_ops._load_and_remap_matrix(  # pylint: disable=protected-access
+  return_tensor = gen_checkpoint_ops.load_and_remap_matrix(
       ckpt_path=ckpt_path,
       old_tensor_name=old_tensor_name,
       row_remapping=row_remapping,
diff --git a/tensorflow/python/training/checkpoint_utils.py b/tensorflow/python/training/checkpoint_utils.py
index fa3de6fad27b6cc773f9f2e86e9f95395eb7c285..e7f88de1d2290a49f3b7bdf47417016d7e7c9cea 100644
--- a/tensorflow/python/training/checkpoint_utils.py
+++ b/tensorflow/python/training/checkpoint_utils.py
@@ -23,6 +23,7 @@ import six
 from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import io_ops
+from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import state_ops
 from tensorflow.python.ops import variable_scope as vs
 from tensorflow.python.ops import variables
@@ -289,10 +290,20 @@ def _set_checkpoint_initializer(variable,
     name: Name of the operation.
   """
   base_type = variable.dtype.base_dtype
-  with ops.colocate_with(variable):
+  # Do not colocate with variable since RestoreV2 op only runs on CPU and
+  # colocation will force variable (and other ops that colocate with variable)
+  # to be on CPU as well. It is okay to place the variable's initializer op on
+  # CPU since it will only be run once at the start.
+  with ops.device(variable.device), ops.device("/cpu:0"):
     restore_op = io_ops.restore_v2(
         ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0]
-    variable._initializer_op = state_ops.assign(variable, restore_op)  # pylint:disable=protected-access
+    if isinstance(variable, resource_variable_ops.ResourceVariable):
+      init_op = variable.assign(restore_op, read_value=False)
+    else:
+      init_op = state_ops.assign(variable, restore_op)
+    variable._initializer_op = init_op  # pylint:disable=protected-access
+    restore_op.set_shape(variable.shape)
+    variable._initial_value = restore_op  # pylint:disable=protected-access
 
 
 def _set_variable_or_list_initializer(variable_or_list, ckpt_file,
diff --git a/tensorflow/python/training/checkpoint_utils_test.py b/tensorflow/python/training/checkpoint_utils_test.py
index cd17faa040d5b85263b54bc53100b18f736a12e0..338436573a3355b77811542d3845e683c38e2521 100644
--- a/tensorflow/python/training/checkpoint_utils_test.py
+++ b/tensorflow/python/training/checkpoint_utils_test.py
@@ -26,6 +26,7 @@ from tensorflow.python.framework import errors_impl
 from tensorflow.python.framework import ops
 from tensorflow.python.ops import init_ops
 from tensorflow.python.ops import partitioned_variables
+from tensorflow.python.ops import resource_variable_ops
 from tensorflow.python.ops import variable_scope
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import test
@@ -145,6 +146,36 @@ class CheckpointsTest(test.TestCase):
         # Check that tensors are not explicitly in the graph.
         self.assertLess(len(str(session.graph.as_graph_def())), 29000)
 
+  def testInitialValueComesFromCheckpoint(self):
+    checkpoint_dir = self.get_temp_dir()
+    with self.test_session() as session:
+      v1, _, _, _ = _create_checkpoints(session, checkpoint_dir)
+
+    # New graph and session.
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as session:
+        with variable_scope.variable_scope(
+            "some_scope", initializer=init_ops.zeros_initializer()):
+          my1 = variable_scope.get_variable("my1", [1, 10])
+
+        # At this point, my1.initialized_value() will add ops that reference
+        # the zeros initializer of my1.
+        before = variables.Variable(my1.initialized_value(), name="before")
+
+        checkpoint_utils.init_from_checkpoint(checkpoint_dir, {"var1": my1})
+
+        # At this point, my1.initialized_value() will add ops that reference
+        # the newly set initializer of my1.
+        after = variables.Variable(my1.initialized_value(), name="after")
+
+        session.run(variables.global_variables_initializer())
+        self.assertAllEqual(session.run(my1), v1)
+        self.assertAllEqual(session.run(my1.initialized_value()), v1)
+        self.assertAllClose(session.run(before), [[0.0] * 10])
+        self.assertAllClose(session.run(after), v1)
+        with self.assertRaises(AssertionError):
+          self.assertAllClose(session.run(before), session.run(after))
+
   def testInitWithScopeDoesNotCaptureSuffixes(self):
     checkpoint_dir = self.get_temp_dir()
     with self.test_session() as session:
@@ -176,7 +207,9 @@ class CheckpointsTest(test.TestCase):
 
       checkpoint_utils.init_from_checkpoint(checkpoint_dir,
                                             {"useful_scope/": "useful_scope/"})
-      self.assertEqual(my4._initializer_op.op.inputs[1].device, "/job:ps")
+      # initializer runs on the same task but always on CPU.
+      self.assertEqual(my4._initializer_op.op.inputs[1].device,
+                       "/job:ps/device:CPU:0")
 
   def testInitFromRootCheckpoint(self):
     checkpoint_dir = self.get_temp_dir()
@@ -332,6 +365,31 @@ class CheckpointsTest(test.TestCase):
           checkpoint_utils.init_from_checkpoint(checkpoint_dir,
                                                 {"useful_scope": "some_scope/"})
 
+  def testNoAdditionalReadOpsForResourceVariables(self):
+    checkpoint_dir = self.get_temp_dir()
+    with self.test_session() as session:
+      v1, _, _, _ = _create_checkpoints(session, checkpoint_dir)
+
+    # New graph and session.
+    with ops.Graph().as_default() as g:
+      with self.test_session(graph=g) as session:
+        my1 = resource_variable_ops.ResourceVariable([[0.0] * 10], name="my1")
+
+        with ops.name_scope("init_from_checkpoint"):
+          checkpoint_utils.init_from_checkpoint(checkpoint_dir, {"var1": my1})
+
+        # Basic sanity checks:
+        session.run(variables.global_variables_initializer())
+        self.assertAllEqual(session.run(my1), v1)
+
+    ops_in_init_from_checkpoint_scope = [
+        op for op in g.get_operations()
+        if (op.name.startswith("init_from_checkpoint/") and
+            not op.name.startswith("init_from_checkpoint/checkpoint_initializer"
+                                  ) and op.type != "AssignVariableOp")
+    ]
+    self.assertEqual(ops_in_init_from_checkpoint_scope, [])
+
 
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/training/checkpointable.py b/tensorflow/python/training/checkpointable.py
index c2fea0f40d49e2a8e157bc498d25e4044cae3899..92e8ff3308446a66154d31216d3a72162e4038b6 100644
--- a/tensorflow/python/training/checkpointable.py
+++ b/tensorflow/python/training/checkpointable.py
@@ -18,23 +18,22 @@ from __future__ import division
 from __future__ import print_function
 
 import collections
-import weakref
 
-from tensorflow.python import pywrap_tensorflow
 from tensorflow.python.eager import context
 from tensorflow.python.framework import dtypes
 from tensorflow.python.framework import ops
+from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import gen_io_ops as io_ops
 from tensorflow.python.util import nest
 
 # A key indicating a variable's value in an object's checkpointed Tensors
-# (Checkpointable._gather_tensors_for_checkpoint). If this is the only key and
+# (Checkpointable._gather_saveables_for_checkpoint). If this is the only key and
 # the object has no dependencies, then its value may be restored on object
 # creation (avoiding double assignment when executing eagerly).
 VARIABLE_VALUE_KEY = "VARIABLE_VALUE"
 
-_CheckpointableReference = collections.namedtuple(
-    "_CheckpointableReference",
+CheckpointableReference = collections.namedtuple(
+    "CheckpointableReference",
     [
         # The local name for this dependency.
         "name",
@@ -57,7 +56,7 @@ class CheckpointInitialValue(ops.Tensor):
   """
 
   def __init__(self, checkpoint_position, shape=None):
-    self.wrapped_value = checkpoint_position.restore_ops()[
+    self.wrapped_value = checkpoint_position.value_tensors()[
         VARIABLE_VALUE_KEY]
     if shape:
       # We need to set the static shape information on the initializer if
@@ -99,9 +98,8 @@ class _CheckpointPosition(object):
       # This object's correspondence with a checkpointed object is new, so
       # process deferred restorations for it and its dependencies.
       restore_ops = checkpointable._restore_from_checkpoint_position(self)  # pylint: disable=protected-access
-      session = self._checkpoint.session
-      if session:
-        session.run(restore_ops)
+      if restore_ops:
+        self._checkpoint.restore_ops.extend(restore_ops)
 
   def bind_object(self, checkpointable):
     """Set a checkpoint<->object correspondence and process slot variables.
@@ -120,13 +118,13 @@ class _CheckpointPosition(object):
       checkpoint.object_by_proto_id[self._proto_id] = checkpointable
       for deferred_slot_restoration in (
           checkpoint.deferred_slot_restorations.pop(self._proto_id, ())):
-        checkpointable._process_slot_restoration(  # pylint: disable=protected-access
+        checkpointable._create_or_restore_slot_variable(  # pylint: disable=protected-access
             slot_variable_position=_CheckpointPosition(
                 checkpoint=checkpoint,
                 proto_id=deferred_slot_restoration.slot_variable_id),
             variable=deferred_slot_restoration.original_variable,
             slot_name=deferred_slot_restoration.slot_name)
-      for slot_restoration in checkpoint.slot_restorations.get(
+      for slot_restoration in checkpoint.slot_restorations.pop(
           self._proto_id, ()):
         optimizer_object = checkpoint.object_by_proto_id.get(
             slot_restoration.optimizer_id, None)
@@ -140,7 +138,7 @@ class _CheckpointPosition(object):
                       slot_variable_id=slot_restoration.slot_variable_id,
                       slot_name=slot_restoration.slot_name))
         else:
-          optimizer_object._process_slot_restoration(  # pylint: disable=protected-access
+          optimizer_object._create_or_restore_slot_variable(  # pylint: disable=protected-access
               slot_variable_position=_CheckpointPosition(
                   checkpoint=checkpoint,
                   proto_id=slot_restoration.slot_variable_id),
@@ -169,22 +167,89 @@ class _CheckpointPosition(object):
             and attributes[0].name == VARIABLE_VALUE_KEY
             and not self.object_proto.children)
 
-  def restore_ops(self):
-    """Create restore ops for this object's attributes."""
-    restore_tensors = {}
+  def value_tensors(self):
+    """Create value `Tensor`s for this object's attributes.
+
+    Does not require that the Python object has been created. Used for
+    restore-on-create when executing eagerly.
+
+    Returns:
+      A dictionary mapping from object attribute names to `Tensor`s.
+    """
+    value_tensors = {}
     for serialized_tensor in self.object_proto.attributes:
       checkpoint_key = serialized_tensor.checkpoint_key
       dtype = self._checkpoint.dtype_map[checkpoint_key]
       base_type = dtype.base_dtype
       with ops.init_scope():
-        restore, = io_ops.restore_v2(
-            prefix=self._checkpoint.save_path,
-            tensor_names=[checkpoint_key],
-            shape_and_slices=[""],
-            dtypes=[base_type],
-            name="%s_checkpoint_read" % (serialized_tensor.name,))
-        restore_tensors[serialized_tensor.name] = restore
-      return restore_tensors
+        with ops.device("/cpu:0"):
+          # Run the restore itself on the CPU.
+          value, = io_ops.restore_v2(
+              prefix=self._checkpoint.save_path,
+              tensor_names=[checkpoint_key],
+              shape_and_slices=[""],
+              dtypes=[base_type],
+              name="%s_checkpoint_read" % (serialized_tensor.name,))
+        # Copy the value to the current device if necessary.
+        value_tensors[serialized_tensor.name] = array_ops.identity(value)
+      return value_tensors
+
+  def restore_ops(self):
+    """Create or fetch restore ops for this object's attributes.
+
+    Requires that the `Checkpointable` Python object has been bound to an object
+    ID in the checkpoint.
+
+    Returns:
+      A list of operations when graph building, or an empty list when executing
+      eagerly.
+    """
+    saveables = self.checkpointable._gather_saveables_for_checkpoint()  # pylint: disable=protected-access
+    # Name saveables based on the name this object had when it was checkpointed.
+    named_saveables = {}
+    restore_ops = []
+    in_graph_mode = context.in_graph_mode()
+    for serialized_tensor in self.object_proto.attributes:
+      saveable_object = saveables.get(serialized_tensor.name, None)
+      if saveable_object is None:
+        # Purposefully does not throw an exception if attributes have been added
+        # or deleted. Stores unused attributes so an exception can be raised if
+        # the user decides to check that everything in the checkpoint was
+        # loaded.
+        self._checkpoint.unused_attributes.setdefault(
+            self.checkpointable, []).append(serialized_tensor.name)
+        continue
+      if in_graph_mode:
+        existing_ops = self._checkpoint.restore_ops_by_name.get(
+            serialized_tensor.name, None)
+      else:
+        existing_ops = None
+      if existing_ops is None:
+        named_saveables[serialized_tensor.checkpoint_key] = saveable_object
+    if named_saveables:
+      validated_saveables = (
+          self._checkpoint.builder._ValidateAndSliceInputs(named_saveables))  # pylint: disable=protected-access
+      validated_names = set(saveable.name for saveable in validated_saveables)
+      if set(named_saveables.keys()) != validated_names:
+        raise AssertionError(
+            ("Saveable keys changed when validating. Got back %s, was "
+             "expecting %s") % (named_saveables.keys(), validated_names))
+      all_tensors = self._checkpoint.builder.bulk_restore(
+          filename_tensor=self._checkpoint.save_path,
+          saveables=validated_saveables, preferred_shard=-1,
+          restore_sequentially=False)
+      saveable_index = 0
+      for saveable in validated_saveables:
+        num_specs = len(saveable.specs)
+        saveable_tensors = all_tensors[
+            saveable_index:saveable_index + num_specs]
+        saveable_index += num_specs
+        restore_op = saveable.restore(saveable_tensors, restored_shapes=None)
+        if in_graph_mode:
+          assert saveable.name not in self._checkpoint.restore_ops_by_name
+          self._checkpoint.restore_ops_by_name[saveable.name] = restore_op
+          restore_ops.append(restore_op)
+    return restore_ops
 
   @property
   def checkpoint(self):
@@ -226,87 +291,13 @@ _SlotVariableRestoration = collections.namedtuple(
     ])
 
 
-class _Checkpoint(object):
-  """Holds the status of an object-based checkpoint load."""
-
-  def __init__(self, object_graph_proto, save_path, session):
-    """Specify the checkpoint being loaded.
-
-    Args:
-      object_graph_proto: The CheckpointableObjectGraph protocol buffer
-        associated with this checkpoint.
-      save_path: The path to the checkpoint, as returned by
-        `tf.train.latest_checkpoint`.
-      session: The session to evaluate assignment ops in. Should be None if
-        executing eagerly.
-
-    Raises:
-      ValueError: If `session` is not None and eager execution is enabled.
-    """
-    self.object_graph_proto = object_graph_proto
-    self.restore_uid = ops.uid()
-    # Dictionary mapping from an id in the protocol buffer flat array to
-    # Checkpointable Python objects. This mapping may be deferred if a
-    # checkpoint is restored before all dependencies have been tracked. Uses
-    # weak references so that partial restorations don't create reference cycles
-    # (as objects with deferred dependencies will generally have references to
-    # this object).
-    self.object_by_proto_id = weakref.WeakValueDictionary()
-    self.save_path = save_path
-    reader = pywrap_tensorflow.NewCheckpointReader(save_path)
-    self.dtype_map = reader.get_variable_to_dtype_map()
-    # A mapping from optimizer proto ids to lists of slot variables to be
-    # restored when the optimizer is tracked. Only includes slot variables whose
-    # regular variables have already been created, and only for optimizer
-    # objects which have not yet been created/tracked.
-    self.deferred_slot_restorations = {}
-    # A mapping from variable proto ids to lists of slot variables to be
-    # restored when the variable is created/tracked. These get shifted over to
-    # deferred_slot_restorations if the optimizer hasn't been created when that
-    # happens.
-    self.slot_restorations = {}
-    for node_index, node in enumerate(self.object_graph_proto.nodes):
-      for slot_reference in node.slot_variables:
-        # `node` refers to an `Optimizer`, since only these have slot variables.
-        self.slot_restorations.setdefault(
-            slot_reference.original_variable_node_id, []).append(
-                _SlotVariableRestoration(
-                    optimizer_id=node_index,
-                    slot_variable_id=slot_reference.slot_variable_node_id,
-                    slot_name=slot_reference.slot_name))
-    if session is not None and context.in_eager_mode():
-      raise ValueError(
-          "Passed a session %s when executing eagerly." % (session,))
-    self.session = session
-
-
-class Checkpointable(object):
-  """Manages dependencies on other objects.
-
-  `Checkpointable` objects may have dependencies: other `Checkpointable` objects
-  which should be saved if the object declaring the dependency is saved. A
-  correctly saveable program has a dependency graph such that if changing a
-  global variable affects an object (e.g. changes the behavior of any of its
-  methods) then there is a chain of dependencies from the influenced object to
-  the variable.
-
-  Dependency edges have names, and are created implicitly when a
-  `Checkpointable` object is assigned to an attribute of another
-  `Checkpointable` object. For example:
-
-  ```
-  obj = Checkpointable()
-  obj.v = ResourceVariable(0.)
-  ```
+class CheckpointableBase(object):
+  """Base class for `Checkpointable` objects without automatic dependencies.
 
-  The `Checkpointable` object `obj` now has a dependency named "v" on a
-  variable.
-
-  `Checkpointable` objects may specify `Tensor`s to be saved and restored
-  directly (e.g. a `Variable` indicating how to save itself) rather than through
-  dependencies on other objects. See
-  `Checkpointable._scatter_tensors_from_checkpoint` and
-  `Checkpointable._gather_tensors_for_checkpoint` for details.
+  This class has no __setattr__ override for performance reasons. Dependencies
+  must be added explicitly. Unless attribute assignment is performance-critical,
+  use `Checkpointable` instead. Use `CheckpointableBase` for `isinstance`
+  checks.
   """
 
   def _maybe_initialize_checkpointable(self):
@@ -314,14 +305,17 @@ class Checkpointable(object):
 
     Not __init__, since most objects will forget to call it.
     """
-    if hasattr(self, "_checkpoint_dependencies"):
+    if hasattr(self, "_unconditional_checkpoint_dependencies"):
       # __init__ already called. This check means that we don't need
       # Checkpointable.__init__() in the constructor of every TensorFlow object.
       return
-    # A list of _CheckpointableReference objects.
-    self._checkpoint_dependencies = []
+    # A list of CheckpointableReference objects. Some classes implementing
+    # `Checkpointable`, notably `Optimizer`s, may override the
+    # _checkpoint_dependencies property with conditional dependencies
+    # (e.g. based on the current graph when saving).
+    self._unconditional_checkpoint_dependencies = []
     # Maps names -> Checkpointable objects
-    self._dependency_names = {}
+    self._unconditional_dependency_names = {}
     # Restorations for other Checkpointable objects on which this object may
     # eventually depend.
     self._deferred_dependencies = {}  # local name -> _CheckpointPosition list
@@ -333,24 +327,36 @@ class Checkpointable(object):
           "initialization code was run.")
     self._update_uid = -1
 
-  def __setattr__(self, name, value):
-    """Support self.foo = checkpointable syntax."""
-    # Perform the attribute assignment, and potentially call other __setattr__
-    # overrides such as that for tf.keras.Model.
-    super(Checkpointable, self).__setattr__(name, value)
-    if isinstance(value, Checkpointable):
-      self._track_checkpointable(
-          value, name=name,
-          # Allow the user to switch the Checkpointable which is tracked by this
-          # name, since assigning a new variable to an attribute has
-          # historically been fine (e.g. Adam did this).
-          # TODO(allenl): Should this be a warning once Checkpointable save/load
-          # is usable?
-          overwrite=True)
+  @property
+  def _checkpoint_dependencies(self):
+    """All dependencies of this object.
+
+    May be overridden to include conditional dependencies.
+
+    Returns:
+      A list of `CheckpointableReference` objects indicating named
+      `Checkpointable` dependencies which should be saved along with this
+      object.
+    """
+    return self._unconditional_checkpoint_dependencies
+
+  def _lookup_dependency(self, name):
+    """Look up a dependency by name.
+
+    May be overridden to include conditional dependencies.
+
+    Args:
+      name: The local name of the dependency.
+    Returns:
+      A `Checkpointable` object, or `None` if no dependency by this name was
+      found.
+    """
+    return self._unconditional_dependency_names.get(name, None)
 
   def _add_variable_with_custom_getter(
       self, name, shape=None, dtype=dtypes.float32,
-      initializer=None, getter=None, **kwargs_for_getter):
+      initializer=None, getter=None, overwrite=False,
+      **kwargs_for_getter):
     """Restore-on-create for a variable be saved with this `Checkpointable`.
 
     If the user has requested that this object or another `Checkpointable` which
@@ -362,12 +368,11 @@ class Checkpointable(object):
       name: A name for the variable. Must be unique within this object.
       shape: The shape of the variable.
       dtype: The data type of the variable.
-
       initializer: The initializer to use. Ignored if there is a deferred
         restoration left over from a call to
         `_restore_from_checkpoint_position`.
-
       getter: The getter to wrap which actually fetches the variable.
+      overwrite: If True, disables unique name and type checks.
       **kwargs_for_getter: Passed to the getter.
 
     Returns:
@@ -377,17 +382,21 @@ class Checkpointable(object):
       ValueError: If the variable name is not unique.
     """
     self._maybe_initialize_checkpointable()
-    if name in self._dependency_names:
+    if not overwrite and self._lookup_dependency(name) is not None:
       raise ValueError(
           ("A variable named '%s' already exists in this Checkpointable, but "
            "Checkpointable._add_variable called to create another with "
            "that name. Variable names must be unique within a Checkpointable "
            "object.") % (name,))
-    # If this is a variable with a single Tensor stored in the checkpoint, we
-    # can set that value as an initializer rather than initializing and then
-    # assigning (when executing eagerly).
-    checkpoint_initializer = self._preload_simple_restoration(
-        name=name, shape=shape)
+    if context.in_eager_mode():
+      # If this is a variable with a single Tensor stored in the checkpoint, we
+      # can set that value as an initializer rather than initializing and then
+      # assigning (when executing eagerly). This call returns None if there is
+      # nothing to restore.
+      checkpoint_initializer = self._preload_simple_restoration(
+          name=name, shape=shape)
+    else:
+      checkpoint_initializer = None
     if (checkpoint_initializer is not None
         and not (
             isinstance(initializer, CheckpointInitialValue)
@@ -400,25 +409,22 @@ class Checkpointable(object):
       # effort" to set the initializer with the highest restore UID.
       initializer = checkpoint_initializer
       shape = None
-      checkpoint_position = checkpoint_initializer.checkpoint_position
-    else:
-      checkpoint_position = None
 
     new_variable = getter(
         name=name, shape=shape, dtype=dtype, initializer=initializer,
         **kwargs_for_getter)
 
-    if (checkpoint_position is not None
-        and hasattr(new_variable, "_update_uid")
-        and new_variable._update_uid == checkpoint_position.restore_uid):  # pylint: disable=protected-access
-      session = checkpoint_position.checkpoint.session
-      if session:
-        session.run(new_variable.initializer)
     # If we set an initializer and the variable processed it, tracking will not
     # assign again. It will add this variable to our dependencies, and if there
     # is a non-trivial restoration queued, it will handle that. This also
     # handles slot variables.
-    return self._track_checkpointable(new_variable, name=name)
+    if not overwrite or isinstance(new_variable, CheckpointableBase):
+      return self._track_checkpointable(new_variable, name=name,
+                                        overwrite=overwrite)
+    else:
+      # TODO(allenl): Some variable types are not yet supported. Remove this
+      # fallback once all get_variable() return types are Checkpointable.
+      return new_variable
 
   def _preload_simple_restoration(self, name, shape):
     """Return a dependency's value for restore-on-create.
@@ -462,13 +468,10 @@ class Checkpointable(object):
     Indicates that checkpoints for this object should include variables from
     `checkpointable`.
 
-    Variables in a checkpoint are mapped to `Checkpointable`s based on names if
-    provided when the checkpoint was written, but otherwise use the order those
-    `Checkpointable`s were declared as dependencies.
-
-    To avoid breaking existing checkpoints when modifying a class, neither
-    variable names nor dependency names (the names passed to
-    `track_checkpointable`) may change.
+    Variables in a checkpoint are mapped to `Checkpointable`s based on the names
+    provided when the checkpoint was written. To avoid breaking existing
+    checkpoints when modifying a class, neither variable names nor dependency
+    names (the names passed to `_track_checkpointable`) may change.
 
     Args:
       checkpointable: A `Checkpointable` which this object depends on.
@@ -487,13 +490,14 @@ class Checkpointable(object):
       ValueError: If another object is already tracked by this name.
     """
     self._maybe_initialize_checkpointable()
-    if not isinstance(checkpointable, Checkpointable):
+    if not isinstance(checkpointable, CheckpointableBase):
       raise TypeError(
           ("Checkpointable._track_checkpointable() passed type %s, not a "
            "Checkpointable.") % (type(checkpointable),))
-    new_reference = _CheckpointableReference(name=name, ref=checkpointable)
-    if (name in self._dependency_names
-        and self._dependency_names[name] is not checkpointable):
+    new_reference = CheckpointableReference(name=name, ref=checkpointable)
+    current_object = self._lookup_dependency(name)
+    if (current_object is not None
+        and current_object is not checkpointable):
       if not overwrite:
         raise ValueError(
             ("Called Checkpointable._track_checkpointable() with name='%s', "
@@ -501,19 +505,47 @@ class Checkpointable(object):
              "dependency. Names must be unique (or overwrite=True).") % (name,))
       # This is a weird thing to do, but we're not going to stop people from
       # using __setattr__.
-      for index, (old_name, _) in enumerate(self._checkpoint_dependencies):
+      for index, (old_name, _) in enumerate(
+          self._unconditional_checkpoint_dependencies):
         if name == old_name:
-          self._checkpoint_dependencies[index] = new_reference
+          self._unconditional_checkpoint_dependencies[index] = new_reference
     else:
-      self._checkpoint_dependencies.append(new_reference)
+      self._unconditional_checkpoint_dependencies.append(new_reference)
 
-    self._dependency_names[name] = checkpointable
-    deferred_dependency_list = self._deferred_dependencies.pop(name, None)
-    if deferred_dependency_list is not None:
-      for checkpoint_position in deferred_dependency_list:
-        checkpoint_position.restore(checkpointable=checkpointable)
+    self._unconditional_dependency_names[name] = checkpointable
+    self._handle_deferred_dependencies(name=name, checkpointable=checkpointable)
     return checkpointable
 
+  def _handle_deferred_dependencies(self, name, checkpointable):
+    """Pop and load any deferred checkpoint restores into `checkpointable`.
+
+    This method does not add a new dependency on `checkpointable`, but it does
+    check if any outstanding/deferred dependencies have been queued waiting for
+    this dependency to be added (matched based on `name`). If so,
+    `checkpointable` and its dependencies are restored. The restorations are
+    considered fulfilled and so are deleted.
+
+    `_track_checkpointable` is more appropriate for adding a
+    normal/unconditional dependency, and includes handling for deferred
+    restorations. This method allows objects such as `Optimizer` to use the same
+    restoration logic while managing conditional dependencies themselves, by
+    overriding `_checkpoint_dependencies` and `_lookup_dependency` to change the
+    object's dependencies based on the context it is saved/restored in (a single
+    optimizer instance can have state associated with multiple graphs).
+
+    Args:
+      name: The name of the dependency within this object (`self`), used to
+        match `checkpointable` with values saved in a checkpoint.
+      checkpointable: The Checkpointable object to restore (inheriting from
+        `CheckpointableBase`).
+    """
+    deferred_dependencies_list = self._deferred_dependencies.pop(name, ())
+    for checkpoint_position in sorted(
+        deferred_dependencies_list,
+        key=lambda restore: restore.checkpoint.restore_uid,
+        reverse=True):
+      checkpoint_position.restore(checkpointable)
+
   def _restore_from_checkpoint_position(self, checkpoint_position):
     """Restore this object and its dependencies (may be deferred)."""
     # Attempt a breadth-first traversal, since presumably the user has more
@@ -540,16 +572,16 @@ class Checkpointable(object):
     # need to actually restore the object. However, we should pass the
     # restoration on to our dependencies.
     if checkpoint.restore_uid > self._update_uid:
-      restore_op = self._scatter_tensors_from_checkpoint(
-          checkpoint_position.restore_ops())
+      restore_ops = checkpoint_position.restore_ops()
+      # TODO(allenl): Get a list of feeds for saving Python state
       self._update_uid = checkpoint.restore_uid
     else:
-      restore_op = ()
+      restore_ops = ()
     for child in checkpoint_position.object_proto.children:
       child_position = _CheckpointPosition(
           checkpoint=checkpoint,
           proto_id=child.node_id)
-      local_object = self._dependency_names.get(child.local_name, None)
+      local_object = self._lookup_dependency(child.local_name)
       if local_object is None:
         # We don't yet have a dependency registered with this name. Save it
         # in case we do.
@@ -562,23 +594,63 @@ class Checkpointable(object):
           # resolution order (shallowest paths first). The caller is responsible
           # for emptying visit_queue.
           visit_queue.append(child_position)
-    return restore_op
+    return restore_ops
 
-  def _scatter_tensors_from_checkpoint(self, attributes):
-    """Restores this object from a checkpoint.
+  def _gather_saveables_for_checkpoint(self):
+    """Returns a dictionary of values to checkpoint with this object.
 
-    Args:
-      attributes: A dictionary of Tensors, with key corresponding to those
-        returned from _gather_tensors_for_checkpoint.
-    Returns:
-      A restore op to run (if graph building).
-    """
-    if attributes:
-      raise AssertionError(
-          ("A Checkpointable object which was not expecting any data received "
-           "some from a checkpoint. (Got %s)") % (attributes,))
-    return ()  # No restore ops
+    Keys in the returned dictionary are local to this object and in a separate
+    namespace from dependencies. Values may either be `SaveableObject`s or
+    variables easily converted to `SaveableObject`s (as in `tf.train.Saver`'s
+    `var_list` constructor argument).
 
-  def _gather_tensors_for_checkpoint(self):
-    """Returns a dictionary of Tensors to save with this object."""
+    Returned values must be saved only by this object; if any value may be
+    shared, it should instead be a dependency. For example, variable objects
+    save their own values with the key `VARIABLE_VALUE_KEY`, but objects which
+    reference variables simply add a dependency.
+    """
     return {}
+
+
+class Checkpointable(CheckpointableBase):
+  """Manages dependencies on other objects.
+
+  `Checkpointable` objects may have dependencies: other `Checkpointable` objects
+  which should be saved if the object declaring the dependency is saved. A
+  correctly saveable program has a dependency graph such that if changing a
+  global variable affects an object (e.g. changes the behavior of any of its
+  methods) then there is a chain of dependencies from the influenced object to
+  the variable.
+
+  Dependency edges have names, and are created implicitly when a
+  `Checkpointable` object is assigned to an attribute of another
+  `Checkpointable` object. For example:
+
+  ```
+  obj = Checkpointable()
+  obj.v = ResourceVariable(0.)
+  ```
+
+  The `Checkpointable` object `obj` now has a dependency named "v" on a
+  variable.
+
+  `Checkpointable` objects may specify `Tensor`s to be saved and restored
+  directly (e.g. a `Variable` indicating how to save itself) rather than through
+  dependencies on other objects. See
+  `Checkpointable._gather_saveables_for_checkpoint` for details.
+  """
+
+  def __setattr__(self, name, value):
+    """Support self.foo = checkpointable syntax."""
+    # Perform the attribute assignment, and potentially call other __setattr__
+    # overrides such as that for tf.keras.Model.
+    super(Checkpointable, self).__setattr__(name, value)
+    if isinstance(value, CheckpointableBase):
+      self._track_checkpointable(
+          value, name=name,
+          # Allow the user to switch the Checkpointable which is tracked by this
+          # name, since assigning a new variable to an attribute has
+          # historically been fine (e.g. Adam did this).
+          # TODO(allenl): Should this be a warning once Checkpointable save/load
+          # is usable?
+          overwrite=True)
diff --git a/tensorflow/python/training/checkpointable_utils.py b/tensorflow/python/training/checkpointable_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..32123f87ef2d12497077ab0e2f7d4d4cad1ec5dd
--- /dev/null
+++ b/tensorflow/python/training/checkpointable_utils.py
@@ -0,0 +1,78 @@
+"""Utilities for saving/loading Checkpointable objects."""
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import weakref
+
+from tensorflow.python.framework import ops
+from tensorflow.python.training import checkpointable
+from tensorflow.python.training import saver as saver_lib
+
+
+class _Checkpoint(object):
+  """Holds the status of an object-based checkpoint load."""
+
+  def __init__(self, object_graph_proto, save_path, dtype_map=None):
+    """Specify the checkpoint being loaded.
+
+    Args:
+      object_graph_proto: The CheckpointableObjectGraph protocol buffer
+        associated with this checkpoint.
+      save_path: A string `Tensor`. The path to the checkpoint, as returned by
+        `tf.train.latest_checkpoint`.
+      dtype_map: When executing eagerly, specifies dtypes for creating slot
+        variables. None when graph building.
+    """
+    self.builder = saver_lib.BulkSaverBuilder()
+    self.object_graph_proto = object_graph_proto
+    self.restore_uid = ops.uid()
+    # Maps from objects to lists of attributes which were in the checkpoint but
+    # not loaded into any object, for error checking.
+    self.unused_attributes = weakref.WeakKeyDictionary()
+    # Dictionary mapping from an id in the protocol buffer flat array to
+    # Checkpointable Python objects. This mapping may be deferred if a
+    # checkpoint is restored before all dependencies have been tracked. Uses
+    # weak references so that partial restorations don't create reference cycles
+    # (as objects with deferred dependencies will generally have references to
+    # this object).
+    self.object_by_proto_id = weakref.WeakValueDictionary()
+    self.save_path = save_path
+    self.dtype_map = dtype_map
+    # When graph building, contains a list of ops to run to restore objects from
+    # this checkpoint.
+    self.restore_ops = []
+    self.restore_ops_by_name = {}
+    # A mapping from optimizer proto ids to lists of slot variables to be
+    # restored when the optimizer is tracked. Only includes slot variables whose
+    # regular variables have already been created, and only for optimizer
+    # objects which have not yet been created/tracked.
+    self.deferred_slot_restorations = {}
+    # A mapping from variable proto ids to lists of slot variables to be
+    # restored when the variable is created/tracked. These get shifted over to
+    # deferred_slot_restorations if the optimizer hasn't been created when that
+    # happens.
+    self.slot_restorations = {}
+    for node_index, node in enumerate(self.object_graph_proto.nodes):
+      for slot_reference in node.slot_variables:
+        # `node` refers to an `Optimizer`, since only these have slot variables.
+        self.slot_restorations.setdefault(
+            slot_reference.original_variable_node_id, []).append(
+                checkpointable._SlotVariableRestoration(  # pylint: disable=protected-access
+                    optimizer_id=node_index,
+                    slot_variable_id=slot_reference.slot_variable_node_id,
+                    slot_name=slot_reference.slot_name))
diff --git a/tensorflow/python/training/device_setter.py b/tensorflow/python/training/device_setter.py
index 689088bb41edfd94a1d483ed2b5f7447e9e060e7..0e824d89e9f5444fd91c2f7123c9f93495d1a804 100644
--- a/tensorflow/python/training/device_setter.py
+++ b/tensorflow/python/training/device_setter.py
@@ -25,6 +25,15 @@ from tensorflow.python.platform import tf_logging as logging
 from tensorflow.python.training import server_lib
 from tensorflow.python.util.tf_export import tf_export
 
+# This is a tuple of PS ops used by tf.estimator.Esitmator which should work in
+# almost all of cases.
+STANDARD_PS_OPS = (
+    "Variable", "VariableV2", "AutoReloadVariable", "MutableHashTable",
+    "MutableHashTableV2", "MutableHashTableOfTensors",
+    "MutableHashTableOfTensorsV2", "MutableDenseHashTable",
+    "MutableDenseHashTableV2", "VarHandleOp"
+)
+
 
 class _RoundRobinStrategy(object):
   """Returns the next ps task index for placement in round-robin order.
diff --git a/tensorflow/python/training/ftrl.py b/tensorflow/python/training/ftrl.py
index 9d02e694db15637126f37ee5575638908b351def..4fa081fab72df62107cf4957d4ff68240ced9ee0 100644
--- a/tensorflow/python/training/ftrl.py
+++ b/tensorflow/python/training/ftrl.py
@@ -53,7 +53,7 @@ class FtrlOptimizer(optimizer.Optimizer):
       learning_rate: A float value or a constant float `Tensor`.
       learning_rate_power: A float value, must be less or equal to zero.
       initial_accumulator_value: The starting value for accumulators.
-        Only positive values are allowed.
+        Only zero or positive values are allowed.
       l1_regularization_strength: A float value, must be greater than or
         equal to zero.
       l2_regularization_strength: A float value, must be greater than or
@@ -84,9 +84,10 @@ class FtrlOptimizer(optimizer.Optimizer):
     """
     super(FtrlOptimizer, self).__init__(use_locking, name)
 
-    if initial_accumulator_value <= 0.0:
-      raise ValueError("initial_accumulator_value %f needs to be positive" %
-                       initial_accumulator_value)
+    if initial_accumulator_value < 0.0:
+      raise ValueError(
+          "initial_accumulator_value %f needs to be be positive or zero" %
+          initial_accumulator_value)
     if learning_rate_power > 0.0:
       raise ValueError("learning_rate_power %f needs to be negative or zero" %
                        learning_rate_power)
diff --git a/tensorflow/python/training/learning_rate_decay_test.py b/tensorflow/python/training/learning_rate_decay_test.py
index 1ce8c156a0b126f680bad62267f90e31a23febed..23b30632f6d9b70389090ac227a692081e523be1 100644
--- a/tensorflow/python/training/learning_rate_decay_test.py
+++ b/tensorflow/python/training/learning_rate_decay_test.py
@@ -43,8 +43,8 @@ class LRDecayTest(test_util.TensorFlowTestCase):
 
   def testStaircase(self):
     with self.test_session():
-      step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                     name="step", container="", shared_name="")
+      step = gen_state_ops.variable(shape=[], dtype=dtypes.int32,
+                                    name="step", container="", shared_name="")
       assign_100 = state_ops.assign(step, 100)
       assign_1 = state_ops.assign(step, 1)
       assign_2 = state_ops.assign(step, 2)
@@ -264,8 +264,8 @@ class ExponentialDecayTest(test_util.TensorFlowTestCase):
     initial_lr = 0.1
     k = 10
     decay_rate = 0.96
-    step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                   name="step", container="", shared_name="")
+    step = gen_state_ops.variable(
+        shape=[], dtype=dtypes.int32, name="step", container="", shared_name="")
     assign_step = state_ops.assign(step, 0)
     increment_step = state_ops.assign_add(step, 1)
     decayed_lr = learning_rate_decay.natural_exp_decay(initial_lr, step,
@@ -281,8 +281,8 @@ class ExponentialDecayTest(test_util.TensorFlowTestCase):
     initial_lr = 0.1
     k = 10
     decay_rate = 0.96
-    step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                   name="step", container="", shared_name="")
+    step = gen_state_ops.variable(
+        shape=[], dtype=dtypes.int32, name="step", container="", shared_name="")
     assign_step = state_ops.assign(step, 0)
     increment_step = state_ops.assign_add(step, 1)
     decayed_lr = learning_rate_decay.natural_exp_decay(initial_lr,
@@ -304,8 +304,8 @@ class InverseDecayTest(test_util.TensorFlowTestCase):
     initial_lr = 0.1
     k = 10
     decay_rate = 0.96
-    step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                   name="step", container="", shared_name="")
+    step = gen_state_ops.variable(
+        shape=[], dtype=dtypes.int32, name="step", container="", shared_name="")
     assign_step = state_ops.assign(step, 0)
     increment_step = state_ops.assign_add(step, 1)
     decayed_lr = learning_rate_decay.inverse_time_decay(initial_lr,
@@ -323,8 +323,8 @@ class InverseDecayTest(test_util.TensorFlowTestCase):
     initial_lr = 0.1
     k = 10
     decay_rate = 0.96
-    step = gen_state_ops._variable(shape=[], dtype=dtypes.int32,
-                                   name="step", container="", shared_name="")
+    step = gen_state_ops.variable(
+        shape=[], dtype=dtypes.int32, name="step", container="", shared_name="")
     assign_step = state_ops.assign(step, 0)
     increment_step = state_ops.assign_add(step, 1)
     decayed_lr = learning_rate_decay.inverse_time_decay(initial_lr,
diff --git a/tensorflow/python/training/moving_averages_test.py b/tensorflow/python/training/moving_averages_test.py
index 6efdeb286657e761a4c46634b9408121765a447b..6717811bbb0f05723a5ad0fbcbfba75249d0d43b 100644
--- a/tensorflow/python/training/moving_averages_test.py
+++ b/tensorflow/python/training/moving_averages_test.py
@@ -376,7 +376,7 @@ class ExponentialMovingAverageTest(test.TestCase):
     with ops.device("/job:dev_v0"):
       v0 = variables.Variable(10.0, name="v0")
     with ops.device("/job:dev_v1"):
-      v1 = gen_state_ops._variable(
+      v1 = gen_state_ops.variable(
           shape=[1],
           dtype=dtypes.float32,
           name="v1",
diff --git a/tensorflow/python/training/optimizer.py b/tensorflow/python/training/optimizer.py
index 762658175a018b1cc8fe8dcf6df4b525d2f9158b..ba7e087c5a63eedd2b785a9af75051db32297c53 100644
--- a/tensorflow/python/training/optimizer.py
+++ b/tensorflow/python/training/optimizer.py
@@ -98,6 +98,9 @@ class _RefVariableProcessor(_OptimizableVariable):
   def __init__(self, v):
     self._v = v
 
+  def __str__(self):
+    return "<_RefVariableProcessor(%s)>" % self._v
+
   def target(self):
     return self._v._ref()  # pylint: disable=protected-access
 
@@ -213,7 +216,11 @@ def _get_processor(v):
 
 
 @tf_export("train.Optimizer")
-class Optimizer(checkpointable.Checkpointable):
+class Optimizer(
+    # Optimizers inherit from CheckpointableBase rather than Checkpointable
+    # since they do most of their dependency management themselves (slot
+    # variables are special-cased, and non-slot variables are keyed to graphs).
+    checkpointable.CheckpointableBase):
   """Base class for optimizers.
 
   This class defines the API to add Ops to train a model.  You never use this
@@ -324,9 +331,18 @@ class Optimizer(checkpointable.Checkpointable):
     self._use_locking = use_locking
     self._name = name
     # Dictionary of slots.
-    #  {slot_name : { variable_to_train: slot_for_the_variable, ...}, ... }
+    #  {slot_name :
+    #      {_var_key(variable_to_train): slot_for_the_variable, ... },
+    #   ... }
     self._slots = {}
     self._non_slot_dict = {}
+    # For implementing Checkpointable. Stores information about how to restore
+    # slot variables which have not yet been created
+    # (checkpointable._CheckpointPosition objects).
+    #  {slot_name :
+    #      {_var_key(variable_to_train): [checkpoint_position, ... ], ... },
+    #   ... }
+    self._deferred_slot_restorations = {}
 
   def get_name(self):
     return self._name
@@ -633,7 +649,8 @@ class Optimizer(checkpointable.Checkpointable):
 
   def _create_non_slot_variable(self, initial_value, name, colocate_with):
     """Add an extra variable, not associated with a slot."""
-    if context.in_graph_mode():
+    in_graph_mode = context.in_graph_mode()
+    if in_graph_mode:
       graph = colocate_with.graph
     else:
       graph = None
@@ -641,12 +658,51 @@ class Optimizer(checkpointable.Checkpointable):
     key = (name, graph)
     v = self._non_slot_dict.get(key, None)
     if v is None:
+      self._maybe_initialize_checkpointable()
       with ops.colocate_with(colocate_with):
+        if not in_graph_mode:
+          restored_initial_value = self._preload_simple_restoration(
+              name=name, shape=None)
+          if restored_initial_value is not None:
+            initial_value = restored_initial_value
         v = variable_scope.variable(initial_value, name=name, trainable=False)
+        # Restore this variable by name if necessary, but don't add a
+        # Checkpointable dependency. Optimizers return the current graph's
+        # non-slot variables from _checkpoint_dependencies explicitly rather
+        # than unconditionally adding dependencies (since there may be multiple
+        # non-slot variables with the same name in different graphs, trying to
+        # save all of them would result in errors).
+        self._handle_deferred_dependencies(name=name, checkpointable=v)
       self._non_slot_dict[key] = v
 
     return v
 
+  @property
+  def _checkpoint_dependencies(self):
+    """From Checkpointable. Gather graph-specific non-slot variables to save."""
+    current_graph_non_slot_variables = []
+    current_graph_key = ops.get_default_graph()._graph_key  # pylint: disable=protected-access
+    for (name, _), variable_object in sorted(self._non_slot_dict.items(),
+                                             # Avoid comparing graphs
+                                             key=lambda item: item[0][0]):
+      if variable_object._graph_key == current_graph_key:  # pylint: disable=protected-access
+        current_graph_non_slot_variables.append(
+            checkpointable.CheckpointableReference(
+                name=name, ref=variable_object))
+    return (super(Optimizer, self)._checkpoint_dependencies
+            + current_graph_non_slot_variables)
+
+  def _lookup_dependency(self, name):
+    """From Checkpointable. Find a non-slot variable in the current graph."""
+    unconditional = super(Optimizer, self)._lookup_dependency(name)
+    if unconditional is not None:
+      return unconditional
+    if context.in_graph_mode():
+      graph = ops.get_default_graph()
+    else:
+      graph = None
+    return self._get_non_slot_variable(name, graph=graph)
+
   def _get_non_slot_variable(self, name, graph=None):
     return self._non_slot_dict.get((name, graph), None)
 
@@ -884,7 +940,11 @@ class Optimizer(checkpointable.Checkpointable):
     """
     named_slots = self._slot_dict(slot_name)
     if _var_key(var) not in named_slots:
-      named_slots[_var_key(var)] = slot_creator.create_slot(var, val, op_name)
+      new_slot_variable = slot_creator.create_slot(var, val, op_name)
+      self._restore_slot_variable(
+          slot_name=slot_name, variable=var,
+          slot_variable=new_slot_variable)
+      named_slots[_var_key(var)] = new_slot_variable
     return named_slots[_var_key(var)]
 
   def _get_or_make_slot_with_initializer(self, var, initializer, shape, dtype,
@@ -905,8 +965,12 @@ class Optimizer(checkpointable.Checkpointable):
     """
     named_slots = self._slot_dict(slot_name)
     if _var_key(var) not in named_slots:
-      named_slots[_var_key(var)] = slot_creator.create_slot_with_initializer(
+      new_slot_variable = slot_creator.create_slot_with_initializer(
           var, initializer, shape, dtype, op_name)
+      self._restore_slot_variable(
+          slot_name=slot_name, variable=var,
+          slot_variable=new_slot_variable)
+      named_slots[_var_key(var)] = new_slot_variable
     return named_slots[_var_key(var)]
 
   def _zeros_slot(self, var, slot_name, op_name):
@@ -923,12 +987,43 @@ class Optimizer(checkpointable.Checkpointable):
     """
     named_slots = self._slot_dict(slot_name)
     if _var_key(var) not in named_slots:
-      named_slots[_var_key(var)] = slot_creator.create_zeros_slot(var, op_name)
+      new_slot_variable = slot_creator.create_zeros_slot(var, op_name)
+      self._restore_slot_variable(
+          slot_name=slot_name, variable=var,
+          slot_variable=new_slot_variable)
+      named_slots[_var_key(var)] = new_slot_variable
     return named_slots[_var_key(var)]
 
-  def _process_slot_restoration(
+  # --------------
+  # For implementing the Checkpointable interface.
+  # --------------
+
+  def _restore_slot_variable(self, slot_name, variable, slot_variable):
+    """Restore a newly created slot variable's value."""
+    variable_key = _var_key(variable)
+    deferred_restorations = self._deferred_slot_restorations.get(
+        slot_name, {}).pop(variable_key, [])
+    # Iterate over restores, highest restore UID first to minimize the number
+    # of assignments.
+    deferred_restorations.sort(key=lambda position: position.restore_uid,
+                               reverse=True)
+    for checkpoint_position in deferred_restorations:
+      checkpoint_position.restore(slot_variable)
+
+  def _create_or_restore_slot_variable(
       self, slot_variable_position, slot_name, variable):
-    """Restore a slot variable's value (creating it if necessary).
+    """Restore a slot variable's value, possibly creating it.
+
+    Called when a variable which has an associated slot variable is created or
+    restored. When executing eagerly, we create the slot variable with a
+    restoring initializer.
+
+    No new variables are created when graph building. Instead,
+    _restore_slot_variable catches these after normal creation and adds restore
+    ops to the graph. This method is nonetheless important when graph building
+    for the case when a slot variable has already been created but `variable`
+    has just been added to a dependency graph (causing us to realize that the
+    slot variable needs to be restored).
 
     Args:
       slot_variable_position: A `checkpointable._CheckpointPosition` object
@@ -939,28 +1034,16 @@ class Optimizer(checkpointable.Checkpointable):
     named_slots = self._slot_dict(slot_name)
     variable_key = _var_key(variable)
     slot_variable = named_slots.get(variable_key, None)
-    if slot_variable is None:
-      if slot_variable_position.is_simple_variable():
-        initializer = checkpointable.CheckpointInitialValue(
-            checkpoint_position=slot_variable_position)
-        slot_variable = self._get_or_make_slot(
-            var=variable,
-            val=initializer,
-            slot_name=slot_name,
-            op_name=self._name)
-        if slot_variable._update_uid == slot_variable_position.restore_uid:  # pylint: disable=protected-access
-          # If our restoration was set (not given with custom getters), run
-          # it. Otherwise wait for the restore() call below to restore if
-          # necessary.
-          session = slot_variable_position.checkpoint.session
-          if session:
-            session.run(slot_variable.initializer)
-
-      else:
-        raise NotImplementedError(
-            "Currently only variables with no dependencies can be loaded as "
-            "slot variables. File a feature request if this limitation bothers "
-            "you. (Got %s)" % (slot_variable_position,))
+    if (slot_variable is None
+        and context.in_eager_mode()
+        and slot_variable_position.is_simple_variable()):
+      initializer = checkpointable.CheckpointInitialValue(
+          checkpoint_position=slot_variable_position)
+      slot_variable = self._get_or_make_slot(
+          var=variable,
+          val=initializer,
+          slot_name=slot_name,
+          op_name=self._name)
       # Slot variables are not owned by any one object (because we don't want to
       # save the slot variable if the optimizer is saved without the non-slot
       # variable, or if the non-slot variable is saved without the optimizer;
@@ -968,4 +1051,15 @@ class Optimizer(checkpointable.Checkpointable):
       # variable, variable)). So we don't _track_ slot variables anywhere, and
       # instead special-case this dependency and otherwise pretend it's a normal
       # graph.
-    slot_variable_position.restore(slot_variable)
+    if slot_variable is not None:
+      # If we've either made this slot variable, or if we've pulled out an
+      # existing slot variable, we should restore it.
+      slot_variable_position.restore(slot_variable)
+    else:
+      # We didn't make the slot variable. Defer restoring until it gets created
+      # normally. We keep a list rather than the one with the highest restore
+      # UID in case slot variables have their own dependencies, in which case
+      # those could differ between restores.
+      self._deferred_slot_restorations.setdefault(
+          slot_name, {}).setdefault(variable_key, []).append(
+              slot_variable_position)
diff --git a/tensorflow/python/training/saver.py b/tensorflow/python/training/saver.py
index 3888e9bba42dc89055638ad0abe2b7e1a9f5b548..6c80562968d3201be4654f8cee5a9f94a4d1f104 100644
--- a/tensorflow/python/training/saver.py
+++ b/tensorflow/python/training/saver.py
@@ -50,6 +50,7 @@ from tensorflow.python.ops import string_ops
 from tensorflow.python.ops import variables
 from tensorflow.python.platform import gfile
 from tensorflow.python.platform import tf_logging as logging
+from tensorflow.python.training import checkpointable
 from tensorflow.python.training import training_util
 from tensorflow.python.training.checkpoint_state_pb2 import CheckpointState
 from tensorflow.python.util import compat
@@ -196,8 +197,8 @@ class BaseSaverBuilder(object):
       # Copy the restored tensor to the variable's device.
       with ops.device(self._var_device):
         restored_tensor = array_ops.identity(restored_tensor)
-      return resource_variable_ops.shape_safe_assign_variable_handle(
-          self.handle_op, self._var_shape, restored_tensor)
+        return resource_variable_ops.shape_safe_assign_variable_handle(
+            self.handle_op, self._var_shape, restored_tensor)
 
   def __init__(self, write_version=saver_pb2.SaverDef.V2):
     self._write_version = write_version
@@ -310,8 +311,7 @@ class BaseSaverBuilder(object):
     Returns:
       A string tensor.
     """
-    # pylint: disable=protected-access
-    return gen_io_ops._sharded_filename(filename_tensor, shard, num_shards)
+    return gen_io_ops.sharded_filename(filename_tensor, shard, num_shards)
 
   def _AddSaveOps(self, filename_tensor, saveables):
     """Add ops to save variables that are on the same shard.
@@ -420,8 +420,7 @@ class BaseSaverBuilder(object):
         sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
     # Return the sharded name for the save path.
     with ops.control_dependencies([x.op for x in sharded_saves]):
-      # pylint: disable=protected-access
-      return gen_io_ops._sharded_filespec(filename_tensor, num_shards_tensor)
+      return gen_io_ops.sharded_filespec(filename_tensor, num_shards_tensor)
 
   def _AddRestoreOps(self,
                      filename_tensor,
@@ -577,10 +576,18 @@ class BaseSaverBuilder(object):
           names_to_saveables[name].append(var)
         else:
           names_to_saveables[name] = [var]
+      elif (isinstance(var, checkpointable.CheckpointableBase)
+            and not isinstance(var, variables.Variable)):
+        names_to_saveables.update(
+            BaseSaverBuilder.OpListToDict(
+                list(var._gather_saveables_for_checkpoint().values())))
       else:
         if context.in_graph_mode():
           if convert_variable_to_tensor:
-            var = ops.internal_convert_to_tensor(var, as_ref=True)
+            if isinstance(var, resource_variable_ops.ResourceVariable):
+              var = var._graph_element  # pylint: disable=protected-access
+            else:
+              var = ops.internal_convert_to_tensor(var, as_ref=True)
             if not BaseSaverBuilder._IsVariable(var):
               raise TypeError("Variable to save is not a Variable: %s" % var)
           if var.op.type == "ReadVariableOp":
@@ -670,7 +677,10 @@ class BaseSaverBuilder(object):
                              "mode is enabled, type: %s." % type(op))
           saveable = BaseSaverBuilder.ResourceVariableSaveable(op, "", name)
         else:
-          variable = ops.internal_convert_to_tensor(op, as_ref=True)
+          if isinstance(op, resource_variable_ops.ResourceVariable):
+            variable = op._graph_element  # pylint: disable=protected-access
+          else:
+            variable = ops.internal_convert_to_tensor(op, as_ref=True)
           if not BaseSaverBuilder._IsVariable(variable):
             raise TypeError("names_to_saveables must be a dict mapping string "
                             "names to Tensors/Variables. Not a variable: %s" %
diff --git a/tensorflow/python/training/saver_test.py b/tensorflow/python/training/saver_test.py
index c5a6f49df599434ab3bc1a9fe3d85db6f824071e..794776544998bef81e3c6cac4815148976037ea5 100644
--- a/tensorflow/python/training/saver_test.py
+++ b/tensorflow/python/training/saver_test.py
@@ -53,6 +53,7 @@ from tensorflow.python.lib.io import file_io
 from tensorflow.python.ops import array_ops
 from tensorflow.python.ops import control_flow_ops
 from tensorflow.python.ops import data_flow_ops
+from tensorflow.python.ops import gradients_impl
 from tensorflow.python.ops import math_ops
 from tensorflow.python.ops import nn_ops
 from tensorflow.python.ops import partitioned_variables
@@ -66,6 +67,7 @@ from tensorflow.python.platform import gfile
 from tensorflow.python.platform import test
 from tensorflow.python.summary import summary
 from tensorflow.python.training import adam
+from tensorflow.python.training import checkpointable
 from tensorflow.python.training import gradient_descent
 from tensorflow.python.training import queue_runner_impl
 from tensorflow.python.training import saver as saver_module
@@ -260,6 +262,24 @@ class SaverTest(test.TestCase):
       save2.restore(sess, save_path)
       self.assertEquals(self.evaluate(v), [1])
 
+  def testNoAdditionalOpsAddedBySaverForResourceVariablesOutsideSaveScope(self):
+    with ops_lib.Graph().as_default() as g:
+      v = resource_variable_ops.ResourceVariable(1.0, name="v")
+      with ops_lib.name_scope("saver1"):
+        saver_module.Saver()
+      with ops_lib.name_scope("saver2"):
+        saver_module.Saver({"name": v})
+    ops_in_saver1_scope_but_not_save_scope = [
+        op for op in g.get_operations()
+        if (op.name.startswith("saver1/") and
+            not op.name.startswith("saver1/save/"))]
+    self.assertEqual(ops_in_saver1_scope_but_not_save_scope, [])
+    ops_in_saver2_scope_but_not_save_scope = [
+        op for op in g.get_operations()
+        if (op.name.startswith("saver2/") and
+            not op.name.startswith("saver2/save/"))]
+    self.assertEqual(ops_in_saver2_scope_but_not_save_scope, [])
+
   def testSaveCopyRestoreWithSaveRelativePaths(self):
     """Save, copy checkpoint dir and restore from copied dir.
 
@@ -2039,6 +2059,80 @@ class MetaGraphTest(test.TestCase):
     self._testGraphExtensionRestore(test_dir)
     self._testRestoreFromTrainGraphWithControlContext(test_dir)
 
+  def _testWhileLoopAndGradientSerDes(self, outer_body_fn):
+    # Build a while loop with `outer_body_fn`, export it, and verify that it can
+    # be imported and the gradient can be built and run correctly.
+
+    test_dir = self._get_test_dir("nested_control_flow")
+    filename = os.path.join(test_dir, "metafile")
+    saver_ckpt = os.path.join(test_dir, "saver.ckpt")
+
+    # Create while loop using `outer_body_fn`.
+    with ops_lib.Graph().as_default():
+      var = variables.Variable(0)
+      var_name = var.name
+      _, output = control_flow_ops.while_loop(lambda i, x: i < 5, outer_body_fn,
+                                              [0, var])
+      output_name = output.name
+      init_op = variables.global_variables_initializer()
+
+      # Generate a MetaGraphDef containing the while loop.
+      with session.Session() as sess:
+        sess.run(init_op)
+        sess.run(output)
+        saver = saver_module.Saver()
+        saver.save(sess, saver_ckpt)
+        saver.export_meta_graph(filename)
+
+      # Build and run the gradients of the while loop. We use this below to
+      # verify that the gradients are correct with an imported MetaGraphDef.
+      grad = gradients_impl.gradients([output], [var])
+      with session.Session() as sess:
+        sess.run(init_op)
+        expected_grad_value = sess.run(grad)
+
+    # Restore the MetaGraphDef into a new Graph.
+    with ops_lib.Graph().as_default():
+      with session.Session() as sess:
+        saver = saver_module.import_meta_graph(filename)
+        saver.restore(sess, saver_ckpt)
+
+      # Make sure we can still build gradients and get the same result.
+      var = ops_lib.get_default_graph().get_tensor_by_name(var_name)
+      output = ops_lib.get_default_graph().get_tensor_by_name(output_name)
+      grad = gradients_impl.gradients([output], [var])
+
+      init_op = variables.global_variables_initializer()
+
+      with session.Session() as sess:
+        sess.run(init_op)
+        actual_grad_value = sess.run(grad)
+        self.assertEqual(expected_grad_value, actual_grad_value)
+
+  def testNestedWhileLoopsSerDes(self):
+    # Test two simple nested while loops.
+    def body(i, x):
+      _, r = control_flow_ops.while_loop(lambda j, y: j < 3,
+                                         lambda j, y: (j + 1, y + x),
+                                         [0, 0])
+      return i + 1, x + r
+    self._testWhileLoopAndGradientSerDes(body)
+
+  def testNestedControlFlowSerDes(self):
+    # Test while loop in a cond in a while loop.
+    # pylint: disable=g-long-lambda
+    def body(i, x):
+      cond_result = control_flow_ops.cond(
+          i > 0,
+          lambda: control_flow_ops.while_loop(
+              lambda j, y: j < 3,
+              lambda j, y: (j + 1, y + x),
+              [0, 0])[1],
+          lambda: x)
+      return i + 1, cond_result
+    # pylint: enable=g-long-lambda
+    self._testWhileLoopAndGradientSerDes(body)
+
   def testStrippedOpListDef(self):
     with self.test_session():
       # Creates a graph.
@@ -2660,5 +2754,92 @@ class ScopedGraphTest(test.TestCase):
       self.assertEqual(2.0, var_dict2["variable2:0"].eval())
 
 
+class _OwnsAVariableSimple(checkpointable.CheckpointableBase):
+  """A Checkpointable object which can be saved using a tf.train.Saver."""
+
+  def __init__(self):
+    self.non_dep_variable = variable_scope.get_variable(
+        name="non_dep_variable", initializer=6., use_resource=True)
+
+  def _gather_saveables_for_checkpoint(self):
+    return {checkpointable.VARIABLE_VALUE_KEY: self.non_dep_variable}
+
+  # The Saver sorts by name before parsing, so we need a name property.
+  @property
+  def name(self):
+    return self.non_dep_variable.name
+
+
+class _MirroringSaveable(
+    saver_module.BaseSaverBuilder.ResourceVariableSaveable):
+
+  def __init__(self, primary_variable, mirrored_variable):
+    self._primary_variable = primary_variable
+    self._mirrored_variable = mirrored_variable
+    super(_MirroringSaveable, self).__init__(
+        self._primary_variable, "", self._primary_variable.name)
+
+  def restore(self, restored_tensors, restored_shapes):
+    """Restore the same value into both variables."""
+    tensor, = restored_tensors
+    return control_flow_ops.group(
+        self._primary_variable.assign(tensor),
+        self._mirrored_variable.assign(tensor))
+
+
+class _OwnsMirroredVariables(checkpointable.CheckpointableBase):
+  """A Checkpointable object which returns a more complex SaveableObject."""
+
+  def __init__(self):
+    self.non_dep_variable = variable_scope.get_variable(
+        name="non_dep_variable", initializer=6., use_resource=True)
+    self.mirrored = variable_scope.get_variable(
+        name="mirrored", initializer=15., use_resource=True)
+
+  def _gather_saveables_for_checkpoint(self):
+    saveable = _MirroringSaveable(
+        primary_variable=self.non_dep_variable,
+        mirrored_variable=self.mirrored)
+    return {checkpointable.VARIABLE_VALUE_KEY: saveable}
+
+  # The Saver sorts by name before parsing, so we need a name property.
+  @property
+  def name(self):
+    return self.non_dep_variable.name
+
+
+@test_util.with_c_api
+class CheckpointableCompatibilityTests(test.TestCase):
+
+  # TODO(allenl): Track down python3 reference cycles in these tests.
+  @test_util.run_in_graph_and_eager_modes()
+  def testNotSaveableButIsCheckpointable(self):
+    v = _OwnsAVariableSimple()
+    saver = saver_module.Saver(var_list=[v])
+    test_dir = self.get_temp_dir()
+    prefix = os.path.join(test_dir, "ckpt")
+    self.evaluate(v.non_dep_variable.assign(42.))
+    with self.test_session() as sess:
+      save_path = saver.save(sess, prefix)
+      self.evaluate(v.non_dep_variable.assign(43.))
+      saver.restore(sess, save_path)
+      self.assertEqual(42., self.evaluate(v.non_dep_variable))
+
+  @test_util.run_in_graph_and_eager_modes()
+  def testMoreComplexSaveableReturned(self):
+    v = _OwnsMirroredVariables()
+    saver = saver_module.Saver(var_list=[v])
+    test_dir = self.get_temp_dir()
+    prefix = os.path.join(test_dir, "ckpt")
+    self.evaluate(v.non_dep_variable.assign(42.))
+    with self.test_session() as sess:
+      save_path = saver.save(sess, prefix)
+      self.evaluate(v.non_dep_variable.assign(43.))
+      self.evaluate(v.mirrored.assign(44.))
+      saver.restore(sess, save_path)
+      self.assertEqual(42., self.evaluate(v.non_dep_variable))
+      self.assertEqual(42., self.evaluate(v.mirrored))
+
+
 if __name__ == "__main__":
   test.main()
diff --git a/tensorflow/python/training/saver_test_utils.py b/tensorflow/python/training/saver_test_utils.py
index 44b06b357ecbe4c8e330a2ccc49e83ddd4bf8c7d..0a8b7a09af92bc60faf82741a4e213996483cde3 100644
--- a/tensorflow/python/training/saver_test_utils.py
+++ b/tensorflow/python/training/saver_test_utils.py
@@ -35,7 +35,7 @@ class CheckpointedOp(object):
   # pylint: disable=protected-access
   def __init__(self, name, table_ref=None):
     if table_ref is None:
-      self.table_ref = gen_lookup_ops._mutable_hash_table_v2(
+      self.table_ref = gen_lookup_ops.mutable_hash_table_v2(
           key_dtype=dtypes.string, value_dtype=dtypes.float32, name=name)
     else:
       self.table_ref = table_ref
@@ -57,10 +57,10 @@ class CheckpointedOp(object):
       return CheckpointedOp.CustomSaveable(self, self.name)
 
   def insert(self, keys, values):
-    return gen_lookup_ops._lookup_table_insert_v2(self.table_ref, keys, values)
+    return gen_lookup_ops.lookup_table_insert_v2(self.table_ref, keys, values)
 
   def lookup(self, keys, default):
-    return gen_lookup_ops._lookup_table_find_v2(self.table_ref, keys, default)
+    return gen_lookup_ops.lookup_table_find_v2(self.table_ref, keys, default)
 
   def keys(self):
     return self._export()[0]
@@ -69,8 +69,8 @@ class CheckpointedOp(object):
     return self._export()[1]
 
   def _export(self):
-    return gen_lookup_ops._lookup_table_export_v2(self.table_ref, dtypes.string,
-                                                  dtypes.float32)
+    return gen_lookup_ops.lookup_table_export_v2(self.table_ref, dtypes.string,
+                                                 dtypes.float32)
 
   class CustomSaveable(saver_module.BaseSaverBuilder.SaveableObject):
     """A custom saveable for CheckpointedOp."""
@@ -86,6 +86,6 @@ class CheckpointedOp(object):
       super(CheckpointedOp.CustomSaveable, self).__init__(table, specs, name)
 
     def restore(self, restore_tensors, shapes):
-      return gen_lookup_ops._lookup_table_import_v2(
+      return gen_lookup_ops.lookup_table_import_v2(
           self.op.table_ref, restore_tensors[0], restore_tensors[1])
   # pylint: enable=protected-access
diff --git a/tensorflow/python/user_ops/user_ops.py b/tensorflow/python/user_ops/user_ops.py
index 17dbab706c9243c5f119dc82cc4428f03b90a18d..6f9b5d92bb2ea662c9a5af279f0fcf71f0efccc5 100644
--- a/tensorflow/python/user_ops/user_ops.py
+++ b/tensorflow/python/user_ops/user_ops.py
@@ -27,4 +27,4 @@ from tensorflow.python.ops.gen_user_ops import *  # pylint: disable=wildcard-imp
 
 def my_fact():
   """Example of overriding the generated code for an Op."""
-  return _gen_user_ops._fact()  # pylint: disable=protected-access
+  return _gen_user_ops.fact()
diff --git a/tensorflow/python/util/decorator_utils.py b/tensorflow/python/util/decorator_utils.py
index df259c7f7c29f9a4b674d3e980b33d6dcf323769..7b4363c0e40802779cf47c75c5a5e5a901da37e2 100644
--- a/tensorflow/python/util/decorator_utils.py
+++ b/tensorflow/python/util/decorator_utils.py
@@ -82,7 +82,7 @@ def add_notice_to_docstring(
     lines = _normalize_docstring(doc).splitlines()
     lines[0] += ' ' + suffix_str
 
-  notice = [''] + notice + [instructions]
+  notice = [''] + notice + ([instructions] if instructions else [])
 
   if len(lines) > 1:
     # Make sure that we keep our distance from the main body
diff --git a/tensorflow/python/util/tf_inspect.py b/tensorflow/python/util/tf_inspect.py
index c4168f7b1ac80976a957e96c79c72fe3b288d622..4ab8a72a83b466c38c50b1c76004e7a6fe942a04 100644
--- a/tensorflow/python/util/tf_inspect.py
+++ b/tensorflow/python/util/tf_inspect.py
@@ -46,8 +46,10 @@ def getargspec(object):  # pylint: disable=redefined-builtin
 
 
 def getfullargspec(obj):  # pylint: disable=redefined-builtin
-  """TFDecorator-aware replacement for inspect.getfullargspec and fallback to
-  inspect.getargspec in Python 2.
+  """TFDecorator-aware replacement for `inspect.getfullargspec`/`getargspec`.
+
+  This wrapper uses `inspect.getfullargspec` if available and falls back to
+  `inspect.getargspec` in Python 2.
 
   Args:
     obj: A callable, possibly decorated.
@@ -134,6 +136,11 @@ def getmembers(object, predicate=None):  # pylint: disable=redefined-builtin
   return _inspect.getmembers(object, predicate)
 
 
+def getmodule(object):  # pylint: disable=redefined-builtin
+  """TFDecorator-aware replacement for inspect.getmodule."""
+  return _inspect.getmodule(object)
+
+
 def getmro(cls):
   """TFDecorator-aware replacement for inspect.getmro."""
   return _inspect.getmro(cls)
@@ -144,6 +151,11 @@ def getsource(object):  # pylint: disable=redefined-builtin
   return _inspect.getsource(tf_decorator.unwrap(object)[1])
 
 
+def isbuiltin(object):  # pylint: disable=redefined-builtin
+  """TFDecorator-aware replacement for inspect.isbuiltin."""
+  return _inspect.isbuiltin(tf_decorator.unwrap(object)[1])
+
+
 def isclass(object):  # pylint: disable=redefined-builtin
   """TFDecorator-aware replacement for inspect.isclass."""
   return _inspect.isclass(tf_decorator.unwrap(object)[1])
diff --git a/tensorflow/python/util/tf_inspect_test.py b/tensorflow/python/util/tf_inspect_test.py
index a9e8ffb30c3392251c2bf7076e02aafd2338696b..129408449ebb45ac3a322f163a13b705cbb31f0c 100644
--- a/tensorflow/python/util/tf_inspect_test.py
+++ b/tensorflow/python/util/tf_inspect_test.py
@@ -124,6 +124,17 @@ class TfInspectTest(test.TestCase):
         inspect.getmembers(TestDecoratedClass),
         tf_inspect.getmembers(TestDecoratedClass))
 
+  def testGetModule(self):
+    self.assertEqual(
+        inspect.getmodule(TestDecoratedClass),
+        tf_inspect.getmodule(TestDecoratedClass))
+    self.assertEqual(
+        inspect.getmodule(test_decorated_function),
+        tf_inspect.getmodule(test_decorated_function))
+    self.assertEqual(
+        inspect.getmodule(test_undecorated_function),
+        tf_inspect.getmodule(test_undecorated_function))
+
   def testGetSource(self):
     expected = '''@test_decorator('decorator')
 def test_decorated_function_with_defaults(a, b=2, c='Hello'):
@@ -133,6 +144,19 @@ def test_decorated_function_with_defaults(a, b=2, c='Hello'):
     self.assertEqual(
         expected, tf_inspect.getsource(test_decorated_function_with_defaults))
 
+  def testIsBuiltin(self):
+    self.assertEqual(
+        tf_inspect.isbuiltin(TestDecoratedClass),
+        inspect.isbuiltin(TestDecoratedClass))
+    self.assertEqual(
+        tf_inspect.isbuiltin(test_decorated_function),
+        inspect.isbuiltin(test_decorated_function))
+    self.assertEqual(
+        tf_inspect.isbuiltin(test_undecorated_function),
+        inspect.isbuiltin(test_undecorated_function))
+    self.assertEqual(tf_inspect.isbuiltin(range), inspect.isbuiltin(range))
+    self.assertEqual(tf_inspect.isbuiltin(max), inspect.isbuiltin(max))
+
   def testIsClass(self):
     self.assertTrue(tf_inspect.isclass(TestDecoratedClass))
     self.assertFalse(tf_inspect.isclass(test_decorated_function))
diff --git a/tensorflow/stream_executor/blas.cc b/tensorflow/stream_executor/blas.cc
index da09d84921e2dd94942b3a62fe7366211c60aed1..31724cf6c9b97e45975b9e053459f7b8f5918dfa 100644
--- a/tensorflow/stream_executor/blas.cc
+++ b/tensorflow/stream_executor/blas.cc
@@ -79,6 +79,8 @@ string ComputationTypeString(ComputationType ty) {
       return "f32";
     case ComputationType::kF64:
       return "f64";
+    case ComputationType::kI32:
+      return "i32";
     case ComputationType::kComplexF32:
       return "complex f32";
     case ComputationType::kComplexF64:
@@ -88,6 +90,10 @@ string ComputationTypeString(ComputationType ty) {
   }
 }
 
+std::ostream& operator<<(std::ostream& os, ComputationType ty) {
+  return os << ComputationTypeString(ty);
+}
+
 }  // namespace blas
 }  // namespace gputools
 }  // namespace perftools
diff --git a/tensorflow/stream_executor/blas.h b/tensorflow/stream_executor/blas.h
index 072f08554688276a05d9be85718de8750bd874c2..c5f778a5c74519c0f35cea5d59aac3d0d4564c56 100644
--- a/tensorflow/stream_executor/blas.h
+++ b/tensorflow/stream_executor/blas.h
@@ -104,6 +104,8 @@ enum class ComputationType {
 // Converts a ComputationType to a string.
 string ComputationTypeString(ComputationType ty);
 
+std::ostream &operator<<(std::ostream &os, ComputationType ty);
+
 // Opaque identifier for an "algorithm" used by a blas routine.  This functions
 // as a hint to the blas library.
 typedef int64 AlgorithmType;
diff --git a/tensorflow/stream_executor/cuda/cuda_blas.cc b/tensorflow/stream_executor/cuda/cuda_blas.cc
index 44a3a745ad86dc24f632e4a36691fba06171c9fb..c563f8f931b0a5689268329386d1252f2a45bdd1 100644
--- a/tensorflow/stream_executor/cuda/cuda_blas.cc
+++ b/tensorflow/stream_executor/cuda/cuda_blas.cc
@@ -13,17 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-// Include cuBLAS headers early, and then set EIGEN_HAS_CUDA_FP16
-// if we have new enough CUDA (which we will only know after including
-// cuda.h). This ensures that Eigen's Half.h does not attempt to make its own
-// __half typedef if CUDA has already defined one (and conversely, that we do
-// not include <cuda_fp16.h> after Half.h has made its typedef).
-#include "cuda/include/cuda.h"
 #include "cuda/include/cublas_v2.h"
-
-#if CUDA_VERSION >= 7050
-#define EIGEN_HAS_CUDA_FP16
-#endif
+#include "cuda/include/cuda.h"
 
 #if CUDA_VERSION >= 8000
 #define SE_CUDA_DATA_HALF CUDA_R_16F
@@ -33,6 +24,34 @@ limitations under the License.
 
 #include "tensorflow/stream_executor/cuda/cuda_blas.h"
 
+// Both Eigen Half.h and CUDA cuda_fp16.h provide similar typedef for __half. As
+// such, there are two ways to get the typedef for __half:
+//
+// (1) Includes cuda_fp16.h and defines EIGEN_HAS_CUDA_FP16.
+// (2) Neither includes cuda_fp16.h nor defines EIGEN_HAS_CUDA_FP16.
+//
+// Due to issue b/73793421, when the first approach is used and NVCC is used to
+// compile this file, NVCC will complain duplicated definition for
+// EIGEN_HAS_CUDA_FP16. On the other hand, when the second approach is used and
+// clang is used to compile this file, clang will not understand __half
+// due to missing the definition and macro EIGEN_HAS_CUDA_FP16.
+//
+// Because this file may be compiled with CLANG but will never be compiled with
+// NVCC, we choose the first approach for CUDA < 9.0. For CUDA >= 9.0, we have
+// to use the second approach because the data member in the __half defined
+// by CUDA > 9.0 is `__x` while Eigen expects it to be `x`.
+//
+// TODO(b/73793421): Remove the following code block to switch to the second
+// approach when the issue is fixed.
+#if CUDA_VERSION < 9000
+#include "cuda/include/cuda_fp16.h"
+#if CUDA_VERSION >= 7050
+#define EIGEN_HAS_CUDA_FP16
+#endif
+#endif
+
+#include "third_party/eigen3/Eigen/Core"
+
 #include <assert.h>
 #include <complex>
 
@@ -2256,6 +2275,14 @@ bool CUDABlas::DoBlasGemmWithAlgorithm(
     DeviceMemory<Eigen::half> *c, int ldc,
     blas::ComputationType computation_type, blas::AlgorithmType algorithm,
     blas::ProfileResult *output_profile_result) {
+  if (computation_type == blas::ComputationType::kF32) {
+    return DoBlasGemmWithAlgorithmImpl(
+        stream, transa, transb, m, n, k, static_cast<float>(alpha), a, lda, b,
+        ldb, static_cast<float>(beta), c, ldc, computation_type, algorithm,
+        output_profile_result);
+  }
+
+  CHECK_EQ(computation_type, blas::ComputationType::kF16);
   return DoBlasGemmWithAlgorithmImpl(
       stream, transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc,
       computation_type, algorithm, output_profile_result);
diff --git a/tensorflow/stream_executor/cuda/cuda_dnn.cc b/tensorflow/stream_executor/cuda/cuda_dnn.cc
index 58b47067662b2595f53ca648dcab7a2a194039ab..61cf4ba7eac1f9482e3c1b179f35434a2a65d955 100644
--- a/tensorflow/stream_executor/cuda/cuda_dnn.cc
+++ b/tensorflow/stream_executor/cuda/cuda_dnn.cc
@@ -2793,7 +2793,7 @@ bool CudnnSupport::DoBatchNormalizationForwardImpl(
       parent_, scale_offset_desc, ToCudnnDataType(scale_data_type)};
   cudnnBatchNormMode_t mode = CUDNN_BATCHNORM_SPATIAL;
 #if CUDNN_VERSION >= 7000
-  if (BatchnormSpatialPersistentEnabled()) {
+  if (BatchnormSpatialPersistentEnabled() && is_training) {
     mode = CUDNN_BATCHNORM_SPATIAL_PERSISTENT;
   }
 #endif
diff --git a/tensorflow/tools/api/golden/tensorflow.-graph.pbtxt b/tensorflow/tools/api/golden/tensorflow.-graph.pbtxt
index 75361803a3991f380d6be2485cfd3d05fd1572e1..cdaeb55e30865e082054085f47d6a071ebf3affd 100644
--- a/tensorflow/tools/api/golden/tensorflow.-graph.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-graph.pbtxt
@@ -130,6 +130,10 @@ tf_class {
     name: "prevent_fetching"
     argspec: "args=[\'self\', \'op\'], varargs=None, keywords=None, defaults=None"
   }
+  member_method {
+    name: "switch_to_thread_local"
+    argspec: "args=[\'self\'], varargs=None, keywords=None, defaults=None"
+  }
   member_method {
     name: "unique_name"
     argspec: "args=[\'self\', \'name\', \'mark_as_used\'], varargs=None, keywords=None, defaults=[\'True\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt b/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt
index 069200065a4ebe530fcc0c2f61e944d34f916224..5a02bb2175e2d6ad71722799143090f2735c1a37 100644
--- a/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.-variable.pbtxt
@@ -1,7 +1,7 @@
 path: "tensorflow.Variable"
 tf_class {
   is_instance: "<class \'tensorflow.python.ops.variables.Variable\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "SaveSliceInfo"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt
index 42de5c0c80023ad5bd7f33a564780060998307c1..0900adaf762df1415c8db63c3879ca2fabc28d9f 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-dataset.pbtxt
@@ -64,7 +64,7 @@ tf_class {
   }
   member_method {
     name: "list_files"
-    argspec: "args=[\'file_pattern\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'file_pattern\', \'shuffle\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
   member_method {
     name: "make_initializable_iterator"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt
index e2fc8d6cb1d318cc50828f22e8e575cc28c7aaad..7b16ac90c925beb25e065d26e73ee2a54b06d9dc 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-fixed-length-record-dataset.pbtxt
@@ -65,7 +65,7 @@ tf_class {
   }
   member_method {
     name: "list_files"
-    argspec: "args=[\'file_pattern\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'file_pattern\', \'shuffle\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
   member_method {
     name: "make_initializable_iterator"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt
index 9770389e5ef1e29a80ae1da2725d9862f6521ff9..9cf5f2ae2057ab4a16131527cf2ef2fa6ada28e5 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-t-f-record-dataset.pbtxt
@@ -17,7 +17,7 @@ tf_class {
   }
   member_method {
     name: "__init__"
-    argspec: "args=[\'self\', \'filenames\', \'compression_type\', \'buffer_size\'], varargs=None, keywords=None, defaults=[\'None\', \'None\'], "
+    argspec: "args=[\'self\', \'filenames\', \'compression_type\', \'buffer_size\', \'num_parallel_reads\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\'], "
   }
   member_method {
     name: "apply"
@@ -65,7 +65,7 @@ tf_class {
   }
   member_method {
     name: "list_files"
-    argspec: "args=[\'file_pattern\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'file_pattern\', \'shuffle\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
   member_method {
     name: "make_initializable_iterator"
diff --git a/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt b/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt
index 7263230c1c7182bb812cb2e433aedd415bcd16c7..8c3d6691439e619c906996a3ddaea4317c4a9597 100644
--- a/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.data.-text-line-dataset.pbtxt
@@ -65,7 +65,7 @@ tf_class {
   }
   member_method {
     name: "list_files"
-    argspec: "args=[\'file_pattern\'], varargs=None, keywords=None, defaults=None"
+    argspec: "args=[\'file_pattern\', \'shuffle\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
   member_method {
     name: "make_initializable_iterator"
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.export.-tensor-serving-input-receiver.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.export.-tensor-serving-input-receiver.pbtxt
new file mode 100644
index 0000000000000000000000000000000000000000..4fe92643bf9867765499d7bf475b9cdd1686aec5
--- /dev/null
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.export.-tensor-serving-input-receiver.pbtxt
@@ -0,0 +1,27 @@
+path: "tensorflow.estimator.export.TensorServingInputReceiver"
+tf_class {
+  is_instance: "<class \'tensorflow.python.estimator.export.export.TensorServingInputReceiver\'>"
+  is_instance: "<class \'tensorflow.python.estimator.export.export.TensorServingInputReceiver\'>"
+  is_instance: "<type \'tuple\'>"
+  member {
+    name: "features"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "receiver_tensors"
+    mtype: "<type \'property\'>"
+  }
+  member {
+    name: "receiver_tensors_alternatives"
+    mtype: "<type \'property\'>"
+  }
+  member_method {
+    name: "__init__"
+  }
+  member_method {
+    name: "count"
+  }
+  member_method {
+    name: "index"
+  }
+}
diff --git a/tensorflow/tools/api/golden/tensorflow.estimator.export.pbtxt b/tensorflow/tools/api/golden/tensorflow.estimator.export.pbtxt
index 4d0dddb3bc0305a28fab0c95c31e4869f5db0aa8..bd72f6cd79f7dffb9f0a7f8ae43751c4ecba939d 100644
--- a/tensorflow/tools/api/golden/tensorflow.estimator.export.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.estimator.export.pbtxt
@@ -20,6 +20,10 @@ tf_module {
     name: "ServingInputReceiver"
     mtype: "<type \'type\'>"
   }
+  member {
+    name: "TensorServingInputReceiver"
+    mtype: "<type \'type\'>"
+  }
   member_method {
     name: "build_parsing_serving_input_receiver_fn"
     argspec: "args=[\'feature_spec\', \'default_batch_size\'], varargs=None, keywords=None, defaults=[\'None\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt
index a13bfe0a920ce13dd9a91f106c9cbcbd185b0cc7..7be2f4f61f6b9637f372591e49efc0c93c7a8c0a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.-model.pbtxt
@@ -1,10 +1,10 @@
 path: "tensorflow.keras.Model"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.training.Model\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Network\'>"
-  is_instance: "<class \'tensorflow.python.layers.network.GraphNetwork\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.network.Network\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
@@ -140,7 +140,7 @@ tf_class {
   }
   member_method {
     name: "call"
-    argspec: "args=[\'self\', \'inputs\', \'mask\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'self\', \'inputs\', \'training\', \'mask\'], varargs=None, keywords=None, defaults=[\'None\', \'None\'], "
   }
   member_method {
     name: "compile"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt
index fb6c8d70dd43eae60ea2fb86f0fc63c36d2b13ad..0f2428d77a537959cf2c46dfa350208abea8cb36 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.-sequential.pbtxt
@@ -1,11 +1,11 @@
 path: "tensorflow.keras.Sequential"
 tf_class {
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.models.Sequential\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.sequential.Sequential\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.training.Model\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Network\'>"
-  is_instance: "<class \'tensorflow.python.layers.network.GraphNetwork\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.network.Network\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
@@ -153,7 +153,7 @@ tf_class {
   }
   member_method {
     name: "call"
-    argspec: "args=[\'self\', \'inputs\', \'mask\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=kwargs, defaults=None"
   }
   member_method {
     name: "compile"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt
index f4ab075959906cdf350ec5d49dc86f928b7eb7ae..db8f626b98b70fd99f38e696aa16c72e74e86e25 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activation.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Activation"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Activation\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt
index eb558cddafc3972127786353072767f0d53bf174..809b3a5430449176a0d7423ec7f4499ceb620890 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-activity-regularization.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.ActivityRegularization"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.ActivityRegularization\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt
index 770a107b664d7ab0a8aedf292a34d4258a201859..68d41bb6cc258ca87d4664ac0fb9d5649f89ebaf 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-add.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Add"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge.Add\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt
index 0ce42b706ec20a8ea1cc83ec95cb64d9be2e5710..970b777e514194db4ac49fe58bea737b35436217 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-alpha-dropout.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.AlphaDropout"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.noise.AlphaDropout\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt
index d6c98fa225ce924bc8e20f8531516eaed4d32ffb..529c64ab293d596012aefd42e0695bd1eb7e44d1 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling1-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.AveragePooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt
index 754fd310c6d8ddb994db0590342b29f8cb7abd71..7e7c330d74fe3b71ecd0eb87e34719e47ae70784 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling2-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.AveragePooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt
index 9b62880c7931d151fb98cc1dc3149dcbd4dd103d..ada8466d7473072b1878861ab36ec40b07fa1914 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average-pooling3-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.AveragePooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt
index b371ad148cee16dd243869d929e0c1c002794682..2a5c1cd530a7a532f6cdd3c184f4ee7eb88d23d3 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-average.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Average"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge.Average\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt
index 3e2aba55fd63326bb0e232fdce06f32884db7a0a..9a2cb29815d59f3761ea25e9ea36ff6489c85b88 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool1-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.AveragePooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt
index fb37308cce0124538648c3837e1e802794d7f1ae..f5e991ea42e5ee2723b64574d4598dc8463f1c8c 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool2-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.AveragePooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt
index 813470ffc7c87727eb0b958e54806f530399806a..31732214a62524017e39776cdfb9ab629746e8ae 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-avg-pool3-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.AveragePooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt
index e251ac18e511b58a49816126d9941b98e4f91088..422eddf10db6763e10405dba5537ca161d1b8994 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-batch-normalization.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.BatchNormalization"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.normalization.BatchNormalization\'>"
   is_instance: "<class \'tensorflow.python.layers.normalization.BatchNormalization\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt
index 699208a0b9b665b69f02edaa2b2d2aeed6a83b63..9053a37916314198842bc21b0608a9b69a64c264 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-bidirectional.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Bidirectional"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.wrappers.Bidirectional\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.wrappers.Wrapper\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt
index ff08def0a08e5201bc01d61be3f2d66d712c384b..3d536d2182fc4480a2ee5fba177543ca21fbd5ac 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-concatenate.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Concatenate"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge.Concatenate\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt
index 6db22ca0320519fd9c101456c9c9c0e26a9a11e0..6a7da1aef8db64ad11bb5a5ba357f33eeb99170b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv-l-s-t-m2-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional_recurrent.ConvLSTM2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional_recurrent.ConvRecurrent2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.Recurrent\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt
index 577f206e3510a9995d5d383ac440b4f68ea39fe5..801a0339720919f8b3f6beee0f045d58b2c0a371 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv1-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Conv1D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv1D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt
index 72924c32b43e5edb39938cc0cd909cffefa61be1..13352e264a5305190717bb973a3f2bce4d7f4fff 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d-transpose.pbtxt
@@ -4,8 +4,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2DTranspose\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt
index 16be08d9b2bae8fe1faecf34c4d87ac9b9baf142..f400e4a15c362037e85ac375cee98bb5f6358669 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv2-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt
index 11e05f884d781166616a9c9a61dacbc8fdae6ae3..b3a9f573b8ba652d2544b21f36f65fe81a6ebb50 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d-transpose.pbtxt
@@ -4,8 +4,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3DTranspose\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt
index 72b72d6b3b1e410dda0b0a529449f0135203fc1b..a9be09c0abd19aeb4df30116ef2befc3948bfbf4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-conv3-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt
index ee93247f63ed700dc6058041bd0ea4ff5c879078..be1ef5eb928d16cc6bf78c289aa20d815c728b23 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution1-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Conv1D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv1D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt
index e5023287e5f38553f3553a37b5a908790072b5c7..30034f7eaf6d9073695353e5c8d9ead0cc8de7cc 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d-transpose.pbtxt
@@ -4,8 +4,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2DTranspose\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt
index ba38cb7121c9d312e7ba9d7147bdc67673d1ad2e..189b38054c004facfeeff8ad2ae87848b89040f2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution2-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt
index 58724a1e1661609ef3c000c7ca1dfe9b3235acff..a76d85c629c1fe620dafd62a0f0e05e9009109e2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d-transpose.pbtxt
@@ -4,8 +4,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3DTranspose\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt
index 98d52c430c659d0fc3e9299f7bede9190dad2fcf..782195d4ad5883d8c0ea6a657cc10258f2080a55 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-convolution3-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt
index 33b6ebe1af731f66f88a9493502f69049ab34b42..2cb7a39ea595e1ff699b96554cb135377d20a488 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping1-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Cropping1D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Cropping1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt
index 4b241ebb0f68c270a9448b02138d44f82211f418..80803306992bba3b601824a93cb3086ef3947369 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping2-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Cropping2D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Cropping2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt
index 1856a9ee21347ed6ca3dd592517eb644e205a5b7..678f40bbc23db15ff7c1138169478fb4412a449d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-cropping3-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Cropping3D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.Cropping3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt
index a8c37af31f649d28ca2ab7614178f2dee58c13fc..fac826109b6a32305ece86c4990f08afe2236ce8 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dense.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Dense"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Dense\'>"
   is_instance: "<class \'tensorflow.python.layers.core.Dense\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt
index 07d3f023e54105c606b198c05750ffa78ee5d0c8..285d544af2d69d564afdec748598b39b6b95670f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dot.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Dot"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge.Dot\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt
index e2e21b5f123f63fa38cb0e344be9a12fc091f20b..b77976974cccb96fc2373c093d2bdf279560c46f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-dropout.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Dropout"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt
index 92b9760d53e35d3e5066a730bb5cbda45492cc64..b07714d3f2d158496e0482f8611e55ea0fb0fd51 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-e-l-u.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.ELU"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.ELU\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt
index 83c528b40117222ac2b3e85ad338459948d0aa8c..e67d4ddfc47077d62319ab097e5333a373cbfc80 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-embedding.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Embedding"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.embeddings.Embedding\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt
index 73609752886c8c57a78f6bc02cc46d2c7ff6e996..b2a668e5a88d312656f48ddd0e9f7aa9f6306991 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-flatten.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Flatten"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Flatten\'>"
   is_instance: "<class \'tensorflow.python.layers.core.Flatten\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt
index b329f1c46bb07ab7684dec6aaf45a20b98c27ed9..1fd3febad26df16576dedca1df7560bf230c08ec 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u-cell.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.GRUCell"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.GRUCell\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt
index c741d4d6e6cf8da9712e68f86abe64e2828823da..f5f41d879dcb840551c00a7272bbcfbe51dbee89 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-g-r-u.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GRU"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.GRU\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.RNN\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activation"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt
index 57596badf1881950270fa6d3c074afb65daaa8eb..f4f1a5d51c5d5689918af4facf907f79d9ca71ec 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-dropout.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.GaussianDropout"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.noise.GaussianDropout\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt
index 3829353cc3c195a750ad862707c5c8563e203fba..e502df5e177d422403d0643c18a9588afb9d9713 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-gaussian-noise.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.GaussianNoise"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.noise.GaussianNoise\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt
index e53e78a977b32eaf2e31867044aedd39ab2dd34f..9c8d5bfcd8966384230e7d5cdcc1cac53a0eab9a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling1-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalAveragePooling1D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalAveragePooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt
index 48fcd1044e06b2fe61aadb6c3675ce82197ff003..8dd65f1f248daaf120780f19050c45d297b7902e 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling2-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalAveragePooling2D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalAveragePooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt
index 66c06ed47289eb2d83d97778a7b13dab821722d2..5e30571cc730ee23767a044036b590460deec00b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-average-pooling3-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalAveragePooling3D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalAveragePooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt
index 4f2420f74ab3069952e4a44bf61e5e12b3e80ea3..ba90fa454696d1cb4e77d80a2dc77ff65def4714 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool1-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalAvgPool1D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalAveragePooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt
index 7912a6d933b851521358e0246d04688da410b909..8823857758307c208527b144c0cc73b566f2f115 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool2-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalAvgPool2D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalAveragePooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt
index d5b2d2c274ad97071497045271c0a595f8e0e062..500ced852ba6b19502769ba9052f2e364af7e283 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-avg-pool3-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalAvgPool3D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalAveragePooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt
index d88ff17eb6df7bbba7d3af4344fc8ddc367ae44c..cf2717ed46b56e639fb774c1e922648e1653ec0d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool1-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalMaxPool1D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalMaxPooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt
index c8cc5a0ddfdd54cbb47de922591a9842abf63396..a86ff1a46997f19b11e6ef03be432b45687a2df2 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool2-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalMaxPool2D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalMaxPooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt
index 7956c5a340d963cfd5976e8af56da222848a164a..e01cc7c1b09ad6a40380613d54b771c6a1c89c1c 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pool3-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalMaxPool3D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalMaxPooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt
index 0a7e16413dfbd80d448eb1bad5771915475d96b2..259c1fb37c787f5318570b7aca6935d2f0ed997f 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling1-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalMaxPooling1D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalMaxPooling1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt
index 6c8a58a996f5313ea48e395e7e443a7c21f198ee..0c41bf97f763f1e40e8fac714709ccac1483a00b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling2-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalMaxPooling2D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalMaxPooling2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt
index 7678ce8aab63fcfa76c0ac61346a723c1dfe1ee7..bec8817aa393ba2d8a6410408938402366cbb01d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-global-max-pooling3-d.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.GlobalMaxPooling3D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.GlobalMaxPooling3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling._GlobalPooling3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt
index d46fd41a3f33002a9bbe755851278c9729ccd1d1..17be86222901c0f5a9a18c0e5f1c5bcac6c06a17 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-input-layer.pbtxt
@@ -1,9 +1,9 @@
 path: "tensorflow.keras.layers.InputLayer"
 tf_class {
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.InputLayer\'>"
-  is_instance: "<class \'tensorflow.python.layers.network.InputLayer\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.input_layer.InputLayer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt
index 3b171b137af699c9608494a17c5651b439fe4545..6d2a8c56196d9b3c80f570c7f1d3ac803253fff6 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m-cell.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.LSTMCell"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.LSTMCell\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt
index 29d9cf78ab5ed3bdd1a488359b59cf7171e7e051..490b5b618c65e28f1ae2e01e8d35e7f3973cc180 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-l-s-t-m.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.LSTM"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.LSTM\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.RNN\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activation"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt
index ca0144929942f7024a4e8bac5552bf0547ceb56d..21a65b838af35e2f540eacab823513e7bf54b434 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-lambda.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Lambda"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Lambda\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt
index c52ad727545c0bf4f199714d71180eac3f1bf62a..127b04738e70c11b2dc1071cf174cf5de23c5133 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-layer.pbtxt
@@ -1,7 +1,8 @@
 path: "tensorflow.keras.layers.Layer"
 tf_class {
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt
index 8134fb738683b79764662d9ea7f721fe04751162..87e49f2ed5b5d73aee5e9aa2511485b1f3f4bcd9 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-leaky-re-l-u.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.LeakyReLU"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.LeakyReLU\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt
index c5d452300947d7f74e7458e2a04bfdfabb1c1da2..1aa3aad3246b83931a47e69a4aa76fdf2b5aee22 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected1-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.LocallyConnected1D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.local.LocallyConnected1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt
index bcbed9241b525a953c8b499197facaefebe8cc44..5e9dc7d4774c651a186a4e320d0cfd088e87b6b3 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-locally-connected2-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.LocallyConnected2D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.local.LocallyConnected2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt
index 244e79b4ffe60ddd6aa56d2780d80dfd66c494a9..0d101e5b68cdb2cdf24ed472c724cfc885e3d95d 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-masking.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Masking"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Masking\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt
index 56cbf5df785ef0e2614ea7e9e6cfe1335e148eec..c85cd49ac8ce2c1fc0759671865b7174cd1c1480 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool1-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.MaxPooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt
index 33c2d30e86f9cdc3fb9f4f498bfc2c94497fe2dd..4f59e330c92f96101c65a9a24f66196e84587ccb 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool2-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.MaxPooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt
index 94f91059b7a1e291c38fe0045accc6c03f226603..c0ea0eb0505d20e70d641f2a646a060d7dbfabda 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pool3-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.MaxPooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt
index 247230a6d68b8ea93a30a2f5846d8baaa78cb13e..ca37ae51314516ae67c7725eb2ccd3d25154e2ac 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling1-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.MaxPooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt
index 8d61b67e7ce9564d31b0bd904a58540d19c89172..3ede2378347f5eddb0e8fae775a0200ea484d3f8 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling2-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.MaxPooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt
index ad2e30802006e934730e5c75247e958329f7121c..d87e25a7ba8e7cce615431723b53a0106c2b5279 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-max-pooling3-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.pooling.MaxPooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt
index ff0db15f190675d533c50c277eb1cb60e0b95e55..e4df7b48ae6b41400375920a48ef8577bb69376e 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-maximum.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Maximum"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge.Maximum\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt
index 1d3f33f04516345ee32f16befe0d7200d2cdad00..6bf7c77743c31b6d74df35d827e9d5bc9a25d303 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-multiply.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.Multiply"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge.Multiply\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.merge._Merge\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt
index c86bc49b22a8cc3e004a77f4a21594aacb2c665a..c14be132b7e406c99841576be8d8fa9ab99aa816 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-p-re-l-u.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.PReLU"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.PReLU\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt
index 2043e1a1263f0f0745b7c6446cc670fd6b0f0000..72ffbceae01da900778dba1ec14e646aa17b39e5 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-permute.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Permute"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Permute\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt
index ad539a7c4c5362500baef0a9c89d054762bbb47d..d3e780c8b22ed580f61ffc3d9b2bad7278391402 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-r-n-n.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.RNN"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.RNN\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt
index 4b0e98520a0dd86c085fa7345af445e1ae253d3b..a27980a9d17397e558a4b732e3dc332a0c1e8432 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-repeat-vector.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.RepeatVector"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.RepeatVector\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt
index 34bc71af8a26ff6e4d7c81a3877751df5209906f..67f991276c6908ff54fd516e84533542a5f60528 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-reshape.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Reshape"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Reshape\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt
index dd67b76523cc50409516e29f963f59d039455bfd..fccea5e8af5ab81e712669ff1b2567d8bde8607e 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv1-d.pbtxt
@@ -4,8 +4,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.SeparableConv1D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._SeparableConv\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt
index 5d898fb2bd86b39cb8fab755382bb96cce231fa6..d20663bdb0bc2eea323d35b1e3d4d27122f50472 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-conv2-d.pbtxt
@@ -4,8 +4,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.SeparableConv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._SeparableConv\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt
index bf62c095e7cc3fbeac95919a0f9fdc545efd3d25..889fa0a1b58bbd3babd293b7b1b45915a9ee3ca4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution1-d.pbtxt
@@ -4,8 +4,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.SeparableConv1D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._SeparableConv\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt
index c758d87993b3acba88a13c7bc9eaeee929a22652..c850f3fedc814b20f0f95cc3cf4fd5c973446b5b 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-separable-convolution2-d.pbtxt
@@ -4,8 +4,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.SeparableConv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._SeparableConv\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt
index 6e3cde3e3eaba4f9985411d66a220f7cdd4ee7ad..526d88ccba60eb25c68432e5baa03fd3a878f718 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n-cell.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.SimpleRNNCell"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.SimpleRNNCell\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt
index 6fafc77b947d0df11755e3136ed2e7a14c148081..7fddae34472411f49d42b4d65d12034d056ec818 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-simple-r-n-n.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.SimpleRNN"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.SimpleRNN\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.RNN\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activation"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt
index ee4b2fa39ed34a544ee800e9370e4f34c4a17041..5b9b62fc970238e49e6d4849285606d0a7908b23 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-softmax.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Softmax"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.Softmax\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt
index e4727072e375b9fc4dc99a1536eaaf3df5415369..769da30999993fad05ae0f7c04e256e6cf01a774 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout1-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.SpatialDropout1D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt
index c5ff7043115ccdd3bc4a1147790b20feda410f65..fca2e42a1519fcf3a9f0ec996c50b148b2df05fd 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout2-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.SpatialDropout2D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt
index 476a7f362cf88e234e964f6f6645ee4ed0cbaff8..36e8de09a967c5940bf8078234f5980a78ec8009 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-spatial-dropout3-d.pbtxt
@@ -3,8 +3,9 @@ tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.SpatialDropout3D\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt
index 3dde1e576918409b106649789443f910775e2f6c..a96f16fae99af9c30959d228202055e9aebfaf58 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-stacked-r-n-n-cells.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.StackedRNNCells"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.recurrent.StackedRNNCells\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt
index ef31c5443efa0c0e5a7a2e0a422d2a9c9c49baaf..e1cbd0e150ed890ae57c1725249d1340fc2cb663 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-thresholded-re-l-u.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.ThresholdedReLU"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.advanced_activations.ThresholdedReLU\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt
index 1e176d8d4b4eb010049f267be3d0683228a7782b..f0d35728fb1c42d563ff0598dd84da51a766a764 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-time-distributed.pbtxt
@@ -2,8 +2,9 @@ path: "tensorflow.keras.layers.TimeDistributed"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.wrappers.TimeDistributed\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.wrappers.Wrapper\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt
index a81b83be49e0073f242efc6890e419b4fe172ab2..74efaea6ddb22ec2fe9d41558978c183b0e06671 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling1-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.UpSampling1D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.UpSampling1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt
index 5403279d45ec7b93bae7907b891c659a043e96d0..dc5bd5fd5319f9bbd601a3c4083ae566b47e1aaa 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling2-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.UpSampling2D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.UpSampling2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt
index 96c337caf28d43fabd0b90df016f4e8ab0c408db..e01ccfb74aead591f1018cdcbb1c888767ecdb20 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-up-sampling3-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.UpSampling3D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.UpSampling3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt
index ea3bb2f8f567c648cd8b3dfa6f179a108690b0f0..7e6f90f7623677244865ac285c134dc79f7b9b69 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-wrapper.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.Wrapper"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.wrappers.Wrapper\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt
index b81a4b1c50b22f13eacb521cfc8bc288bd40c81f..4d0d402dad442ccf52267f5ce40b05400afbfbc7 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding1-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.ZeroPadding1D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.ZeroPadding1D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt
index 1a26f2f3c9bbaa2aa567e76e1aafe14805ecff38..b353a529bcf8e543d334fee57fca26ebc83036a4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding2-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.ZeroPadding2D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.ZeroPadding2D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt
index 310277fe67433fd870ae3d907984f402576925b2..9fe1256e616dbca4f35101df160dc55bc68bfa8a 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.layers.-zero-padding3-d.pbtxt
@@ -1,8 +1,9 @@
 path: "tensorflow.keras.layers.ZeroPadding3D"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.layers.convolutional.ZeroPadding3D\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt
index f85b328e34e6645b0fe0ade18df86411ec0f4e1f..8ccf15f9ab0fcfa59907ff05a962a84d3d86ccb4 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.models.-model.pbtxt
@@ -1,10 +1,10 @@
 path: "tensorflow.keras.models.Model"
 tf_class {
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.training.Model\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Network\'>"
-  is_instance: "<class \'tensorflow.python.layers.network.GraphNetwork\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.network.Network\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
@@ -140,7 +140,7 @@ tf_class {
   }
   member_method {
     name: "call"
-    argspec: "args=[\'self\', \'inputs\', \'mask\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'self\', \'inputs\', \'training\', \'mask\'], varargs=None, keywords=None, defaults=[\'None\', \'None\'], "
   }
   member_method {
     name: "compile"
diff --git a/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt b/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt
index 2e044d78bb2cd6c0ac817218480565c785d11ddc..102eb3220334516e0051f952353920f229f4ff20 100644
--- a/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.keras.models.-sequential.pbtxt
@@ -1,11 +1,11 @@
 path: "tensorflow.keras.models.Sequential"
 tf_class {
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.models.Sequential\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.sequential.Sequential\'>"
   is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.training.Model\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Network\'>"
-  is_instance: "<class \'tensorflow.python.layers.network.GraphNetwork\'>"
-  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.topology.Layer\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.network.Network\'>"
+  is_instance: "<class \'tensorflow.python.keras._impl.keras.engine.base_layer.Layer\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
@@ -153,7 +153,7 @@ tf_class {
   }
   member_method {
     name: "call"
-    argspec: "args=[\'self\', \'inputs\', \'mask\'], varargs=None, keywords=None, defaults=[\'None\'], "
+    argspec: "args=[\'self\', \'inputs\'], varargs=None, keywords=kwargs, defaults=None"
   }
   member_method {
     name: "compile"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt
index de81206bc8b25046cd48c79ff8f154041c0e0cb0..1c4f550d7f05b8be33326cb39d7a5f3bf663f5e6 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt
index 72d5496464210efd9e423996dfb274dd9564f761..d2db0952693f2989e6a9e8748a254eb4db483206 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt
index 595e77ff9f8b64b6606fb075f3edf2281b4c3c1f..34d9a9df281c09a2e2030daf74a2ceb8066085bb 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-average-pooling3-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.AveragePooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt
index 0c4aa2ff2612269727026141574726ad6df5cdbd..21ad0efecf88c42a3a679910ddfe095585a7933a 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-batch-normalization.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.layers.BatchNormalization"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.normalization.BatchNormalization\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt
index 5f576d0189309442dc4cea3d3617ab3144420165..ed38747c7671a267bb640ecb96a4c5fcc46c5edf 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv1D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt
index 675a7c76e569d3163ecd2c547841b4c36078b21d..ff453c6059477c20528fc768d93c65d208cdfc4a 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d-transpose.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt
index eaabbf6aab172aea5c51f8071076890bb6b5bcf7..5583bd22dce18b0a0593b73bde509818b63b3f29 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv2D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt
index 838e070d79d2d7cfbd631f1a5e9960412cfdae5a..63f0c32a7c8f7e530c76c64fa619102bc12f9ad9 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d-transpose.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt
index 4bd8cfc1a48cd839e2ffa54d0d0ca863060406d8..b77726252ccca30a7c6555fb569eb65b69e34998 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-conv3-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional.Conv3D\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt
index 57eccb03ffeb90652b019b5ce8a519797e4a3a3d..92db9f6dcd2f77c4253eb77df4a26fb632b2a766 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-dense.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.layers.Dense"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Dense\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt
index a1ec00eeeaa98a6199e29b187b0760ddc92db09d..80fa846a24c9162d8521bdb4f098b9cd8e34aedb 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-dropout.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.layers.Dropout"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Dropout\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt
index a06943d51a52f1951056136445b0d5786d801b5b..f63213b3dde40aa54b165c1c269c26fd2cd9e3b4 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-flatten.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.layers.Flatten"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.core.Flatten\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt
index 24fda0c87ed0aeabd0fd4a16bb2efab444f8cd8a..4e45b2d513bb72bb47433d72c310d6a34fbc0c01 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-layer.pbtxt
@@ -1,6 +1,7 @@
 path: "tensorflow.layers.Layer"
 tf_class {
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt
index 4c3d00e0e1ddfe95c56f9ebc7c5d609c79dd44d4..19ec33fce775caa634e71e2295ac945a6f70ade9 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling1-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling1D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt
index f7e2017b0c9438130f1cfb2431eb73ca4d3103c5..76180c333a21c592a3b53bb445df9b12d3596552 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling2-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling2D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt
index 84780926a38ff811a5ab35fadfac690a6dbbbbe2..ded75c8ff09efc6746ddd2284f53d2c021cc473c 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-max-pooling3-d.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.pooling.MaxPooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.pooling._Pooling3D\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt
index 05799ecfc9fdb9ff44620a67dcdbdc4426fddced..3dbfa5453f8e0ebb02429df9c4cbdf98de6b8ced 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv1-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._SeparableConv\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt
index c2aeb35c4648bcce22ca73c838a85803a6b9cedf..ab171df1d1650e19836018f3316e6919f6d36def 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.-separable-conv2-d.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.layers.convolutional._SeparableConv\'>"
   is_instance: "<class \'tensorflow.python.layers.convolutional._Conv\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.layers.pbtxt b/tensorflow/tools/api/golden/tensorflow.layers.pbtxt
index 59134f84891ad5518dcb5331ce04475482c8b59e..df74c32e1f10cc7540ef105adef6be681e93d089 100644
--- a/tensorflow/tools/api/golden/tensorflow.layers.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.layers.pbtxt
@@ -76,10 +76,6 @@ tf_module {
     name: "SeparableConv2D"
     mtype: "<type \'type\'>"
   }
-  member_method {
-    name: "Input"
-    argspec: "args=[\'shape\', \'batch_size\', \'name\', \'dtype\', \'sparse\', \'tensor\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \"<dtype: \'float32\'>\", \'False\', \'None\'], "
-  }
   member_method {
     name: "average_pooling1d"
     argspec: "args=[\'inputs\', \'pool_size\', \'strides\', \'padding\', \'data_format\', \'name\'], varargs=None, keywords=None, defaults=[\'valid\', \'channels_last\', \'None\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt
index 44536787f09fc98bba8a4eb0bc562427cfe48b8b..9c71a24d0500e2091e0ae94cc4dd7ed6b788a54f 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-l-s-t-m-cell.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.LayerRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt
index 768565d3cacbd1313ee5a64c9b15f9ab70683772..9e19f96b7452616956fb7fd3ca62d8f4b25a2122 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-basic-r-n-n-cell.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.LayerRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt
index 0d253e5dd233d6d2b6ad0070a463c283a8769dab..7540aa62861895a7c41840476d4edb79785a77a9 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-device-wrapper.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.DeviceWrapper\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt
index 97edf245f6fbed393a6fb8dbf1e83649e9ac4b4e..fc1ff386690f9c7acb11d4cc0770e394f78350ad 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-dropout-wrapper.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.DropoutWrapper\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt
index 6ecc134d4df866ab5d59e238a8157064421579bd..751122cfff3bf9c55dd9fa264fdf2e1960940724 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-g-r-u-cell.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.LayerRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt
index 4b3ca1578ba52f30e3405ff198fb716496a462c6..4b6313f395fd8fd4ec2af78365117620263e7a55 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-l-s-t-m-cell.pbtxt
@@ -4,6 +4,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.LayerRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt
index 9a6c73a079884b8ab92be1c9e89b2a9f34aad851..00e8c71140596ecea237ce05a09feff1fbb49001 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-multi-r-n-n-cell.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.MultiRNNCell\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt
index 27488f8e73f20456fae911511ecd2e41a60da351..3852f90dd6c4a254e20e789bdeb7796d61cef6bc 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-r-n-n-cell.pbtxt
@@ -2,6 +2,7 @@ path: "tensorflow.nn.rnn_cell.RNNCell"
 tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
index 3310836ed26387718115c2454300b9edfe930451..8f3f0f7506ef49014b31cd4bc04f1cb1e0d696fc 100644
--- a/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.nn.rnn_cell.-residual-wrapper.pbtxt
@@ -3,6 +3,7 @@ tf_class {
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.ResidualWrapper\'>"
   is_instance: "<class \'tensorflow.python.ops.rnn_cell_impl.RNNCell\'>"
   is_instance: "<class \'tensorflow.python.layers.base.Layer\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "activity_regularizer"
diff --git a/tensorflow/tools/api/golden/tensorflow.pbtxt b/tensorflow/tools/api/golden/tensorflow.pbtxt
index 066c4513ff5185b50bdf193f579e71e505dbd3b6..222553eb4176388d4d9455e32930e2b0d42f250c 100644
--- a/tensorflow/tools/api/golden/tensorflow.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.pbtxt
@@ -1600,6 +1600,10 @@ tf_module {
     name: "reduce_sum"
     argspec: "args=[\'input_tensor\', \'axis\', \'keepdims\', \'name\', \'reduction_indices\', \'keep_dims\'], varargs=None, keywords=None, defaults=[\'None\', \'None\', \'None\', \'None\', \'None\'], "
   }
+  member_method {
+    name: "regex_replace"
+    argspec: "args=[\'input\', \'pattern\', \'rewrite\', \'replace_global\', \'name\'], varargs=None, keywords=None, defaults=[\'True\', \'None\'], "
+  }
   member_method {
     name: "register_tensor_conversion_function"
     argspec: "args=[\'base_type\', \'conversion_func\', \'priority\'], varargs=None, keywords=None, defaults=[\'100\'], "
@@ -1988,6 +1992,10 @@ tf_module {
     name: "tile"
     argspec: "args=[\'input\', \'multiples\', \'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
   }
+  member_method {
+    name: "timestamp"
+    argspec: "args=[\'name\'], varargs=None, keywords=None, defaults=[\'None\'], "
+  }
   member_method {
     name: "to_bfloat16"
     argspec: "args=[\'x\', \'name\'], varargs=None, keywords=None, defaults=[\'ToBFloat16\'], "
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-adadelta-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-adadelta-optimizer.pbtxt
index 4eea52596abfb42bedbd5aa7061aaddc496991e1..16bfbf20d5227d6308248bebcb62f32a2df8ef41 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-adadelta-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-adadelta-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.AdadeltaOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.adadelta.AdadeltaOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-adagrad-d-a-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-adagrad-d-a-optimizer.pbtxt
index 5aaaf0e20b7f77c130de856677dc7eee5d825fe9..61cde9181c2367153b7b289b41bd932482bb92fd 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-adagrad-d-a-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-adagrad-d-a-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.AdagradDAOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.adagrad_da.AdagradDAOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-adagrad-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-adagrad-optimizer.pbtxt
index 7f1201879cc43d2a526ed65779ccd6a705812366..0a998c1afe4fff6e215360bc1cf8fc135754223c 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-adagrad-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-adagrad-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.AdagradOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.adagrad.AdagradOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-adam-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-adam-optimizer.pbtxt
index 503c439d83379e8c36569bda55f3394eedcac8f9..cc5954152577796ee7a5a6e1cedc873647d64f7c 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-adam-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-adam-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.AdamOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.adam.AdamOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-ftrl-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-ftrl-optimizer.pbtxt
index 39c071748c5200115ed07f9b234b48552546ef10..1add3a902122341a706c38b19ea6ff5882c26445 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-ftrl-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-ftrl-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.FtrlOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.ftrl.FtrlOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-gradient-descent-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-gradient-descent-optimizer.pbtxt
index 6b441786ca17b98ab9aa5fd04af60372f2085921..ef5bbd6ace29abb5c73516176fcc7594a58d493a 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-gradient-descent-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-gradient-descent-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.GradientDescentOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.gradient_descent.GradientDescentOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-momentum-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-momentum-optimizer.pbtxt
index 80f3963bacdcdbe01eb4c77e18dc155f1e5ea682..3d6e87f5eb44de9d6ce1bdd25a54b8df9020cc03 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-momentum-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-momentum-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.MomentumOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.momentum.MomentumOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-optimizer.pbtxt
index c880ba328a6b589e0bf0da8a010d417d5fc7c98a..e73861ff7cb2d90d8efac72cdd7de3b27395f29e 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-optimizer.pbtxt
@@ -1,7 +1,7 @@
 path: "tensorflow.train.Optimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-proximal-adagrad-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-proximal-adagrad-optimizer.pbtxt
index 6acdf35f786dc531a642f4280489bb9df3625b27..301b35b199c87890a0aef4139eb06253592ce0c4 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-proximal-adagrad-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-proximal-adagrad-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.ProximalAdagradOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.proximal_adagrad.ProximalAdagradOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-proximal-gradient-descent-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-proximal-gradient-descent-optimizer.pbtxt
index 00b1e309e360f524283c746bb8499d970a62bbe7..8815befa936a85522011111a4a6270d22cbc25ae 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-proximal-gradient-descent-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-proximal-gradient-descent-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.ProximalGradientDescentOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.proximal_gradient_descent.ProximalGradientDescentOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-r-m-s-prop-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-r-m-s-prop-optimizer.pbtxt
index 05dc391cab99f982ef488eaf7231a9cd683b72b0..e9819683ba5ec1bcacb3cdbcb2d787e866a77b6f 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-r-m-s-prop-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-r-m-s-prop-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.RMSPropOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.rmsprop.RMSPropOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/golden/tensorflow.train.-sync-replicas-optimizer.pbtxt b/tensorflow/tools/api/golden/tensorflow.train.-sync-replicas-optimizer.pbtxt
index 4be28192610a735d89b49c7e332e3135dce23eb8..3db96aff876b88b80b647570cf68b1ebc0b2da3b 100644
--- a/tensorflow/tools/api/golden/tensorflow.train.-sync-replicas-optimizer.pbtxt
+++ b/tensorflow/tools/api/golden/tensorflow.train.-sync-replicas-optimizer.pbtxt
@@ -2,7 +2,7 @@ path: "tensorflow.train.SyncReplicasOptimizer"
 tf_class {
   is_instance: "<class \'tensorflow.python.training.sync_replicas_optimizer.SyncReplicasOptimizer\'>"
   is_instance: "<class \'tensorflow.python.training.optimizer.Optimizer\'>"
-  is_instance: "<class \'tensorflow.python.training.checkpointable.Checkpointable\'>"
+  is_instance: "<class \'tensorflow.python.training.checkpointable.CheckpointableBase\'>"
   is_instance: "<type \'object\'>"
   member {
     name: "GATE_GRAPH"
diff --git a/tensorflow/tools/api/tests/api_compatibility_test.py b/tensorflow/tools/api/tests/api_compatibility_test.py
index c1e09cc531ed8e8995e3e73b87e96b72fba6c038..2a784973e1098bb1f67eb1b002b7b006f69670ff 100644
--- a/tensorflow/tools/api/tests/api_compatibility_test.py
+++ b/tensorflow/tools/api/tests/api_compatibility_test.py
@@ -165,7 +165,7 @@ class ApiCompatibilityTest(test.TestCase):
       logging.error('%d differences found between API and golden.', diff_count)
       messages = verbose_diffs if verbose else diffs
       for i in range(diff_count):
-        logging.error('Issue %d\t: %s', i + 1, messages[i])
+        print('Issue %d\t: %s' % (i + 1, messages[i]), file=sys.stderr)
 
       if update_goldens:
         # Write files if requested.
diff --git a/tensorflow/tools/ci_build/Dockerfile.rbe.gpu b/tensorflow/tools/ci_build/Dockerfile.rbe.gpu
new file mode 100644
index 0000000000000000000000000000000000000000..24ff4765a619701cd614414d2b06f7fa4ce7d8c0
--- /dev/null
+++ b/tensorflow/tools/ci_build/Dockerfile.rbe.gpu
@@ -0,0 +1,26 @@
+FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
+
+LABEL maintainer="Nick Lopez <ngiraldo@google.com>"
+
+# In the Ubuntu 16.04 images, cudnn is placed in system paths. Move them to
+# /usr/local/cuda
+RUN cp -P /usr/include/cudnn.h /usr/local/cuda/include
+RUN cp -P /usr/lib/x86_64-linux-gnu/libcudnn* /usr/local/cuda/lib64
+
+# Copy and run the install scripts.
+COPY install/*.sh /install/
+ARG DEBIAN_FRONTEND=noninteractive
+RUN /install/install_bootstrap_deb_packages.sh
+RUN add-apt-repository -y ppa:openjdk-r/ppa && \
+    add-apt-repository -y ppa:george-edison55/cmake-3.x
+RUN /install/install_deb_packages.sh
+RUN /install/install_pip_packages.sh
+RUN /install/install_golang.sh
+
+# Install clang from pre-built package
+RUN cd /tmp && \
+    wget https://storage.googleapis.com/clang-builds-stable/clang-ubuntu16_04/clang_r323528.tar.gz && \
+    echo "26752d9f5785df07193fac8316ba5d5ba3bec36d970c29a1577360848818ac74  clang_r323528.tar.gz" | sha256sum -c && \
+    tar -C /usr/local -xf clang_r323528.tar.gz && \
+    rm clang_r323528.tar.gz
+
diff --git a/tensorflow/tools/ci_build/builds/test_tutorials.sh b/tensorflow/tools/ci_build/builds/test_tutorials.sh
index 67e5af556405a5c659000a07a79a6bd9a1d1e542..db335f14ca4f88ade7a540ffab7ed9de67f1248e 100755
--- a/tensorflow/tools/ci_build/builds/test_tutorials.sh
+++ b/tensorflow/tools/ci_build/builds/test_tutorials.sh
@@ -277,17 +277,6 @@ test_ptb_word_lm() {
   fi
 }
 
-
-# -----------------------------------------------------------
-# translate_test
-test_translate_test() {
-  LOG_FILE=$1
-
-  run_in_directory "${TEST_DIR}" "${LOG_FILE}" \
-    "${TF_MODELS_DIR}/tutorials/rnn/translate/translate.py" --self_test=True
-}
-
-
 # Run the tutorial tests
 test_runner "tutorial test-on-install" \
     "${TUT_TESTS}" "${TF_BUILD_TUT_TEST_BLACKLIST}" "${LOGS_DIR}"
diff --git a/tensorflow/tools/ci_build/builds/with_the_same_user b/tensorflow/tools/ci_build/builds/with_the_same_user
index 5817716c8dec37dfdfd50defb4b20b1deafced70..d4bf546d401d058bd205a70c147615c8efc4f4ba 100755
--- a/tensorflow/tools/ci_build/builds/with_the_same_user
+++ b/tensorflow/tools/ci_build/builds/with_the_same_user
@@ -36,8 +36,13 @@ else
   rm /this_is_writable_file_system
 fi
 
+if [ -n "${CI_BUILD_USER_FORCE_BADNAME}" ]; then
+  ADDUSER_OPTS="--force-badname"
+fi
+
 getent group "${CI_BUILD_GID}" || addgroup --gid "${CI_BUILD_GID}" "${CI_BUILD_GROUP}"
-getent passwd "${CI_BUILD_UID}" || adduser --gid "${CI_BUILD_GID}" --uid "${CI_BUILD_UID}" \
+getent passwd "${CI_BUILD_UID}" || adduser ${ADDUSER_OPTS} \
+    --gid "${CI_BUILD_GID}" --uid "${CI_BUILD_UID}" \
     --gecos "${CI_BUILD_USER} (generated by with_the_same_user script)" \
     --disabled-password --home "${CI_BUILD_HOME}" --quiet "${CI_BUILD_USER}"
 usermod -a -G sudo "${CI_BUILD_USER}"
diff --git a/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh b/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh
index cfeaebdbf57c01fef7cd81dae76217429336d0ff..d0816c92b7308a1079579e605ee9af491a0533fb 100755
--- a/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh
+++ b/tensorflow/tools/ci_build/gpu_build/parallel_gpu_execute.sh
@@ -54,3 +54,6 @@ for i in `seq 0 $((TF_GPU_COUNT-1))`; do
   fi
 done
 
+echo "Cannot find a free GPU to run the test $* on, exiting with failure..."
+exit 1
+
diff --git a/tensorflow/tools/ci_build/install/install_bazel.sh b/tensorflow/tools/ci_build/install/install_bazel.sh
index 1df6a84d7c6f86abfb965063625ac43a3f1a57fb..3e27a94cf2bf3110ac181d6ef5a57366be17255f 100755
--- a/tensorflow/tools/ci_build/install/install_bazel.sh
+++ b/tensorflow/tools/ci_build/install/install_bazel.sh
@@ -15,7 +15,7 @@
 # ==============================================================================
 
 # Select bazel version.
-BAZEL_VERSION="0.10.0"
+BAZEL_VERSION="0.11.0"
 
 set +e
 local_bazel_ver=$(bazel version 2>&1 | grep -i label | awk '{print $3}')
diff --git a/tensorflow/tools/docker/Dockerfile.devel b/tensorflow/tools/docker/Dockerfile.devel
index d16761c3675942838fd2be0ea6e0b7463a3bf249..22c73c3fe13f2cb763295fa25b43e2f82c0e8962 100644
--- a/tensorflow/tools/docker/Dockerfile.devel
+++ b/tensorflow/tools/docker/Dockerfile.devel
@@ -57,7 +57,7 @@ RUN echo "startup --batch" >>/etc/bazel.bazelrc
 RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/etc/bazel.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.8.0
+ENV BAZEL_VERSION 0.11.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
diff --git a/tensorflow/tools/docker/Dockerfile.devel-gpu b/tensorflow/tools/docker/Dockerfile.devel-gpu
index 4ef37881bc91aaa58bab031c69b4a96c2a9d8ec1..69ba340f9201266fd2c2f86571e83f6acdcda950 100644
--- a/tensorflow/tools/docker/Dockerfile.devel-gpu
+++ b/tensorflow/tools/docker/Dockerfile.devel-gpu
@@ -66,7 +66,7 @@ RUN echo "startup --batch" >>/etc/bazel.bazelrc
 RUN echo "build --spawn_strategy=standalone --genrule_strategy=standalone" \
     >>/etc/bazel.bazelrc
 # Install the most recent bazel release.
-ENV BAZEL_VERSION 0.8.0
+ENV BAZEL_VERSION 0.11.0
 WORKDIR /
 RUN mkdir /bazel && \
     cd /bazel && \
diff --git a/tensorflow/tools/graph_transforms/BUILD b/tensorflow/tools/graph_transforms/BUILD
index ad3668fa02e102607c9a03ac312451a147affdda..4fe4fc3b137ddf453c9194424a0c4dc31c5a12c3 100644
--- a/tensorflow/tools/graph_transforms/BUILD
+++ b/tensorflow/tools/graph_transforms/BUILD
@@ -134,8 +134,8 @@ cc_library(
         "//tensorflow/core:tensorflow",
         "//tensorflow/contrib/rnn:gru_ops_op_lib",
         "//tensorflow/contrib/rnn:lstm_ops_op_lib",
+        "//tensorflow/core/kernels:quantization_utils",
     ] + if_not_windows([
-        "//tensorflow/core/kernels:quantized_ops",
         "//tensorflow/core/kernels:remote_fused_graph_rewriter_transform",
         "//tensorflow/core/kernels/hexagon:hexagon_rewriter_transform",
     ]),
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/BUILD.bazel b/tensorflow/tools/integration_tests/gcs_smoke_test/BUILD.bazel
new file mode 100755
index 0000000000000000000000000000000000000000..0acc139df975fe58ad436837cedd711c54752598
--- /dev/null
+++ b/tensorflow/tools/integration_tests/gcs_smoke_test/BUILD.bazel
@@ -0,0 +1,67 @@
+package(default_visibility = ["//visibility:public"])
+
+load("@rbe_integration_test//skylark:integration_tests.bzl", "sut_component", "integration_test")
+load("@rbe_integration_test//skylark:toolchains.bzl", "toolchain_container_images")
+
+sut_component(
+    name = "gcs",
+    docker_image = toolchain_container_images()["tensorflow"],
+    setups = [{
+        "program": "setup.sh",
+        "args": [
+            "gs://tensorflow-test-bucket/tf-gcs-test",
+        ],
+        "output_properties": ["gcs_path"],
+        "timeout_seconds": 100,
+    }],
+    teardowns = [{
+        "program": "teardown.sh",
+        "args": ["{gcs_path}"],
+        "timeout_seconds": 100,
+    }],
+)
+
+py_binary(
+    name = "gcs_smoke",
+    srcs = ["gcs_smoke.py"],
+)
+
+sh_binary(
+    name = "test_wrapper",
+    srcs = ["test_wrapper.sh"],
+    data = [
+        "gcs_smoke",
+    ],
+)
+
+integration_test(
+    name = "gcs_smoke_test",
+    sut_deps = {
+        ":gcs": "gcs",
+    },
+    tags = [
+        "manual",
+        "notap",
+    ],
+    test = {
+        "program": ":test_wrapper",
+        "args": [
+            "--gcs_bucket_url={gcs#gcs_path}",
+            "--num_examples=20",
+        ],
+        "timeout_seconds": 250,
+    },
+    test_docker_image = toolchain_container_images()["tensorflow"],
+    test_type = "MultiMachine",
+)
+
+filegroup(
+    name = "all_files",
+    srcs = glob(
+        ["**/*"],
+        exclude = [
+            "**/METADATA",
+            "**/OWNERS",
+        ],
+    ),
+)
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/gcs_smoke.py b/tensorflow/tools/integration_tests/gcs_smoke_test/gcs_smoke.py
new file mode 100755
index 0000000000000000000000000000000000000000..8438c2156cb09b4d8c9442d9a5f4de67e59272f2
--- /dev/null
+++ b/tensorflow/tools/integration_tests/gcs_smoke_test/gcs_smoke.py
@@ -0,0 +1,253 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Smoke test for reading records from GCS to TensorFlow."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import sys
+import time
+
+import numpy as np
+import tensorflow as tf
+from tensorflow.core.example import example_pb2
+from tensorflow.python.lib.io import file_io
+
+flags = tf.app.flags
+flags.DEFINE_string("gcs_bucket_url", "",
+                    "The URL to the GCS bucket in which the temporary "
+                    "tfrecord file is to be written and read, e.g., "
+                    "gs://my-gcs-bucket/test-directory")
+flags.DEFINE_integer("num_examples", 10, "Number of examples to generate")
+
+FLAGS = flags.FLAGS
+
+
+def create_examples(num_examples, input_mean):
+  """Create ExampleProto's containing data."""
+  ids = np.arange(num_examples).reshape([num_examples, 1])
+  inputs = np.random.randn(num_examples, 1) + input_mean
+  target = inputs - input_mean
+  examples = []
+  for row in range(num_examples):
+    ex = example_pb2.Example()
+    ex.features.feature["id"].bytes_list.value.append(str(ids[row, 0]))
+    ex.features.feature["target"].float_list.value.append(target[row, 0])
+    ex.features.feature["inputs"].float_list.value.append(inputs[row, 0])
+    examples.append(ex)
+  return examples
+
+
+def create_dir_test():
+  """Verifies file_io directory handling methods."""
+
+  # Test directory creation.
+  starttime_ms = int(round(time.time() * 1000))
+  dir_name = "%s/tf_gcs_test_%s" % (FLAGS.gcs_bucket_url, starttime_ms)
+  print("Creating dir %s" % dir_name)
+  file_io.create_dir(dir_name)
+  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
+  print("Created directory in: %d milliseconds" % elapsed_ms)
+
+  # Check that the directory exists.
+  dir_exists = file_io.is_directory(dir_name)
+  assert dir_exists
+  print("%s directory exists: %s" % (dir_name, dir_exists))
+
+  # Test recursive directory creation.
+  starttime_ms = int(round(time.time() * 1000))
+  recursive_dir_name = "%s/%s/%s" % (dir_name,
+                                     "nested_dir1",
+                                     "nested_dir2")
+  print("Creating recursive dir %s" % recursive_dir_name)
+  file_io.recursive_create_dir(recursive_dir_name)
+  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
+  print("Created directory recursively in: %d milliseconds" % elapsed_ms)
+
+  # Check that the directory exists.
+  recursive_dir_exists = file_io.is_directory(recursive_dir_name)
+  assert recursive_dir_exists
+  print("%s directory exists: %s" % (recursive_dir_name, recursive_dir_exists))
+
+  # Create some contents in the just created directory and list the contents.
+  num_files = 10
+  files_to_create = ["file_%d.txt" % n for n in range(num_files)]
+  for file_num in files_to_create:
+    file_name = "%s/%s" % (dir_name, file_num)
+    print("Creating file %s." % file_name)
+    file_io.write_string_to_file(file_name, "test file.")
+
+  print("Listing directory %s." % dir_name)
+  starttime_ms = int(round(time.time() * 1000))
+  directory_contents = file_io.list_directory(dir_name)
+  print(directory_contents)
+  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
+  print("Listed directory %s in %s milliseconds" % (dir_name, elapsed_ms))
+  assert set(directory_contents) == set(files_to_create + ["nested_dir1/"])
+
+  # Test directory renaming.
+  dir_to_rename = "%s/old_dir" % dir_name
+  new_dir_name = "%s/new_dir" % dir_name
+  file_io.create_dir(dir_to_rename)
+  assert file_io.is_directory(dir_to_rename)
+  assert not file_io.is_directory(new_dir_name)
+
+  starttime_ms = int(round(time.time() * 1000))
+  print("Will try renaming directory %s to %s" % (dir_to_rename, new_dir_name))
+  file_io.rename(dir_to_rename, new_dir_name)
+  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
+  print("Renamed directory %s to %s in %s milliseconds" % (
+      dir_to_rename, new_dir_name, elapsed_ms))
+  assert not file_io.is_directory(dir_to_rename)
+  assert file_io.is_directory(new_dir_name)
+
+  # Test Delete directory recursively.
+  print("Deleting directory recursively %s." % dir_name)
+  starttime_ms = int(round(time.time() * 1000))
+  file_io.delete_recursively(dir_name)
+  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
+  dir_exists = file_io.is_directory(dir_name)
+  assert not dir_exists
+  print("Deleted directory recursively %s in %s milliseconds" % (
+      dir_name, elapsed_ms))
+
+
+def create_object_test():
+  """Verifies file_io's object manipulation methods ."""
+  starttime_ms = int(round(time.time() * 1000))
+  dir_name = "%s/tf_gcs_test_%s" % (FLAGS.gcs_bucket_url, starttime_ms)
+  print("Creating dir %s." % dir_name)
+  file_io.create_dir(dir_name)
+
+  num_files = 5
+  # Create files of 2 different patterns in this directory.
+  files_pattern_1 = ["%s/test_file_%d.txt" % (dir_name, n)
+                     for n in range(num_files)]
+  files_pattern_2 = ["%s/testfile%d.txt" % (dir_name, n)
+                     for n in range(num_files)]
+
+  starttime_ms = int(round(time.time() * 1000))
+  files_to_create = files_pattern_1 + files_pattern_2
+  for file_name in files_to_create:
+    print("Creating file %s." % file_name)
+    file_io.write_string_to_file(file_name, "test file creation.")
+  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
+  print("Created %d files in %s milliseconds" %
+        (len(files_to_create), elapsed_ms))
+
+  # Listing files of pattern1.
+  list_files_pattern = "%s/test_file*.txt" % dir_name
+  print("Getting files matching pattern %s." % list_files_pattern)
+  starttime_ms = int(round(time.time() * 1000))
+  files_list = file_io.get_matching_files(list_files_pattern)
+  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
+  print("Listed files in %s milliseconds" % elapsed_ms)
+  print(files_list)
+  assert set(files_list) == set(files_pattern_1)
+
+  # Listing files of pattern2.
+  list_files_pattern = "%s/testfile*.txt" % dir_name
+  print("Getting files matching pattern %s." % list_files_pattern)
+  starttime_ms = int(round(time.time() * 1000))
+  files_list = file_io.get_matching_files(list_files_pattern)
+  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
+  print("Listed files in %s milliseconds" % elapsed_ms)
+  print(files_list)
+  assert set(files_list) == set(files_pattern_2)
+
+  # Test renaming file.
+  file_to_rename = "%s/oldname.txt" % dir_name
+  file_new_name = "%s/newname.txt" % dir_name
+  file_io.write_string_to_file(file_to_rename, "test file.")
+  assert file_io.file_exists(file_to_rename)
+  assert not file_io.file_exists(file_new_name)
+
+  print("Will try renaming file %s to %s" % (file_to_rename, file_new_name))
+  starttime_ms = int(round(time.time() * 1000))
+  file_io.rename(file_to_rename, file_new_name)
+  elapsed_ms = int(round(time.time() * 1000)) - starttime_ms
+  print("File %s renamed to %s in %s milliseconds" % (
+      file_to_rename, file_new_name, elapsed_ms))
+  assert not file_io.file_exists(file_to_rename)
+  assert file_io.file_exists(file_new_name)
+
+  # Delete directory.
+  print("Deleting directory %s." % dir_name)
+  file_io.delete_recursively(dir_name)
+
+
+def main(argv):
+  del argv  # Unused.
+  # Sanity check on the GCS bucket URL.
+  if not FLAGS.gcs_bucket_url or not FLAGS.gcs_bucket_url.startswith("gs://"):
+    print("ERROR: Invalid GCS bucket URL: \"%s\"" % FLAGS.gcs_bucket_url)
+    sys.exit(1)
+
+  # Verify that writing to the records file in GCS works.
+  print("\n=== Testing writing and reading of GCS record file... ===")
+  example_data = create_examples(FLAGS.num_examples, 5)
+  with tf.python_io.TFRecordWriter(FLAGS.gcs_bucket_url) as hf:
+    for e in example_data:
+      hf.write(e.SerializeToString())
+
+    print("Data written to: %s" % FLAGS.gcs_bucket_url)
+
+  # Verify that reading from the tfrecord file works and that
+  # tf_record_iterator works.
+  record_iter = tf.python_io.tf_record_iterator(FLAGS.gcs_bucket_url)
+  read_count = 0
+  for _ in record_iter:
+    read_count += 1
+  print("Read %d records using tf_record_iterator" % read_count)
+
+  if read_count != FLAGS.num_examples:
+    print("FAIL: The number of records read from tf_record_iterator (%d) "
+          "differs from the expected number (%d)" % (read_count,
+                                                     FLAGS.num_examples))
+    sys.exit(1)
+
+  # Verify that running the read op in a session works.
+  print("\n=== Testing TFRecordReader.read op in a session... ===")
+  with tf.Graph().as_default() as _:
+    filename_queue = tf.train.string_input_producer([FLAGS.gcs_bucket_url],
+                                                    num_epochs=1)
+    reader = tf.TFRecordReader()
+    _, serialized_example = reader.read(filename_queue)
+
+    with tf.Session() as sess:
+      sess.run(tf.global_variables_initializer())
+      sess.run(tf.local_variables_initializer())
+      tf.train.start_queue_runners()
+      index = 0
+      for _ in range(FLAGS.num_examples):
+        print("Read record: %d" % index)
+        sess.run(serialized_example)
+        index += 1
+
+      # Reading one more record should trigger an exception.
+      try:
+        sess.run(serialized_example)
+        print("FAIL: Failed to catch the expected OutOfRangeError while "
+              "reading one more record than is available")
+        sys.exit(1)
+      except tf.errors.OutOfRangeError:
+        print("Successfully caught the expected OutOfRangeError while "
+              "reading one more record than is available")
+
+  create_dir_test()
+  create_object_test()
+
+if __name__ == "__main__":
+  tf.app.run(main)
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/setup.sh b/tensorflow/tools/integration_tests/gcs_smoke_test/setup.sh
new file mode 100755
index 0000000000000000000000000000000000000000..6553ba5e3093c26d3c95f40216cd3922a1fb9e4e
--- /dev/null
+++ b/tensorflow/tools/integration_tests/gcs_smoke_test/setup.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+GCS_NUMBER=$(cat /dev/urandom | tr -dc 'A-F0-9' | fold -w 8 | head -n 1)
+GCS_PATH="$1"/"$GCS_NUMBER".tfrecord
+
+echo "gcs_path=$GCS_PATH" > "$_SETUP_OUTPUT"
+touch "$_SETUP_DONE"
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/teardown.sh b/tensorflow/tools/integration_tests/gcs_smoke_test/teardown.sh
new file mode 100755
index 0000000000000000000000000000000000000000..852486d1677ec597fe56111ffb0e470c333c1cd7
--- /dev/null
+++ b/tensorflow/tools/integration_tests/gcs_smoke_test/teardown.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+GSUTIL_BIN="/var/gcloud/google-cloud-sdk/bin/gsutil"
+
+echo "Got teardown argument $1"
+
+if "${GSUTIL_BIN}" rm "$1"
+then
+  echo "Cleaned up new tfrecord file in GCS: '$1'"
+else
+  echo "FAIL: Unable to clean up new tfrecord file in GCS: '$1'"
+  exit 1
+fi
diff --git a/tensorflow/tools/integration_tests/gcs_smoke_test/test_wrapper.sh b/tensorflow/tools/integration_tests/gcs_smoke_test/test_wrapper.sh
new file mode 100755
index 0000000000000000000000000000000000000000..ef29dee3462c21d6318a6fb7e7e658961f0d88dd
--- /dev/null
+++ b/tensorflow/tools/integration_tests/gcs_smoke_test/test_wrapper.sh
@@ -0,0 +1,21 @@
+# This is a python2 only test.
+#!/bin/bash
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+# Test Tensorflow package installation.
+/usr/local/bin/pip install --user tf-nightly
+
+# Test Tensorflow interaction with GCS.
+python tensorflow/tools/integration_test/gcs_smoke_test/gcs_smoke.py "$@"
diff --git a/tensorflow/tools/lib_package/BUILD b/tensorflow/tools/lib_package/BUILD
index 614457e8996491a60d4a7df213180117bce321ad..3fbdb5cacd1fd0039deaae5ac330b6c2ca006a68 100644
--- a/tensorflow/tools/lib_package/BUILD
+++ b/tensorflow/tools/lib_package/BUILD
@@ -27,6 +27,7 @@ pkg_tar(
         ":cheaders",
         ":clib",
         ":clicenses",
+        ":eager_cheaders",
     ],
 )
 
@@ -57,7 +58,6 @@ pkg_tar(
     name = "cheaders",
     files = [
         "//tensorflow/c:headers",
-        "//tensorflow/c/eager:headers",
     ],
     package_dir = "include/tensorflow/c",
     # Mark as "manual" till
@@ -68,6 +68,20 @@ pkg_tar(
     tags = ["manual"],
 )
 
+pkg_tar(
+    name = "eager_cheaders",
+    files = [
+        "//tensorflow/c/eager:headers",
+    ],
+    package_dir = "include/tensorflow/c/eager",
+    # Mark as "manual" till
+    # https://github.com/bazelbuild/bazel/issues/2352
+    # and https://github.com/bazelbuild/bazel/issues/1580
+    # are resolved, otherwise these rules break when built
+    # with Python 3.
+    tags = ["manual"],
+)
+
 pkg_tar(
     name = "clib",
     files = ["//tensorflow:libtensorflow.so"],
diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 96bd2d5326acb27d94e785dcda7f1a4e4356bfe0..1af246f9dc3b4f0998d35726c3ea08c639af04c9 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -5,6 +5,7 @@ load("//third_party/tensorrt:tensorrt_configure.bzl", "tensorrt_configure")
 load("//third_party/mkl:build_defs.bzl", "mkl_repository")
 load("//third_party/git:git_configure.bzl", "git_configure")
 load("//third_party/py:python_configure.bzl", "python_configure")
+
 load("//third_party/sycl:sycl_configure.bzl", "sycl_configure")
 load("//third_party/toolchains/clang6:repo.bzl", "clang6_configure")
 load("//third_party/toolchains/cpus/arm:arm_compiler_configure.bzl", "arm_compiler_configure")
@@ -126,6 +127,7 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
       sha256 = "0cadb31a35b514bf2dfd6b5d38205da94ef326ec6908fc3fd7c269948467214f",
       strip_prefix = "eigen-eigen-2355b229ea4c",
       build_file = str(Label("//third_party:eigen.BUILD")),
+      patch_file = str(Label("//third_party:eigen_fix_cuda_compilation.patch"))
   )
 
   tf_http_archive(
@@ -179,11 +181,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "gemmlowp",
       urls = [
-          "https://mirror.bazel.build/github.com/google/gemmlowp/archive/d4d1e29a62192d8defdc057b913ef36ca582ac98.zip",
-          "https://github.com/google/gemmlowp/archive/d4d1e29a62192d8defdc057b913ef36ca582ac98.zip",
+          "https://mirror.bazel.build/github.com/google/gemmlowp/archive/7c7c744640ddc3d0af18fb245b4d23228813a71b.zip",
+          "https://github.com/google/gemmlowp/archive/7c7c744640ddc3d0af18fb245b4d23228813a71b.zip",
       ],
-      sha256 = "e2bee7afd3c43028f23dd0d7f85ddd8b21aaf79c572b658e56164ef502b2b9c7",
-      strip_prefix = "gemmlowp-d4d1e29a62192d8defdc057b913ef36ca582ac98",
+      sha256 = "b852cc90259a7357c8a323f108f2cec6e85979fc3b18b5590b99e0130044b2cf",
+      strip_prefix = "gemmlowp-7c7c744640ddc3d0af18fb245b4d23228813a71b",
   )
 
   tf_http_archive(
@@ -473,11 +475,11 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "llvm",
       urls = [
-          "https://mirror.bazel.build/github.com/llvm-mirror/llvm/archive/11b0e47b5b79bab22d27b6b2952b1f7582848063.tar.gz",
-          "https://github.com/llvm-mirror/llvm/archive/11b0e47b5b79bab22d27b6b2952b1f7582848063.tar.gz",
+          "https://mirror.bazel.build/github.com/llvm-mirror/llvm/archive/193aea3782308c66a7a12f1c37520a1b4ff1dbd8.tar.gz",
+          "https://github.com/llvm-mirror/llvm/archive/193aea3782308c66a7a12f1c37520a1b4ff1dbd8.tar.gz",
       ],
-      sha256 = "b870b6f5df94c4c0cf7c6957046fca354c37d7641e838e905279a7509b0705e9",
-      strip_prefix = "llvm-11b0e47b5b79bab22d27b6b2952b1f7582848063",
+      sha256 = "2eda56deafb8da85bc23aa52fa1fb8c39da6a58c865e5216d0a0787bd09a09ed",
+      strip_prefix = "llvm-193aea3782308c66a7a12f1c37520a1b4ff1dbd8",
       build_file = str(Label("//third_party/llvm:llvm.BUILD")),
   )
 
@@ -664,15 +666,12 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "cub_archive",
       urls = [
-          "https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.7.4.zip",
-          "https://github.com/NVlabs/cub/archive/1.7.4.zip",
+          "https://mirror.bazel.build/github.com/NVlabs/cub/archive/1.8.0.zip",
+          "https://github.com/NVlabs/cub/archive/1.8.0.zip",
       ],
-      sha256 = "20a1a39fd97e5da7f40f5f2e7fd73fd2ea59f9dc4bb8a6c5f228aa543e727e31",
-      strip_prefix = "cub-1.7.4",
+      sha256 = "6bfa06ab52a650ae7ee6963143a0bbc667d6504822cbd9670369b598f18c58c3",
+      strip_prefix = "cub-1.8.0",
       build_file = str(Label("//third_party:cub.BUILD")),
-      # TODO: remove the patch when upstream fix is accepted and released.
-      #       PR with a fix: https://github.com/NVlabs/cub/pull/125
-      patch_file = str(Label("//third_party/cub:fix_compilation_in_clang.patch")),
   )
 
   tf_http_archive(
@@ -690,13 +689,23 @@ def tf_workspace(path_prefix="", tf_repo_name=""):
   tf_http_archive(
       name = "bazel_toolchains",
       urls = [
-          "https://mirror.bazel.build/github.com/bazelbuild/bazel-toolchains/archive/f3b09700fae5d7b6e659d7cefe0dcc6e8498504c.tar.gz",
-          "https://github.com/bazelbuild/bazel-toolchains/archive/f3b09700fae5d7b6e659d7cefe0dcc6e8498504c.tar.gz",
+          "https://mirror.bazel.build/github.com/bazelbuild/bazel-toolchains/archive/44200e0c026d86c53470d107b3697a3e46469c43.tar.gz",
+          "https://github.com/bazelbuild/bazel-toolchains/archive/44200e0c026d86c53470d107b3697a3e46469c43.tar.gz",
       ],
-      sha256 = "ed829b5eea8af1f405f4cc3d6ecfc3b1365bb7843171036030a31b5127002311",
-      strip_prefix = "bazel-toolchains-f3b09700fae5d7b6e659d7cefe0dcc6e8498504c",
+      strip_prefix = "bazel-toolchains-44200e0c026d86c53470d107b3697a3e46469c43",
+      sha256 = "699b55a6916c687f4b7dc092dbbf5f64672cde0dc965f79717735ec4e5416556",
   )
 
+  tf_http_archive(
+      name = "rbe_integration_test",
+      urls = [
+          "http://mirror.bazel.build/github.com/google/rbe-integration-test/archive/78a6194c7dda200b9522cf07707e3bc695804d1e.tar.gz",
+          "https://github.com/google/rbe-integration-test/archive/78a6194c7dda200b9522cf07707e3bc695804d1e.tar.gz",
+      ],
+      sha256 = "66d93b3919a165d486c31f5290d312abe9fda2685242f812c110653c124e1db4",
+      strip_prefix = "rbe-integration-test-78a6194c7dda200b9522cf07707e3bc695804d1e",
+   )
+
   tf_http_archive(
       name = "arm_neon_2_x86_sse",
       sha256 = "c8d90aa4357f8079d427e87a6f4c493da1fa4140aee926c05902d7ec1533d9a5",
diff --git a/third_party/cub/BUILD b/third_party/cub/BUILD
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/third_party/cub/fix_compilation_in_clang.patch b/third_party/cub/fix_compilation_in_clang.patch
deleted file mode 100644
index 384e674f2012b2b3ea59c5c1bd205873baa8cf18..0000000000000000000000000000000000000000
--- a/third_party/cub/fix_compilation_in_clang.patch
+++ /dev/null
@@ -1,23 +0,0 @@
-From 565b77f7c82048871a4d5e3e506dc663d53cd469 Mon Sep 17 00:00:00 2001
-From: Ilya Biryukov <ibiryukov@google.com>
-Date: Fri, 26 Jan 2018 18:46:06 +0100
-Subject: [PATCH] Added missing 'template' keyword.
-
-To unbreak compilation with clang.
----
- cub/device/dispatch/dispatch_radix_sort.cuh | 2 +-
- 1 file changed, 1 insertion(+), 1 deletion(-)
-
-diff --git a/cub/device/dispatch/dispatch_radix_sort.cuh b/cub/device/dispatch/dispatch_radix_sort.cuh
-index 7fbc621f..f622e212 100644
---- a/cub/device/dispatch/dispatch_radix_sort.cuh
-+++ b/cub/device/dispatch/dispatch_radix_sort.cuh
-@@ -104,7 +104,7 @@ __global__ void DeviceRadixSortUpsweepKernel(
-     CTA_SYNC();
- 
-     // Write out digit counts (striped)
--    upsweep.ExtractCounts<IS_DESCENDING>(d_spine, gridDim.x, blockIdx.x);
-+    upsweep.template ExtractCounts<IS_DESCENDING>(d_spine, gridDim.x, blockIdx.x);
- }
- 
- 
diff --git a/third_party/eigen_fix_cuda_compilation.patch b/third_party/eigen_fix_cuda_compilation.patch
new file mode 100644
index 0000000000000000000000000000000000000000..b921a7c31d5c96c79cd3033b13c60a8f7e63ba75
--- /dev/null
+++ b/third_party/eigen_fix_cuda_compilation.patch
@@ -0,0 +1,38 @@
+diff --git a/Eigen/src/Core/ProductEvaluators.h b/Eigen/src/Core/ProductEvaluators.h
+--- a/Eigen/src/Core/ProductEvaluators.h
++++ b/Eigen/src/Core/ProductEvaluators.h
+@@ -137,7 +137,7 @@ struct Assignment<DstXprType, Product<Lh
+   typename enable_if<(Options==DefaultProduct || Options==AliasFreeProduct)>::type>
+ {
+   typedef Product<Lhs,Rhs,Options> SrcXprType;
+-  static EIGEN_STRONG_INLINE
++  static EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE
+   void run(DstXprType &dst, const SrcXprType &src, const internal::assign_op<Scalar,Scalar> &)
+   {
+     Index dstRows = src.rows();
+@@ -390,7 +390,7 @@ struct generic_product_impl<Lhs,Rhs,Dens
+   typedef typename Product<Lhs,Rhs>::Scalar Scalar;
+   
+   template<typename Dst>
+-  static EIGEN_STRONG_INLINE void evalTo(Dst& dst, const Lhs& lhs, const Rhs& rhs)
++  static EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void evalTo(Dst& dst, const Lhs& lhs, const Rhs& rhs)
+   {
+     // Same as: dst.noalias() = lhs.lazyProduct(rhs);
+     // but easier on the compiler side
+@@ -398,14 +398,14 @@ struct generic_product_impl<Lhs,Rhs,Dens
+   }
+   
+   template<typename Dst>
+-  static EIGEN_STRONG_INLINE void addTo(Dst& dst, const Lhs& lhs, const Rhs& rhs)
++  static EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void addTo(Dst& dst, const Lhs& lhs, const Rhs& rhs)
+   {
+     // dst.noalias() += lhs.lazyProduct(rhs);
+     call_assignment_no_alias(dst, lhs.lazyProduct(rhs), internal::add_assign_op<typename Dst::Scalar,Scalar>());
+   }
+   
+   template<typename Dst>
+-  static EIGEN_STRONG_INLINE void subTo(Dst& dst, const Lhs& lhs, const Rhs& rhs)
++  static EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE void subTo(Dst& dst, const Lhs& lhs, const Rhs& rhs)
+   {
+     // dst.noalias() -= lhs.lazyProduct(rhs);
+     call_assignment_no_alias(dst, lhs.lazyProduct(rhs), internal::sub_assign_op<typename Dst::Scalar,Scalar>());
diff --git a/third_party/gpus/crosstool/CROSSTOOL_clang.tpl b/third_party/gpus/crosstool/CROSSTOOL_clang.tpl
index e4363d604577de09241d635b6990c9dd6429efe0..2f09473ee2ddf9a38ca0c7aa11094690607b532f 100644
--- a/third_party/gpus/crosstool/CROSSTOOL_clang.tpl
+++ b/third_party/gpus/crosstool/CROSSTOOL_clang.tpl
@@ -49,6 +49,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-lstdc++"
       }
@@ -75,6 +76,7 @@ toolchain {
     name: "alwayslink"
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       action: "c++-link-executable"
       flag_group {
         flag: "-Wl,-no-as-needed"
@@ -116,6 +118,7 @@ toolchain {
     }
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-Wl,-z,relro,-z,now"
       }
@@ -161,6 +164,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         # Stamp the binary with a unique identifier.
         flag: "-Wl,--build-id=md5"
@@ -176,6 +180,7 @@ toolchain {
       action: "c++-compile"
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag:"-no-canonical-prefixes"
       }
@@ -199,6 +204,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-B/usr/bin/"
       }
@@ -246,6 +252,7 @@ toolchain {
     }
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       action: "c++-link-executable"
       flag_group {
         flag: "-Wl,--gc-sections"
diff --git a/third_party/gpus/cuda/remote.BUILD.tpl b/third_party/gpus/cuda/remote.BUILD.tpl
index d88d512b90c352e6a301ed6efe8266d8dd6bf744..f774def5e6cec25e4920ecce0076340a31c70386 100644
--- a/third_party/gpus/cuda/remote.BUILD.tpl
+++ b/third_party/gpus/cuda/remote.BUILD.tpl
@@ -41,65 +41,65 @@ config_setting(
 
 alias(
     name = "cuda_headers",
-    actual = "%{remote_cuda_repo}cuda:cuda_headers",
+    actual = "%{remote_cuda_repo}/cuda:cuda_headers",
 )
 
 alias(
     name = "cudart_static",
-    actual = "%{remote_cuda_repo}cuda:cudart_static",
+    actual = "%{remote_cuda_repo}/cuda:cudart_static",
 )
 
 alias(
     name = "cuda_driver",
-    actual = "%{remote_cuda_repo}cuda:cuda_driver",
+    actual = "%{remote_cuda_repo}/cuda:cuda_driver",
 )
 
 alias(
     name = "cudart",
-    actual = "%{remote_cuda_repo}cuda:cudart",
+    actual = "%{remote_cuda_repo}/cuda:cudart",
 )
 
 alias(
     name = "cublas",
-    actual = "%{remote_cuda_repo}cuda:cublas",
+    actual = "%{remote_cuda_repo}/cuda:cublas",
 )
 
 alias(
     name = "cusolver",
-    actual = "%{remote_cuda_repo}cuda:cusolver",
+    actual = "%{remote_cuda_repo}/cuda:cusolver",
 )
 
 alias(
     name = "cudnn",
-    actual = "%{remote_cuda_repo}cuda:cudnn",
+    actual = "%{remote_cuda_repo}/cuda:cudnn",
 )
 
 alias(
     name = "cufft",
-    actual = "%{remote_cuda_repo}cuda:cufft",
+    actual = "%{remote_cuda_repo}/cuda:cufft",
 )
 
 alias(
     name = "curand",
-    actual = "%{remote_cuda_repo}cuda:curand",
+    actual = "%{remote_cuda_repo}/cuda:curand",
 )
 
 alias(
     name = "cuda",
-    actual = "%{remote_cuda_repo}cuda:cuda",
+    actual = "%{remote_cuda_repo}/cuda:cuda",
 )
 
 alias(
     name = "cupti_headers",
-    actual = "%{remote_cuda_repo}cuda:cupti_headers",
+    actual = "%{remote_cuda_repo}/cuda:cupti_headers",
 )
 
 alias(
     name = "cupti_dsos",
-    actual = "%{remote_cuda_repo}cuda:cupti_dsos",
+    actual = "%{remote_cuda_repo}/cuda:cupti_dsos",
 )
 
 alias(
     name = "libdevice_root",
-    actual = "%{remote_cuda_repo}cuda:libdevice_root",
+    actual = "%{remote_cuda_repo}/cuda:libdevice_root",
 )
diff --git a/third_party/toolchains/gpus/crosstool/BUILD b/third_party/toolchains/gpus/crosstool/BUILD
index a8c6b0f0291363f3a7576a70e78b3428fb984957..1f9065007ca884a46bfa391d1ee8a8f0333da235 100644
--- a/third_party/toolchains/gpus/crosstool/BUILD
+++ b/third_party/toolchains/gpus/crosstool/BUILD
@@ -50,3 +50,8 @@ filegroup(
     name = "empty",
     srcs = [],
 )
+
+filegroup(
+    name = "crosstool_wrapper_driver_is_not_gcc",
+    srcs = ["clang/bin/crosstool_wrapper_driver_is_not_gcc"],
+)
diff --git a/third_party/toolchains/gpus/crosstool/CROSSTOOL b/third_party/toolchains/gpus/crosstool/CROSSTOOL
index a47e0c7cd74edcea777d76854c2d7e97d69897fa..d6ee7e38c414dd59b76c7b2b4c95c55831bb30a8 100644
--- a/third_party/toolchains/gpus/crosstool/CROSSTOOL
+++ b/third_party/toolchains/gpus/crosstool/CROSSTOOL
@@ -53,6 +53,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-lstdc++"
       }
@@ -79,6 +80,7 @@ toolchain {
     name: "alwayslink"
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       action: "c++-link-executable"
       flag_group {
         flag: "-Wl,-no-as-needed"
@@ -120,6 +122,7 @@ toolchain {
     }
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-Wl,-z,relro,-z,now"
       }
@@ -141,8 +144,8 @@ toolchain {
       flag_group {
         # All warnings are enabled. Maybe enable -Werror as well?
         flag: "-Wall"
-        # TODO(ngiraldo): Some parts of the codebase set -Werror and hit this 
-        # warning, so switch it off for now.
+        # Some parts of the codebase set -Werror and hit this warning, so
+        # switch it off for now.
         flag: "-Wno-invalid-partial-specialization"
       }
     }
@@ -165,6 +168,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         # Stamp the binary with a unique identifier.
         flag: "-Wl,--build-id=md5"
@@ -180,6 +184,7 @@ toolchain {
       action: "c++-compile"
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag:"-no-canonical-prefixes"
       }
@@ -203,6 +208,7 @@ toolchain {
     flag_set {
       action: "c++-link-executable"
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       flag_group {
         flag: "-B/usr/bin/"
       }
@@ -250,6 +256,7 @@ toolchain {
     }
     flag_set {
       action: "c++-link-dynamic-library"
+      action: "c++-link-nodeps-dynamic-library"
       action: "c++-link-executable"
       flag_group {
         flag: "-Wl,--gc-sections"
@@ -296,7 +303,7 @@ toolchain {
   cxx_builtin_include_directory: "/usr/include/x86_64-linux-gnu/c++/5.4.0"
   cxx_builtin_include_directory: "/usr/include/c++/5.4.0/backward"
   cxx_builtin_include_directory: "/usr/local/include"
-  cxx_builtin_include_directory: "/usr/local/lib/clang/6.0.0/include"
+  cxx_builtin_include_directory: "/usr/local/lib/clang/7.0.0/include"
   cxx_builtin_include_directory: "/usr/include/x86_64-linux-gnu"
   cxx_builtin_include_directory: "/usr/include"
 }
diff --git a/third_party/toolchains/gpus/cuda/BUILD b/third_party/toolchains/gpus/cuda/BUILD
index 39136de99c901d6d6a9dafefe3163972511ec122..cfc6930851dbdca5486322bedc839dad9ed8673d 100644
--- a/third_party/toolchains/gpus/cuda/BUILD
+++ b/third_party/toolchains/gpus/cuda/BUILD
@@ -51,6 +51,7 @@ cc_library(
     includes = [
         ".",
         "cuda/include",
+        "cuda/include/crt",
     ],
     visibility = ["//visibility:public"],
 )
@@ -84,8 +85,8 @@ cc_library(
 
 cc_library(
     name = "cudart",
-    srcs = ["cuda/lib/libcudart.so.8.0"],
-    data = ["cuda/lib/libcudart.so.8.0"],
+    srcs = ["cuda/lib/libcudart.so.9.0"],
+    data = ["cuda/lib/libcudart.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -96,8 +97,8 @@ cc_library(
 
 cc_library(
     name = "cublas",
-    srcs = ["cuda/lib/libcublas.so.8.0"],
-    data = ["cuda/lib/libcublas.so.8.0"],
+    srcs = ["cuda/lib/libcublas.so.9.0"],
+    data = ["cuda/lib/libcublas.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -108,8 +109,8 @@ cc_library(
 
 cc_library(
     name = "cusolver",
-    srcs = ["cuda/lib/libcusolver.so.8.0"],
-    data = ["cuda/lib/libcusolver.so.8.0"],
+    srcs = ["cuda/lib/libcusolver.so.9.0"],
+    data = ["cuda/lib/libcusolver.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -121,8 +122,8 @@ cc_library(
 
 cc_library(
     name = "cudnn",
-    srcs = ["cuda/lib/libcudnn.so.6"],
-    data = ["cuda/lib/libcudnn.so.6"],
+    srcs = ["cuda/lib/libcudnn.so.7"],
+    data = ["cuda/lib/libcudnn.so.7"],
     includes = [
         ".",
         "cuda/include",
@@ -133,8 +134,8 @@ cc_library(
 
 cc_library(
     name = "cufft",
-    srcs = ["cuda/lib/libcufft.so.8.0"],
-    data = ["cuda/lib/libcufft.so.8.0"],
+    srcs = ["cuda/lib/libcufft.so.9.0"],
+    data = ["cuda/lib/libcufft.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -145,8 +146,8 @@ cc_library(
 
 cc_library(
     name = "curand",
-    srcs = ["cuda/lib/libcurand.so.8.0"],
-    data = ["cuda/lib/libcurand.so.8.0"],
+    srcs = ["cuda/lib/libcurand.so.9.0"],
+    data = ["cuda/lib/libcurand.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -183,7 +184,7 @@ cc_library(
 
 cc_library(
     name = "cupti_dsos",
-    data = ["cuda/lib/libcupti.so.8.0"],
+    data = ["cuda/lib/libcupti.so.9.0"],
     includes = [
         ".",
         "cuda/include",
@@ -200,1063 +201,990 @@ cc_library(
 genrule(
     name = "cuda-include",
     outs = [
-        "cuda/include/math_functions.hpp",
-        "cuda/include/cufft.h",
-        "cuda/include/nvgraph.h",
-        "cuda/include/curand_normal.h",
-        "cuda/include/curand_uniform.h",
-        "cuda/include/nppi_data_exchange_and_initialization.h",
-        "cuda/include/cuda_gl_interop.h",
-        "cuda/include/nppi_compression_functions.h",
-        "cuda/include/npp.h",
+        "cuda/include/CL/cl.h",
+        "cuda/include/CL/cl.hpp",
+        "cuda/include/CL/cl_egl.h",
+        "cuda/include/CL/cl_ext.h",
+        "cuda/include/CL/cl_gl.h",
+        "cuda/include/CL/cl_gl_ext.h",
+        "cuda/include/CL/cl_platform.h",
+        "cuda/include/CL/opencl.h",
+        "cuda/include/builtin_types.h",
+        "cuda/include/channel_descriptor.h",
+        "cuda/include/common_functions.h",
+        "cuda/include/cooperative_groups.h",
+        "cuda/include/cooperative_groups_helpers.h",
+        "cuda/include/crt/common_functions.h",
+        "cuda/include/crt/device_double_functions.h",
+        "cuda/include/crt/device_double_functions.hpp",
+        "cuda/include/crt/device_functions.h",
+        "cuda/include/crt/device_functions.hpp",
+        "cuda/include/crt/func_macro.h",
+        "cuda/include/crt/host_config.h",
+        "cuda/include/crt/host_defines.h",
+        "cuda/include/crt/host_runtime.h",
+        "cuda/include/crt/math_functions.h",
+        "cuda/include/crt/math_functions.hpp",
+        "cuda/include/crt/mma.h",
+        "cuda/include/crt/mma.hpp",
+        "cuda/include/crt/nvfunctional",
+        "cuda/include/crt/sm_70_rt.h",
+        "cuda/include/crt/sm_70_rt.hpp",
+        "cuda/include/crt/storage_class.h",
+        "cuda/include/cuComplex.h",
+        "cuda/include/cublas.h",
+        "cuda/include/cublasXt.h",
+        "cuda/include/cublas_api.h",
+        "cuda/include/cublas_v2.h",
         "cuda/include/cuda.h",
-        "cuda/include/nppi_statistics_functions.h",
-        "cuda/include/vector_functions.hpp",
-        "cuda/include/sm_32_intrinsics.hpp",
-        "cuda/include/sm_32_intrinsics.h",
-        "cuda/include/curand_discrete.h",
+        "cuda/include/cudaEGL.h",
+        "cuda/include/cudaGL.h",
+        "cuda/include/cudaProfiler.h",
+        "cuda/include/cudaVDPAU.h",
+        "cuda/include/cuda_device_runtime_api.h",
+        "cuda/include/cuda_fp16.h",
+        "cuda/include/cuda_fp16.hpp",
+        "cuda/include/cuda_gl_interop.h",
+        "cuda/include/cuda_occupancy.h",
+        "cuda/include/cuda_profiler_api.h",
         "cuda/include/cuda_runtime.h",
+        "cuda/include/cuda_runtime_api.h",
+        "cuda/include/cuda_surface_types.h",
+        "cuda/include/cuda_texture_types.h",
+        "cuda/include/cuda_vdpau_interop.h",
+        "cuda/include/cudalibxt.h",
+        "cuda/include/cudnn.h",
+        "cuda/include/cufft.h",
         "cuda/include/cufftXt.h",
-        "cuda/include/sm_61_intrinsics.h",
-        "cuda/include/texture_fetch_functions.h",
+        "cuda/include/cufftw.h",
+        "cuda/include/curand.h",
+        "cuda/include/curand_discrete.h",
+        "cuda/include/curand_discrete2.h",
+        "cuda/include/curand_globals.h",
+        "cuda/include/curand_kernel.h",
+        "cuda/include/curand_lognormal.h",
         "cuda/include/curand_mrg32k3a.h",
-        "cuda/include/host_defines.h",
-        "cuda/include/common_functions.h",
-        "cuda/include/nppi_support_functions.h",
-        "cuda/include/nppi_linear_transforms.h",
-        "cuda/include/device_double_functions.hpp",
-        "cuda/include/math_constants.h",
-        "cuda/include/nvToolsExtSync.h",
-        "cuda/include/npps_initialization.h",
+        "cuda/include/curand_mtgp32.h",
+        "cuda/include/curand_mtgp32_host.h",
+        "cuda/include/curand_mtgp32_kernel.h",
+        "cuda/include/curand_mtgp32dc_p_11213.h",
+        "cuda/include/curand_normal.h",
+        "cuda/include/curand_normal_static.h",
+        "cuda/include/curand_philox4x32_x.h",
+        "cuda/include/curand_poisson.h",
+        "cuda/include/curand_precalc.h",
+        "cuda/include/curand_uniform.h",
+        "cuda/include/cusolverDn.h",
+        "cuda/include/cusolverRf.h",
+        "cuda/include/cusolverSp.h",
         "cuda/include/cusolverSp_LOWLEVEL_PREVIEW.h",
-        "cuda/include/texture_indirect_functions.hpp",
-        "cuda/include/cudaProfiler.h",
-        "cuda/include/npps_filtering_functions.h",
+        "cuda/include/cusolver_common.h",
+        "cuda/include/cusparse.h",
         "cuda/include/cusparse_v2.h",
-        "cuda/include/nppi.h",
-        "cuda/include/surface_indirect_functions.h",
-        "cuda/include/sm_30_intrinsics.h",
+        "cuda/include/device_atomic_functions.h",
+        "cuda/include/device_atomic_functions.hpp",
         "cuda/include/device_double_functions.h",
-        "cuda/include/sm_35_intrinsics.h",
-        "cuda/include/cusolverSp.h",
-        "cuda/include/library_types.h",
-        "cuda/include/surface_indirect_functions.hpp",
-        "cuda/include/cudalibxt.h",
-        "cuda/include/channel_descriptor.h",
+        "cuda/include/device_double_functions.hpp",
+        "cuda/include/device_functions.h",
+        "cuda/include/device_functions.hpp",
         "cuda/include/device_functions_decls.h",
-        "cuda/include/curand_kernel.h",
-        "cuda/include/curand_mtgp32_host.h",
-        "cuda/include/nvToolsExtCuda.h",
-        "cuda/include/nvToolsExt.h",
-        "cuda/include/cuComplex.h",
-        "cuda/include/sm_32_atomic_functions.h",
-        "cuda/include/texture_indirect_functions.h",
-        "cuda/include/sm_32_atomic_functions.hpp",
-        "cuda/include/sm_20_intrinsics.hpp",
         "cuda/include/device_launch_parameters.h",
-        "cuda/include/curand_mtgp32.h",
-        "cuda/include/texture_fetch_functions.hpp",
-        "cuda/include/cuda_occupancy.h",
-        "cuda/include/CL/opencl.h",
-        "cuda/include/CL/cl_platform.h",
-        "cuda/include/CL/cl_egl.h",
-        "cuda/include/CL/cl_gl.h",
-        "cuda/include/CL/cl.h",
-        "cuda/include/CL/cl_gl_ext.h",
-        "cuda/include/CL/cl_ext.h",
-        "cuda/include/CL/cl.hpp",
+        "cuda/include/device_types.h",
+        "cuda/include/driver_functions.h",
+        "cuda/include/driver_types.h",
+        "cuda/include/dynlink_cuda.h",
+        "cuda/include/dynlink_cuda_cuda.h",
+        "cuda/include/dynlink_cuviddec.h",
+        "cuda/include/dynlink_nvcuvid.h",
+        "cuda/include/fatBinaryCtl.h",
+        "cuda/include/fatbinary.h",
         "cuda/include/host_config.h",
-        "cuda/include/cuda_surface_types.h",
+        "cuda/include/host_defines.h",
+        "cuda/include/library_types.h",
+        "cuda/include/math_constants.h",
         "cuda/include/math_functions.h",
+        "cuda/include/math_functions.hpp",
+        "cuda/include/math_functions_dbl_ptx3.h",
+        "cuda/include/math_functions_dbl_ptx3.hpp",
+        "cuda/include/mma.h",
+        "cuda/include/npp.h",
+        "cuda/include/nppcore.h",
+        "cuda/include/nppdefs.h",
+        "cuda/include/nppi.h",
+        "cuda/include/nppi_arithmetic_and_logical_operations.h",
+        "cuda/include/nppi_color_conversion.h",
+        "cuda/include/nppi_compression_functions.h",
+        "cuda/include/nppi_computer_vision.h",
+        "cuda/include/nppi_data_exchange_and_initialization.h",
+        "cuda/include/nppi_filtering_functions.h",
+        "cuda/include/nppi_geometry_transforms.h",
+        "cuda/include/nppi_linear_transforms.h",
+        "cuda/include/nppi_morphological_operations.h",
+        "cuda/include/nppi_statistics_functions.h",
+        "cuda/include/nppi_support_functions.h",
+        "cuda/include/nppi_threshold_and_compare_operations.h",
+        "cuda/include/npps.h",
+        "cuda/include/npps_arithmetic_and_logical_operations.h",
+        "cuda/include/npps_conversion_functions.h",
+        "cuda/include/npps_filtering_functions.h",
+        "cuda/include/npps_initialization.h",
+        "cuda/include/npps_statistics_functions.h",
+        "cuda/include/npps_support_functions.h",
+        "cuda/include/nppversion.h",
+        "cuda/include/nvToolsExt.h",
+        "cuda/include/nvToolsExtCuda.h",
+        "cuda/include/nvToolsExtCudaRt.h",
         "cuda/include/nvToolsExtMeta.h",
+        "cuda/include/nvToolsExtSync.h",
+        "cuda/include/nvblas.h",
+        "cuda/include/nvfunctional",
+        "cuda/include/nvgraph.h",
+        "cuda/include/nvml.h",
+        "cuda/include/nvrtc.h",
+        "cuda/include/sm_20_atomic_functions.h",
         "cuda/include/sm_20_atomic_functions.hpp",
-        "cuda/include/device_functions.h",
-        "cuda/include/device_types.h",
-        "cuda/include/npps_conversion_functions.h",
-        "cuda/include/curand_precalc.h",
-        "cuda/include/cusolverRf.h",
+        "cuda/include/sm_20_intrinsics.h",
+        "cuda/include/sm_20_intrinsics.hpp",
+        "cuda/include/sm_30_intrinsics.h",
+        "cuda/include/sm_30_intrinsics.hpp",
+        "cuda/include/sm_32_atomic_functions.h",
+        "cuda/include/sm_32_atomic_functions.hpp",
+        "cuda/include/sm_32_intrinsics.h",
+        "cuda/include/sm_32_intrinsics.hpp",
+        "cuda/include/sm_35_atomic_functions.h",
+        "cuda/include/sm_35_intrinsics.h",
+        "cuda/include/sm_60_atomic_functions.h",
         "cuda/include/sm_60_atomic_functions.hpp",
-        "cuda/include/cuviddec.h",
-        "cuda/include/curand_discrete2.h",
-        "cuda/include/device_functions.hpp",
-        "cuda/include/thrust/transform_scan.h",
-        "cuda/include/thrust/system_error.h",
-        "cuda/include/thrust/device_malloc.h",
-        "cuda/include/thrust/partition.h",
-        "cuda/include/thrust/unique.h",
-        "cuda/include/thrust/device_delete.h",
-        "cuda/include/thrust/execution_policy.h",
+        "cuda/include/sm_61_intrinsics.h",
+        "cuda/include/sm_61_intrinsics.hpp",
+        "cuda/include/sobol_direction_vectors.h",
+        "cuda/include/surface_functions.h",
+        "cuda/include/surface_functions.hpp",
+        "cuda/include/surface_indirect_functions.h",
+        "cuda/include/surface_indirect_functions.hpp",
+        "cuda/include/surface_types.h",
+        "cuda/include/texture_fetch_functions.h",
+        "cuda/include/texture_fetch_functions.hpp",
+        "cuda/include/texture_indirect_functions.h",
+        "cuda/include/texture_indirect_functions.hpp",
+        "cuda/include/texture_types.h",
         "cuda/include/thrust/adjacent_difference.h",
-        "cuda/include/thrust/sequence.h",
-        "cuda/include/thrust/merge.h",
-        "cuda/include/thrust/device_new.h",
-        "cuda/include/thrust/transform_reduce.h",
-        "cuda/include/thrust/device_vector.h",
-        "cuda/include/thrust/gather.h",
-        "cuda/include/thrust/sort.h",
-        "cuda/include/thrust/scan.h",
-        "cuda/include/thrust/detail/temporary_array.h",
-        "cuda/include/thrust/detail/util/align.h",
-        "cuda/include/thrust/detail/util/blocking.h",
-        "cuda/include/thrust/detail/transform.inl",
-        "cuda/include/thrust/detail/device_vector.inl",
+        "cuda/include/thrust/advance.h",
+        "cuda/include/thrust/binary_search.h",
+        "cuda/include/thrust/complex.h",
+        "cuda/include/thrust/copy.h",
+        "cuda/include/thrust/count.h",
+        "cuda/include/thrust/detail/adjacent_difference.inl",
+        "cuda/include/thrust/detail/advance.inl",
+        "cuda/include/thrust/detail/allocator/allocator_traits.h",
+        "cuda/include/thrust/detail/allocator/allocator_traits.inl",
+        "cuda/include/thrust/detail/allocator/copy_construct_range.h",
+        "cuda/include/thrust/detail/allocator/copy_construct_range.inl",
+        "cuda/include/thrust/detail/allocator/default_construct_range.h",
+        "cuda/include/thrust/detail/allocator/default_construct_range.inl",
+        "cuda/include/thrust/detail/allocator/destroy_range.h",
+        "cuda/include/thrust/detail/allocator/destroy_range.inl",
+        "cuda/include/thrust/detail/allocator/fill_construct_range.h",
+        "cuda/include/thrust/detail/allocator/fill_construct_range.inl",
+        "cuda/include/thrust/detail/allocator/malloc_allocator.h",
+        "cuda/include/thrust/detail/allocator/malloc_allocator.inl",
+        "cuda/include/thrust/detail/allocator/no_throw_allocator.h",
+        "cuda/include/thrust/detail/allocator/tagged_allocator.h",
+        "cuda/include/thrust/detail/allocator/tagged_allocator.inl",
+        "cuda/include/thrust/detail/allocator/temporary_allocator.h",
+        "cuda/include/thrust/detail/allocator/temporary_allocator.inl",
         "cuda/include/thrust/detail/binary_search.inl",
-        "cuda/include/thrust/detail/overlapped_copy.h",
-        "cuda/include/thrust/detail/vector_base.inl",
-        "cuda/include/thrust/detail/device_reference.inl",
-        "cuda/include/thrust/detail/functional/actor.h",
-        "cuda/include/thrust/detail/functional/value.h",
-        "cuda/include/thrust/detail/functional/operators.h",
-        "cuda/include/thrust/detail/functional/operators/logical_operators.h",
-        "cuda/include/thrust/detail/functional/operators/relational_operators.h",
-        "cuda/include/thrust/detail/functional/operators/assignment_operator.h",
-        "cuda/include/thrust/detail/functional/operators/bitwise_operators.h",
-        "cuda/include/thrust/detail/functional/operators/operator_adaptors.h",
-        "cuda/include/thrust/detail/functional/operators/arithmetic_operators.h",
-        "cuda/include/thrust/detail/functional/operators/compound_assignment_operators.h",
-        "cuda/include/thrust/detail/functional/argument.h",
-        "cuda/include/thrust/detail/functional/placeholder.h",
-        "cuda/include/thrust/detail/functional/actor.inl",
-        "cuda/include/thrust/detail/functional/composite.h",
-        "cuda/include/thrust/detail/static_map.h",
-        "cuda/include/thrust/detail/type_traits/has_nested_type.h",
-        "cuda/include/thrust/detail/type_traits/is_call_possible.h",
-        "cuda/include/thrust/detail/type_traits/function_traits.h",
-        "cuda/include/thrust/detail/type_traits/pointer_traits.h",
-        "cuda/include/thrust/detail/type_traits/has_member_function.h",
-        "cuda/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h",
-        "cuda/include/thrust/detail/type_traits/minimum_type.h",
-        "cuda/include/thrust/detail/type_traits/has_trivial_assign.h",
-        "cuda/include/thrust/detail/type_traits/is_metafunction_defined.h",
-        "cuda/include/thrust/detail/type_traits/iterator/is_discard_iterator.h",
-        "cuda/include/thrust/detail/type_traits/iterator/is_output_iterator.h",
-        "cuda/include/thrust/detail/type_traits/result_of_adaptable_function.h",
-        "cuda/include/thrust/detail/reference.h",
-        "cuda/include/thrust/detail/inner_product.inl",
-        "cuda/include/thrust/detail/use_default.h",
-        "cuda/include/thrust/detail/sequence.inl",
-        "cuda/include/thrust/detail/sort.inl",
-        "cuda/include/thrust/detail/equal.inl",
-        "cuda/include/thrust/detail/execution_policy.h",
-        "cuda/include/thrust/detail/integer_traits.h",
-        "cuda/include/thrust/detail/type_traits.h",
-        "cuda/include/thrust/detail/reverse.inl",
-        "cuda/include/thrust/detail/tabulate.inl",
-        "cuda/include/thrust/detail/unique.inl",
-        "cuda/include/thrust/detail/scatter.inl",
-        "cuda/include/thrust/detail/set_operations.inl",
-        "cuda/include/thrust/detail/device_malloc.inl",
-        "cuda/include/thrust/detail/copy_if.inl",
-        "cuda/include/thrust/detail/fill.inl",
-        "cuda/include/thrust/detail/temporary_array.inl",
-        "cuda/include/thrust/detail/transform_scan.inl",
-        "cuda/include/thrust/detail/minmax.h",
-        "cuda/include/thrust/detail/swap.inl",
-        "cuda/include/thrust/detail/pointer.inl",
-        "cuda/include/thrust/detail/transform_reduce.inl",
-        "cuda/include/thrust/detail/config.h",
-        "cuda/include/thrust/detail/distance.inl",
-        "cuda/include/thrust/detail/pair.inl",
-        "cuda/include/thrust/detail/allocator/temporary_allocator.h",
-        "cuda/include/thrust/detail/allocator/tagged_allocator.h",
-        "cuda/include/thrust/detail/allocator/destroy_range.inl",
-        "cuda/include/thrust/detail/allocator/destroy_range.h",
-        "cuda/include/thrust/detail/allocator/no_throw_allocator.h",
-        "cuda/include/thrust/detail/allocator/default_construct_range.inl",
-        "cuda/include/thrust/detail/allocator/fill_construct_range.inl",
-        "cuda/include/thrust/detail/allocator/tagged_allocator.inl",
-        "cuda/include/thrust/detail/allocator/malloc_allocator.h",
-        "cuda/include/thrust/detail/allocator/allocator_traits.h",
-        "cuda/include/thrust/detail/allocator/copy_construct_range.h",
-        "cuda/include/thrust/detail/allocator/allocator_traits.inl",
-        "cuda/include/thrust/detail/allocator/default_construct_range.h",
-        "cuda/include/thrust/detail/allocator/copy_construct_range.inl",
-        "cuda/include/thrust/detail/allocator/malloc_allocator.inl",
-        "cuda/include/thrust/detail/allocator/temporary_allocator.inl",
-        "cuda/include/thrust/detail/allocator/fill_construct_range.h",
-        "cuda/include/thrust/detail/temporary_buffer.h",
-        "cuda/include/thrust/detail/reduce.inl",
-        "cuda/include/thrust/detail/device_new.inl",
-        "cuda/include/thrust/detail/pointer.h",
-        "cuda/include/thrust/detail/for_each.inl",
-        "cuda/include/thrust/detail/generate.inl",
-        "cuda/include/thrust/detail/dispatch/is_trivial_copy.h",
-        "cuda/include/thrust/detail/adjacent_difference.inl",
-        "cuda/include/thrust/detail/tuple_meta_transform.h",
-        "cuda/include/thrust/detail/functional.inl",
-        "cuda/include/thrust/detail/remove.inl",
-        "cuda/include/thrust/detail/tuple_transform.h",
-        "cuda/include/thrust/detail/merge.inl",
-        "cuda/include/thrust/detail/extrema.inl",
-        "cuda/include/thrust/detail/trivial_sequence.h",
-        "cuda/include/thrust/detail/vector_base.h",
-        "cuda/include/thrust/detail/count.inl",
-        "cuda/include/thrust/detail/uninitialized_copy.inl",
-        "cuda/include/thrust/detail/function.h",
-        "cuda/include/thrust/detail/swap_ranges.inl",
-        "cuda/include/thrust/detail/device_delete.inl",
-        "cuda/include/thrust/detail/static_assert.h",
-        "cuda/include/thrust/detail/logical.inl",
-        "cuda/include/thrust/detail/seq.h",
-        "cuda/include/thrust/detail/mpl/math.h",
-        "cuda/include/thrust/detail/mismatch.inl",
-        "cuda/include/thrust/detail/internal_functional.h",
-        "cuda/include/thrust/detail/get_iterator_value.h",
-        "cuda/include/thrust/detail/copy.inl",
-        "cuda/include/thrust/detail/copy.h",
+        "cuda/include/thrust/detail/complex/arithmetic.h",
+        "cuda/include/thrust/detail/complex/c99math.h",
+        "cuda/include/thrust/detail/complex/catrig.h",
         "cuda/include/thrust/detail/complex/catrigf.h",
-        "cuda/include/thrust/detail/complex/cpowf.h",
-        "cuda/include/thrust/detail/complex/csqrtf.h",
+        "cuda/include/thrust/detail/complex/ccosh.h",
         "cuda/include/thrust/detail/complex/ccoshf.h",
-        "cuda/include/thrust/detail/complex/csinhf.h",
+        "cuda/include/thrust/detail/complex/cexp.h",
+        "cuda/include/thrust/detail/complex/cexpf.h",
+        "cuda/include/thrust/detail/complex/clog.h",
         "cuda/include/thrust/detail/complex/clogf.h",
-        "cuda/include/thrust/detail/complex/ccosh.h",
-        "cuda/include/thrust/detail/complex/arithmetic.h",
-        "cuda/include/thrust/detail/complex/csqrt.h",
-        "cuda/include/thrust/detail/complex/cpow.h",
         "cuda/include/thrust/detail/complex/complex.inl",
-        "cuda/include/thrust/detail/complex/math_private.h",
-        "cuda/include/thrust/detail/complex/c99math.h",
+        "cuda/include/thrust/detail/complex/cpow.h",
+        "cuda/include/thrust/detail/complex/cpowf.h",
         "cuda/include/thrust/detail/complex/cproj.h",
-        "cuda/include/thrust/detail/complex/catrig.h",
-        "cuda/include/thrust/detail/complex/ctanhf.h",
-        "cuda/include/thrust/detail/complex/cexpf.h",
         "cuda/include/thrust/detail/complex/csinh.h",
-        "cuda/include/thrust/detail/complex/stream.h",
+        "cuda/include/thrust/detail/complex/csinhf.h",
+        "cuda/include/thrust/detail/complex/csqrt.h",
+        "cuda/include/thrust/detail/complex/csqrtf.h",
         "cuda/include/thrust/detail/complex/ctanh.h",
-        "cuda/include/thrust/detail/complex/cexp.h",
-        "cuda/include/thrust/detail/complex/clog.h",
-        "cuda/include/thrust/detail/range/head_flags.h",
-        "cuda/include/thrust/detail/range/tail_flags.h",
-        "cuda/include/thrust/detail/execute_with_allocator.h",
-        "cuda/include/thrust/detail/integer_math.h",
-        "cuda/include/thrust/detail/swap.h",
-        "cuda/include/thrust/detail/uninitialized_fill.inl",
-        "cuda/include/thrust/detail/scan.inl",
-        "cuda/include/thrust/detail/gather.inl",
-        "cuda/include/thrust/detail/reference_forward_declaration.h",
-        "cuda/include/thrust/detail/numeric_traits.h",
-        "cuda/include/thrust/detail/reference.inl",
-        "cuda/include/thrust/detail/cstdint.h",
-        "cuda/include/thrust/detail/device_free.inl",
-        "cuda/include/thrust/detail/copy_if.h",
-        "cuda/include/thrust/detail/partition.inl",
-        "cuda/include/thrust/detail/find.inl",
-        "cuda/include/thrust/detail/config/forceinline.h",
-        "cuda/include/thrust/detail/config/debug.h",
-        "cuda/include/thrust/detail/config/config.h",
-        "cuda/include/thrust/detail/config/host_device.h",
-        "cuda/include/thrust/detail/config/host_system.h",
+        "cuda/include/thrust/detail/complex/ctanhf.h",
+        "cuda/include/thrust/detail/complex/math_private.h",
+        "cuda/include/thrust/detail/complex/stream.h",
+        "cuda/include/thrust/detail/config.h",
         "cuda/include/thrust/detail/config/compiler.h",
-        "cuda/include/thrust/detail/config/device_system.h",
         "cuda/include/thrust/detail/config/compiler_fence.h",
+        "cuda/include/thrust/detail/config/config.h",
+        "cuda/include/thrust/detail/config/debug.h",
+        "cuda/include/thrust/detail/config/device_system.h",
         "cuda/include/thrust/detail/config/exec_check_disable.h",
-        "cuda/include/thrust/detail/config/simple_defines.h",
+        "cuda/include/thrust/detail/config/forceinline.h",
         "cuda/include/thrust/detail/config/global_workarounds.h",
-        "cuda/include/thrust/detail/replace.inl",
+        "cuda/include/thrust/detail/config/host_device.h",
+        "cuda/include/thrust/detail/config/host_system.h",
+        "cuda/include/thrust/detail/config/simple_defines.h",
+        "cuda/include/thrust/detail/contiguous_storage.h",
+        "cuda/include/thrust/detail/contiguous_storage.inl",
+        "cuda/include/thrust/detail/copy.h",
+        "cuda/include/thrust/detail/copy.inl",
+        "cuda/include/thrust/detail/copy_if.h",
+        "cuda/include/thrust/detail/copy_if.inl",
+        "cuda/include/thrust/detail/count.inl",
+        "cuda/include/thrust/detail/cstdint.h",
+        "cuda/include/thrust/detail/device_delete.inl",
+        "cuda/include/thrust/detail/device_free.inl",
+        "cuda/include/thrust/detail/device_malloc.inl",
+        "cuda/include/thrust/detail/device_new.inl",
         "cuda/include/thrust/detail/device_ptr.inl",
-        "cuda/include/thrust/detail/tuple.inl",
-        "cuda/include/thrust/detail/malloc_and_free.h",
+        "cuda/include/thrust/detail/device_reference.inl",
+        "cuda/include/thrust/detail/device_vector.inl",
+        "cuda/include/thrust/detail/dispatch/is_trivial_copy.h",
+        "cuda/include/thrust/detail/distance.inl",
+        "cuda/include/thrust/detail/equal.inl",
+        "cuda/include/thrust/detail/execute_with_allocator.h",
+        "cuda/include/thrust/detail/execution_policy.h",
+        "cuda/include/thrust/detail/extrema.inl",
+        "cuda/include/thrust/detail/fill.inl",
+        "cuda/include/thrust/detail/find.inl",
+        "cuda/include/thrust/detail/for_each.inl",
+        "cuda/include/thrust/detail/function.h",
+        "cuda/include/thrust/detail/functional.inl",
+        "cuda/include/thrust/detail/functional/actor.h",
+        "cuda/include/thrust/detail/functional/actor.inl",
+        "cuda/include/thrust/detail/functional/argument.h",
+        "cuda/include/thrust/detail/functional/composite.h",
+        "cuda/include/thrust/detail/functional/operators.h",
+        "cuda/include/thrust/detail/functional/operators/arithmetic_operators.h",
+        "cuda/include/thrust/detail/functional/operators/assignment_operator.h",
+        "cuda/include/thrust/detail/functional/operators/bitwise_operators.h",
+        "cuda/include/thrust/detail/functional/operators/compound_assignment_operators.h",
+        "cuda/include/thrust/detail/functional/operators/logical_operators.h",
+        "cuda/include/thrust/detail/functional/operators/operator_adaptors.h",
+        "cuda/include/thrust/detail/functional/operators/relational_operators.h",
+        "cuda/include/thrust/detail/functional/placeholder.h",
+        "cuda/include/thrust/detail/functional/value.h",
+        "cuda/include/thrust/detail/gather.inl",
+        "cuda/include/thrust/detail/generate.inl",
+        "cuda/include/thrust/detail/get_iterator_value.h",
         "cuda/include/thrust/detail/host_vector.inl",
+        "cuda/include/thrust/detail/inner_product.inl",
+        "cuda/include/thrust/detail/integer_math.h",
+        "cuda/include/thrust/detail/integer_traits.h",
+        "cuda/include/thrust/detail/internal_functional.h",
+        "cuda/include/thrust/detail/logical.inl",
+        "cuda/include/thrust/detail/malloc_and_free.h",
+        "cuda/include/thrust/detail/merge.inl",
+        "cuda/include/thrust/detail/minmax.h",
+        "cuda/include/thrust/detail/mismatch.inl",
+        "cuda/include/thrust/detail/mpl/math.h",
+        "cuda/include/thrust/detail/numeric_traits.h",
+        "cuda/include/thrust/detail/overlapped_copy.h",
+        "cuda/include/thrust/detail/pair.inl",
+        "cuda/include/thrust/detail/partition.inl",
+        "cuda/include/thrust/detail/pointer.h",
+        "cuda/include/thrust/detail/pointer.inl",
+        "cuda/include/thrust/detail/range/head_flags.h",
+        "cuda/include/thrust/detail/range/tail_flags.h",
         "cuda/include/thrust/detail/raw_pointer_cast.h",
-        "cuda/include/thrust/detail/advance.inl",
-        "cuda/include/thrust/detail/contiguous_storage.h",
         "cuda/include/thrust/detail/raw_reference_cast.h",
-        "cuda/include/thrust/detail/contiguous_storage.inl",
-        "cuda/include/thrust/reverse.h",
-        "cuda/include/thrust/device_malloc_allocator.h",
-        "cuda/include/thrust/scatter.h",
-        "cuda/include/thrust/pair.h",
-        "cuda/include/thrust/advance.h",
-        "cuda/include/thrust/find.h",
-        "cuda/include/thrust/device_ptr.h",
-        "cuda/include/thrust/generate.h",
-        "cuda/include/thrust/uninitialized_fill.h",
-        "cuda/include/thrust/system/system_error.h",
-        "cuda/include/thrust/system/detail/bad_alloc.h",
-        "cuda/include/thrust/system/detail/adl/transform_scan.h",
-        "cuda/include/thrust/system/detail/adl/unique_by_key.h",
-        "cuda/include/thrust/system/detail/adl/partition.h",
-        "cuda/include/thrust/system/detail/adl/unique.h",
-        "cuda/include/thrust/system/detail/adl/adjacent_difference.h",
-        "cuda/include/thrust/system/detail/adl/sequence.h",
-        "cuda/include/thrust/system/detail/adl/merge.h",
-        "cuda/include/thrust/system/detail/adl/transform_reduce.h",
-        "cuda/include/thrust/system/detail/adl/gather.h",
-        "cuda/include/thrust/system/detail/adl/sort.h",
-        "cuda/include/thrust/system/detail/adl/scan.h",
-        "cuda/include/thrust/system/detail/adl/temporary_buffer.h",
-        "cuda/include/thrust/system/detail/adl/scan_by_key.h",
-        "cuda/include/thrust/system/detail/adl/reverse.h",
-        "cuda/include/thrust/system/detail/adl/assign_value.h",
-        "cuda/include/thrust/system/detail/adl/scatter.h",
-        "cuda/include/thrust/system/detail/adl/find.h",
-        "cuda/include/thrust/system/detail/adl/generate.h",
-        "cuda/include/thrust/system/detail/adl/uninitialized_fill.h",
-        "cuda/include/thrust/system/detail/adl/remove.h",
-        "cuda/include/thrust/system/detail/adl/tabulate.h",
-        "cuda/include/thrust/system/detail/adl/for_each.h",
-        "cuda/include/thrust/system/detail/adl/reduce_by_key.h",
-        "cuda/include/thrust/system/detail/adl/reduce.h",
-        "cuda/include/thrust/system/detail/adl/equal.h",
-        "cuda/include/thrust/system/detail/adl/copy.h",
-        "cuda/include/thrust/system/detail/adl/swap_ranges.h",
-        "cuda/include/thrust/system/detail/adl/uninitialized_copy.h",
-        "cuda/include/thrust/system/detail/adl/binary_search.h",
-        "cuda/include/thrust/system/detail/adl/set_operations.h",
-        "cuda/include/thrust/system/detail/adl/mismatch.h",
-        "cuda/include/thrust/system/detail/adl/extrema.h",
-        "cuda/include/thrust/system/detail/adl/count.h",
-        "cuda/include/thrust/system/detail/adl/replace.h",
+        "cuda/include/thrust/detail/reduce.inl",
+        "cuda/include/thrust/detail/reference.h",
+        "cuda/include/thrust/detail/reference.inl",
+        "cuda/include/thrust/detail/reference_forward_declaration.h",
+        "cuda/include/thrust/detail/remove.inl",
+        "cuda/include/thrust/detail/replace.inl",
+        "cuda/include/thrust/detail/reverse.inl",
+        "cuda/include/thrust/detail/scan.inl",
+        "cuda/include/thrust/detail/scatter.inl",
+        "cuda/include/thrust/detail/seq.h",
+        "cuda/include/thrust/detail/sequence.inl",
+        "cuda/include/thrust/detail/set_operations.inl",
+        "cuda/include/thrust/detail/sort.inl",
+        "cuda/include/thrust/detail/static_assert.h",
+        "cuda/include/thrust/detail/static_map.h",
+        "cuda/include/thrust/detail/swap.h",
+        "cuda/include/thrust/detail/swap.inl",
+        "cuda/include/thrust/detail/swap_ranges.inl",
+        "cuda/include/thrust/detail/tabulate.inl",
+        "cuda/include/thrust/detail/temporary_array.h",
+        "cuda/include/thrust/detail/temporary_array.inl",
+        "cuda/include/thrust/detail/temporary_buffer.h",
+        "cuda/include/thrust/detail/transform.inl",
+        "cuda/include/thrust/detail/transform_reduce.inl",
+        "cuda/include/thrust/detail/transform_scan.inl",
+        "cuda/include/thrust/detail/trivial_sequence.h",
+        "cuda/include/thrust/detail/tuple.inl",
+        "cuda/include/thrust/detail/tuple_meta_transform.h",
+        "cuda/include/thrust/detail/tuple_transform.h",
+        "cuda/include/thrust/detail/type_traits.h",
+        "cuda/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h",
+        "cuda/include/thrust/detail/type_traits/function_traits.h",
+        "cuda/include/thrust/detail/type_traits/has_member_function.h",
+        "cuda/include/thrust/detail/type_traits/has_nested_type.h",
+        "cuda/include/thrust/detail/type_traits/has_trivial_assign.h",
+        "cuda/include/thrust/detail/type_traits/is_call_possible.h",
+        "cuda/include/thrust/detail/type_traits/is_metafunction_defined.h",
+        "cuda/include/thrust/detail/type_traits/iterator/is_discard_iterator.h",
+        "cuda/include/thrust/detail/type_traits/iterator/is_output_iterator.h",
+        "cuda/include/thrust/detail/type_traits/minimum_type.h",
+        "cuda/include/thrust/detail/type_traits/pointer_traits.h",
+        "cuda/include/thrust/detail/type_traits/result_of_adaptable_function.h",
+        "cuda/include/thrust/detail/uninitialized_copy.inl",
+        "cuda/include/thrust/detail/uninitialized_fill.inl",
+        "cuda/include/thrust/detail/unique.inl",
+        "cuda/include/thrust/detail/use_default.h",
+        "cuda/include/thrust/detail/util/align.h",
+        "cuda/include/thrust/detail/util/blocking.h",
+        "cuda/include/thrust/detail/vector_base.h",
+        "cuda/include/thrust/detail/vector_base.inl",
+        "cuda/include/thrust/device_allocator.h",
+        "cuda/include/thrust/device_delete.h",
+        "cuda/include/thrust/device_free.h",
+        "cuda/include/thrust/device_malloc.h",
+        "cuda/include/thrust/device_malloc_allocator.h",
+        "cuda/include/thrust/device_new.h",
+        "cuda/include/thrust/device_new_allocator.h",
+        "cuda/include/thrust/device_ptr.h",
+        "cuda/include/thrust/device_reference.h",
+        "cuda/include/thrust/device_vector.h",
+        "cuda/include/thrust/distance.h",
+        "cuda/include/thrust/equal.h",
+        "cuda/include/thrust/execution_policy.h",
+        "cuda/include/thrust/extrema.h",
+        "cuda/include/thrust/fill.h",
+        "cuda/include/thrust/find.h",
+        "cuda/include/thrust/for_each.h",
+        "cuda/include/thrust/functional.h",
+        "cuda/include/thrust/gather.h",
+        "cuda/include/thrust/generate.h",
+        "cuda/include/thrust/host_vector.h",
+        "cuda/include/thrust/inner_product.h",
+        "cuda/include/thrust/iterator/constant_iterator.h",
+        "cuda/include/thrust/iterator/counting_iterator.h",
+        "cuda/include/thrust/iterator/detail/any_assign.h",
+        "cuda/include/thrust/iterator/detail/any_system_tag.h",
+        "cuda/include/thrust/iterator/detail/constant_iterator_base.h",
+        "cuda/include/thrust/iterator/detail/counting_iterator.inl",
+        "cuda/include/thrust/iterator/detail/device_system_tag.h",
+        "cuda/include/thrust/iterator/detail/discard_iterator_base.h",
+        "cuda/include/thrust/iterator/detail/distance_from_result.h",
+        "cuda/include/thrust/iterator/detail/host_system_tag.h",
+        "cuda/include/thrust/iterator/detail/is_iterator_category.h",
+        "cuda/include/thrust/iterator/detail/is_trivial_iterator.h",
+        "cuda/include/thrust/iterator/detail/iterator_adaptor_base.h",
+        "cuda/include/thrust/iterator/detail/iterator_category_to_system.h",
+        "cuda/include/thrust/iterator/detail/iterator_category_to_traversal.h",
+        "cuda/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h",
+        "cuda/include/thrust/iterator/detail/iterator_facade_category.h",
+        "cuda/include/thrust/iterator/detail/iterator_traits.inl",
+        "cuda/include/thrust/iterator/detail/iterator_traversal_tags.h",
+        "cuda/include/thrust/iterator/detail/join_iterator.h",
+        "cuda/include/thrust/iterator/detail/minimum_category.h",
+        "cuda/include/thrust/iterator/detail/minimum_system.h",
+        "cuda/include/thrust/iterator/detail/normal_iterator.h",
+        "cuda/include/thrust/iterator/detail/permutation_iterator_base.h",
+        "cuda/include/thrust/iterator/detail/retag.h",
+        "cuda/include/thrust/iterator/detail/reverse_iterator.inl",
+        "cuda/include/thrust/iterator/detail/reverse_iterator_base.h",
+        "cuda/include/thrust/iterator/detail/tagged_iterator.h",
+        "cuda/include/thrust/iterator/detail/transform_iterator.inl",
+        "cuda/include/thrust/iterator/detail/transform_output_iterator.inl",
+        "cuda/include/thrust/iterator/detail/tuple_of_iterator_references.h",
+        "cuda/include/thrust/iterator/detail/universal_categories.h",
+        "cuda/include/thrust/iterator/detail/zip_iterator.inl",
+        "cuda/include/thrust/iterator/detail/zip_iterator_base.h",
+        "cuda/include/thrust/iterator/discard_iterator.h",
+        "cuda/include/thrust/iterator/iterator_adaptor.h",
+        "cuda/include/thrust/iterator/iterator_categories.h",
+        "cuda/include/thrust/iterator/iterator_facade.h",
+        "cuda/include/thrust/iterator/iterator_traits.h",
+        "cuda/include/thrust/iterator/permutation_iterator.h",
+        "cuda/include/thrust/iterator/retag.h",
+        "cuda/include/thrust/iterator/reverse_iterator.h",
+        "cuda/include/thrust/iterator/transform_iterator.h",
+        "cuda/include/thrust/iterator/transform_output_iterator.h",
+        "cuda/include/thrust/iterator/zip_iterator.h",
+        "cuda/include/thrust/logical.h",
+        "cuda/include/thrust/memory.h",
+        "cuda/include/thrust/merge.h",
+        "cuda/include/thrust/mismatch.h",
+        "cuda/include/thrust/pair.h",
+        "cuda/include/thrust/partition.h",
+        "cuda/include/thrust/random.h",
+        "cuda/include/thrust/random/detail/discard_block_engine.inl",
+        "cuda/include/thrust/random/detail/linear_congruential_engine.inl",
+        "cuda/include/thrust/random/detail/linear_congruential_engine_discard.h",
+        "cuda/include/thrust/random/detail/linear_feedback_shift_engine.inl",
+        "cuda/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h",
+        "cuda/include/thrust/random/detail/mod.h",
+        "cuda/include/thrust/random/detail/normal_distribution.inl",
+        "cuda/include/thrust/random/detail/normal_distribution_base.h",
+        "cuda/include/thrust/random/detail/random_core_access.h",
+        "cuda/include/thrust/random/detail/subtract_with_carry_engine.inl",
+        "cuda/include/thrust/random/detail/uniform_int_distribution.inl",
+        "cuda/include/thrust/random/detail/uniform_real_distribution.inl",
+        "cuda/include/thrust/random/detail/xor_combine_engine.inl",
+        "cuda/include/thrust/random/detail/xor_combine_engine_max.h",
+        "cuda/include/thrust/random/discard_block_engine.h",
+        "cuda/include/thrust/random/linear_congruential_engine.h",
+        "cuda/include/thrust/random/linear_feedback_shift_engine.h",
+        "cuda/include/thrust/random/normal_distribution.h",
+        "cuda/include/thrust/random/subtract_with_carry_engine.h",
+        "cuda/include/thrust/random/uniform_int_distribution.h",
+        "cuda/include/thrust/random/uniform_real_distribution.h",
+        "cuda/include/thrust/random/xor_combine_engine.h",
+        "cuda/include/thrust/reduce.h",
+        "cuda/include/thrust/remove.h",
+        "cuda/include/thrust/replace.h",
+        "cuda/include/thrust/reverse.h",
+        "cuda/include/thrust/scan.h",
+        "cuda/include/thrust/scatter.h",
+        "cuda/include/thrust/sequence.h",
+        "cuda/include/thrust/set_operations.h",
+        "cuda/include/thrust/sort.h",
+        "cuda/include/thrust/swap.h",
+        "cuda/include/thrust/system/cpp/detail/adjacent_difference.h",
+        "cuda/include/thrust/system/cpp/detail/assign_value.h",
+        "cuda/include/thrust/system/cpp/detail/binary_search.h",
+        "cuda/include/thrust/system/cpp/detail/copy.h",
+        "cuda/include/thrust/system/cpp/detail/copy_if.h",
+        "cuda/include/thrust/system/cpp/detail/count.h",
+        "cuda/include/thrust/system/cpp/detail/equal.h",
+        "cuda/include/thrust/system/cpp/detail/execution_policy.h",
+        "cuda/include/thrust/system/cpp/detail/extrema.h",
+        "cuda/include/thrust/system/cpp/detail/fill.h",
+        "cuda/include/thrust/system/cpp/detail/find.h",
+        "cuda/include/thrust/system/cpp/detail/for_each.h",
+        "cuda/include/thrust/system/cpp/detail/gather.h",
+        "cuda/include/thrust/system/cpp/detail/generate.h",
+        "cuda/include/thrust/system/cpp/detail/get_value.h",
+        "cuda/include/thrust/system/cpp/detail/inner_product.h",
+        "cuda/include/thrust/system/cpp/detail/iter_swap.h",
+        "cuda/include/thrust/system/cpp/detail/logical.h",
+        "cuda/include/thrust/system/cpp/detail/malloc_and_free.h",
+        "cuda/include/thrust/system/cpp/detail/memory.inl",
+        "cuda/include/thrust/system/cpp/detail/merge.h",
+        "cuda/include/thrust/system/cpp/detail/mismatch.h",
+        "cuda/include/thrust/system/cpp/detail/par.h",
+        "cuda/include/thrust/system/cpp/detail/partition.h",
+        "cuda/include/thrust/system/cpp/detail/reduce.h",
+        "cuda/include/thrust/system/cpp/detail/reduce_by_key.h",
+        "cuda/include/thrust/system/cpp/detail/remove.h",
+        "cuda/include/thrust/system/cpp/detail/replace.h",
+        "cuda/include/thrust/system/cpp/detail/reverse.h",
+        "cuda/include/thrust/system/cpp/detail/scan.h",
+        "cuda/include/thrust/system/cpp/detail/scan_by_key.h",
+        "cuda/include/thrust/system/cpp/detail/scatter.h",
+        "cuda/include/thrust/system/cpp/detail/sequence.h",
+        "cuda/include/thrust/system/cpp/detail/set_operations.h",
+        "cuda/include/thrust/system/cpp/detail/sort.h",
+        "cuda/include/thrust/system/cpp/detail/swap_ranges.h",
+        "cuda/include/thrust/system/cpp/detail/tabulate.h",
+        "cuda/include/thrust/system/cpp/detail/temporary_buffer.h",
+        "cuda/include/thrust/system/cpp/detail/transform.h",
+        "cuda/include/thrust/system/cpp/detail/transform_reduce.h",
+        "cuda/include/thrust/system/cpp/detail/transform_scan.h",
+        "cuda/include/thrust/system/cpp/detail/uninitialized_copy.h",
+        "cuda/include/thrust/system/cpp/detail/uninitialized_fill.h",
+        "cuda/include/thrust/system/cpp/detail/unique.h",
+        "cuda/include/thrust/system/cpp/detail/unique_by_key.h",
+        "cuda/include/thrust/system/cpp/detail/vector.inl",
+        "cuda/include/thrust/system/cpp/execution_policy.h",
+        "cuda/include/thrust/system/cpp/memory.h",
+        "cuda/include/thrust/system/cpp/vector.h",
+        "cuda/include/thrust/system/cuda/config.h",
+        "cuda/include/thrust/system/cuda/detail/adjacent_difference.h",
+        "cuda/include/thrust/system/cuda/detail/assign_value.h",
+        "cuda/include/thrust/system/cuda/detail/binary_search.h",
+        "cuda/include/thrust/system/cuda/detail/copy.h",
+        "cuda/include/thrust/system/cuda/detail/copy_if.h",
+        "cuda/include/thrust/system/cuda/detail/core/agent_launcher.h",
+        "cuda/include/thrust/system/cuda/detail/core/alignment.h",
+        "cuda/include/thrust/system/cuda/detail/core/triple_chevron_launch.h",
+        "cuda/include/thrust/system/cuda/detail/core/util.h",
+        "cuda/include/thrust/system/cuda/detail/count.h",
+        "cuda/include/thrust/system/cuda/detail/cross_system.h",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_histogram.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_downsweep.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_upsweep.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_reduce_by_key.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_rle.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_segment_fixup.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_select_if.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_csrt.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_orig.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_row_based.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/agent/single_pass_scan_operators.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_adjacent_difference.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_load.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_shuffle.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/block_store.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans2.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans3.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/cub.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_partition.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_segmented_radix_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_segmented_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_select.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/device_spmv.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_histogram.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_radix_sort.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce_by_key.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_rle.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_select_if.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_csrt.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_orig.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_row_based.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/host/mutex.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/discard_output_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_search.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_allocator.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_arch.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_debug.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_device.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_macro.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_namespace.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_ptx.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/util_type.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh",
+        "cuda/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh",
+        "cuda/include/thrust/system/cuda/detail/equal.h",
+        "cuda/include/thrust/system/cuda/detail/error.inl",
+        "cuda/include/thrust/system/cuda/detail/execution_policy.h",
+        "cuda/include/thrust/system/cuda/detail/extrema.h",
+        "cuda/include/thrust/system/cuda/detail/fill.h",
+        "cuda/include/thrust/system/cuda/detail/find.h",
+        "cuda/include/thrust/system/cuda/detail/for_each.h",
+        "cuda/include/thrust/system/cuda/detail/gather.h",
+        "cuda/include/thrust/system/cuda/detail/generate.h",
+        "cuda/include/thrust/system/cuda/detail/get_value.h",
+        "cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h",
+        "cuda/include/thrust/system/cuda/detail/guarded_driver_types.h",
+        "cuda/include/thrust/system/cuda/detail/inner_product.h",
+        "cuda/include/thrust/system/cuda/detail/internal/copy_cross_system.h",
+        "cuda/include/thrust/system/cuda/detail/internal/copy_device_to_device.h",
+        "cuda/include/thrust/system/cuda/detail/iter_swap.h",
+        "cuda/include/thrust/system/cuda/detail/logical.h",
+        "cuda/include/thrust/system/cuda/detail/malloc_and_free.h",
+        "cuda/include/thrust/system/cuda/detail/memory.inl",
+        "cuda/include/thrust/system/cuda/detail/memory_buffer.h",
+        "cuda/include/thrust/system/cuda/detail/merge.h",
+        "cuda/include/thrust/system/cuda/detail/mismatch.h",
+        "cuda/include/thrust/system/cuda/detail/par.h",
+        "cuda/include/thrust/system/cuda/detail/par_to_seq.h",
+        "cuda/include/thrust/system/cuda/detail/parallel_for.h",
+        "cuda/include/thrust/system/cuda/detail/partition.h",
+        "cuda/include/thrust/system/cuda/detail/reduce.h",
+        "cuda/include/thrust/system/cuda/detail/reduce_by_key.h",
+        "cuda/include/thrust/system/cuda/detail/remove.h",
+        "cuda/include/thrust/system/cuda/detail/replace.h",
+        "cuda/include/thrust/system/cuda/detail/reverse.h",
+        "cuda/include/thrust/system/cuda/detail/scan.h",
+        "cuda/include/thrust/system/cuda/detail/scan_by_key.h",
+        "cuda/include/thrust/system/cuda/detail/scatter.h",
+        "cuda/include/thrust/system/cuda/detail/sequence.h",
+        "cuda/include/thrust/system/cuda/detail/set_operations.h",
+        "cuda/include/thrust/system/cuda/detail/sort.h",
+        "cuda/include/thrust/system/cuda/detail/swap_ranges.h",
+        "cuda/include/thrust/system/cuda/detail/tabulate.h",
+        "cuda/include/thrust/system/cuda/detail/temporary_buffer.h",
+        "cuda/include/thrust/system/cuda/detail/terminate.h",
+        "cuda/include/thrust/system/cuda/detail/transform.h",
+        "cuda/include/thrust/system/cuda/detail/transform_reduce.h",
+        "cuda/include/thrust/system/cuda/detail/transform_scan.h",
+        "cuda/include/thrust/system/cuda/detail/uninitialized_copy.h",
+        "cuda/include/thrust/system/cuda/detail/uninitialized_fill.h",
+        "cuda/include/thrust/system/cuda/detail/unique.h",
+        "cuda/include/thrust/system/cuda/detail/unique_by_key.h",
+        "cuda/include/thrust/system/cuda/detail/util.h",
+        "cuda/include/thrust/system/cuda/detail/vector.inl",
+        "cuda/include/thrust/system/cuda/error.h",
+        "cuda/include/thrust/system/cuda/execution_policy.h",
+        "cuda/include/thrust/system/cuda/experimental/pinned_allocator.h",
+        "cuda/include/thrust/system/cuda/memory.h",
+        "cuda/include/thrust/system/cuda/vector.h",
+        "cuda/include/thrust/system/detail/adl/adjacent_difference.h",
+        "cuda/include/thrust/system/detail/adl/assign_value.h",
+        "cuda/include/thrust/system/detail/adl/binary_search.h",
+        "cuda/include/thrust/system/detail/adl/copy.h",
+        "cuda/include/thrust/system/detail/adl/copy_if.h",
+        "cuda/include/thrust/system/detail/adl/count.h",
+        "cuda/include/thrust/system/detail/adl/equal.h",
+        "cuda/include/thrust/system/detail/adl/extrema.h",
+        "cuda/include/thrust/system/detail/adl/fill.h",
+        "cuda/include/thrust/system/detail/adl/find.h",
+        "cuda/include/thrust/system/detail/adl/for_each.h",
+        "cuda/include/thrust/system/detail/adl/gather.h",
+        "cuda/include/thrust/system/detail/adl/generate.h",
         "cuda/include/thrust/system/detail/adl/get_value.h",
         "cuda/include/thrust/system/detail/adl/inner_product.h",
-        "cuda/include/thrust/system/detail/adl/copy_if.h",
-        "cuda/include/thrust/system/detail/adl/logical.h",
         "cuda/include/thrust/system/detail/adl/iter_swap.h",
+        "cuda/include/thrust/system/detail/adl/logical.h",
         "cuda/include/thrust/system/detail/adl/malloc_and_free.h",
-        "cuda/include/thrust/system/detail/adl/fill.h",
+        "cuda/include/thrust/system/detail/adl/merge.h",
+        "cuda/include/thrust/system/detail/adl/mismatch.h",
+        "cuda/include/thrust/system/detail/adl/partition.h",
+        "cuda/include/thrust/system/detail/adl/reduce.h",
+        "cuda/include/thrust/system/detail/adl/reduce_by_key.h",
+        "cuda/include/thrust/system/detail/adl/remove.h",
+        "cuda/include/thrust/system/detail/adl/replace.h",
+        "cuda/include/thrust/system/detail/adl/reverse.h",
+        "cuda/include/thrust/system/detail/adl/scan.h",
+        "cuda/include/thrust/system/detail/adl/scan_by_key.h",
+        "cuda/include/thrust/system/detail/adl/scatter.h",
+        "cuda/include/thrust/system/detail/adl/sequence.h",
+        "cuda/include/thrust/system/detail/adl/set_operations.h",
+        "cuda/include/thrust/system/detail/adl/sort.h",
+        "cuda/include/thrust/system/detail/adl/swap_ranges.h",
+        "cuda/include/thrust/system/detail/adl/tabulate.h",
+        "cuda/include/thrust/system/detail/adl/temporary_buffer.h",
         "cuda/include/thrust/system/detail/adl/transform.h",
+        "cuda/include/thrust/system/detail/adl/transform_reduce.h",
+        "cuda/include/thrust/system/detail/adl/transform_scan.h",
+        "cuda/include/thrust/system/detail/adl/uninitialized_copy.h",
+        "cuda/include/thrust/system/detail/adl/uninitialized_fill.h",
+        "cuda/include/thrust/system/detail/adl/unique.h",
+        "cuda/include/thrust/system/detail/adl/unique_by_key.h",
+        "cuda/include/thrust/system/detail/bad_alloc.h",
         "cuda/include/thrust/system/detail/errno.h",
         "cuda/include/thrust/system/detail/error_category.inl",
-        "cuda/include/thrust/system/detail/sequential/transform_scan.h",
-        "cuda/include/thrust/system/detail/sequential/unique_by_key.h",
-        "cuda/include/thrust/system/detail/sequential/stable_primitive_sort.h",
-        "cuda/include/thrust/system/detail/sequential/stable_primitive_sort.inl",
-        "cuda/include/thrust/system/detail/sequential/stable_merge_sort.h",
-        "cuda/include/thrust/system/detail/sequential/sort.inl",
-        "cuda/include/thrust/system/detail/sequential/partition.h",
-        "cuda/include/thrust/system/detail/sequential/unique.h",
-        "cuda/include/thrust/system/detail/sequential/execution_policy.h",
-        "cuda/include/thrust/system/detail/sequential/adjacent_difference.h",
-        "cuda/include/thrust/system/detail/sequential/sequence.h",
-        "cuda/include/thrust/system/detail/sequential/merge.h",
-        "cuda/include/thrust/system/detail/sequential/transform_reduce.h",
-        "cuda/include/thrust/system/detail/sequential/gather.h",
-        "cuda/include/thrust/system/detail/sequential/sort.h",
-        "cuda/include/thrust/system/detail/sequential/copy_backward.h",
-        "cuda/include/thrust/system/detail/sequential/stable_radix_sort.inl",
-        "cuda/include/thrust/system/detail/sequential/scan.h",
-        "cuda/include/thrust/system/detail/sequential/temporary_buffer.h",
-        "cuda/include/thrust/system/detail/sequential/scan_by_key.h",
-        "cuda/include/thrust/system/detail/sequential/reverse.h",
-        "cuda/include/thrust/system/detail/sequential/assign_value.h",
-        "cuda/include/thrust/system/detail/sequential/scatter.h",
-        "cuda/include/thrust/system/detail/sequential/find.h",
-        "cuda/include/thrust/system/detail/sequential/stable_merge_sort.inl",
-        "cuda/include/thrust/system/detail/sequential/merge.inl",
-        "cuda/include/thrust/system/detail/sequential/generate.h",
-        "cuda/include/thrust/system/detail/sequential/uninitialized_fill.h",
-        "cuda/include/thrust/system/detail/sequential/general_copy.h",
-        "cuda/include/thrust/system/detail/sequential/insertion_sort.h",
-        "cuda/include/thrust/system/detail/sequential/remove.h",
-        "cuda/include/thrust/system/detail/sequential/tabulate.h",
-        "cuda/include/thrust/system/detail/sequential/for_each.h",
-        "cuda/include/thrust/system/detail/sequential/reduce_by_key.h",
-        "cuda/include/thrust/system/detail/sequential/reduce.h",
-        "cuda/include/thrust/system/detail/sequential/equal.h",
-        "cuda/include/thrust/system/detail/sequential/stable_radix_sort.h",
-        "cuda/include/thrust/system/detail/sequential/copy.inl",
-        "cuda/include/thrust/system/detail/sequential/copy.h",
-        "cuda/include/thrust/system/detail/sequential/swap_ranges.h",
-        "cuda/include/thrust/system/detail/sequential/uninitialized_copy.h",
-        "cuda/include/thrust/system/detail/sequential/binary_search.h",
-        "cuda/include/thrust/system/detail/sequential/set_operations.h",
-        "cuda/include/thrust/system/detail/sequential/mismatch.h",
-        "cuda/include/thrust/system/detail/sequential/extrema.h",
-        "cuda/include/thrust/system/detail/sequential/count.h",
-        "cuda/include/thrust/system/detail/sequential/trivial_copy.h",
-        "cuda/include/thrust/system/detail/sequential/replace.h",
-        "cuda/include/thrust/system/detail/sequential/get_value.h",
-        "cuda/include/thrust/system/detail/sequential/inner_product.h",
-        "cuda/include/thrust/system/detail/sequential/copy_if.h",
-        "cuda/include/thrust/system/detail/sequential/logical.h",
-        "cuda/include/thrust/system/detail/sequential/iter_swap.h",
-        "cuda/include/thrust/system/detail/sequential/malloc_and_free.h",
-        "cuda/include/thrust/system/detail/sequential/fill.h",
-        "cuda/include/thrust/system/detail/sequential/transform.h",
-        "cuda/include/thrust/system/detail/error_condition.inl",
-        "cuda/include/thrust/system/detail/internal/decompose.h",
         "cuda/include/thrust/system/detail/error_code.inl",
-        "cuda/include/thrust/system/detail/generic/transform_scan.h",
-        "cuda/include/thrust/system/detail/generic/memory.inl",
-        "cuda/include/thrust/system/detail/generic/transform.inl",
-        "cuda/include/thrust/system/detail/generic/binary_search.inl",
-        "cuda/include/thrust/system/detail/generic/scan_by_key.inl",
-        "cuda/include/thrust/system/detail/generic/unique_by_key.h",
-        "cuda/include/thrust/system/detail/generic/inner_product.inl",
-        "cuda/include/thrust/system/detail/generic/select_system.h",
-        "cuda/include/thrust/system/detail/generic/sequence.inl",
-        "cuda/include/thrust/system/detail/generic/sort.inl",
-        "cuda/include/thrust/system/detail/generic/equal.inl",
-        "cuda/include/thrust/system/detail/generic/partition.h",
-        "cuda/include/thrust/system/detail/generic/unique.h",
+        "cuda/include/thrust/system/detail/error_condition.inl",
         "cuda/include/thrust/system/detail/generic/adjacent_difference.h",
-        "cuda/include/thrust/system/detail/generic/tag.h",
-        "cuda/include/thrust/system/detail/generic/unique_by_key.inl",
-        "cuda/include/thrust/system/detail/generic/sequence.h",
-        "cuda/include/thrust/system/detail/generic/type_traits.h",
-        "cuda/include/thrust/system/detail/generic/merge.h",
-        "cuda/include/thrust/system/detail/generic/reverse.inl",
-        "cuda/include/thrust/system/detail/generic/tabulate.inl",
-        "cuda/include/thrust/system/detail/generic/unique.inl",
-        "cuda/include/thrust/system/detail/generic/scatter.inl",
-        "cuda/include/thrust/system/detail/generic/set_operations.inl",
-        "cuda/include/thrust/system/detail/generic/copy_if.inl",
-        "cuda/include/thrust/system/detail/generic/transform_reduce.h",
-        "cuda/include/thrust/system/detail/generic/transform_scan.inl",
-        "cuda/include/thrust/system/detail/generic/gather.h",
-        "cuda/include/thrust/system/detail/generic/reduce_by_key.inl",
-        "cuda/include/thrust/system/detail/generic/transform_reduce.inl",
-        "cuda/include/thrust/system/detail/generic/sort.h",
-        "cuda/include/thrust/system/detail/generic/distance.inl",
-        "cuda/include/thrust/system/detail/generic/scan.h",
-        "cuda/include/thrust/system/detail/generic/temporary_buffer.h",
-        "cuda/include/thrust/system/detail/generic/reduce.inl",
-        "cuda/include/thrust/system/detail/generic/scan_by_key.h",
-        "cuda/include/thrust/system/detail/generic/reverse.h",
-        "cuda/include/thrust/system/detail/generic/temporary_buffer.inl",
-        "cuda/include/thrust/system/detail/generic/scatter.h",
-        "cuda/include/thrust/system/detail/generic/generate.inl",
         "cuda/include/thrust/system/detail/generic/adjacent_difference.inl",
-        "cuda/include/thrust/system/detail/generic/remove.inl",
         "cuda/include/thrust/system/detail/generic/advance.h",
-        "cuda/include/thrust/system/detail/generic/find.h",
-        "cuda/include/thrust/system/detail/generic/merge.inl",
-        "cuda/include/thrust/system/detail/generic/scalar/binary_search.inl",
-        "cuda/include/thrust/system/detail/generic/scalar/binary_search.h",
-        "cuda/include/thrust/system/detail/generic/extrema.inl",
-        "cuda/include/thrust/system/detail/generic/generate.h",
-        "cuda/include/thrust/system/detail/generic/uninitialized_fill.h",
+        "cuda/include/thrust/system/detail/generic/advance.inl",
+        "cuda/include/thrust/system/detail/generic/binary_search.h",
+        "cuda/include/thrust/system/detail/generic/binary_search.inl",
+        "cuda/include/thrust/system/detail/generic/copy.h",
+        "cuda/include/thrust/system/detail/generic/copy.inl",
+        "cuda/include/thrust/system/detail/generic/copy_if.h",
+        "cuda/include/thrust/system/detail/generic/copy_if.inl",
+        "cuda/include/thrust/system/detail/generic/count.h",
         "cuda/include/thrust/system/detail/generic/count.inl",
-        "cuda/include/thrust/system/detail/generic/remove.h",
-        "cuda/include/thrust/system/detail/generic/uninitialized_copy.inl",
-        "cuda/include/thrust/system/detail/generic/tabulate.h",
-        "cuda/include/thrust/system/detail/generic/for_each.h",
         "cuda/include/thrust/system/detail/generic/distance.h",
-        "cuda/include/thrust/system/detail/generic/swap_ranges.inl",
-        "cuda/include/thrust/system/detail/generic/reduce_by_key.h",
-        "cuda/include/thrust/system/detail/generic/reduce.h",
+        "cuda/include/thrust/system/detail/generic/distance.inl",
         "cuda/include/thrust/system/detail/generic/equal.h",
-        "cuda/include/thrust/system/detail/generic/mismatch.inl",
-        "cuda/include/thrust/system/detail/generic/copy.inl",
-        "cuda/include/thrust/system/detail/generic/copy.h",
-        "cuda/include/thrust/system/detail/generic/swap_ranges.h",
-        "cuda/include/thrust/system/detail/generic/uninitialized_copy.h",
-        "cuda/include/thrust/system/detail/generic/binary_search.h",
-        "cuda/include/thrust/system/detail/generic/set_operations.h",
-        "cuda/include/thrust/system/detail/generic/uninitialized_fill.inl",
-        "cuda/include/thrust/system/detail/generic/mismatch.h",
-        "cuda/include/thrust/system/detail/generic/scan.inl",
-        "cuda/include/thrust/system/detail/generic/gather.inl",
+        "cuda/include/thrust/system/detail/generic/equal.inl",
         "cuda/include/thrust/system/detail/generic/extrema.h",
-        "cuda/include/thrust/system/detail/generic/count.h",
-        "cuda/include/thrust/system/detail/generic/replace.h",
+        "cuda/include/thrust/system/detail/generic/extrema.inl",
+        "cuda/include/thrust/system/detail/generic/fill.h",
+        "cuda/include/thrust/system/detail/generic/find.h",
+        "cuda/include/thrust/system/detail/generic/find.inl",
+        "cuda/include/thrust/system/detail/generic/for_each.h",
+        "cuda/include/thrust/system/detail/generic/gather.h",
+        "cuda/include/thrust/system/detail/generic/gather.inl",
+        "cuda/include/thrust/system/detail/generic/generate.h",
+        "cuda/include/thrust/system/detail/generic/generate.inl",
         "cuda/include/thrust/system/detail/generic/inner_product.h",
-        "cuda/include/thrust/system/detail/generic/copy_if.h",
+        "cuda/include/thrust/system/detail/generic/inner_product.inl",
         "cuda/include/thrust/system/detail/generic/logical.h",
-        "cuda/include/thrust/system/detail/generic/partition.inl",
         "cuda/include/thrust/system/detail/generic/memory.h",
-        "cuda/include/thrust/system/detail/generic/find.inl",
+        "cuda/include/thrust/system/detail/generic/memory.inl",
+        "cuda/include/thrust/system/detail/generic/merge.h",
+        "cuda/include/thrust/system/detail/generic/merge.inl",
+        "cuda/include/thrust/system/detail/generic/mismatch.h",
+        "cuda/include/thrust/system/detail/generic/mismatch.inl",
+        "cuda/include/thrust/system/detail/generic/partition.h",
+        "cuda/include/thrust/system/detail/generic/partition.inl",
+        "cuda/include/thrust/system/detail/generic/reduce.h",
+        "cuda/include/thrust/system/detail/generic/reduce.inl",
+        "cuda/include/thrust/system/detail/generic/reduce_by_key.h",
+        "cuda/include/thrust/system/detail/generic/reduce_by_key.inl",
+        "cuda/include/thrust/system/detail/generic/remove.h",
+        "cuda/include/thrust/system/detail/generic/remove.inl",
+        "cuda/include/thrust/system/detail/generic/replace.h",
         "cuda/include/thrust/system/detail/generic/replace.inl",
-        "cuda/include/thrust/system/detail/generic/advance.inl",
-        "cuda/include/thrust/system/detail/generic/fill.h",
+        "cuda/include/thrust/system/detail/generic/reverse.h",
+        "cuda/include/thrust/system/detail/generic/reverse.inl",
+        "cuda/include/thrust/system/detail/generic/scalar/binary_search.h",
+        "cuda/include/thrust/system/detail/generic/scalar/binary_search.inl",
+        "cuda/include/thrust/system/detail/generic/scan.h",
+        "cuda/include/thrust/system/detail/generic/scan.inl",
+        "cuda/include/thrust/system/detail/generic/scan_by_key.h",
+        "cuda/include/thrust/system/detail/generic/scan_by_key.inl",
+        "cuda/include/thrust/system/detail/generic/scatter.h",
+        "cuda/include/thrust/system/detail/generic/scatter.inl",
+        "cuda/include/thrust/system/detail/generic/select_system.h",
+        "cuda/include/thrust/system/detail/generic/sequence.h",
+        "cuda/include/thrust/system/detail/generic/sequence.inl",
+        "cuda/include/thrust/system/detail/generic/set_operations.h",
+        "cuda/include/thrust/system/detail/generic/set_operations.inl",
+        "cuda/include/thrust/system/detail/generic/sort.h",
+        "cuda/include/thrust/system/detail/generic/sort.inl",
+        "cuda/include/thrust/system/detail/generic/swap_ranges.h",
+        "cuda/include/thrust/system/detail/generic/swap_ranges.inl",
+        "cuda/include/thrust/system/detail/generic/tabulate.h",
+        "cuda/include/thrust/system/detail/generic/tabulate.inl",
+        "cuda/include/thrust/system/detail/generic/tag.h",
+        "cuda/include/thrust/system/detail/generic/temporary_buffer.h",
+        "cuda/include/thrust/system/detail/generic/temporary_buffer.inl",
         "cuda/include/thrust/system/detail/generic/transform.h",
+        "cuda/include/thrust/system/detail/generic/transform.inl",
+        "cuda/include/thrust/system/detail/generic/transform_reduce.h",
+        "cuda/include/thrust/system/detail/generic/transform_reduce.inl",
+        "cuda/include/thrust/system/detail/generic/transform_scan.h",
+        "cuda/include/thrust/system/detail/generic/transform_scan.inl",
+        "cuda/include/thrust/system/detail/generic/type_traits.h",
+        "cuda/include/thrust/system/detail/generic/uninitialized_copy.h",
+        "cuda/include/thrust/system/detail/generic/uninitialized_copy.inl",
+        "cuda/include/thrust/system/detail/generic/uninitialized_fill.h",
+        "cuda/include/thrust/system/detail/generic/uninitialized_fill.inl",
+        "cuda/include/thrust/system/detail/generic/unique.h",
+        "cuda/include/thrust/system/detail/generic/unique.inl",
+        "cuda/include/thrust/system/detail/generic/unique_by_key.h",
+        "cuda/include/thrust/system/detail/generic/unique_by_key.inl",
+        "cuda/include/thrust/system/detail/internal/decompose.h",
+        "cuda/include/thrust/system/detail/sequential/adjacent_difference.h",
+        "cuda/include/thrust/system/detail/sequential/assign_value.h",
+        "cuda/include/thrust/system/detail/sequential/binary_search.h",
+        "cuda/include/thrust/system/detail/sequential/copy.h",
+        "cuda/include/thrust/system/detail/sequential/copy.inl",
+        "cuda/include/thrust/system/detail/sequential/copy_backward.h",
+        "cuda/include/thrust/system/detail/sequential/copy_if.h",
+        "cuda/include/thrust/system/detail/sequential/count.h",
+        "cuda/include/thrust/system/detail/sequential/equal.h",
+        "cuda/include/thrust/system/detail/sequential/execution_policy.h",
+        "cuda/include/thrust/system/detail/sequential/extrema.h",
+        "cuda/include/thrust/system/detail/sequential/fill.h",
+        "cuda/include/thrust/system/detail/sequential/find.h",
+        "cuda/include/thrust/system/detail/sequential/for_each.h",
+        "cuda/include/thrust/system/detail/sequential/gather.h",
+        "cuda/include/thrust/system/detail/sequential/general_copy.h",
+        "cuda/include/thrust/system/detail/sequential/generate.h",
+        "cuda/include/thrust/system/detail/sequential/get_value.h",
+        "cuda/include/thrust/system/detail/sequential/inner_product.h",
+        "cuda/include/thrust/system/detail/sequential/insertion_sort.h",
+        "cuda/include/thrust/system/detail/sequential/iter_swap.h",
+        "cuda/include/thrust/system/detail/sequential/logical.h",
+        "cuda/include/thrust/system/detail/sequential/malloc_and_free.h",
+        "cuda/include/thrust/system/detail/sequential/merge.h",
+        "cuda/include/thrust/system/detail/sequential/merge.inl",
+        "cuda/include/thrust/system/detail/sequential/mismatch.h",
+        "cuda/include/thrust/system/detail/sequential/partition.h",
+        "cuda/include/thrust/system/detail/sequential/reduce.h",
+        "cuda/include/thrust/system/detail/sequential/reduce_by_key.h",
+        "cuda/include/thrust/system/detail/sequential/remove.h",
+        "cuda/include/thrust/system/detail/sequential/replace.h",
+        "cuda/include/thrust/system/detail/sequential/reverse.h",
+        "cuda/include/thrust/system/detail/sequential/scan.h",
+        "cuda/include/thrust/system/detail/sequential/scan_by_key.h",
+        "cuda/include/thrust/system/detail/sequential/scatter.h",
+        "cuda/include/thrust/system/detail/sequential/sequence.h",
+        "cuda/include/thrust/system/detail/sequential/set_operations.h",
+        "cuda/include/thrust/system/detail/sequential/sort.h",
+        "cuda/include/thrust/system/detail/sequential/sort.inl",
+        "cuda/include/thrust/system/detail/sequential/stable_merge_sort.h",
+        "cuda/include/thrust/system/detail/sequential/stable_merge_sort.inl",
+        "cuda/include/thrust/system/detail/sequential/stable_primitive_sort.h",
+        "cuda/include/thrust/system/detail/sequential/stable_primitive_sort.inl",
+        "cuda/include/thrust/system/detail/sequential/stable_radix_sort.h",
+        "cuda/include/thrust/system/detail/sequential/stable_radix_sort.inl",
+        "cuda/include/thrust/system/detail/sequential/swap_ranges.h",
+        "cuda/include/thrust/system/detail/sequential/tabulate.h",
+        "cuda/include/thrust/system/detail/sequential/temporary_buffer.h",
+        "cuda/include/thrust/system/detail/sequential/transform.h",
+        "cuda/include/thrust/system/detail/sequential/transform_reduce.h",
+        "cuda/include/thrust/system/detail/sequential/transform_scan.h",
+        "cuda/include/thrust/system/detail/sequential/trivial_copy.h",
+        "cuda/include/thrust/system/detail/sequential/uninitialized_copy.h",
+        "cuda/include/thrust/system/detail/sequential/uninitialized_fill.h",
+        "cuda/include/thrust/system/detail/sequential/unique.h",
+        "cuda/include/thrust/system/detail/sequential/unique_by_key.h",
         "cuda/include/thrust/system/detail/system_error.inl",
-        "cuda/include/thrust/system/omp/execution_policy.h",
-        "cuda/include/thrust/system/omp/vector.h",
-        "cuda/include/thrust/system/omp/detail/transform_scan.h",
-        "cuda/include/thrust/system/omp/detail/memory.inl",
-        "cuda/include/thrust/system/omp/detail/reduce_intervals.inl",
-        "cuda/include/thrust/system/omp/detail/unique_by_key.h",
-        "cuda/include/thrust/system/omp/detail/sort.inl",
-        "cuda/include/thrust/system/omp/detail/partition.h",
-        "cuda/include/thrust/system/omp/detail/unique.h",
-        "cuda/include/thrust/system/omp/detail/execution_policy.h",
+        "cuda/include/thrust/system/error_code.h",
         "cuda/include/thrust/system/omp/detail/adjacent_difference.h",
-        "cuda/include/thrust/system/omp/detail/unique_by_key.inl",
-        "cuda/include/thrust/system/omp/detail/sequence.h",
-        "cuda/include/thrust/system/omp/detail/merge.h",
-        "cuda/include/thrust/system/omp/detail/unique.inl",
+        "cuda/include/thrust/system/omp/detail/assign_value.h",
+        "cuda/include/thrust/system/omp/detail/binary_search.h",
+        "cuda/include/thrust/system/omp/detail/copy.h",
+        "cuda/include/thrust/system/omp/detail/copy.inl",
+        "cuda/include/thrust/system/omp/detail/copy_if.h",
         "cuda/include/thrust/system/omp/detail/copy_if.inl",
-        "cuda/include/thrust/system/omp/detail/transform_reduce.h",
-        "cuda/include/thrust/system/omp/detail/gather.h",
-        "cuda/include/thrust/system/omp/detail/reduce_by_key.inl",
-        "cuda/include/thrust/system/omp/detail/sort.h",
-        "cuda/include/thrust/system/omp/detail/scan.h",
-        "cuda/include/thrust/system/omp/detail/temporary_buffer.h",
+        "cuda/include/thrust/system/omp/detail/count.h",
         "cuda/include/thrust/system/omp/detail/default_decomposition.h",
-        "cuda/include/thrust/system/omp/detail/reduce.inl",
-        "cuda/include/thrust/system/omp/detail/scan_by_key.h",
-        "cuda/include/thrust/system/omp/detail/reverse.h",
-        "cuda/include/thrust/system/omp/detail/assign_value.h",
-        "cuda/include/thrust/system/omp/detail/scatter.h",
-        "cuda/include/thrust/system/omp/detail/for_each.inl",
         "cuda/include/thrust/system/omp/detail/default_decomposition.inl",
-        "cuda/include/thrust/system/omp/detail/remove.inl",
-        "cuda/include/thrust/system/omp/detail/vector.inl",
-        "cuda/include/thrust/system/omp/detail/find.h",
-        "cuda/include/thrust/system/omp/detail/generate.h",
-        "cuda/include/thrust/system/omp/detail/uninitialized_fill.h",
-        "cuda/include/thrust/system/omp/detail/remove.h",
-        "cuda/include/thrust/system/omp/detail/tabulate.h",
-        "cuda/include/thrust/system/omp/detail/for_each.h",
-        "cuda/include/thrust/system/omp/detail/reduce_by_key.h",
-        "cuda/include/thrust/system/omp/detail/reduce.h",
         "cuda/include/thrust/system/omp/detail/equal.h",
-        "cuda/include/thrust/system/omp/detail/copy.inl",
-        "cuda/include/thrust/system/omp/detail/copy.h",
-        "cuda/include/thrust/system/omp/detail/swap_ranges.h",
-        "cuda/include/thrust/system/omp/detail/uninitialized_copy.h",
-        "cuda/include/thrust/system/omp/detail/binary_search.h",
-        "cuda/include/thrust/system/omp/detail/set_operations.h",
-        "cuda/include/thrust/system/omp/detail/mismatch.h",
+        "cuda/include/thrust/system/omp/detail/execution_policy.h",
         "cuda/include/thrust/system/omp/detail/extrema.h",
-        "cuda/include/thrust/system/omp/detail/count.h",
-        "cuda/include/thrust/system/omp/detail/replace.h",
+        "cuda/include/thrust/system/omp/detail/fill.h",
+        "cuda/include/thrust/system/omp/detail/find.h",
+        "cuda/include/thrust/system/omp/detail/for_each.h",
+        "cuda/include/thrust/system/omp/detail/for_each.inl",
+        "cuda/include/thrust/system/omp/detail/gather.h",
+        "cuda/include/thrust/system/omp/detail/generate.h",
         "cuda/include/thrust/system/omp/detail/get_value.h",
         "cuda/include/thrust/system/omp/detail/inner_product.h",
-        "cuda/include/thrust/system/omp/detail/copy_if.h",
-        "cuda/include/thrust/system/omp/detail/logical.h",
-        "cuda/include/thrust/system/omp/detail/partition.inl",
         "cuda/include/thrust/system/omp/detail/iter_swap.h",
+        "cuda/include/thrust/system/omp/detail/logical.h",
+        "cuda/include/thrust/system/omp/detail/malloc_and_free.h",
+        "cuda/include/thrust/system/omp/detail/memory.inl",
+        "cuda/include/thrust/system/omp/detail/merge.h",
+        "cuda/include/thrust/system/omp/detail/mismatch.h",
         "cuda/include/thrust/system/omp/detail/par.h",
+        "cuda/include/thrust/system/omp/detail/partition.h",
+        "cuda/include/thrust/system/omp/detail/partition.inl",
+        "cuda/include/thrust/system/omp/detail/reduce.h",
+        "cuda/include/thrust/system/omp/detail/reduce.inl",
+        "cuda/include/thrust/system/omp/detail/reduce_by_key.h",
+        "cuda/include/thrust/system/omp/detail/reduce_by_key.inl",
         "cuda/include/thrust/system/omp/detail/reduce_intervals.h",
-        "cuda/include/thrust/system/omp/detail/malloc_and_free.h",
-        "cuda/include/thrust/system/omp/detail/fill.h",
+        "cuda/include/thrust/system/omp/detail/reduce_intervals.inl",
+        "cuda/include/thrust/system/omp/detail/remove.h",
+        "cuda/include/thrust/system/omp/detail/remove.inl",
+        "cuda/include/thrust/system/omp/detail/replace.h",
+        "cuda/include/thrust/system/omp/detail/reverse.h",
+        "cuda/include/thrust/system/omp/detail/scan.h",
+        "cuda/include/thrust/system/omp/detail/scan_by_key.h",
+        "cuda/include/thrust/system/omp/detail/scatter.h",
+        "cuda/include/thrust/system/omp/detail/sequence.h",
+        "cuda/include/thrust/system/omp/detail/set_operations.h",
+        "cuda/include/thrust/system/omp/detail/sort.h",
+        "cuda/include/thrust/system/omp/detail/sort.inl",
+        "cuda/include/thrust/system/omp/detail/swap_ranges.h",
+        "cuda/include/thrust/system/omp/detail/tabulate.h",
+        "cuda/include/thrust/system/omp/detail/temporary_buffer.h",
         "cuda/include/thrust/system/omp/detail/transform.h",
-        "cuda/include/thrust/system/omp/memory.h",
-        "cuda/include/thrust/system/tbb/execution_policy.h",
-        "cuda/include/thrust/system/tbb/vector.h",
-        "cuda/include/thrust/system/tbb/detail/transform_scan.h",
-        "cuda/include/thrust/system/tbb/detail/memory.inl",
-        "cuda/include/thrust/system/tbb/detail/unique_by_key.h",
-        "cuda/include/thrust/system/tbb/detail/sort.inl",
-        "cuda/include/thrust/system/tbb/detail/partition.h",
-        "cuda/include/thrust/system/tbb/detail/unique.h",
-        "cuda/include/thrust/system/tbb/detail/execution_policy.h",
+        "cuda/include/thrust/system/omp/detail/transform_reduce.h",
+        "cuda/include/thrust/system/omp/detail/transform_scan.h",
+        "cuda/include/thrust/system/omp/detail/uninitialized_copy.h",
+        "cuda/include/thrust/system/omp/detail/uninitialized_fill.h",
+        "cuda/include/thrust/system/omp/detail/unique.h",
+        "cuda/include/thrust/system/omp/detail/unique.inl",
+        "cuda/include/thrust/system/omp/detail/unique_by_key.h",
+        "cuda/include/thrust/system/omp/detail/unique_by_key.inl",
+        "cuda/include/thrust/system/omp/detail/vector.inl",
+        "cuda/include/thrust/system/omp/execution_policy.h",
+        "cuda/include/thrust/system/omp/memory.h",
+        "cuda/include/thrust/system/omp/vector.h",
+        "cuda/include/thrust/system/system_error.h",
         "cuda/include/thrust/system/tbb/detail/adjacent_difference.h",
-        "cuda/include/thrust/system/tbb/detail/unique_by_key.inl",
-        "cuda/include/thrust/system/tbb/detail/sequence.h",
-        "cuda/include/thrust/system/tbb/detail/merge.h",
-        "cuda/include/thrust/system/tbb/detail/unique.inl",
-        "cuda/include/thrust/system/tbb/detail/copy_if.inl",
-        "cuda/include/thrust/system/tbb/detail/transform_reduce.h",
-        "cuda/include/thrust/system/tbb/detail/gather.h",
-        "cuda/include/thrust/system/tbb/detail/reduce_by_key.inl",
-        "cuda/include/thrust/system/tbb/detail/sort.h",
-        "cuda/include/thrust/system/tbb/detail/scan.h",
-        "cuda/include/thrust/system/tbb/detail/temporary_buffer.h",
-        "cuda/include/thrust/system/tbb/detail/reduce.inl",
-        "cuda/include/thrust/system/tbb/detail/scan_by_key.h",
-        "cuda/include/thrust/system/tbb/detail/reverse.h",
         "cuda/include/thrust/system/tbb/detail/assign_value.h",
-        "cuda/include/thrust/system/tbb/detail/scatter.h",
-        "cuda/include/thrust/system/tbb/detail/for_each.inl",
-        "cuda/include/thrust/system/tbb/detail/remove.inl",
-        "cuda/include/thrust/system/tbb/detail/vector.inl",
-        "cuda/include/thrust/system/tbb/detail/find.h",
-        "cuda/include/thrust/system/tbb/detail/merge.inl",
-        "cuda/include/thrust/system/tbb/detail/generate.h",
-        "cuda/include/thrust/system/tbb/detail/uninitialized_fill.h",
-        "cuda/include/thrust/system/tbb/detail/remove.h",
-        "cuda/include/thrust/system/tbb/detail/tabulate.h",
-        "cuda/include/thrust/system/tbb/detail/for_each.h",
-        "cuda/include/thrust/system/tbb/detail/reduce_by_key.h",
-        "cuda/include/thrust/system/tbb/detail/reduce.h",
-        "cuda/include/thrust/system/tbb/detail/equal.h",
-        "cuda/include/thrust/system/tbb/detail/copy.inl",
-        "cuda/include/thrust/system/tbb/detail/copy.h",
-        "cuda/include/thrust/system/tbb/detail/swap_ranges.h",
-        "cuda/include/thrust/system/tbb/detail/uninitialized_copy.h",
         "cuda/include/thrust/system/tbb/detail/binary_search.h",
-        "cuda/include/thrust/system/tbb/detail/set_operations.h",
-        "cuda/include/thrust/system/tbb/detail/mismatch.h",
-        "cuda/include/thrust/system/tbb/detail/scan.inl",
-        "cuda/include/thrust/system/tbb/detail/extrema.h",
+        "cuda/include/thrust/system/tbb/detail/copy.h",
+        "cuda/include/thrust/system/tbb/detail/copy.inl",
+        "cuda/include/thrust/system/tbb/detail/copy_if.h",
+        "cuda/include/thrust/system/tbb/detail/copy_if.inl",
         "cuda/include/thrust/system/tbb/detail/count.h",
-        "cuda/include/thrust/system/tbb/detail/replace.h",
+        "cuda/include/thrust/system/tbb/detail/equal.h",
+        "cuda/include/thrust/system/tbb/detail/execution_policy.h",
+        "cuda/include/thrust/system/tbb/detail/extrema.h",
+        "cuda/include/thrust/system/tbb/detail/fill.h",
+        "cuda/include/thrust/system/tbb/detail/find.h",
+        "cuda/include/thrust/system/tbb/detail/for_each.h",
+        "cuda/include/thrust/system/tbb/detail/for_each.inl",
+        "cuda/include/thrust/system/tbb/detail/gather.h",
+        "cuda/include/thrust/system/tbb/detail/generate.h",
         "cuda/include/thrust/system/tbb/detail/get_value.h",
         "cuda/include/thrust/system/tbb/detail/inner_product.h",
-        "cuda/include/thrust/system/tbb/detail/copy_if.h",
-        "cuda/include/thrust/system/tbb/detail/logical.h",
-        "cuda/include/thrust/system/tbb/detail/partition.inl",
         "cuda/include/thrust/system/tbb/detail/iter_swap.h",
+        "cuda/include/thrust/system/tbb/detail/logical.h",
+        "cuda/include/thrust/system/tbb/detail/malloc_and_free.h",
+        "cuda/include/thrust/system/tbb/detail/memory.inl",
+        "cuda/include/thrust/system/tbb/detail/merge.h",
+        "cuda/include/thrust/system/tbb/detail/merge.inl",
+        "cuda/include/thrust/system/tbb/detail/mismatch.h",
         "cuda/include/thrust/system/tbb/detail/par.h",
+        "cuda/include/thrust/system/tbb/detail/partition.h",
+        "cuda/include/thrust/system/tbb/detail/partition.inl",
+        "cuda/include/thrust/system/tbb/detail/reduce.h",
+        "cuda/include/thrust/system/tbb/detail/reduce.inl",
+        "cuda/include/thrust/system/tbb/detail/reduce_by_key.h",
+        "cuda/include/thrust/system/tbb/detail/reduce_by_key.inl",
         "cuda/include/thrust/system/tbb/detail/reduce_intervals.h",
-        "cuda/include/thrust/system/tbb/detail/malloc_and_free.h",
-        "cuda/include/thrust/system/tbb/detail/fill.h",
+        "cuda/include/thrust/system/tbb/detail/remove.h",
+        "cuda/include/thrust/system/tbb/detail/remove.inl",
+        "cuda/include/thrust/system/tbb/detail/replace.h",
+        "cuda/include/thrust/system/tbb/detail/reverse.h",
+        "cuda/include/thrust/system/tbb/detail/scan.h",
+        "cuda/include/thrust/system/tbb/detail/scan.inl",
+        "cuda/include/thrust/system/tbb/detail/scan_by_key.h",
+        "cuda/include/thrust/system/tbb/detail/scatter.h",
+        "cuda/include/thrust/system/tbb/detail/sequence.h",
+        "cuda/include/thrust/system/tbb/detail/set_operations.h",
+        "cuda/include/thrust/system/tbb/detail/sort.h",
+        "cuda/include/thrust/system/tbb/detail/sort.inl",
+        "cuda/include/thrust/system/tbb/detail/swap_ranges.h",
+        "cuda/include/thrust/system/tbb/detail/tabulate.h",
+        "cuda/include/thrust/system/tbb/detail/temporary_buffer.h",
         "cuda/include/thrust/system/tbb/detail/transform.h",
-        "cuda/include/thrust/system/tbb/memory.h",
-        "cuda/include/thrust/system/error_code.h",
-        "cuda/include/thrust/system/cpp/execution_policy.h",
-        "cuda/include/thrust/system/cpp/vector.h",
-        "cuda/include/thrust/system/cpp/detail/transform_scan.h",
-        "cuda/include/thrust/system/cpp/detail/memory.inl",
-        "cuda/include/thrust/system/cpp/detail/unique_by_key.h",
-        "cuda/include/thrust/system/cpp/detail/partition.h",
-        "cuda/include/thrust/system/cpp/detail/unique.h",
-        "cuda/include/thrust/system/cpp/detail/execution_policy.h",
-        "cuda/include/thrust/system/cpp/detail/adjacent_difference.h",
-        "cuda/include/thrust/system/cpp/detail/sequence.h",
-        "cuda/include/thrust/system/cpp/detail/merge.h",
-        "cuda/include/thrust/system/cpp/detail/transform_reduce.h",
-        "cuda/include/thrust/system/cpp/detail/gather.h",
-        "cuda/include/thrust/system/cpp/detail/sort.h",
-        "cuda/include/thrust/system/cpp/detail/scan.h",
-        "cuda/include/thrust/system/cpp/detail/temporary_buffer.h",
-        "cuda/include/thrust/system/cpp/detail/scan_by_key.h",
-        "cuda/include/thrust/system/cpp/detail/reverse.h",
-        "cuda/include/thrust/system/cpp/detail/assign_value.h",
-        "cuda/include/thrust/system/cpp/detail/scatter.h",
-        "cuda/include/thrust/system/cpp/detail/vector.inl",
-        "cuda/include/thrust/system/cpp/detail/find.h",
-        "cuda/include/thrust/system/cpp/detail/generate.h",
-        "cuda/include/thrust/system/cpp/detail/uninitialized_fill.h",
-        "cuda/include/thrust/system/cpp/detail/remove.h",
-        "cuda/include/thrust/system/cpp/detail/tabulate.h",
-        "cuda/include/thrust/system/cpp/detail/for_each.h",
-        "cuda/include/thrust/system/cpp/detail/reduce_by_key.h",
-        "cuda/include/thrust/system/cpp/detail/reduce.h",
-        "cuda/include/thrust/system/cpp/detail/equal.h",
-        "cuda/include/thrust/system/cpp/detail/copy.h",
-        "cuda/include/thrust/system/cpp/detail/swap_ranges.h",
-        "cuda/include/thrust/system/cpp/detail/uninitialized_copy.h",
-        "cuda/include/thrust/system/cpp/detail/binary_search.h",
-        "cuda/include/thrust/system/cpp/detail/set_operations.h",
-        "cuda/include/thrust/system/cpp/detail/mismatch.h",
-        "cuda/include/thrust/system/cpp/detail/extrema.h",
-        "cuda/include/thrust/system/cpp/detail/count.h",
-        "cuda/include/thrust/system/cpp/detail/replace.h",
-        "cuda/include/thrust/system/cpp/detail/get_value.h",
-        "cuda/include/thrust/system/cpp/detail/inner_product.h",
-        "cuda/include/thrust/system/cpp/detail/copy_if.h",
-        "cuda/include/thrust/system/cpp/detail/logical.h",
-        "cuda/include/thrust/system/cpp/detail/iter_swap.h",
-        "cuda/include/thrust/system/cpp/detail/par.h",
-        "cuda/include/thrust/system/cpp/detail/malloc_and_free.h",
-        "cuda/include/thrust/system/cpp/detail/fill.h",
-        "cuda/include/thrust/system/cpp/detail/transform.h",
-        "cuda/include/thrust/system/cpp/memory.h",
-        "cuda/include/thrust/system/cuda/execution_policy.h",
-        "cuda/include/thrust/system/cuda/vector.h",
-        "cuda/include/thrust/system/cuda/error.h",
-        "cuda/include/thrust/system/cuda/detail/copy_device_to_device.h",
-        "cuda/include/thrust/system/cuda/detail/transform_scan.h",
-        "cuda/include/thrust/system/cuda/detail/memory.inl",
-        "cuda/include/thrust/system/cuda/detail/cub/util_allocator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_device.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_partition.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_rle_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_histogram_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_by_key_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_scan_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_select_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_radix_sort_dispatch.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_select.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_histo.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_downsweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_upsweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_satomic.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_sort.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_gatomic.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_select.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_scan_prefix_operators.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce_by_key.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_macro.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_namespace.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_upsweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_histogram_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_rle_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_select_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_satomic_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_sort_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_gatomic_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_downsweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_by_key_sweep.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_prefix_operators.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_type.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/host/spinlock.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_ptx.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_debug.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/cub.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_load.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_shift.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_store.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh",
-        "cuda/include/thrust/system/cuda/detail/cub/util_arch.cuh",
-        "cuda/include/thrust/system/cuda/detail/reduce_intervals.inl",
-        "cuda/include/thrust/system/cuda/detail/copy_cross_system.inl",
-        "cuda/include/thrust/system/cuda/detail/unique_by_key.h",
-        "cuda/include/thrust/system/cuda/detail/bulk.h",
-        "cuda/include/thrust/system/cuda/detail/sort.inl",
-        "cuda/include/thrust/system/cuda/detail/partition.h",
-        "cuda/include/thrust/system/cuda/detail/unique.h",
-        "cuda/include/thrust/system/cuda/detail/execution_policy.h",
-        "cuda/include/thrust/system/cuda/detail/cuda_launch_config.h",
-        "cuda/include/thrust/system/cuda/detail/cub.h",
-        "cuda/include/thrust/system/cuda/detail/adjacent_difference.h",
-        "cuda/include/thrust/system/cuda/detail/sequence.h",
-        "cuda/include/thrust/system/cuda/detail/merge.h",
-        "cuda/include/thrust/system/cuda/detail/set_symmetric_difference.inl",
-        "cuda/include/thrust/system/cuda/detail/copy_if.inl",
-        "cuda/include/thrust/system/cuda/detail/transform_reduce.h",
-        "cuda/include/thrust/system/cuda/detail/error.inl",
-        "cuda/include/thrust/system/cuda/detail/gather.h",
-        "cuda/include/thrust/system/cuda/detail/reduce_by_key.inl",
-        "cuda/include/thrust/system/cuda/detail/sort.h",
-        "cuda/include/thrust/system/cuda/detail/synchronize.h",
-        "cuda/include/thrust/system/cuda/detail/scan.h",
-        "cuda/include/thrust/system/cuda/detail/temporary_indirect_permutation.h",
-        "cuda/include/thrust/system/cuda/detail/extern_shared_ptr.h",
-        "cuda/include/thrust/system/cuda/detail/detail/set_operation.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/balanced_path.h",
-        "cuda/include/thrust/system/cuda/detail/detail/virtualized_smem_closure.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_primitive_sort.h",
-        "cuda/include/thrust/system/cuda/detail/detail/set_operation.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_primitive_sort.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_merge_sort.h",
-        "cuda/include/thrust/system/cuda/detail/detail/launch_closure.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/merge.h",
-        "cuda/include/thrust/system/cuda/detail/detail/alignment.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_radix_sort.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_sort_each.h",
-        "cuda/include/thrust/system/cuda/detail/detail/launch_calculator.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_merge_sort.inl",
-        "cuda/include/thrust/system/cuda/detail/detail/launch_closure.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_radix_sort.h",
-        "cuda/include/thrust/system/cuda/detail/detail/uninitialized.h",
-        "cuda/include/thrust/system/cuda/detail/detail/cached_temporary_allocator.h",
-        "cuda/include/thrust/system/cuda/detail/detail/launch_calculator.h",
-        "cuda/include/thrust/system/cuda/detail/detail/stable_sort_each.inl",
-        "cuda/include/thrust/system/cuda/detail/temporary_buffer.h",
-        "cuda/include/thrust/system/cuda/detail/default_decomposition.h",
-        "cuda/include/thrust/system/cuda/detail/reduce.inl",
-        "cuda/include/thrust/system/cuda/detail/scan_by_key.h",
-        "cuda/include/thrust/system/cuda/detail/reverse.h",
-        "cuda/include/thrust/system/cuda/detail/assign_value.h",
-        "cuda/include/thrust/system/cuda/detail/scatter.h",
-        "cuda/include/thrust/system/cuda/detail/reduce_intervals.hpp",
-        "cuda/include/thrust/system/cuda/detail/for_each.inl",
-        "cuda/include/thrust/system/cuda/detail/default_decomposition.inl",
-        "cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h",
-        "cuda/include/thrust/system/cuda/detail/adjacent_difference.inl",
-        "cuda/include/thrust/system/cuda/detail/vector.inl",
-        "cuda/include/thrust/system/cuda/detail/throw_on_error.h",
-        "cuda/include/thrust/system/cuda/detail/find.h",
-        "cuda/include/thrust/system/cuda/detail/terminate.h",
-        "cuda/include/thrust/system/cuda/detail/merge.inl",
-        "cuda/include/thrust/system/cuda/detail/trivial_copy.inl",
-        "cuda/include/thrust/system/cuda/detail/generate.h",
-        "cuda/include/thrust/system/cuda/detail/execute_on_stream.h",
-        "cuda/include/thrust/system/cuda/detail/uninitialized_fill.h",
-        "cuda/include/thrust/system/cuda/detail/remove.h",
-        "cuda/include/thrust/system/cuda/detail/tabulate.h",
-        "cuda/include/thrust/system/cuda/detail/for_each.h",
-        "cuda/include/thrust/system/cuda/detail/reduce_by_key.h",
-        "cuda/include/thrust/system/cuda/detail/decomposition.h",
-        "cuda/include/thrust/system/cuda/detail/reduce.h",
-        "cuda/include/thrust/system/cuda/detail/equal.h",
-        "cuda/include/thrust/system/cuda/detail/runtime_introspection.h",
-        "cuda/include/thrust/system/cuda/detail/copy.inl",
-        "cuda/include/thrust/system/cuda/detail/copy.h",
-        "cuda/include/thrust/system/cuda/detail/swap_ranges.h",
-        "cuda/include/thrust/system/cuda/detail/uninitialized_copy.h",
-        "cuda/include/thrust/system/cuda/detail/binary_search.h",
-        "cuda/include/thrust/system/cuda/detail/runtime_introspection.inl",
-        "cuda/include/thrust/system/cuda/detail/set_operations.h",
-        "cuda/include/thrust/system/cuda/detail/mismatch.h",
-        "cuda/include/thrust/system/cuda/detail/scan.inl",
-        "cuda/include/thrust/system/cuda/detail/synchronize.inl",
-        "cuda/include/thrust/system/cuda/detail/extrema.h",
-        "cuda/include/thrust/system/cuda/detail/set_union.inl",
-        "cuda/include/thrust/system/cuda/detail/set_intersection.inl",
-        "cuda/include/thrust/system/cuda/detail/count.h",
-        "cuda/include/thrust/system/cuda/detail/trivial_copy.h",
-        "cuda/include/thrust/system/cuda/detail/copy_device_to_device.inl",
-        "cuda/include/thrust/system/cuda/detail/replace.h",
-        "cuda/include/thrust/system/cuda/detail/bulk/malloc.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/config.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/closure.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/tail_flags.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/terminate.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/alignment.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/guarded_cuda_runtime_api.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/choose_sizes.inl",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/tuple_meta_transform.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_task.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/head_flags.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/synchronize.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/throw_on_error.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/parameter_ptr.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launcher.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/triple_chevron_launcher.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.inl",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launch_config.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/async.inl",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/tuple_transform.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/pointer_traits.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/apply_from_tuple.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/detail/is_contiguous_iterator.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/iterator.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/choose_sizes.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/copy.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/merge.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/accumulate.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/scan.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/detail/stable_merge_sort.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/gather.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/sort.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/reduce.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/scatter.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/adjacent_difference.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/reduce_by_key.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/algorithm/for_each.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/bulk.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/execution_policy.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/iterator/strided_iterator.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/uninitialized.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/async.hpp",
-        "cuda/include/thrust/system/cuda/detail/bulk/future.hpp",
-        "cuda/include/thrust/system/cuda/detail/guarded_driver_types.h",
-        "cuda/include/thrust/system/cuda/detail/get_value.h",
-        "cuda/include/thrust/system/cuda/detail/inner_product.h",
-        "cuda/include/thrust/system/cuda/detail/copy_if.h",
-        "cuda/include/thrust/system/cuda/detail/logical.h",
-        "cuda/include/thrust/system/cuda/detail/iter_swap.h",
-        "cuda/include/thrust/system/cuda/detail/block/merge.h",
-        "cuda/include/thrust/system/cuda/detail/block/inclusive_scan.h",
-        "cuda/include/thrust/system/cuda/detail/block/merge.inl",
-        "cuda/include/thrust/system/cuda/detail/block/merging_sort.h",
-        "cuda/include/thrust/system/cuda/detail/block/exclusive_scan.h",
-        "cuda/include/thrust/system/cuda/detail/block/reduce.h",
-        "cuda/include/thrust/system/cuda/detail/block/copy.h",
-        "cuda/include/thrust/system/cuda/detail/block/odd_even_sort.h",
-        "cuda/include/thrust/system/cuda/detail/par.h",
-        "cuda/include/thrust/system/cuda/detail/copy_cross_system.h",
-        "cuda/include/thrust/system/cuda/detail/reduce_intervals.h",
-        "cuda/include/thrust/system/cuda/detail/malloc_and_free.h",
-        "cuda/include/thrust/system/cuda/detail/fill.h",
-        "cuda/include/thrust/system/cuda/detail/set_difference.inl",
-        "cuda/include/thrust/system/cuda/detail/transform.h",
-        "cuda/include/thrust/system/cuda/experimental/pinned_allocator.h",
-        "cuda/include/thrust/system/cuda/memory.h",
-        "cuda/include/thrust/remove.h",
+        "cuda/include/thrust/system/tbb/detail/transform_reduce.h",
+        "cuda/include/thrust/system/tbb/detail/transform_scan.h",
+        "cuda/include/thrust/system/tbb/detail/uninitialized_copy.h",
+        "cuda/include/thrust/system/tbb/detail/uninitialized_fill.h",
+        "cuda/include/thrust/system/tbb/detail/unique.h",
+        "cuda/include/thrust/system/tbb/detail/unique.inl",
+        "cuda/include/thrust/system/tbb/detail/unique_by_key.h",
+        "cuda/include/thrust/system/tbb/detail/unique_by_key.inl",
+        "cuda/include/thrust/system/tbb/detail/vector.inl",
+        "cuda/include/thrust/system/tbb/execution_policy.h",
+        "cuda/include/thrust/system/tbb/memory.h",
+        "cuda/include/thrust/system/tbb/vector.h",
+        "cuda/include/thrust/system_error.h",
         "cuda/include/thrust/tabulate.h",
-        "cuda/include/thrust/for_each.h",
-        "cuda/include/thrust/distance.h",
-        "cuda/include/thrust/reduce.h",
-        "cuda/include/thrust/equal.h",
-        "cuda/include/thrust/complex.h",
-        "cuda/include/thrust/device_allocator.h",
-        "cuda/include/thrust/copy.h",
+        "cuda/include/thrust/transform.h",
+        "cuda/include/thrust/transform_reduce.h",
+        "cuda/include/thrust/transform_scan.h",
+        "cuda/include/thrust/tuple.h",
         "cuda/include/thrust/uninitialized_copy.h",
-        "cuda/include/thrust/device_reference.h",
-        "cuda/include/thrust/binary_search.h",
-        "cuda/include/thrust/set_operations.h",
-        "cuda/include/thrust/swap.h",
-        "cuda/include/thrust/mismatch.h",
-        "cuda/include/thrust/extrema.h",
-        "cuda/include/thrust/count.h",
-        "cuda/include/thrust/device_free.h",
-        "cuda/include/thrust/random/discard_block_engine.h",
-        "cuda/include/thrust/random/normal_distribution.h",
-        "cuda/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h",
-        "cuda/include/thrust/random/detail/subtract_with_carry_engine.inl",
-        "cuda/include/thrust/random/detail/xor_combine_engine_max.h",
-        "cuda/include/thrust/random/detail/linear_congruential_engine_discard.h",
-        "cuda/include/thrust/random/detail/uniform_int_distribution.inl",
-        "cuda/include/thrust/random/detail/discard_block_engine.inl",
-        "cuda/include/thrust/random/detail/uniform_real_distribution.inl",
-        "cuda/include/thrust/random/detail/random_core_access.h",
-        "cuda/include/thrust/random/detail/mod.h",
-        "cuda/include/thrust/random/detail/linear_feedback_shift_engine.inl",
-        "cuda/include/thrust/random/detail/linear_congruential_engine.inl",
-        "cuda/include/thrust/random/detail/xor_combine_engine.inl",
-        "cuda/include/thrust/random/detail/normal_distribution.inl",
-        "cuda/include/thrust/random/detail/normal_distribution_base.h",
-        "cuda/include/thrust/random/uniform_int_distribution.h",
-        "cuda/include/thrust/random/linear_feedback_shift_engine.h",
-        "cuda/include/thrust/random/xor_combine_engine.h",
-        "cuda/include/thrust/random/subtract_with_carry_engine.h",
-        "cuda/include/thrust/random/linear_congruential_engine.h",
-        "cuda/include/thrust/random/uniform_real_distribution.h",
-        "cuda/include/thrust/functional.h",
-        "cuda/include/thrust/replace.h",
-        "cuda/include/thrust/device_new_allocator.h",
-        "cuda/include/thrust/host_vector.h",
+        "cuda/include/thrust/uninitialized_fill.h",
+        "cuda/include/thrust/unique.h",
         "cuda/include/thrust/version.h",
-        "cuda/include/thrust/inner_product.h",
-        "cuda/include/thrust/iterator/iterator_traits.h",
-        "cuda/include/thrust/iterator/discard_iterator.h",
-        "cuda/include/thrust/iterator/retag.h",
-        "cuda/include/thrust/iterator/permutation_iterator.h",
-        "cuda/include/thrust/iterator/transform_iterator.h",
-        "cuda/include/thrust/iterator/detail/reverse_iterator.inl",
-        "cuda/include/thrust/iterator/detail/zip_iterator.inl",
-        "cuda/include/thrust/iterator/detail/counting_iterator.inl",
-        "cuda/include/thrust/iterator/detail/distance_from_result.h",
-        "cuda/include/thrust/iterator/detail/host_system_tag.h",
-        "cuda/include/thrust/iterator/detail/iterator_traversal_tags.h",
-        "cuda/include/thrust/iterator/detail/retag.h",
-        "cuda/include/thrust/iterator/detail/tagged_iterator.h",
-        "cuda/include/thrust/iterator/detail/iterator_traits.inl",
-        "cuda/include/thrust/iterator/detail/minimum_category.h",
-        "cuda/include/thrust/iterator/detail/discard_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/iterator_category_to_traversal.h",
-        "cuda/include/thrust/iterator/detail/zip_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/normal_iterator.h",
-        "cuda/include/thrust/iterator/detail/join_iterator.h",
-        "cuda/include/thrust/iterator/detail/device_system_tag.h",
-        "cuda/include/thrust/iterator/detail/universal_categories.h",
-        "cuda/include/thrust/iterator/detail/reverse_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/minimum_system.h",
-        "cuda/include/thrust/iterator/detail/tuple_of_iterator_references.h",
-        "cuda/include/thrust/iterator/detail/is_iterator_category.h",
-        "cuda/include/thrust/iterator/detail/permutation_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/any_assign.h",
-        "cuda/include/thrust/iterator/detail/any_system_tag.h",
-        "cuda/include/thrust/iterator/detail/is_trivial_iterator.h",
-        "cuda/include/thrust/iterator/detail/iterator_category_to_system.h",
-        "cuda/include/thrust/iterator/detail/iterator_adaptor_base.h",
-        "cuda/include/thrust/iterator/detail/constant_iterator_base.h",
-        "cuda/include/thrust/iterator/detail/transform_iterator.inl",
-        "cuda/include/thrust/iterator/detail/iterator_facade_category.h",
-        "cuda/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h",
-        "cuda/include/thrust/iterator/constant_iterator.h",
-        "cuda/include/thrust/iterator/counting_iterator.h",
-        "cuda/include/thrust/iterator/iterator_adaptor.h",
-        "cuda/include/thrust/iterator/iterator_facade.h",
-        "cuda/include/thrust/iterator/iterator_categories.h",
-        "cuda/include/thrust/iterator/reverse_iterator.h",
-        "cuda/include/thrust/iterator/zip_iterator.h",
-        "cuda/include/thrust/logical.h",
-        "cuda/include/thrust/tuple.h",
-        "cuda/include/thrust/memory.h",
-        "cuda/include/thrust/random.h",
-        "cuda/include/thrust/fill.h",
-        "cuda/include/thrust/transform.h",
-        "cuda/include/texture_types.h",
-        "cuda/include/nppversion.h",
-        "cuda/include/cuda_texture_types.h",
-        "cuda/include/fatbinary.h",
-        "cuda/include/cublasXt.h",
-        "cuda/include/cuda_fp16.h",
         "cuda/include/vector_functions.h",
-        "cuda/include/cusparse.h",
-        "cuda/include/nppi_filtering_functions.h",
-        "cuda/include/nppi_morphological_operations.h",
-        "cuda/include/sobol_direction_vectors.h",
-        "cuda/include/nvblas.h",
-        "cuda/include/curand_mtgp32dc_p_11213.h",
-        "cuda/include/nvcuvid.h",
-        "cuda/include/cuda_runtime_api.h",
-        "cuda/include/curand_mtgp32_kernel.h",
-        "cuda/include/cublas_v2.h",
-        "cuda/include/builtin_types.h",
-        "cuda/include/nppi_geometry_transforms.h",
-        "cuda/include/npps_support_functions.h",
-        "cuda/include/cufftw.h",
-        "cuda/include/cuda_device_runtime_api.h",
-        "cuda/include/sm_30_intrinsics.hpp",
+        "cuda/include/vector_functions.hpp",
         "cuda/include/vector_types.h",
-        "cuda/include/sm_35_atomic_functions.h",
-        "cuda/include/sm_20_intrinsics.h",
-        "cuda/include/driver_types.h",
-        "cuda/include/nvToolsExtCudaRt.h",
-        "cuda/include/curand_globals.h",
-        "cuda/include/device_atomic_functions.h",
-        "cuda/include/surface_types.h",
-        "cuda/include/nvrtc.h",
-        "cuda/include/nppdefs.h",
-        "cuda/include/sm_60_atomic_functions.h",
-        "cuda/include/driver_functions.h",
-        "cuda/include/cusolver_common.h",
-        "cuda/include/cublas.h",
-        "cuda/include/curand_lognormal.h",
-        "cuda/include/device_atomic_functions.hpp",
-        "cuda/include/crt/device_runtime.h",
-        "cuda/include/crt/storage_class.h",
-        "cuda/include/crt/func_macro.h",
-        "cuda/include/crt/host_runtime.h",
-        "cuda/include/nppi_arithmetic_and_logical_operations.h",
-        "cuda/include/npps_arithmetic_and_logical_operations.h",
-        "cuda/include/nppi_computer_vision.h",
-        "cuda/include/surface_functions.hpp",
-        "cuda/include/surface_functions.h",
-        "cuda/include/curand_normal_static.h",
-        "cuda/include/curand.h",
-        "cuda/include/math_functions_dbl_ptx3.h",
-        "cuda/include/curand_philox4x32_x.h",
-        "cuda/include/nppi_threshold_and_compare_operations.h",
-        "cuda/include/nvml.h",
-        "cuda/include/npps.h",
-        "cuda/include/cuda_vdpau_interop.h",
-        "cuda/include/sm_61_intrinsics.hpp",
-        "cuda/include/cublas_api.h",
-        "cuda/include/nppi_color_conversion.h",
-        "cuda/include/math_functions_dbl_ptx3.hpp",
-        "cuda/include/nppcore.h",
-        "cuda/include/cudaGL.h",
-        "cuda/include/fatBinaryCtl.h",
-        "cuda/include/npps_statistics_functions.h",
-        "cuda/include/cudaVDPAU.h",
-        "cuda/include/curand_poisson.h",
-        "cuda/include/cusolverDn.h",
-        "cuda/include/cuda_profiler_api.h",
-        "cuda/include/sm_20_atomic_functions.h",
-        "cuda/include/nvfunctional",
     ],
     cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-8.0/include/math_functions.hpp" "$(@D)/cuda/include/math_functions.hpp" && cp "/usr/local/cuda-8.0/include/cufft.h" "$(@D)/cuda/include/cufft.h" && cp "/usr/local/cuda-8.0/include/nvgraph.h" "$(@D)/cuda/include/nvgraph.h" && cp "/usr/local/cuda-8.0/include/curand_normal.h" "$(@D)/cuda/include/curand_normal.h" && cp "/usr/local/cuda-8.0/include/curand_uniform.h" "$(@D)/cuda/include/curand_uniform.h" && cp "/usr/local/cuda-8.0/include/nppi_data_exchange_and_initialization.h" "$(@D)/cuda/include/nppi_data_exchange_and_initialization.h" && cp "/usr/local/cuda-8.0/include/cuda_gl_interop.h" "$(@D)/cuda/include/cuda_gl_interop.h" && cp "/usr/local/cuda-8.0/include/nppi_compression_functions.h" "$(@D)/cuda/include/nppi_compression_functions.h" && cp "/usr/local/cuda-8.0/include/npp.h" "$(@D)/cuda/include/npp.h" && cp "/usr/local/cuda-8.0/include/cuda.h" "$(@D)/cuda/include/cuda.h" && cp "/usr/local/cuda-8.0/include/nppi_statistics_functions.h" "$(@D)/cuda/include/nppi_statistics_functions.h" && cp "/usr/local/cuda-8.0/include/vector_functions.hpp" "$(@D)/cuda/include/vector_functions.hpp" && cp "/usr/local/cuda-8.0/include/sm_32_intrinsics.hpp" "$(@D)/cuda/include/sm_32_intrinsics.hpp" && cp "/usr/local/cuda-8.0/include/sm_32_intrinsics.h" "$(@D)/cuda/include/sm_32_intrinsics.h" && cp "/usr/local/cuda-8.0/include/curand_discrete.h" "$(@D)/cuda/include/curand_discrete.h" && cp "/usr/local/cuda-8.0/include/cuda_runtime.h" "$(@D)/cuda/include/cuda_runtime.h" && cp "/usr/local/cuda-8.0/include/cufftXt.h" "$(@D)/cuda/include/cufftXt.h" && cp "/usr/local/cuda-8.0/include/sm_61_intrinsics.h" "$(@D)/cuda/include/sm_61_intrinsics.h" && cp "/usr/local/cuda-8.0/include/texture_fetch_functions.h" "$(@D)/cuda/include/texture_fetch_functions.h" && cp "/usr/local/cuda-8.0/include/curand_mrg32k3a.h" "$(@D)/cuda/include/curand_mrg32k3a.h" && cp "/usr/local/cuda-8.0/include/host_defines.h" "$(@D)/cuda/include/host_defines.h" && cp "/usr/local/cuda-8.0/include/common_functions.h" "$(@D)/cuda/include/common_functions.h" && cp "/usr/local/cuda-8.0/include/nppi_support_functions.h" "$(@D)/cuda/include/nppi_support_functions.h" && cp "/usr/local/cuda-8.0/include/nppi_linear_transforms.h" "$(@D)/cuda/include/nppi_linear_transforms.h" && cp "/usr/local/cuda-8.0/include/device_double_functions.hpp" "$(@D)/cuda/include/device_double_functions.hpp" && cp "/usr/local/cuda-8.0/include/math_constants.h" "$(@D)/cuda/include/math_constants.h" && cp "/usr/local/cuda-8.0/include/nvToolsExtSync.h" "$(@D)/cuda/include/nvToolsExtSync.h" && cp "/usr/local/cuda-8.0/include/npps_initialization.h" "$(@D)/cuda/include/npps_initialization.h" && cp "/usr/local/cuda-8.0/include/cusolverSp_LOWLEVEL_PREVIEW.h" "$(@D)/cuda/include/cusolverSp_LOWLEVEL_PREVIEW.h" && cp "/usr/local/cuda-8.0/include/texture_indirect_functions.hpp" "$(@D)/cuda/include/texture_indirect_functions.hpp" && cp "/usr/local/cuda-8.0/include/cudaProfiler.h" "$(@D)/cuda/include/cudaProfiler.h" && cp "/usr/local/cuda-8.0/include/npps_filtering_functions.h" "$(@D)/cuda/include/npps_filtering_functions.h" && cp "/usr/local/cuda-8.0/include/cusparse_v2.h" "$(@D)/cuda/include/cusparse_v2.h" && cp "/usr/local/cuda-8.0/include/nppi.h" "$(@D)/cuda/include/nppi.h" && cp "/usr/local/cuda-8.0/include/surface_indirect_functions.h" "$(@D)/cuda/include/surface_indirect_functions.h" && cp "/usr/local/cuda-8.0/include/sm_30_intrinsics.h" "$(@D)/cuda/include/sm_30_intrinsics.h" && cp "/usr/local/cuda-8.0/include/device_double_functions.h" "$(@D)/cuda/include/device_double_functions.h" && cp "/usr/local/cuda-8.0/include/sm_35_intrinsics.h" "$(@D)/cuda/include/sm_35_intrinsics.h" && cp "/usr/local/cuda-8.0/include/cusolverSp.h" "$(@D)/cuda/include/cusolverSp.h" && cp "/usr/local/cuda-8.0/include/library_types.h" "$(@D)/cuda/include/library_types.h" && cp "/usr/local/cuda-8.0/include/surface_indirect_functions.hpp" "$(@D)/cuda/include/surface_indirect_functions.hpp" && cp "/usr/local/cuda-8.0/include/cudalibxt.h" "$(@D)/cuda/include/cudalibxt.h" && cp "/usr/local/cuda-8.0/include/channel_descriptor.h" "$(@D)/cuda/include/channel_descriptor.h" && cp "/usr/local/cuda-8.0/include/device_functions_decls.h" "$(@D)/cuda/include/device_functions_decls.h" && cp "/usr/local/cuda-8.0/include/curand_kernel.h" "$(@D)/cuda/include/curand_kernel.h" && cp "/usr/local/cuda-8.0/include/curand_mtgp32_host.h" "$(@D)/cuda/include/curand_mtgp32_host.h" && cp "/usr/local/cuda-8.0/include/nvToolsExtCuda.h" "$(@D)/cuda/include/nvToolsExtCuda.h" && cp "/usr/local/cuda-8.0/include/nvToolsExt.h" "$(@D)/cuda/include/nvToolsExt.h" && cp "/usr/local/cuda-8.0/include/cuComplex.h" "$(@D)/cuda/include/cuComplex.h" && cp "/usr/local/cuda-8.0/include/sm_32_atomic_functions.h" "$(@D)/cuda/include/sm_32_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/texture_indirect_functions.h" "$(@D)/cuda/include/texture_indirect_functions.h" && cp "/usr/local/cuda-8.0/include/sm_32_atomic_functions.hpp" "$(@D)/cuda/include/sm_32_atomic_functions.hpp" && cp "/usr/local/cuda-8.0/include/sm_20_intrinsics.hpp" "$(@D)/cuda/include/sm_20_intrinsics.hpp" && cp "/usr/local/cuda-8.0/include/device_launch_parameters.h" "$(@D)/cuda/include/device_launch_parameters.h" && cp "/usr/local/cuda-8.0/include/curand_mtgp32.h" "$(@D)/cuda/include/curand_mtgp32.h" && cp "/usr/local/cuda-8.0/include/texture_fetch_functions.hpp" "$(@D)/cuda/include/texture_fetch_functions.hpp" && cp "/usr/local/cuda-8.0/include/cuda_occupancy.h" "$(@D)/cuda/include/cuda_occupancy.h" && cp "/usr/local/cuda-8.0/include/CL/opencl.h" "$(@D)/cuda/include/CL/opencl.h" && cp "/usr/local/cuda-8.0/include/CL/cl_platform.h" "$(@D)/cuda/include/CL/cl_platform.h" && cp "/usr/local/cuda-8.0/include/CL/cl_egl.h" "$(@D)/cuda/include/CL/cl_egl.h" && cp "/usr/local/cuda-8.0/include/CL/cl_gl.h" "$(@D)/cuda/include/CL/cl_gl.h" && cp "/usr/local/cuda-8.0/include/CL/cl.h" "$(@D)/cuda/include/CL/cl.h" && cp "/usr/local/cuda-8.0/include/CL/cl_gl_ext.h" "$(@D)/cuda/include/CL/cl_gl_ext.h" && cp "/usr/local/cuda-8.0/include/CL/cl_ext.h" "$(@D)/cuda/include/CL/cl_ext.h" && cp "/usr/local/cuda-8.0/include/CL/cl.hpp" "$(@D)/cuda/include/CL/cl.hpp" && cp "/usr/local/cuda-8.0/include/host_config.h" "$(@D)/cuda/include/host_config.h" && cp "/usr/local/cuda-8.0/include/cuda_surface_types.h" "$(@D)/cuda/include/cuda_surface_types.h" && cp "/usr/local/cuda-8.0/include/math_functions.h" "$(@D)/cuda/include/math_functions.h" && cp "/usr/local/cuda-8.0/include/nvToolsExtMeta.h" "$(@D)/cuda/include/nvToolsExtMeta.h" && cp "/usr/local/cuda-8.0/include/sm_20_atomic_functions.hpp" "$(@D)/cuda/include/sm_20_atomic_functions.hpp" && cp "/usr/local/cuda-8.0/include/device_functions.h" "$(@D)/cuda/include/device_functions.h" && cp "/usr/local/cuda-8.0/include/device_types.h" "$(@D)/cuda/include/device_types.h" && cp "/usr/local/cuda-8.0/include/npps_conversion_functions.h" "$(@D)/cuda/include/npps_conversion_functions.h" && cp "/usr/local/cuda-8.0/include/curand_precalc.h" "$(@D)/cuda/include/curand_precalc.h" && cp "/usr/local/cuda-8.0/include/cusolverRf.h" "$(@D)/cuda/include/cusolverRf.h" && cp "/usr/local/cuda-8.0/include/sm_60_atomic_functions.hpp" "$(@D)/cuda/include/sm_60_atomic_functions.hpp" && cp "/usr/local/cuda-8.0/include/cuviddec.h" "$(@D)/cuda/include/cuviddec.h" && cp "/usr/local/cuda-8.0/include/curand_discrete2.h" "$(@D)/cuda/include/curand_discrete2.h" && cp "/usr/local/cuda-8.0/include/device_functions.hpp" "$(@D)/cuda/include/device_functions.hpp" && cp "/usr/local/cuda-8.0/include/thrust/transform_scan.h" "$(@D)/cuda/include/thrust/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system_error.h" "$(@D)/cuda/include/thrust/system_error.h" && cp "/usr/local/cuda-8.0/include/thrust/device_malloc.h" "$(@D)/cuda/include/thrust/device_malloc.h" && cp "/usr/local/cuda-8.0/include/thrust/partition.h" "$(@D)/cuda/include/thrust/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/unique.h" "$(@D)/cuda/include/thrust/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/device_delete.h" "$(@D)/cuda/include/thrust/device_delete.h" && cp "/usr/local/cuda-8.0/include/thrust/execution_policy.h" "$(@D)/cuda/include/thrust/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/adjacent_difference.h" "$(@D)/cuda/include/thrust/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/sequence.h" "$(@D)/cuda/include/thrust/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/merge.h" "$(@D)/cuda/include/thrust/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/device_new.h" "$(@D)/cuda/include/thrust/device_new.h" && cp "/usr/local/cuda-8.0/include/thrust/transform_reduce.h" "$(@D)/cuda/include/thrust/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/device_vector.h" "$(@D)/cuda/include/thrust/device_vector.h" && cp "/usr/local/cuda-8.0/include/thrust/gather.h" "$(@D)/cuda/include/thrust/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/sort.h" "$(@D)/cuda/include/thrust/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/scan.h" "$(@D)/cuda/include/thrust/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/temporary_array.h" "$(@D)/cuda/include/thrust/detail/temporary_array.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/util/align.h" "$(@D)/cuda/include/thrust/detail/util/align.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/util/blocking.h" "$(@D)/cuda/include/thrust/detail/util/blocking.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/transform.inl" "$(@D)/cuda/include/thrust/detail/transform.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_vector.inl" "$(@D)/cuda/include/thrust/detail/device_vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/binary_search.inl" "$(@D)/cuda/include/thrust/detail/binary_search.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/overlapped_copy.h" "$(@D)/cuda/include/thrust/detail/overlapped_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/vector_base.inl" "$(@D)/cuda/include/thrust/detail/vector_base.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_reference.inl" "$(@D)/cuda/include/thrust/detail/device_reference.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/actor.h" "$(@D)/cuda/include/thrust/detail/functional/actor.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/value.h" "$(@D)/cuda/include/thrust/detail/functional/value.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/logical_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/logical_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/relational_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/relational_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/assignment_operator.h" "$(@D)/cuda/include/thrust/detail/functional/operators/assignment_operator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/bitwise_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/bitwise_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/operator_adaptors.h" "$(@D)/cuda/include/thrust/detail/functional/operators/operator_adaptors.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/arithmetic_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/arithmetic_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/operators/compound_assignment_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/compound_assignment_operators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/argument.h" "$(@D)/cuda/include/thrust/detail/functional/argument.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/placeholder.h" "$(@D)/cuda/include/thrust/detail/functional/placeholder.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/actor.inl" "$(@D)/cuda/include/thrust/detail/functional/actor.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional/composite.h" "$(@D)/cuda/include/thrust/detail/functional/composite.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/static_map.h" "$(@D)/cuda/include/thrust/detail/static_map.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/has_nested_type.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_nested_type.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/is_call_possible.h" "$(@D)/cuda/include/thrust/detail/type_traits/is_call_possible.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/function_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits/function_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/pointer_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits/pointer_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/has_member_function.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_member_function.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h" "$(@D)/cuda/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/minimum_type.h" "$(@D)/cuda/include/thrust/detail/type_traits/minimum_type.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/has_trivial_assign.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_trivial_assign.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/is_metafunction_defined.h" "$(@D)/cuda/include/thrust/detail/type_traits/is_metafunction_defined.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/iterator/is_discard_iterator.h" "$(@D)/cuda/include/thrust/detail/type_traits/iterator/is_discard_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/iterator/is_output_iterator.h" "$(@D)/cuda/include/thrust/detail/type_traits/iterator/is_output_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits/result_of_adaptable_function.h" "$(@D)/cuda/include/thrust/detail/type_traits/result_of_adaptable_function.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/reference.h" "$(@D)/cuda/include/thrust/detail/reference.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/inner_product.inl" "$(@D)/cuda/include/thrust/detail/inner_product.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/use_default.h" "$(@D)/cuda/include/thrust/detail/use_default.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/sequence.inl" "$(@D)/cuda/include/thrust/detail/sequence.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/sort.inl" "$(@D)/cuda/include/thrust/detail/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/equal.inl" "$(@D)/cuda/include/thrust/detail/equal.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/execution_policy.h" "$(@D)/cuda/include/thrust/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/integer_traits.h" "$(@D)/cuda/include/thrust/detail/integer_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/type_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/reverse.inl" "$(@D)/cuda/include/thrust/detail/reverse.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/tabulate.inl" "$(@D)/cuda/include/thrust/detail/tabulate.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/unique.inl" "$(@D)/cuda/include/thrust/detail/unique.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/scatter.inl" "$(@D)/cuda/include/thrust/detail/scatter.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/set_operations.inl" "$(@D)/cuda/include/thrust/detail/set_operations.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_malloc.inl" "$(@D)/cuda/include/thrust/detail/device_malloc.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/copy_if.inl" "$(@D)/cuda/include/thrust/detail/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/fill.inl" "$(@D)/cuda/include/thrust/detail/fill.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/temporary_array.inl" "$(@D)/cuda/include/thrust/detail/temporary_array.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/transform_scan.inl" "$(@D)/cuda/include/thrust/detail/transform_scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/minmax.h" "$(@D)/cuda/include/thrust/detail/minmax.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/swap.inl" "$(@D)/cuda/include/thrust/detail/swap.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/pointer.inl" "$(@D)/cuda/include/thrust/detail/pointer.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/transform_reduce.inl" "$(@D)/cuda/include/thrust/detail/transform_reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/config.h" "$(@D)/cuda/include/thrust/detail/config.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/distance.inl" "$(@D)/cuda/include/thrust/detail/distance.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/pair.inl" "$(@D)/cuda/include/thrust/detail/pair.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/temporary_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/temporary_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/tagged_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/tagged_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/destroy_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/destroy_range.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/destroy_range.h" "$(@D)/cuda/include/thrust/detail/allocator/destroy_range.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/no_throw_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/no_throw_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/default_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/default_construct_range.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/fill_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/fill_construct_range.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/tagged_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/tagged_allocator.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/malloc_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/malloc_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/allocator_traits.h" "$(@D)/cuda/include/thrust/detail/allocator/allocator_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/copy_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/copy_construct_range.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/allocator_traits.inl" "$(@D)/cuda/include/thrust/detail/allocator/allocator_traits.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/default_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/default_construct_range.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/copy_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/copy_construct_range.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/malloc_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/malloc_allocator.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/temporary_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/temporary_allocator.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/allocator/fill_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/fill_construct_range.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/reduce.inl" "$(@D)/cuda/include/thrust/detail/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_new.inl" "$(@D)/cuda/include/thrust/detail/device_new.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/pointer.h" "$(@D)/cuda/include/thrust/detail/pointer.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/for_each.inl" "$(@D)/cuda/include/thrust/detail/for_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/generate.inl" "$(@D)/cuda/include/thrust/detail/generate.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/dispatch/is_trivial_copy.h" "$(@D)/cuda/include/thrust/detail/dispatch/is_trivial_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/adjacent_difference.inl" "$(@D)/cuda/include/thrust/detail/adjacent_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/tuple_meta_transform.h" "$(@D)/cuda/include/thrust/detail/tuple_meta_transform.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/functional.inl" "$(@D)/cuda/include/thrust/detail/functional.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/remove.inl" "$(@D)/cuda/include/thrust/detail/remove.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/tuple_transform.h" "$(@D)/cuda/include/thrust/detail/tuple_transform.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/merge.inl" "$(@D)/cuda/include/thrust/detail/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/extrema.inl" "$(@D)/cuda/include/thrust/detail/extrema.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/trivial_sequence.h" "$(@D)/cuda/include/thrust/detail/trivial_sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/vector_base.h" "$(@D)/cuda/include/thrust/detail/vector_base.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/count.inl" "$(@D)/cuda/include/thrust/detail/count.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/uninitialized_copy.inl" "$(@D)/cuda/include/thrust/detail/uninitialized_copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/function.h" "$(@D)/cuda/include/thrust/detail/function.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/swap_ranges.inl" "$(@D)/cuda/include/thrust/detail/swap_ranges.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_delete.inl" "$(@D)/cuda/include/thrust/detail/device_delete.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/static_assert.h" "$(@D)/cuda/include/thrust/detail/static_assert.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/logical.inl" "$(@D)/cuda/include/thrust/detail/logical.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/seq.h" "$(@D)/cuda/include/thrust/detail/seq.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/mpl/math.h" "$(@D)/cuda/include/thrust/detail/mpl/math.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/mismatch.inl" "$(@D)/cuda/include/thrust/detail/mismatch.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/internal_functional.h" "$(@D)/cuda/include/thrust/detail/internal_functional.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/get_iterator_value.h" "$(@D)/cuda/include/thrust/detail/get_iterator_value.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/copy.inl" "$(@D)/cuda/include/thrust/detail/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/copy.h" "$(@D)/cuda/include/thrust/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/catrigf.h" "$(@D)/cuda/include/thrust/detail/complex/catrigf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cpowf.h" "$(@D)/cuda/include/thrust/detail/complex/cpowf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/csqrtf.h" "$(@D)/cuda/include/thrust/detail/complex/csqrtf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/ccoshf.h" "$(@D)/cuda/include/thrust/detail/complex/ccoshf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/csinhf.h" "$(@D)/cuda/include/thrust/detail/complex/csinhf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/clogf.h" "$(@D)/cuda/include/thrust/detail/complex/clogf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/ccosh.h" "$(@D)/cuda/include/thrust/detail/complex/ccosh.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/arithmetic.h" "$(@D)/cuda/include/thrust/detail/complex/arithmetic.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/csqrt.h" "$(@D)/cuda/include/thrust/detail/complex/csqrt.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cpow.h" "$(@D)/cuda/include/thrust/detail/complex/cpow.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/complex.inl" "$(@D)/cuda/include/thrust/detail/complex/complex.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/math_private.h" "$(@D)/cuda/include/thrust/detail/complex/math_private.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/c99math.h" "$(@D)/cuda/include/thrust/detail/complex/c99math.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cproj.h" "$(@D)/cuda/include/thrust/detail/complex/cproj.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/catrig.h" "$(@D)/cuda/include/thrust/detail/complex/catrig.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/ctanhf.h" "$(@D)/cuda/include/thrust/detail/complex/ctanhf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cexpf.h" "$(@D)/cuda/include/thrust/detail/complex/cexpf.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/csinh.h" "$(@D)/cuda/include/thrust/detail/complex/csinh.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/stream.h" "$(@D)/cuda/include/thrust/detail/complex/stream.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/ctanh.h" "$(@D)/cuda/include/thrust/detail/complex/ctanh.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/cexp.h" "$(@D)/cuda/include/thrust/detail/complex/cexp.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/complex/clog.h" "$(@D)/cuda/include/thrust/detail/complex/clog.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/range/head_flags.h" "$(@D)/cuda/include/thrust/detail/range/head_flags.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/range/tail_flags.h" "$(@D)/cuda/include/thrust/detail/range/tail_flags.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/execute_with_allocator.h" "$(@D)/cuda/include/thrust/detail/execute_with_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/integer_math.h" "$(@D)/cuda/include/thrust/detail/integer_math.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/swap.h" "$(@D)/cuda/include/thrust/detail/swap.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/uninitialized_fill.inl" "$(@D)/cuda/include/thrust/detail/uninitialized_fill.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/scan.inl" "$(@D)/cuda/include/thrust/detail/scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/gather.inl" "$(@D)/cuda/include/thrust/detail/gather.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/reference_forward_declaration.h" "$(@D)/cuda/include/thrust/detail/reference_forward_declaration.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/numeric_traits.h" "$(@D)/cuda/include/thrust/detail/numeric_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/reference.inl" "$(@D)/cuda/include/thrust/detail/reference.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/cstdint.h" "$(@D)/cuda/include/thrust/detail/cstdint.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_free.inl" "$(@D)/cuda/include/thrust/detail/device_free.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/copy_if.h" "$(@D)/cuda/include/thrust/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/partition.inl" "$(@D)/cuda/include/thrust/detail/partition.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/find.inl" "$(@D)/cuda/include/thrust/detail/find.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/forceinline.h" "$(@D)/cuda/include/thrust/detail/config/forceinline.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/debug.h" "$(@D)/cuda/include/thrust/detail/config/debug.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/config.h" "$(@D)/cuda/include/thrust/detail/config/config.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/host_device.h" "$(@D)/cuda/include/thrust/detail/config/host_device.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/host_system.h" "$(@D)/cuda/include/thrust/detail/config/host_system.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/compiler.h" "$(@D)/cuda/include/thrust/detail/config/compiler.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/device_system.h" "$(@D)/cuda/include/thrust/detail/config/device_system.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/compiler_fence.h" "$(@D)/cuda/include/thrust/detail/config/compiler_fence.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/exec_check_disable.h" "$(@D)/cuda/include/thrust/detail/config/exec_check_disable.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/simple_defines.h" "$(@D)/cuda/include/thrust/detail/config/simple_defines.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/config/global_workarounds.h" "$(@D)/cuda/include/thrust/detail/config/global_workarounds.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/replace.inl" "$(@D)/cuda/include/thrust/detail/replace.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/device_ptr.inl" "$(@D)/cuda/include/thrust/detail/device_ptr.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/tuple.inl" "$(@D)/cuda/include/thrust/detail/tuple.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/host_vector.inl" "$(@D)/cuda/include/thrust/detail/host_vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/raw_pointer_cast.h" "$(@D)/cuda/include/thrust/detail/raw_pointer_cast.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/advance.inl" "$(@D)/cuda/include/thrust/detail/advance.inl" && cp "/usr/local/cuda-8.0/include/thrust/detail/contiguous_storage.h" "$(@D)/cuda/include/thrust/detail/contiguous_storage.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/raw_reference_cast.h" "$(@D)/cuda/include/thrust/detail/raw_reference_cast.h" && cp "/usr/local/cuda-8.0/include/thrust/detail/contiguous_storage.inl" "$(@D)/cuda/include/thrust/detail/contiguous_storage.inl" && cp "/usr/local/cuda-8.0/include/thrust/reverse.h" "$(@D)/cuda/include/thrust/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/device_malloc_allocator.h" "$(@D)/cuda/include/thrust/device_malloc_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/scatter.h" "$(@D)/cuda/include/thrust/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/pair.h" "$(@D)/cuda/include/thrust/pair.h" && cp "/usr/local/cuda-8.0/include/thrust/advance.h" "$(@D)/cuda/include/thrust/advance.h" && cp "/usr/local/cuda-8.0/include/thrust/find.h" "$(@D)/cuda/include/thrust/find.h" && cp "/usr/local/cuda-8.0/include/thrust/device_ptr.h" "$(@D)/cuda/include/thrust/device_ptr.h" && cp "/usr/local/cuda-8.0/include/thrust/generate.h" "$(@D)/cuda/include/thrust/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/uninitialized_fill.h" "$(@D)/cuda/include/thrust/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/system_error.h" "$(@D)/cuda/include/thrust/system/system_error.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/bad_alloc.h" "$(@D)/cuda/include/thrust/system/detail/bad_alloc.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/partition.h" "$(@D)/cuda/include/thrust/system/detail/adl/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/unique.h" "$(@D)/cuda/include/thrust/system/detail/adl/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/adl/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/sequence.h" "$(@D)/cuda/include/thrust/system/detail/adl/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/merge.h" "$(@D)/cuda/include/thrust/system/detail/adl/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/gather.h" "$(@D)/cuda/include/thrust/system/detail/adl/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/sort.h" "$(@D)/cuda/include/thrust/system/detail/adl/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/scan.h" "$(@D)/cuda/include/thrust/system/detail/adl/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/adl/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/reverse.h" "$(@D)/cuda/include/thrust/system/detail/adl/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/assign_value.h" "$(@D)/cuda/include/thrust/system/detail/adl/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/scatter.h" "$(@D)/cuda/include/thrust/system/detail/adl/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/find.h" "$(@D)/cuda/include/thrust/system/detail/adl/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/generate.h" "$(@D)/cuda/include/thrust/system/detail/adl/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/adl/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/remove.h" "$(@D)/cuda/include/thrust/system/detail/adl/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/adl/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/for_each.h" "$(@D)/cuda/include/thrust/system/detail/adl/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/reduce.h" "$(@D)/cuda/include/thrust/system/detail/adl/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/equal.h" "$(@D)/cuda/include/thrust/system/detail/adl/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/copy.h" "$(@D)/cuda/include/thrust/system/detail/adl/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/adl/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/adl/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/adl/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/adl/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/adl/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/extrema.h" "$(@D)/cuda/include/thrust/system/detail/adl/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/count.h" "$(@D)/cuda/include/thrust/system/detail/adl/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/replace.h" "$(@D)/cuda/include/thrust/system/detail/adl/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/get_value.h" "$(@D)/cuda/include/thrust/system/detail/adl/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/adl/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/adl/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/logical.h" "$(@D)/cuda/include/thrust/system/detail/adl/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/iter_swap.h" "$(@D)/cuda/include/thrust/system/detail/adl/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/detail/adl/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/fill.h" "$(@D)/cuda/include/thrust/system/detail/adl/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/adl/transform.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/errno.h" "$(@D)/cuda/include/thrust/system/detail/errno.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/error_category.inl" "$(@D)/cuda/include/thrust/system/detail/error_category.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_primitive_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_primitive_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_primitive_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_primitive_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_merge_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_merge_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/partition.h" "$(@D)/cuda/include/thrust/system/detail/sequential/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/unique.h" "$(@D)/cuda/include/thrust/system/detail/sequential/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/execution_policy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/sequential/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/sequence.h" "$(@D)/cuda/include/thrust/system/detail/sequential/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/merge.h" "$(@D)/cuda/include/thrust/system/detail/sequential/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/gather.h" "$(@D)/cuda/include/thrust/system/detail/sequential/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/copy_backward.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy_backward.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_radix_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_radix_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/scan.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/sequential/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/reverse.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/assign_value.h" "$(@D)/cuda/include/thrust/system/detail/sequential/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/scatter.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/find.h" "$(@D)/cuda/include/thrust/system/detail/sequential/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_merge_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_merge_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/merge.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/generate.h" "$(@D)/cuda/include/thrust/system/detail/sequential/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/sequential/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/general_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/general_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/insertion_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/insertion_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/remove.h" "$(@D)/cuda/include/thrust/system/detail/sequential/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/sequential/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/for_each.h" "$(@D)/cuda/include/thrust/system/detail/sequential/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/reduce.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/equal.h" "$(@D)/cuda/include/thrust/system/detail/sequential/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/stable_radix_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_radix_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/copy.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/sequential/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/sequential/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/sequential/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/sequential/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/extrema.h" "$(@D)/cuda/include/thrust/system/detail/sequential/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/count.h" "$(@D)/cuda/include/thrust/system/detail/sequential/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/trivial_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/trivial_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/replace.h" "$(@D)/cuda/include/thrust/system/detail/sequential/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/get_value.h" "$(@D)/cuda/include/thrust/system/detail/sequential/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/sequential/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/logical.h" "$(@D)/cuda/include/thrust/system/detail/sequential/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/iter_swap.h" "$(@D)/cuda/include/thrust/system/detail/sequential/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/detail/sequential/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/fill.h" "$(@D)/cuda/include/thrust/system/detail/sequential/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/sequential/transform.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/error_condition.inl" "$(@D)/cuda/include/thrust/system/detail/error_condition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/internal/decompose.h" "$(@D)/cuda/include/thrust/system/detail/internal/decompose.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/error_code.inl" "$(@D)/cuda/include/thrust/system/detail/error_code.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/memory.inl" "$(@D)/cuda/include/thrust/system/detail/generic/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/binary_search.inl" "$(@D)/cuda/include/thrust/system/detail/generic/binary_search.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scan_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scan_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/inner_product.inl" "$(@D)/cuda/include/thrust/system/detail/generic/inner_product.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/select_system.h" "$(@D)/cuda/include/thrust/system/detail/generic/select_system.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/sequence.inl" "$(@D)/cuda/include/thrust/system/detail/generic/sequence.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/sort.inl" "$(@D)/cuda/include/thrust/system/detail/generic/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/equal.inl" "$(@D)/cuda/include/thrust/system/detail/generic/equal.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/partition.h" "$(@D)/cuda/include/thrust/system/detail/generic/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/unique.h" "$(@D)/cuda/include/thrust/system/detail/generic/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/generic/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/tag.h" "$(@D)/cuda/include/thrust/system/detail/generic/tag.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/unique_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/sequence.h" "$(@D)/cuda/include/thrust/system/detail/generic/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/type_traits.h" "$(@D)/cuda/include/thrust/system/detail/generic/type_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/merge.h" "$(@D)/cuda/include/thrust/system/detail/generic/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reverse.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reverse.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/tabulate.inl" "$(@D)/cuda/include/thrust/system/detail/generic/tabulate.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/unique.inl" "$(@D)/cuda/include/thrust/system/detail/generic/unique.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scatter.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scatter.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/set_operations.inl" "$(@D)/cuda/include/thrust/system/detail/generic/set_operations.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/copy_if.inl" "$(@D)/cuda/include/thrust/system/detail/generic/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform_scan.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform_scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/gather.h" "$(@D)/cuda/include/thrust/system/detail/generic/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reduce_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform_reduce.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform_reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/sort.h" "$(@D)/cuda/include/thrust/system/detail/generic/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/distance.inl" "$(@D)/cuda/include/thrust/system/detail/generic/distance.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scan.h" "$(@D)/cuda/include/thrust/system/detail/generic/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/generic/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reduce.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reverse.h" "$(@D)/cuda/include/thrust/system/detail/generic/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/temporary_buffer.inl" "$(@D)/cuda/include/thrust/system/detail/generic/temporary_buffer.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scatter.h" "$(@D)/cuda/include/thrust/system/detail/generic/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/generate.inl" "$(@D)/cuda/include/thrust/system/detail/generic/generate.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/adjacent_difference.inl" "$(@D)/cuda/include/thrust/system/detail/generic/adjacent_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/remove.inl" "$(@D)/cuda/include/thrust/system/detail/generic/remove.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/advance.h" "$(@D)/cuda/include/thrust/system/detail/generic/advance.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/find.h" "$(@D)/cuda/include/thrust/system/detail/generic/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/merge.inl" "$(@D)/cuda/include/thrust/system/detail/generic/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scalar/binary_search.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scalar/binary_search.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scalar/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/generic/scalar/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/extrema.inl" "$(@D)/cuda/include/thrust/system/detail/generic/extrema.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/generate.h" "$(@D)/cuda/include/thrust/system/detail/generic/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/count.inl" "$(@D)/cuda/include/thrust/system/detail/generic/count.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/remove.h" "$(@D)/cuda/include/thrust/system/detail/generic/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/uninitialized_copy.inl" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/generic/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/for_each.h" "$(@D)/cuda/include/thrust/system/detail/generic/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/distance.h" "$(@D)/cuda/include/thrust/system/detail/generic/distance.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/swap_ranges.inl" "$(@D)/cuda/include/thrust/system/detail/generic/swap_ranges.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/reduce.h" "$(@D)/cuda/include/thrust/system/detail/generic/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/equal.h" "$(@D)/cuda/include/thrust/system/detail/generic/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/mismatch.inl" "$(@D)/cuda/include/thrust/system/detail/generic/mismatch.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/copy.inl" "$(@D)/cuda/include/thrust/system/detail/generic/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/copy.h" "$(@D)/cuda/include/thrust/system/detail/generic/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/generic/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/generic/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/generic/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/uninitialized_fill.inl" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_fill.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/generic/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/scan.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/gather.inl" "$(@D)/cuda/include/thrust/system/detail/generic/gather.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/extrema.h" "$(@D)/cuda/include/thrust/system/detail/generic/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/count.h" "$(@D)/cuda/include/thrust/system/detail/generic/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/replace.h" "$(@D)/cuda/include/thrust/system/detail/generic/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/generic/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/generic/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/logical.h" "$(@D)/cuda/include/thrust/system/detail/generic/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/partition.inl" "$(@D)/cuda/include/thrust/system/detail/generic/partition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/memory.h" "$(@D)/cuda/include/thrust/system/detail/generic/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/find.inl" "$(@D)/cuda/include/thrust/system/detail/generic/find.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/replace.inl" "$(@D)/cuda/include/thrust/system/detail/generic/replace.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/advance.inl" "$(@D)/cuda/include/thrust/system/detail/generic/advance.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/fill.h" "$(@D)/cuda/include/thrust/system/detail/generic/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/generic/transform.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/detail/system_error.inl" "$(@D)/cuda/include/thrust/system/detail/system_error.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/execution_policy.h" "$(@D)/cuda/include/thrust/system/omp/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/vector.h" "$(@D)/cuda/include/thrust/system/omp/vector.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/memory.inl" "$(@D)/cuda/include/thrust/system/omp/detail/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce_intervals.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_intervals.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/sort.inl" "$(@D)/cuda/include/thrust/system/omp/detail/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/partition.h" "$(@D)/cuda/include/thrust/system/omp/detail/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/unique.h" "$(@D)/cuda/include/thrust/system/omp/detail/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/omp/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/omp/detail/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/omp/detail/unique_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/sequence.h" "$(@D)/cuda/include/thrust/system/omp/detail/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/merge.h" "$(@D)/cuda/include/thrust/system/omp/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/unique.inl" "$(@D)/cuda/include/thrust/system/omp/detail/unique.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/omp/detail/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/gather.h" "$(@D)/cuda/include/thrust/system/omp/detail/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/sort.h" "$(@D)/cuda/include/thrust/system/omp/detail/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/scan.h" "$(@D)/cuda/include/thrust/system/omp/detail/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/omp/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/default_decomposition.h" "$(@D)/cuda/include/thrust/system/omp/detail/default_decomposition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reverse.h" "$(@D)/cuda/include/thrust/system/omp/detail/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/omp/detail/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/scatter.h" "$(@D)/cuda/include/thrust/system/omp/detail/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/omp/detail/for_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/default_decomposition.inl" "$(@D)/cuda/include/thrust/system/omp/detail/default_decomposition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/remove.inl" "$(@D)/cuda/include/thrust/system/omp/detail/remove.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/vector.inl" "$(@D)/cuda/include/thrust/system/omp/detail/vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/find.h" "$(@D)/cuda/include/thrust/system/omp/detail/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/generate.h" "$(@D)/cuda/include/thrust/system/omp/detail/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/omp/detail/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/remove.h" "$(@D)/cuda/include/thrust/system/omp/detail/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/omp/detail/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/for_each.h" "$(@D)/cuda/include/thrust/system/omp/detail/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/equal.h" "$(@D)/cuda/include/thrust/system/omp/detail/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/copy.inl" "$(@D)/cuda/include/thrust/system/omp/detail/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/copy.h" "$(@D)/cuda/include/thrust/system/omp/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/omp/detail/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/omp/detail/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/omp/detail/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/omp/detail/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/omp/detail/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/extrema.h" "$(@D)/cuda/include/thrust/system/omp/detail/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/count.h" "$(@D)/cuda/include/thrust/system/omp/detail/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/replace.h" "$(@D)/cuda/include/thrust/system/omp/detail/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/get_value.h" "$(@D)/cuda/include/thrust/system/omp/detail/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/omp/detail/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/omp/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/logical.h" "$(@D)/cuda/include/thrust/system/omp/detail/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/partition.inl" "$(@D)/cuda/include/thrust/system/omp/detail/partition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/omp/detail/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/par.h" "$(@D)/cuda/include/thrust/system/omp/detail/par.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_intervals.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/omp/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/fill.h" "$(@D)/cuda/include/thrust/system/omp/detail/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/detail/transform.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/omp/memory.h" "$(@D)/cuda/include/thrust/system/omp/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/execution_policy.h" "$(@D)/cuda/include/thrust/system/tbb/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/vector.h" "$(@D)/cuda/include/thrust/system/tbb/vector.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/memory.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/sort.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/partition.h" "$(@D)/cuda/include/thrust/system/tbb/detail/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/unique.h" "$(@D)/cuda/include/thrust/system/tbb/detail/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/tbb/detail/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/unique_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/sequence.h" "$(@D)/cuda/include/thrust/system/tbb/detail/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/merge.h" "$(@D)/cuda/include/thrust/system/tbb/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/unique.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/unique.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/gather.h" "$(@D)/cuda/include/thrust/system/tbb/detail/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/sort.h" "$(@D)/cuda/include/thrust/system/tbb/detail/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/scan.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/tbb/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reverse.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/tbb/detail/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/scatter.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/for_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/remove.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/remove.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/vector.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/find.h" "$(@D)/cuda/include/thrust/system/tbb/detail/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/merge.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/generate.h" "$(@D)/cuda/include/thrust/system/tbb/detail/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/tbb/detail/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/remove.h" "$(@D)/cuda/include/thrust/system/tbb/detail/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/tbb/detail/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/for_each.h" "$(@D)/cuda/include/thrust/system/tbb/detail/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/equal.h" "$(@D)/cuda/include/thrust/system/tbb/detail/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/copy.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/copy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/tbb/detail/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/tbb/detail/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/tbb/detail/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/tbb/detail/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/scan.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/extrema.h" "$(@D)/cuda/include/thrust/system/tbb/detail/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/count.h" "$(@D)/cuda/include/thrust/system/tbb/detail/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/replace.h" "$(@D)/cuda/include/thrust/system/tbb/detail/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/get_value.h" "$(@D)/cuda/include/thrust/system/tbb/detail/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/tbb/detail/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/tbb/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/logical.h" "$(@D)/cuda/include/thrust/system/tbb/detail/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/partition.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/partition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/tbb/detail/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/par.h" "$(@D)/cuda/include/thrust/system/tbb/detail/par.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_intervals.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/tbb/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/fill.h" "$(@D)/cuda/include/thrust/system/tbb/detail/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/detail/transform.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/tbb/memory.h" "$(@D)/cuda/include/thrust/system/tbb/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/system/error_code.h" "$(@D)/cuda/include/thrust/system/error_code.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/execution_policy.h" "$(@D)/cuda/include/thrust/system/cpp/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/vector.h" "$(@D)/cuda/include/thrust/system/cpp/vector.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/memory.inl" "$(@D)/cuda/include/thrust/system/cpp/detail/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/partition.h" "$(@D)/cuda/include/thrust/system/cpp/detail/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/unique.h" "$(@D)/cuda/include/thrust/system/cpp/detail/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/cpp/detail/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/sequence.h" "$(@D)/cuda/include/thrust/system/cpp/detail/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/merge.h" "$(@D)/cuda/include/thrust/system/cpp/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/gather.h" "$(@D)/cuda/include/thrust/system/cpp/detail/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/sort.h" "$(@D)/cuda/include/thrust/system/cpp/detail/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/scan.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/cpp/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/reverse.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/cpp/detail/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/scatter.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/vector.inl" "$(@D)/cuda/include/thrust/system/cpp/detail/vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/find.h" "$(@D)/cuda/include/thrust/system/cpp/detail/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/generate.h" "$(@D)/cuda/include/thrust/system/cpp/detail/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/cpp/detail/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/remove.h" "$(@D)/cuda/include/thrust/system/cpp/detail/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/cpp/detail/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/for_each.h" "$(@D)/cuda/include/thrust/system/cpp/detail/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/reduce.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/equal.h" "$(@D)/cuda/include/thrust/system/cpp/detail/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/copy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/cpp/detail/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/cpp/detail/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/cpp/detail/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/cpp/detail/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/extrema.h" "$(@D)/cuda/include/thrust/system/cpp/detail/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/count.h" "$(@D)/cuda/include/thrust/system/cpp/detail/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/replace.h" "$(@D)/cuda/include/thrust/system/cpp/detail/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/get_value.h" "$(@D)/cuda/include/thrust/system/cpp/detail/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/cpp/detail/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/cpp/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/logical.h" "$(@D)/cuda/include/thrust/system/cpp/detail/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/cpp/detail/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/par.h" "$(@D)/cuda/include/thrust/system/cpp/detail/par.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/cpp/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/fill.h" "$(@D)/cuda/include/thrust/system/cpp/detail/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/detail/transform.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cpp/memory.h" "$(@D)/cuda/include/thrust/system/cpp/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/execution_policy.h" "$(@D)/cuda/include/thrust/system/cuda/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/vector.h" "$(@D)/cuda/include/thrust/system/cuda/vector.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/error.h" "$(@D)/cuda/include/thrust/system/cuda/error.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_device_to_device.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_device_to_device.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/memory.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/memory.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_allocator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_allocator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_device.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_device.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_partition.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_partition.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_rle_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_rle_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_histogram_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_histogram_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_by_key_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_by_key_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_scan_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_scan_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_select_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_select_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_reduce_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/dispatch/device_radix_sort_dispatch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/device_radix_sort_dispatch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_select.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_select.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_histo.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_histo.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_downsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_downsweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_upsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_radix_sort_upsweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_satomic.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_satomic.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_sort.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_gatomic.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/specializations/block_range_histo_gatomic.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_select.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_select.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_scan_prefix_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_scan_prefix_operators.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce_by_key.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_range/block_range_reduce_by_key.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_macro.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_macro.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_namespace.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_namespace.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_upsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_upsweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_histogram_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_histogram_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_rle_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_rle_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_select_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_select_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_satomic_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_satomic_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_sort_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_sort_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_gatomic_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/specializations/block_histogram_gatomic_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_downsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_radix_sort_downsweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_by_key_sweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_reduce_by_key_sweep.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_prefix_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block_sweep/block_scan_prefix_operators.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_type.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_type.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/host/spinlock.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/host/spinlock.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_ptx.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_ptx.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_debug.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_debug.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/cub.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/cub.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_load.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_load.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_shift.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_shift.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_store.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_store.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub/util_arch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_arch.cuh" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_intervals.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_intervals.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_cross_system.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_cross_system.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/unique_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk.h" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/sort.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/partition.h" "$(@D)/cuda/include/thrust/system/cuda/detail/partition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/unique.h" "$(@D)/cuda/include/thrust/system/cuda/detail/unique.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/execution_policy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cuda_launch_config.h" "$(@D)/cuda/include/thrust/system/cuda/detail/cuda_launch_config.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/cub.h" "$(@D)/cuda/include/thrust/system/cuda/detail/cub.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/cuda/detail/adjacent_difference.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/sequence.h" "$(@D)/cuda/include/thrust/system/cuda/detail/sequence.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/merge.h" "$(@D)/cuda/include/thrust/system/cuda/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_symmetric_difference.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/set_symmetric_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_if.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform_reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/error.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/error.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/gather.h" "$(@D)/cuda/include/thrust/system/cuda/detail/gather.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_by_key.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/synchronize.h" "$(@D)/cuda/include/thrust/system/cuda/detail/synchronize.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/temporary_indirect_permutation.h" "$(@D)/cuda/include/thrust/system/cuda/detail/temporary_indirect_permutation.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/extern_shared_ptr.h" "$(@D)/cuda/include/thrust/system/cuda/detail/extern_shared_ptr.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/set_operation.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/set_operation.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/balanced_path.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/balanced_path.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/virtualized_smem_closure.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/virtualized_smem_closure.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_primitive_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_primitive_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/set_operation.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/set_operation.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_primitive_sort.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_primitive_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_merge_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_merge_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/launch_closure.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/launch_closure.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/merge.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/alignment.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/alignment.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_radix_sort.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_radix_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_sort_each.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_sort_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/launch_calculator.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/launch_calculator.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_merge_sort.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_merge_sort.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/launch_closure.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/launch_closure.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_radix_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_radix_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/uninitialized.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/uninitialized.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/cached_temporary_allocator.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/cached_temporary_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/launch_calculator.h" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/launch_calculator.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/detail/stable_sort_each.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/detail/stable_sort_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/cuda/detail/temporary_buffer.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/default_decomposition.h" "$(@D)/cuda/include/thrust/system/cuda/detail/default_decomposition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scan_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reverse.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reverse.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/cuda/detail/assign_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/scatter.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scatter.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_intervals.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_intervals.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/for_each.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/default_decomposition.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/default_decomposition.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h" "$(@D)/cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/adjacent_difference.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/adjacent_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/vector.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/vector.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/throw_on_error.h" "$(@D)/cuda/include/thrust/system/cuda/detail/throw_on_error.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/find.h" "$(@D)/cuda/include/thrust/system/cuda/detail/find.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/terminate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/terminate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/merge.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/trivial_copy.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/trivial_copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/generate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/generate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/execute_on_stream.h" "$(@D)/cuda/include/thrust/system/cuda/detail/execute_on_stream.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/cuda/detail/uninitialized_fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/remove.h" "$(@D)/cuda/include/thrust/system/cuda/detail/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/for_each.h" "$(@D)/cuda/include/thrust/system/cuda/detail/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_by_key.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/decomposition.h" "$(@D)/cuda/include/thrust/system/cuda/detail/decomposition.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/equal.h" "$(@D)/cuda/include/thrust/system/cuda/detail/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/runtime_introspection.h" "$(@D)/cuda/include/thrust/system/cuda/detail/runtime_introspection.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/copy.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/cuda/detail/swap_ranges.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/cuda/detail/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/runtime_introspection.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/runtime_introspection.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/cuda/detail/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/cuda/detail/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/scan.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/scan.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/synchronize.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/synchronize.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/extrema.h" "$(@D)/cuda/include/thrust/system/cuda/detail/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_union.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/set_union.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_intersection.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/set_intersection.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/count.h" "$(@D)/cuda/include/thrust/system/cuda/detail/count.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/trivial_copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/trivial_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_device_to_device.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_device_to_device.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/replace.h" "$(@D)/cuda/include/thrust/system/cuda/detail/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/malloc.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/malloc.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/config.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/config.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/closure.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/closure.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/tail_flags.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/tail_flags.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/terminate.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/terminate.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/alignment.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/alignment.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/guarded_cuda_runtime_api.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/guarded_cuda_runtime_api.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/choose_sizes.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/choose_sizes.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/tuple_meta_transform.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/tuple_meta_transform.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_task.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_task.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/head_flags.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/head_flags.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/synchronize.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/synchronize.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/throw_on_error.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/throw_on_error.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/parameter_ptr.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/parameter_ptr.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launcher.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launcher.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/triple_chevron_launcher.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/triple_chevron_launcher.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launch_config.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/cuda_launch_config.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/cuda_launcher/runtime_introspection.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/async.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/async.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/tuple_transform.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/tuple_transform.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/pointer_traits.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/pointer_traits.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/apply_from_tuple.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/apply_from_tuple.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/detail/is_contiguous_iterator.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/detail/is_contiguous_iterator.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/iterator.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/iterator.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/choose_sizes.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/choose_sizes.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/copy.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/copy.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/merge.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/merge.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/accumulate.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/accumulate.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/scan.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/scan.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/detail/stable_merge_sort.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/detail/stable_merge_sort.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/gather.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/gather.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/sort.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/sort.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/reduce.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/reduce.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/scatter.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/scatter.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/adjacent_difference.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/adjacent_difference.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/reduce_by_key.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/reduce_by_key.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/algorithm/for_each.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/algorithm/for_each.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/bulk.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/bulk.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/execution_policy.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/execution_policy.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/iterator/strided_iterator.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/iterator/strided_iterator.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/uninitialized.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/uninitialized.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/async.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/async.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/bulk/future.hpp" "$(@D)/cuda/include/thrust/system/cuda/detail/bulk/future.hpp" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/guarded_driver_types.h" "$(@D)/cuda/include/thrust/system/cuda/detail/guarded_driver_types.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/get_value.h" "$(@D)/cuda/include/thrust/system/cuda/detail/get_value.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/cuda/detail/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_if.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/logical.h" "$(@D)/cuda/include/thrust/system/cuda/detail/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/cuda/detail/iter_swap.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/merge.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/merge.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/inclusive_scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/inclusive_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/merge.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/block/merge.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/merging_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/merging_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/exclusive_scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/exclusive_scan.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/block/odd_even_sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/block/odd_even_sort.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/par.h" "$(@D)/cuda/include/thrust/system/cuda/detail/par.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/copy_cross_system.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_cross_system.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_intervals.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/cuda/detail/malloc_and_free.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/fill.h" "$(@D)/cuda/include/thrust/system/cuda/detail/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/set_difference.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/set_difference.inl" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/detail/transform.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/experimental/pinned_allocator.h" "$(@D)/cuda/include/thrust/system/cuda/experimental/pinned_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/system/cuda/memory.h" "$(@D)/cuda/include/thrust/system/cuda/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/remove.h" "$(@D)/cuda/include/thrust/remove.h" && cp "/usr/local/cuda-8.0/include/thrust/tabulate.h" "$(@D)/cuda/include/thrust/tabulate.h" && cp "/usr/local/cuda-8.0/include/thrust/for_each.h" "$(@D)/cuda/include/thrust/for_each.h" && cp "/usr/local/cuda-8.0/include/thrust/distance.h" "$(@D)/cuda/include/thrust/distance.h" && cp "/usr/local/cuda-8.0/include/thrust/reduce.h" "$(@D)/cuda/include/thrust/reduce.h" && cp "/usr/local/cuda-8.0/include/thrust/equal.h" "$(@D)/cuda/include/thrust/equal.h" && cp "/usr/local/cuda-8.0/include/thrust/complex.h" "$(@D)/cuda/include/thrust/complex.h" && cp "/usr/local/cuda-8.0/include/thrust/device_allocator.h" "$(@D)/cuda/include/thrust/device_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/copy.h" "$(@D)/cuda/include/thrust/copy.h" && cp "/usr/local/cuda-8.0/include/thrust/uninitialized_copy.h" "$(@D)/cuda/include/thrust/uninitialized_copy.h" && cp "/usr/local/cuda-8.0/include/thrust/device_reference.h" "$(@D)/cuda/include/thrust/device_reference.h" && cp "/usr/local/cuda-8.0/include/thrust/binary_search.h" "$(@D)/cuda/include/thrust/binary_search.h" && cp "/usr/local/cuda-8.0/include/thrust/set_operations.h" "$(@D)/cuda/include/thrust/set_operations.h" && cp "/usr/local/cuda-8.0/include/thrust/swap.h" "$(@D)/cuda/include/thrust/swap.h" && cp "/usr/local/cuda-8.0/include/thrust/mismatch.h" "$(@D)/cuda/include/thrust/mismatch.h" && cp "/usr/local/cuda-8.0/include/thrust/extrema.h" "$(@D)/cuda/include/thrust/extrema.h" && cp "/usr/local/cuda-8.0/include/thrust/count.h" "$(@D)/cuda/include/thrust/count.h" && cp "/usr/local/cuda-8.0/include/thrust/device_free.h" "$(@D)/cuda/include/thrust/device_free.h" && cp "/usr/local/cuda-8.0/include/thrust/random/discard_block_engine.h" "$(@D)/cuda/include/thrust/random/discard_block_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/normal_distribution.h" "$(@D)/cuda/include/thrust/random/normal_distribution.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h" "$(@D)/cuda/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/subtract_with_carry_engine.inl" "$(@D)/cuda/include/thrust/random/detail/subtract_with_carry_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/xor_combine_engine_max.h" "$(@D)/cuda/include/thrust/random/detail/xor_combine_engine_max.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/linear_congruential_engine_discard.h" "$(@D)/cuda/include/thrust/random/detail/linear_congruential_engine_discard.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/uniform_int_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/uniform_int_distribution.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/discard_block_engine.inl" "$(@D)/cuda/include/thrust/random/detail/discard_block_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/uniform_real_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/uniform_real_distribution.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/random_core_access.h" "$(@D)/cuda/include/thrust/random/detail/random_core_access.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/mod.h" "$(@D)/cuda/include/thrust/random/detail/mod.h" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/linear_feedback_shift_engine.inl" "$(@D)/cuda/include/thrust/random/detail/linear_feedback_shift_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/linear_congruential_engine.inl" "$(@D)/cuda/include/thrust/random/detail/linear_congruential_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/xor_combine_engine.inl" "$(@D)/cuda/include/thrust/random/detail/xor_combine_engine.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/normal_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/normal_distribution.inl" && cp "/usr/local/cuda-8.0/include/thrust/random/detail/normal_distribution_base.h" "$(@D)/cuda/include/thrust/random/detail/normal_distribution_base.h" && cp "/usr/local/cuda-8.0/include/thrust/random/uniform_int_distribution.h" "$(@D)/cuda/include/thrust/random/uniform_int_distribution.h" && cp "/usr/local/cuda-8.0/include/thrust/random/linear_feedback_shift_engine.h" "$(@D)/cuda/include/thrust/random/linear_feedback_shift_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/xor_combine_engine.h" "$(@D)/cuda/include/thrust/random/xor_combine_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/subtract_with_carry_engine.h" "$(@D)/cuda/include/thrust/random/subtract_with_carry_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/linear_congruential_engine.h" "$(@D)/cuda/include/thrust/random/linear_congruential_engine.h" && cp "/usr/local/cuda-8.0/include/thrust/random/uniform_real_distribution.h" "$(@D)/cuda/include/thrust/random/uniform_real_distribution.h" && cp "/usr/local/cuda-8.0/include/thrust/functional.h" "$(@D)/cuda/include/thrust/functional.h" && cp "/usr/local/cuda-8.0/include/thrust/replace.h" "$(@D)/cuda/include/thrust/replace.h" && cp "/usr/local/cuda-8.0/include/thrust/device_new_allocator.h" "$(@D)/cuda/include/thrust/device_new_allocator.h" && cp "/usr/local/cuda-8.0/include/thrust/host_vector.h" "$(@D)/cuda/include/thrust/host_vector.h" && cp "/usr/local/cuda-8.0/include/thrust/version.h" "$(@D)/cuda/include/thrust/version.h" && cp "/usr/local/cuda-8.0/include/thrust/inner_product.h" "$(@D)/cuda/include/thrust/inner_product.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/iterator_traits.h" "$(@D)/cuda/include/thrust/iterator/iterator_traits.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/discard_iterator.h" "$(@D)/cuda/include/thrust/iterator/discard_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/retag.h" "$(@D)/cuda/include/thrust/iterator/retag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/permutation_iterator.h" "$(@D)/cuda/include/thrust/iterator/permutation_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/transform_iterator.h" "$(@D)/cuda/include/thrust/iterator/transform_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/reverse_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/reverse_iterator.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/zip_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/zip_iterator.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/counting_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/counting_iterator.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/distance_from_result.h" "$(@D)/cuda/include/thrust/iterator/detail/distance_from_result.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/host_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/host_system_tag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_traversal_tags.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_traversal_tags.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/retag.h" "$(@D)/cuda/include/thrust/iterator/detail/retag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/tagged_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/tagged_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_traits.inl" "$(@D)/cuda/include/thrust/iterator/detail/iterator_traits.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/minimum_category.h" "$(@D)/cuda/include/thrust/iterator/detail/minimum_category.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/discard_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/discard_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_category_to_traversal.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_to_traversal.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/zip_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/zip_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/normal_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/normal_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/join_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/join_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/device_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/device_system_tag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/universal_categories.h" "$(@D)/cuda/include/thrust/iterator/detail/universal_categories.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/reverse_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/reverse_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/minimum_system.h" "$(@D)/cuda/include/thrust/iterator/detail/minimum_system.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/tuple_of_iterator_references.h" "$(@D)/cuda/include/thrust/iterator/detail/tuple_of_iterator_references.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/is_iterator_category.h" "$(@D)/cuda/include/thrust/iterator/detail/is_iterator_category.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/permutation_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/permutation_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/any_assign.h" "$(@D)/cuda/include/thrust/iterator/detail/any_assign.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/any_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/any_system_tag.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/is_trivial_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/is_trivial_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_category_to_system.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_to_system.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_adaptor_base.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_adaptor_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/constant_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/constant_iterator_base.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/transform_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/transform_iterator.inl" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_facade_category.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_facade_category.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/constant_iterator.h" "$(@D)/cuda/include/thrust/iterator/constant_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/counting_iterator.h" "$(@D)/cuda/include/thrust/iterator/counting_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/iterator_adaptor.h" "$(@D)/cuda/include/thrust/iterator/iterator_adaptor.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/iterator_facade.h" "$(@D)/cuda/include/thrust/iterator/iterator_facade.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/iterator_categories.h" "$(@D)/cuda/include/thrust/iterator/iterator_categories.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/reverse_iterator.h" "$(@D)/cuda/include/thrust/iterator/reverse_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/iterator/zip_iterator.h" "$(@D)/cuda/include/thrust/iterator/zip_iterator.h" && cp "/usr/local/cuda-8.0/include/thrust/logical.h" "$(@D)/cuda/include/thrust/logical.h" && cp "/usr/local/cuda-8.0/include/thrust/tuple.h" "$(@D)/cuda/include/thrust/tuple.h" && cp "/usr/local/cuda-8.0/include/thrust/memory.h" "$(@D)/cuda/include/thrust/memory.h" && cp "/usr/local/cuda-8.0/include/thrust/random.h" "$(@D)/cuda/include/thrust/random.h" && cp "/usr/local/cuda-8.0/include/thrust/fill.h" "$(@D)/cuda/include/thrust/fill.h" && cp "/usr/local/cuda-8.0/include/thrust/transform.h" "$(@D)/cuda/include/thrust/transform.h" && cp "/usr/local/cuda-8.0/include/texture_types.h" "$(@D)/cuda/include/texture_types.h" && cp "/usr/local/cuda-8.0/include/nppversion.h" "$(@D)/cuda/include/nppversion.h" && cp "/usr/local/cuda-8.0/include/cuda_texture_types.h" "$(@D)/cuda/include/cuda_texture_types.h" && cp "/usr/local/cuda-8.0/include/fatbinary.h" "$(@D)/cuda/include/fatbinary.h" && cp "/usr/local/cuda-8.0/include/cublasXt.h" "$(@D)/cuda/include/cublasXt.h" && cp "/usr/local/cuda-8.0/include/cuda_fp16.h" "$(@D)/cuda/include/cuda_fp16.h" && cp "/usr/local/cuda-8.0/include/vector_functions.h" "$(@D)/cuda/include/vector_functions.h" && cp "/usr/local/cuda-8.0/include/cusparse.h" "$(@D)/cuda/include/cusparse.h" && cp "/usr/local/cuda-8.0/include/nppi_filtering_functions.h" "$(@D)/cuda/include/nppi_filtering_functions.h" && cp "/usr/local/cuda-8.0/include/nppi_morphological_operations.h" "$(@D)/cuda/include/nppi_morphological_operations.h" && cp "/usr/local/cuda-8.0/include/sobol_direction_vectors.h" "$(@D)/cuda/include/sobol_direction_vectors.h" && cp "/usr/local/cuda-8.0/include/nvblas.h" "$(@D)/cuda/include/nvblas.h" && cp "/usr/local/cuda-8.0/include/curand_mtgp32dc_p_11213.h" "$(@D)/cuda/include/curand_mtgp32dc_p_11213.h" && cp "/usr/local/cuda-8.0/include/nvcuvid.h" "$(@D)/cuda/include/nvcuvid.h" && cp "/usr/local/cuda-8.0/include/cuda_runtime_api.h" "$(@D)/cuda/include/cuda_runtime_api.h" && cp "/usr/local/cuda-8.0/include/curand_mtgp32_kernel.h" "$(@D)/cuda/include/curand_mtgp32_kernel.h" && cp "/usr/local/cuda-8.0/include/cublas_v2.h" "$(@D)/cuda/include/cublas_v2.h" && cp "/usr/local/cuda-8.0/include/builtin_types.h" "$(@D)/cuda/include/builtin_types.h" && cp "/usr/local/cuda-8.0/include/nppi_geometry_transforms.h" "$(@D)/cuda/include/nppi_geometry_transforms.h" && cp "/usr/local/cuda-8.0/include/npps_support_functions.h" "$(@D)/cuda/include/npps_support_functions.h" && cp "/usr/local/cuda-8.0/include/cufftw.h" "$(@D)/cuda/include/cufftw.h" && cp "/usr/local/cuda-8.0/include/cuda_device_runtime_api.h" "$(@D)/cuda/include/cuda_device_runtime_api.h" && cp "/usr/local/cuda-8.0/include/sm_30_intrinsics.hpp" "$(@D)/cuda/include/sm_30_intrinsics.hpp" && cp "/usr/local/cuda-8.0/include/vector_types.h" "$(@D)/cuda/include/vector_types.h" && cp "/usr/local/cuda-8.0/include/sm_35_atomic_functions.h" "$(@D)/cuda/include/sm_35_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/sm_20_intrinsics.h" "$(@D)/cuda/include/sm_20_intrinsics.h" && cp "/usr/local/cuda-8.0/include/driver_types.h" "$(@D)/cuda/include/driver_types.h" && cp "/usr/local/cuda-8.0/include/nvToolsExtCudaRt.h" "$(@D)/cuda/include/nvToolsExtCudaRt.h" && cp "/usr/local/cuda-8.0/include/curand_globals.h" "$(@D)/cuda/include/curand_globals.h" && cp "/usr/local/cuda-8.0/include/device_atomic_functions.h" "$(@D)/cuda/include/device_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/surface_types.h" "$(@D)/cuda/include/surface_types.h" && cp "/usr/local/cuda-8.0/include/nvrtc.h" "$(@D)/cuda/include/nvrtc.h" && cp "/usr/local/cuda-8.0/include/nppdefs.h" "$(@D)/cuda/include/nppdefs.h" && cp "/usr/local/cuda-8.0/include/sm_60_atomic_functions.h" "$(@D)/cuda/include/sm_60_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/driver_functions.h" "$(@D)/cuda/include/driver_functions.h" && cp "/usr/local/cuda-8.0/include/cusolver_common.h" "$(@D)/cuda/include/cusolver_common.h" && cp "/usr/local/cuda-8.0/include/cublas.h" "$(@D)/cuda/include/cublas.h" && cp "/usr/local/cuda-8.0/include/curand_lognormal.h" "$(@D)/cuda/include/curand_lognormal.h" && cp "/usr/local/cuda-8.0/include/device_atomic_functions.hpp" "$(@D)/cuda/include/device_atomic_functions.hpp" && cp "/usr/local/cuda-8.0/include/crt/device_runtime.h" "$(@D)/cuda/include/crt/device_runtime.h" && cp "/usr/local/cuda-8.0/include/crt/storage_class.h" "$(@D)/cuda/include/crt/storage_class.h" && cp "/usr/local/cuda-8.0/include/crt/func_macro.h" "$(@D)/cuda/include/crt/func_macro.h" && cp "/usr/local/cuda-8.0/include/crt/host_runtime.h" "$(@D)/cuda/include/crt/host_runtime.h" && cp "/usr/local/cuda-8.0/include/nppi_arithmetic_and_logical_operations.h" "$(@D)/cuda/include/nppi_arithmetic_and_logical_operations.h" && cp "/usr/local/cuda-8.0/include/npps_arithmetic_and_logical_operations.h" "$(@D)/cuda/include/npps_arithmetic_and_logical_operations.h" && cp "/usr/local/cuda-8.0/include/nppi_computer_vision.h" "$(@D)/cuda/include/nppi_computer_vision.h" && cp "/usr/local/cuda-8.0/include/surface_functions.hpp" "$(@D)/cuda/include/surface_functions.hpp" && cp "/usr/local/cuda-8.0/include/surface_functions.h" "$(@D)/cuda/include/surface_functions.h" && cp "/usr/local/cuda-8.0/include/curand_normal_static.h" "$(@D)/cuda/include/curand_normal_static.h" && cp "/usr/local/cuda-8.0/include/curand.h" "$(@D)/cuda/include/curand.h" && cp "/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.h" "$(@D)/cuda/include/math_functions_dbl_ptx3.h" && cp "/usr/local/cuda-8.0/include/curand_philox4x32_x.h" "$(@D)/cuda/include/curand_philox4x32_x.h" && cp "/usr/local/cuda-8.0/include/nppi_threshold_and_compare_operations.h" "$(@D)/cuda/include/nppi_threshold_and_compare_operations.h" && cp "/usr/local/cuda-8.0/include/nvml.h" "$(@D)/cuda/include/nvml.h" && cp "/usr/local/cuda-8.0/include/npps.h" "$(@D)/cuda/include/npps.h" && cp "/usr/local/cuda-8.0/include/cuda_vdpau_interop.h" "$(@D)/cuda/include/cuda_vdpau_interop.h" && cp "/usr/local/cuda-8.0/include/sm_61_intrinsics.hpp" "$(@D)/cuda/include/sm_61_intrinsics.hpp" && cp "/usr/local/cuda-8.0/include/cublas_api.h" "$(@D)/cuda/include/cublas_api.h" && cp "/usr/local/cuda-8.0/include/nppi_color_conversion.h" "$(@D)/cuda/include/nppi_color_conversion.h" && cp "/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.hpp" "$(@D)/cuda/include/math_functions_dbl_ptx3.hpp" && cp "/usr/local/cuda-8.0/include/nppcore.h" "$(@D)/cuda/include/nppcore.h" && cp "/usr/local/cuda-8.0/include/cudaGL.h" "$(@D)/cuda/include/cudaGL.h" && cp "/usr/local/cuda-8.0/include/fatBinaryCtl.h" "$(@D)/cuda/include/fatBinaryCtl.h" && cp "/usr/local/cuda-8.0/include/npps_statistics_functions.h" "$(@D)/cuda/include/npps_statistics_functions.h" && cp "/usr/local/cuda-8.0/include/cudaVDPAU.h" "$(@D)/cuda/include/cudaVDPAU.h" && cp "/usr/local/cuda-8.0/include/curand_poisson.h" "$(@D)/cuda/include/curand_poisson.h" && cp "/usr/local/cuda-8.0/include/cusolverDn.h" "$(@D)/cuda/include/cusolverDn.h" && cp "/usr/local/cuda-8.0/include/cuda_profiler_api.h" "$(@D)/cuda/include/cuda_profiler_api.h" && cp "/usr/local/cuda-8.0/include/sm_20_atomic_functions.h" "$(@D)/cuda/include/sm_20_atomic_functions.h" && cp "/usr/local/cuda-8.0/include/nvfunctional" "$(@D)/cuda/include/nvfunctional"
+if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-9.0/include/CL/cl.h" "$(@D)/cuda/include/CL/cl.h" && cp "/usr/local/cuda-9.0/include/CL/cl.hpp" "$(@D)/cuda/include/CL/cl.hpp" && cp "/usr/local/cuda-9.0/include/CL/cl_egl.h" "$(@D)/cuda/include/CL/cl_egl.h" && cp "/usr/local/cuda-9.0/include/CL/cl_ext.h" "$(@D)/cuda/include/CL/cl_ext.h" && cp "/usr/local/cuda-9.0/include/CL/cl_gl.h" "$(@D)/cuda/include/CL/cl_gl.h" && cp "/usr/local/cuda-9.0/include/CL/cl_gl_ext.h" "$(@D)/cuda/include/CL/cl_gl_ext.h" && cp "/usr/local/cuda-9.0/include/CL/cl_platform.h" "$(@D)/cuda/include/CL/cl_platform.h" && cp "/usr/local/cuda-9.0/include/CL/opencl.h" "$(@D)/cuda/include/CL/opencl.h" && cp "/usr/local/cuda-9.0/include/builtin_types.h" "$(@D)/cuda/include/builtin_types.h" && cp "/usr/local/cuda-9.0/include/channel_descriptor.h" "$(@D)/cuda/include/channel_descriptor.h" && cp "/usr/local/cuda-9.0/include/common_functions.h" "$(@D)/cuda/include/common_functions.h" && cp "/usr/local/cuda-9.0/include/cooperative_groups.h" "$(@D)/cuda/include/cooperative_groups.h" && cp "/usr/local/cuda-9.0/include/cooperative_groups_helpers.h" "$(@D)/cuda/include/cooperative_groups_helpers.h" && cp "/usr/local/cuda-9.0/include/crt/common_functions.h" "$(@D)/cuda/include/crt/common_functions.h" && cp "/usr/local/cuda-9.0/include/crt/device_double_functions.h" "$(@D)/cuda/include/crt/device_double_functions.h" && cp "/usr/local/cuda-9.0/include/crt/device_double_functions.hpp" "$(@D)/cuda/include/crt/device_double_functions.hpp" && cp "/usr/local/cuda-9.0/include/crt/device_functions.h" "$(@D)/cuda/include/crt/device_functions.h" && cp "/usr/local/cuda-9.0/include/crt/device_functions.hpp" "$(@D)/cuda/include/crt/device_functions.hpp" && cp "/usr/local/cuda-9.0/include/crt/func_macro.h" "$(@D)/cuda/include/crt/func_macro.h" && cp "/usr/local/cuda-9.0/include/crt/host_config.h" "$(@D)/cuda/include/crt/host_config.h" && cp "/usr/local/cuda-9.0/include/crt/host_defines.h" "$(@D)/cuda/include/crt/host_defines.h" && cp "/usr/local/cuda-9.0/include/crt/host_runtime.h" "$(@D)/cuda/include/crt/host_runtime.h" && cp "/usr/local/cuda-9.0/include/crt/math_functions.h" "$(@D)/cuda/include/crt/math_functions.h" && cp "/usr/local/cuda-9.0/include/crt/math_functions.hpp" "$(@D)/cuda/include/crt/math_functions.hpp" && cp "/usr/local/cuda-9.0/include/crt/mma.h" "$(@D)/cuda/include/crt/mma.h" && cp "/usr/local/cuda-9.0/include/crt/mma.hpp" "$(@D)/cuda/include/crt/mma.hpp" && cp "/usr/local/cuda-9.0/include/crt/nvfunctional" "$(@D)/cuda/include/crt/nvfunctional" && cp "/usr/local/cuda-9.0/include/crt/sm_70_rt.h" "$(@D)/cuda/include/crt/sm_70_rt.h" && cp "/usr/local/cuda-9.0/include/crt/sm_70_rt.hpp" "$(@D)/cuda/include/crt/sm_70_rt.hpp" && cp "/usr/local/cuda-9.0/include/crt/storage_class.h" "$(@D)/cuda/include/crt/storage_class.h" && cp "/usr/local/cuda-9.0/include/cuComplex.h" "$(@D)/cuda/include/cuComplex.h" && cp "/usr/local/cuda-9.0/include/cublas.h" "$(@D)/cuda/include/cublas.h" && cp "/usr/local/cuda-9.0/include/cublasXt.h" "$(@D)/cuda/include/cublasXt.h" && cp "/usr/local/cuda-9.0/include/cublas_api.h" "$(@D)/cuda/include/cublas_api.h" && cp "/usr/local/cuda-9.0/include/cublas_v2.h" "$(@D)/cuda/include/cublas_v2.h" && cp "/usr/local/cuda-9.0/include/cuda.h" "$(@D)/cuda/include/cuda.h" && cp "/usr/local/cuda-9.0/include/cudaEGL.h" "$(@D)/cuda/include/cudaEGL.h" && cp "/usr/local/cuda-9.0/include/cudaGL.h" "$(@D)/cuda/include/cudaGL.h" && cp "/usr/local/cuda-9.0/include/cudaProfiler.h" "$(@D)/cuda/include/cudaProfiler.h" && cp "/usr/local/cuda-9.0/include/cudaVDPAU.h" "$(@D)/cuda/include/cudaVDPAU.h" && cp "/usr/local/cuda-9.0/include/cuda_device_runtime_api.h" "$(@D)/cuda/include/cuda_device_runtime_api.h" && cp "/usr/local/cuda-9.0/include/cuda_fp16.h" "$(@D)/cuda/include/cuda_fp16.h" && cp "/usr/local/cuda-9.0/include/cuda_fp16.hpp" "$(@D)/cuda/include/cuda_fp16.hpp" && cp "/usr/local/cuda-9.0/include/cuda_gl_interop.h" "$(@D)/cuda/include/cuda_gl_interop.h" && cp "/usr/local/cuda-9.0/include/cuda_occupancy.h" "$(@D)/cuda/include/cuda_occupancy.h" && cp "/usr/local/cuda-9.0/include/cuda_profiler_api.h" "$(@D)/cuda/include/cuda_profiler_api.h" && cp "/usr/local/cuda-9.0/include/cuda_runtime.h" "$(@D)/cuda/include/cuda_runtime.h" && cp "/usr/local/cuda-9.0/include/cuda_runtime_api.h" "$(@D)/cuda/include/cuda_runtime_api.h" && cp "/usr/local/cuda-9.0/include/cuda_surface_types.h" "$(@D)/cuda/include/cuda_surface_types.h" && cp "/usr/local/cuda-9.0/include/cuda_texture_types.h" "$(@D)/cuda/include/cuda_texture_types.h" && cp "/usr/local/cuda-9.0/include/cuda_vdpau_interop.h" "$(@D)/cuda/include/cuda_vdpau_interop.h" && cp "/usr/local/cuda-9.0/include/cudalibxt.h" "$(@D)/cuda/include/cudalibxt.h" && cp "/usr/local/cuda-9.0/include/cudnn.h" "$(@D)/cuda/include/cudnn.h" && cp "/usr/local/cuda-9.0/include/cufft.h" "$(@D)/cuda/include/cufft.h" && cp "/usr/local/cuda-9.0/include/cufftXt.h" "$(@D)/cuda/include/cufftXt.h" && cp "/usr/local/cuda-9.0/include/cufftw.h" "$(@D)/cuda/include/cufftw.h" && cp "/usr/local/cuda-9.0/include/curand.h" "$(@D)/cuda/include/curand.h" && cp "/usr/local/cuda-9.0/include/curand_discrete.h" "$(@D)/cuda/include/curand_discrete.h" && cp "/usr/local/cuda-9.0/include/curand_discrete2.h" "$(@D)/cuda/include/curand_discrete2.h" && cp "/usr/local/cuda-9.0/include/curand_globals.h" "$(@D)/cuda/include/curand_globals.h" && cp "/usr/local/cuda-9.0/include/curand_kernel.h" "$(@D)/cuda/include/curand_kernel.h" && cp "/usr/local/cuda-9.0/include/curand_lognormal.h" "$(@D)/cuda/include/curand_lognormal.h" && cp "/usr/local/cuda-9.0/include/curand_mrg32k3a.h" "$(@D)/cuda/include/curand_mrg32k3a.h" && cp "/usr/local/cuda-9.0/include/curand_mtgp32.h" "$(@D)/cuda/include/curand_mtgp32.h" && cp "/usr/local/cuda-9.0/include/curand_mtgp32_host.h" "$(@D)/cuda/include/curand_mtgp32_host.h" && cp "/usr/local/cuda-9.0/include/curand_mtgp32_kernel.h" "$(@D)/cuda/include/curand_mtgp32_kernel.h" && cp "/usr/local/cuda-9.0/include/curand_mtgp32dc_p_11213.h" "$(@D)/cuda/include/curand_mtgp32dc_p_11213.h" && cp "/usr/local/cuda-9.0/include/curand_normal.h" "$(@D)/cuda/include/curand_normal.h" && cp "/usr/local/cuda-9.0/include/curand_normal_static.h" "$(@D)/cuda/include/curand_normal_static.h" && cp "/usr/local/cuda-9.0/include/curand_philox4x32_x.h" "$(@D)/cuda/include/curand_philox4x32_x.h" && cp "/usr/local/cuda-9.0/include/curand_poisson.h" "$(@D)/cuda/include/curand_poisson.h" && cp "/usr/local/cuda-9.0/include/curand_precalc.h" "$(@D)/cuda/include/curand_precalc.h" && cp "/usr/local/cuda-9.0/include/curand_uniform.h" "$(@D)/cuda/include/curand_uniform.h" && cp "/usr/local/cuda-9.0/include/cusolverDn.h" "$(@D)/cuda/include/cusolverDn.h" && cp "/usr/local/cuda-9.0/include/cusolverRf.h" "$(@D)/cuda/include/cusolverRf.h" && cp "/usr/local/cuda-9.0/include/cusolverSp.h" "$(@D)/cuda/include/cusolverSp.h" && cp "/usr/local/cuda-9.0/include/cusolverSp_LOWLEVEL_PREVIEW.h" "$(@D)/cuda/include/cusolverSp_LOWLEVEL_PREVIEW.h" && cp "/usr/local/cuda-9.0/include/cusolver_common.h" "$(@D)/cuda/include/cusolver_common.h" && cp "/usr/local/cuda-9.0/include/cusparse.h" "$(@D)/cuda/include/cusparse.h" && cp "/usr/local/cuda-9.0/include/cusparse_v2.h" "$(@D)/cuda/include/cusparse_v2.h" && cp "/usr/local/cuda-9.0/include/device_atomic_functions.h" "$(@D)/cuda/include/device_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/device_atomic_functions.hpp" "$(@D)/cuda/include/device_atomic_functions.hpp" && cp "/usr/local/cuda-9.0/include/device_double_functions.h" "$(@D)/cuda/include/device_double_functions.h" && cp "/usr/local/cuda-9.0/include/device_double_functions.hpp" "$(@D)/cuda/include/device_double_functions.hpp" && cp "/usr/local/cuda-9.0/include/device_functions.h" "$(@D)/cuda/include/device_functions.h" && cp "/usr/local/cuda-9.0/include/device_functions.hpp" "$(@D)/cuda/include/device_functions.hpp" && cp "/usr/local/cuda-9.0/include/device_functions_decls.h" "$(@D)/cuda/include/device_functions_decls.h" && cp "/usr/local/cuda-9.0/include/device_launch_parameters.h" "$(@D)/cuda/include/device_launch_parameters.h" && cp "/usr/local/cuda-9.0/include/device_types.h" "$(@D)/cuda/include/device_types.h" && cp "/usr/local/cuda-9.0/include/driver_functions.h" "$(@D)/cuda/include/driver_functions.h" && cp "/usr/local/cuda-9.0/include/driver_types.h" "$(@D)/cuda/include/driver_types.h" && cp "/usr/local/cuda-9.0/include/dynlink_cuda.h" "$(@D)/cuda/include/dynlink_cuda.h" && cp "/usr/local/cuda-9.0/include/dynlink_cuda_cuda.h" "$(@D)/cuda/include/dynlink_cuda_cuda.h" && cp "/usr/local/cuda-9.0/include/dynlink_cuviddec.h" "$(@D)/cuda/include/dynlink_cuviddec.h" && cp "/usr/local/cuda-9.0/include/dynlink_nvcuvid.h" "$(@D)/cuda/include/dynlink_nvcuvid.h" && cp "/usr/local/cuda-9.0/include/fatBinaryCtl.h" "$(@D)/cuda/include/fatBinaryCtl.h" && cp "/usr/local/cuda-9.0/include/fatbinary.h" "$(@D)/cuda/include/fatbinary.h" && cp "/usr/local/cuda-9.0/include/host_config.h" "$(@D)/cuda/include/host_config.h" && cp "/usr/local/cuda-9.0/include/host_defines.h" "$(@D)/cuda/include/host_defines.h" && cp "/usr/local/cuda-9.0/include/library_types.h" "$(@D)/cuda/include/library_types.h" && cp "/usr/local/cuda-9.0/include/math_constants.h" "$(@D)/cuda/include/math_constants.h" && cp "/usr/local/cuda-9.0/include/math_functions.h" "$(@D)/cuda/include/math_functions.h" && cp "/usr/local/cuda-9.0/include/math_functions.hpp" "$(@D)/cuda/include/math_functions.hpp" && cp "/usr/local/cuda-9.0/include/math_functions_dbl_ptx3.h" "$(@D)/cuda/include/math_functions_dbl_ptx3.h" && cp "/usr/local/cuda-9.0/include/math_functions_dbl_ptx3.hpp" "$(@D)/cuda/include/math_functions_dbl_ptx3.hpp" && cp "/usr/local/cuda-9.0/include/mma.h" "$(@D)/cuda/include/mma.h" && cp "/usr/local/cuda-9.0/include/npp.h" "$(@D)/cuda/include/npp.h" && cp "/usr/local/cuda-9.0/include/nppcore.h" "$(@D)/cuda/include/nppcore.h" && cp "/usr/local/cuda-9.0/include/nppdefs.h" "$(@D)/cuda/include/nppdefs.h" && cp "/usr/local/cuda-9.0/include/nppi.h" "$(@D)/cuda/include/nppi.h" && cp "/usr/local/cuda-9.0/include/nppi_arithmetic_and_logical_operations.h" "$(@D)/cuda/include/nppi_arithmetic_and_logical_operations.h" && cp "/usr/local/cuda-9.0/include/nppi_color_conversion.h" "$(@D)/cuda/include/nppi_color_conversion.h" && cp "/usr/local/cuda-9.0/include/nppi_compression_functions.h" "$(@D)/cuda/include/nppi_compression_functions.h" && cp "/usr/local/cuda-9.0/include/nppi_computer_vision.h" "$(@D)/cuda/include/nppi_computer_vision.h" && cp "/usr/local/cuda-9.0/include/nppi_data_exchange_and_initialization.h" "$(@D)/cuda/include/nppi_data_exchange_and_initialization.h" && cp "/usr/local/cuda-9.0/include/nppi_filtering_functions.h" "$(@D)/cuda/include/nppi_filtering_functions.h" && cp "/usr/local/cuda-9.0/include/nppi_geometry_transforms.h" "$(@D)/cuda/include/nppi_geometry_transforms.h" && cp "/usr/local/cuda-9.0/include/nppi_linear_transforms.h" "$(@D)/cuda/include/nppi_linear_transforms.h" && cp "/usr/local/cuda-9.0/include/nppi_morphological_operations.h" "$(@D)/cuda/include/nppi_morphological_operations.h" && cp "/usr/local/cuda-9.0/include/nppi_statistics_functions.h" "$(@D)/cuda/include/nppi_statistics_functions.h" && cp "/usr/local/cuda-9.0/include/nppi_support_functions.h" "$(@D)/cuda/include/nppi_support_functions.h" && cp "/usr/local/cuda-9.0/include/nppi_threshold_and_compare_operations.h" "$(@D)/cuda/include/nppi_threshold_and_compare_operations.h" && cp "/usr/local/cuda-9.0/include/npps.h" "$(@D)/cuda/include/npps.h" && cp "/usr/local/cuda-9.0/include/npps_arithmetic_and_logical_operations.h" "$(@D)/cuda/include/npps_arithmetic_and_logical_operations.h" && cp "/usr/local/cuda-9.0/include/npps_conversion_functions.h" "$(@D)/cuda/include/npps_conversion_functions.h" && cp "/usr/local/cuda-9.0/include/npps_filtering_functions.h" "$(@D)/cuda/include/npps_filtering_functions.h" && cp "/usr/local/cuda-9.0/include/npps_initialization.h" "$(@D)/cuda/include/npps_initialization.h" && cp "/usr/local/cuda-9.0/include/npps_statistics_functions.h" "$(@D)/cuda/include/npps_statistics_functions.h" && cp "/usr/local/cuda-9.0/include/npps_support_functions.h" "$(@D)/cuda/include/npps_support_functions.h" && cp "/usr/local/cuda-9.0/include/nppversion.h" "$(@D)/cuda/include/nppversion.h" && cp "/usr/local/cuda-9.0/include/nvToolsExt.h" "$(@D)/cuda/include/nvToolsExt.h" && cp "/usr/local/cuda-9.0/include/nvToolsExtCuda.h" "$(@D)/cuda/include/nvToolsExtCuda.h" && cp "/usr/local/cuda-9.0/include/nvToolsExtCudaRt.h" "$(@D)/cuda/include/nvToolsExtCudaRt.h" && cp "/usr/local/cuda-9.0/include/nvToolsExtMeta.h" "$(@D)/cuda/include/nvToolsExtMeta.h" && cp "/usr/local/cuda-9.0/include/nvToolsExtSync.h" "$(@D)/cuda/include/nvToolsExtSync.h" && cp "/usr/local/cuda-9.0/include/nvblas.h" "$(@D)/cuda/include/nvblas.h" && cp "/usr/local/cuda-9.0/include/nvfunctional" "$(@D)/cuda/include/nvfunctional" && cp "/usr/local/cuda-9.0/include/nvgraph.h" "$(@D)/cuda/include/nvgraph.h" && cp "/usr/local/cuda-9.0/include/nvml.h" "$(@D)/cuda/include/nvml.h" && cp "/usr/local/cuda-9.0/include/nvrtc.h" "$(@D)/cuda/include/nvrtc.h" && cp "/usr/local/cuda-9.0/include/sm_20_atomic_functions.h" "$(@D)/cuda/include/sm_20_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/sm_20_atomic_functions.hpp" "$(@D)/cuda/include/sm_20_atomic_functions.hpp" && cp "/usr/local/cuda-9.0/include/sm_20_intrinsics.h" "$(@D)/cuda/include/sm_20_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_20_intrinsics.hpp" "$(@D)/cuda/include/sm_20_intrinsics.hpp" && cp "/usr/local/cuda-9.0/include/sm_30_intrinsics.h" "$(@D)/cuda/include/sm_30_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_30_intrinsics.hpp" "$(@D)/cuda/include/sm_30_intrinsics.hpp" && cp "/usr/local/cuda-9.0/include/sm_32_atomic_functions.h" "$(@D)/cuda/include/sm_32_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/sm_32_atomic_functions.hpp" "$(@D)/cuda/include/sm_32_atomic_functions.hpp" && cp "/usr/local/cuda-9.0/include/sm_32_intrinsics.h" "$(@D)/cuda/include/sm_32_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_32_intrinsics.hpp" "$(@D)/cuda/include/sm_32_intrinsics.hpp" && cp "/usr/local/cuda-9.0/include/sm_35_atomic_functions.h" "$(@D)/cuda/include/sm_35_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/sm_35_intrinsics.h" "$(@D)/cuda/include/sm_35_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_60_atomic_functions.h" "$(@D)/cuda/include/sm_60_atomic_functions.h" && cp "/usr/local/cuda-9.0/include/sm_60_atomic_functions.hpp" "$(@D)/cuda/include/sm_60_atomic_functions.hpp" && cp "/usr/local/cuda-9.0/include/sm_61_intrinsics.h" "$(@D)/cuda/include/sm_61_intrinsics.h" && cp "/usr/local/cuda-9.0/include/sm_61_intrinsics.hpp" "$(@D)/cuda/include/sm_61_intrinsics.hpp" && cp "/usr/local/cuda-9.0/include/sobol_direction_vectors.h" "$(@D)/cuda/include/sobol_direction_vectors.h" && cp "/usr/local/cuda-9.0/include/surface_functions.h" "$(@D)/cuda/include/surface_functions.h" && cp "/usr/local/cuda-9.0/include/surface_functions.hpp" "$(@D)/cuda/include/surface_functions.hpp" && cp "/usr/local/cuda-9.0/include/surface_indirect_functions.h" "$(@D)/cuda/include/surface_indirect_functions.h" && cp "/usr/local/cuda-9.0/include/surface_indirect_functions.hpp" "$(@D)/cuda/include/surface_indirect_functions.hpp" && cp "/usr/local/cuda-9.0/include/surface_types.h" "$(@D)/cuda/include/surface_types.h" && cp "/usr/local/cuda-9.0/include/texture_fetch_functions.h" "$(@D)/cuda/include/texture_fetch_functions.h" && cp "/usr/local/cuda-9.0/include/texture_fetch_functions.hpp" "$(@D)/cuda/include/texture_fetch_functions.hpp" && cp "/usr/local/cuda-9.0/include/texture_indirect_functions.h" "$(@D)/cuda/include/texture_indirect_functions.h" && cp "/usr/local/cuda-9.0/include/texture_indirect_functions.hpp" "$(@D)/cuda/include/texture_indirect_functions.hpp" && cp "/usr/local/cuda-9.0/include/texture_types.h" "$(@D)/cuda/include/texture_types.h" && cp "/usr/local/cuda-9.0/include/thrust/adjacent_difference.h" "$(@D)/cuda/include/thrust/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/advance.h" "$(@D)/cuda/include/thrust/advance.h" && cp "/usr/local/cuda-9.0/include/thrust/binary_search.h" "$(@D)/cuda/include/thrust/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/complex.h" "$(@D)/cuda/include/thrust/complex.h" && cp "/usr/local/cuda-9.0/include/thrust/copy.h" "$(@D)/cuda/include/thrust/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/count.h" "$(@D)/cuda/include/thrust/count.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/adjacent_difference.inl" "$(@D)/cuda/include/thrust/detail/adjacent_difference.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/advance.inl" "$(@D)/cuda/include/thrust/detail/advance.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/allocator_traits.h" "$(@D)/cuda/include/thrust/detail/allocator/allocator_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/allocator_traits.inl" "$(@D)/cuda/include/thrust/detail/allocator/allocator_traits.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/copy_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/copy_construct_range.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/copy_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/copy_construct_range.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/default_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/default_construct_range.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/default_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/default_construct_range.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/destroy_range.h" "$(@D)/cuda/include/thrust/detail/allocator/destroy_range.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/destroy_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/destroy_range.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/fill_construct_range.h" "$(@D)/cuda/include/thrust/detail/allocator/fill_construct_range.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/fill_construct_range.inl" "$(@D)/cuda/include/thrust/detail/allocator/fill_construct_range.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/malloc_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/malloc_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/malloc_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/malloc_allocator.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/no_throw_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/no_throw_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/tagged_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/tagged_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/tagged_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/tagged_allocator.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/temporary_allocator.h" "$(@D)/cuda/include/thrust/detail/allocator/temporary_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/allocator/temporary_allocator.inl" "$(@D)/cuda/include/thrust/detail/allocator/temporary_allocator.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/binary_search.inl" "$(@D)/cuda/include/thrust/detail/binary_search.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/arithmetic.h" "$(@D)/cuda/include/thrust/detail/complex/arithmetic.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/c99math.h" "$(@D)/cuda/include/thrust/detail/complex/c99math.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/catrig.h" "$(@D)/cuda/include/thrust/detail/complex/catrig.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/catrigf.h" "$(@D)/cuda/include/thrust/detail/complex/catrigf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/ccosh.h" "$(@D)/cuda/include/thrust/detail/complex/ccosh.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/ccoshf.h" "$(@D)/cuda/include/thrust/detail/complex/ccoshf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cexp.h" "$(@D)/cuda/include/thrust/detail/complex/cexp.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cexpf.h" "$(@D)/cuda/include/thrust/detail/complex/cexpf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/clog.h" "$(@D)/cuda/include/thrust/detail/complex/clog.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/clogf.h" "$(@D)/cuda/include/thrust/detail/complex/clogf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/complex.inl" "$(@D)/cuda/include/thrust/detail/complex/complex.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cpow.h" "$(@D)/cuda/include/thrust/detail/complex/cpow.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cpowf.h" "$(@D)/cuda/include/thrust/detail/complex/cpowf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/cproj.h" "$(@D)/cuda/include/thrust/detail/complex/cproj.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/csinh.h" "$(@D)/cuda/include/thrust/detail/complex/csinh.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/csinhf.h" "$(@D)/cuda/include/thrust/detail/complex/csinhf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/csqrt.h" "$(@D)/cuda/include/thrust/detail/complex/csqrt.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/csqrtf.h" "$(@D)/cuda/include/thrust/detail/complex/csqrtf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/ctanh.h" "$(@D)/cuda/include/thrust/detail/complex/ctanh.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/ctanhf.h" "$(@D)/cuda/include/thrust/detail/complex/ctanhf.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/math_private.h" "$(@D)/cuda/include/thrust/detail/complex/math_private.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/complex/stream.h" "$(@D)/cuda/include/thrust/detail/complex/stream.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config.h" "$(@D)/cuda/include/thrust/detail/config.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/compiler.h" "$(@D)/cuda/include/thrust/detail/config/compiler.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/compiler_fence.h" "$(@D)/cuda/include/thrust/detail/config/compiler_fence.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/config.h" "$(@D)/cuda/include/thrust/detail/config/config.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/debug.h" "$(@D)/cuda/include/thrust/detail/config/debug.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/device_system.h" "$(@D)/cuda/include/thrust/detail/config/device_system.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/exec_check_disable.h" "$(@D)/cuda/include/thrust/detail/config/exec_check_disable.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/forceinline.h" "$(@D)/cuda/include/thrust/detail/config/forceinline.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/global_workarounds.h" "$(@D)/cuda/include/thrust/detail/config/global_workarounds.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/host_device.h" "$(@D)/cuda/include/thrust/detail/config/host_device.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/host_system.h" "$(@D)/cuda/include/thrust/detail/config/host_system.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/config/simple_defines.h" "$(@D)/cuda/include/thrust/detail/config/simple_defines.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/contiguous_storage.h" "$(@D)/cuda/include/thrust/detail/contiguous_storage.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/contiguous_storage.inl" "$(@D)/cuda/include/thrust/detail/contiguous_storage.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/copy.h" "$(@D)/cuda/include/thrust/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/copy.inl" "$(@D)/cuda/include/thrust/detail/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/copy_if.h" "$(@D)/cuda/include/thrust/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/copy_if.inl" "$(@D)/cuda/include/thrust/detail/copy_if.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/count.inl" "$(@D)/cuda/include/thrust/detail/count.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/cstdint.h" "$(@D)/cuda/include/thrust/detail/cstdint.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_delete.inl" "$(@D)/cuda/include/thrust/detail/device_delete.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_free.inl" "$(@D)/cuda/include/thrust/detail/device_free.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_malloc.inl" "$(@D)/cuda/include/thrust/detail/device_malloc.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_new.inl" "$(@D)/cuda/include/thrust/detail/device_new.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_ptr.inl" "$(@D)/cuda/include/thrust/detail/device_ptr.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_reference.inl" "$(@D)/cuda/include/thrust/detail/device_reference.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/device_vector.inl" "$(@D)/cuda/include/thrust/detail/device_vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/dispatch/is_trivial_copy.h" "$(@D)/cuda/include/thrust/detail/dispatch/is_trivial_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/distance.inl" "$(@D)/cuda/include/thrust/detail/distance.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/equal.inl" "$(@D)/cuda/include/thrust/detail/equal.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/execute_with_allocator.h" "$(@D)/cuda/include/thrust/detail/execute_with_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/execution_policy.h" "$(@D)/cuda/include/thrust/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/extrema.inl" "$(@D)/cuda/include/thrust/detail/extrema.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/fill.inl" "$(@D)/cuda/include/thrust/detail/fill.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/find.inl" "$(@D)/cuda/include/thrust/detail/find.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/for_each.inl" "$(@D)/cuda/include/thrust/detail/for_each.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/function.h" "$(@D)/cuda/include/thrust/detail/function.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional.inl" "$(@D)/cuda/include/thrust/detail/functional.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/actor.h" "$(@D)/cuda/include/thrust/detail/functional/actor.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/actor.inl" "$(@D)/cuda/include/thrust/detail/functional/actor.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/argument.h" "$(@D)/cuda/include/thrust/detail/functional/argument.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/composite.h" "$(@D)/cuda/include/thrust/detail/functional/composite.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/arithmetic_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/arithmetic_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/assignment_operator.h" "$(@D)/cuda/include/thrust/detail/functional/operators/assignment_operator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/bitwise_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/bitwise_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/compound_assignment_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/compound_assignment_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/logical_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/logical_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/operator_adaptors.h" "$(@D)/cuda/include/thrust/detail/functional/operators/operator_adaptors.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/operators/relational_operators.h" "$(@D)/cuda/include/thrust/detail/functional/operators/relational_operators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/placeholder.h" "$(@D)/cuda/include/thrust/detail/functional/placeholder.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/functional/value.h" "$(@D)/cuda/include/thrust/detail/functional/value.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/gather.inl" "$(@D)/cuda/include/thrust/detail/gather.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/generate.inl" "$(@D)/cuda/include/thrust/detail/generate.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/get_iterator_value.h" "$(@D)/cuda/include/thrust/detail/get_iterator_value.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/host_vector.inl" "$(@D)/cuda/include/thrust/detail/host_vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/inner_product.inl" "$(@D)/cuda/include/thrust/detail/inner_product.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/integer_math.h" "$(@D)/cuda/include/thrust/detail/integer_math.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/integer_traits.h" "$(@D)/cuda/include/thrust/detail/integer_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/internal_functional.h" "$(@D)/cuda/include/thrust/detail/internal_functional.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/logical.inl" "$(@D)/cuda/include/thrust/detail/logical.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/merge.inl" "$(@D)/cuda/include/thrust/detail/merge.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/minmax.h" "$(@D)/cuda/include/thrust/detail/minmax.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/mismatch.inl" "$(@D)/cuda/include/thrust/detail/mismatch.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/mpl/math.h" "$(@D)/cuda/include/thrust/detail/mpl/math.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/numeric_traits.h" "$(@D)/cuda/include/thrust/detail/numeric_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/overlapped_copy.h" "$(@D)/cuda/include/thrust/detail/overlapped_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/pair.inl" "$(@D)/cuda/include/thrust/detail/pair.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/partition.inl" "$(@D)/cuda/include/thrust/detail/partition.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/pointer.h" "$(@D)/cuda/include/thrust/detail/pointer.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/pointer.inl" "$(@D)/cuda/include/thrust/detail/pointer.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/range/head_flags.h" "$(@D)/cuda/include/thrust/detail/range/head_flags.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/range/tail_flags.h" "$(@D)/cuda/include/thrust/detail/range/tail_flags.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/raw_pointer_cast.h" "$(@D)/cuda/include/thrust/detail/raw_pointer_cast.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/raw_reference_cast.h" "$(@D)/cuda/include/thrust/detail/raw_reference_cast.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/reduce.inl" "$(@D)/cuda/include/thrust/detail/reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/reference.h" "$(@D)/cuda/include/thrust/detail/reference.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/reference.inl" "$(@D)/cuda/include/thrust/detail/reference.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/reference_forward_declaration.h" "$(@D)/cuda/include/thrust/detail/reference_forward_declaration.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/remove.inl" "$(@D)/cuda/include/thrust/detail/remove.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/replace.inl" "$(@D)/cuda/include/thrust/detail/replace.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/reverse.inl" "$(@D)/cuda/include/thrust/detail/reverse.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/scan.inl" "$(@D)/cuda/include/thrust/detail/scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/scatter.inl" "$(@D)/cuda/include/thrust/detail/scatter.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/seq.h" "$(@D)/cuda/include/thrust/detail/seq.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/sequence.inl" "$(@D)/cuda/include/thrust/detail/sequence.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/set_operations.inl" "$(@D)/cuda/include/thrust/detail/set_operations.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/sort.inl" "$(@D)/cuda/include/thrust/detail/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/static_assert.h" "$(@D)/cuda/include/thrust/detail/static_assert.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/static_map.h" "$(@D)/cuda/include/thrust/detail/static_map.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/swap.h" "$(@D)/cuda/include/thrust/detail/swap.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/swap.inl" "$(@D)/cuda/include/thrust/detail/swap.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/swap_ranges.inl" "$(@D)/cuda/include/thrust/detail/swap_ranges.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/tabulate.inl" "$(@D)/cuda/include/thrust/detail/tabulate.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/temporary_array.h" "$(@D)/cuda/include/thrust/detail/temporary_array.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/temporary_array.inl" "$(@D)/cuda/include/thrust/detail/temporary_array.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/transform.inl" "$(@D)/cuda/include/thrust/detail/transform.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/transform_reduce.inl" "$(@D)/cuda/include/thrust/detail/transform_reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/transform_scan.inl" "$(@D)/cuda/include/thrust/detail/transform_scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/trivial_sequence.h" "$(@D)/cuda/include/thrust/detail/trivial_sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/tuple.inl" "$(@D)/cuda/include/thrust/detail/tuple.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/tuple_meta_transform.h" "$(@D)/cuda/include/thrust/detail/tuple_meta_transform.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/tuple_transform.h" "$(@D)/cuda/include/thrust/detail/tuple_transform.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h" "$(@D)/cuda/include/thrust/detail/type_traits/algorithm/intermediate_type_from_function_and_iterators.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/function_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits/function_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/has_member_function.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_member_function.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/has_nested_type.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_nested_type.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/has_trivial_assign.h" "$(@D)/cuda/include/thrust/detail/type_traits/has_trivial_assign.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/is_call_possible.h" "$(@D)/cuda/include/thrust/detail/type_traits/is_call_possible.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/is_metafunction_defined.h" "$(@D)/cuda/include/thrust/detail/type_traits/is_metafunction_defined.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/iterator/is_discard_iterator.h" "$(@D)/cuda/include/thrust/detail/type_traits/iterator/is_discard_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/iterator/is_output_iterator.h" "$(@D)/cuda/include/thrust/detail/type_traits/iterator/is_output_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/minimum_type.h" "$(@D)/cuda/include/thrust/detail/type_traits/minimum_type.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/pointer_traits.h" "$(@D)/cuda/include/thrust/detail/type_traits/pointer_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/type_traits/result_of_adaptable_function.h" "$(@D)/cuda/include/thrust/detail/type_traits/result_of_adaptable_function.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/uninitialized_copy.inl" "$(@D)/cuda/include/thrust/detail/uninitialized_copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/uninitialized_fill.inl" "$(@D)/cuda/include/thrust/detail/uninitialized_fill.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/unique.inl" "$(@D)/cuda/include/thrust/detail/unique.inl" && cp "/usr/local/cuda-9.0/include/thrust/detail/use_default.h" "$(@D)/cuda/include/thrust/detail/use_default.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/util/align.h" "$(@D)/cuda/include/thrust/detail/util/align.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/util/blocking.h" "$(@D)/cuda/include/thrust/detail/util/blocking.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/vector_base.h" "$(@D)/cuda/include/thrust/detail/vector_base.h" && cp "/usr/local/cuda-9.0/include/thrust/detail/vector_base.inl" "$(@D)/cuda/include/thrust/detail/vector_base.inl" && cp "/usr/local/cuda-9.0/include/thrust/device_allocator.h" "$(@D)/cuda/include/thrust/device_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/device_delete.h" "$(@D)/cuda/include/thrust/device_delete.h" && cp "/usr/local/cuda-9.0/include/thrust/device_free.h" "$(@D)/cuda/include/thrust/device_free.h" && cp "/usr/local/cuda-9.0/include/thrust/device_malloc.h" "$(@D)/cuda/include/thrust/device_malloc.h" && cp "/usr/local/cuda-9.0/include/thrust/device_malloc_allocator.h" "$(@D)/cuda/include/thrust/device_malloc_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/device_new.h" "$(@D)/cuda/include/thrust/device_new.h" && cp "/usr/local/cuda-9.0/include/thrust/device_new_allocator.h" "$(@D)/cuda/include/thrust/device_new_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/device_ptr.h" "$(@D)/cuda/include/thrust/device_ptr.h" && cp "/usr/local/cuda-9.0/include/thrust/device_reference.h" "$(@D)/cuda/include/thrust/device_reference.h" && cp "/usr/local/cuda-9.0/include/thrust/device_vector.h" "$(@D)/cuda/include/thrust/device_vector.h" && cp "/usr/local/cuda-9.0/include/thrust/distance.h" "$(@D)/cuda/include/thrust/distance.h" && cp "/usr/local/cuda-9.0/include/thrust/equal.h" "$(@D)/cuda/include/thrust/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/execution_policy.h" "$(@D)/cuda/include/thrust/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/extrema.h" "$(@D)/cuda/include/thrust/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/fill.h" "$(@D)/cuda/include/thrust/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/find.h" "$(@D)/cuda/include/thrust/find.h" && cp "/usr/local/cuda-9.0/include/thrust/for_each.h" "$(@D)/cuda/include/thrust/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/functional.h" "$(@D)/cuda/include/thrust/functional.h" && cp "/usr/local/cuda-9.0/include/thrust/gather.h" "$(@D)/cuda/include/thrust/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/generate.h" "$(@D)/cuda/include/thrust/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/host_vector.h" "$(@D)/cuda/include/thrust/host_vector.h" && cp "/usr/local/cuda-9.0/include/thrust/inner_product.h" "$(@D)/cuda/include/thrust/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/constant_iterator.h" "$(@D)/cuda/include/thrust/iterator/constant_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/counting_iterator.h" "$(@D)/cuda/include/thrust/iterator/counting_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/any_assign.h" "$(@D)/cuda/include/thrust/iterator/detail/any_assign.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/any_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/any_system_tag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/constant_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/constant_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/counting_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/counting_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/device_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/device_system_tag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/discard_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/discard_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/distance_from_result.h" "$(@D)/cuda/include/thrust/iterator/detail/distance_from_result.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/host_system_tag.h" "$(@D)/cuda/include/thrust/iterator/detail/host_system_tag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/is_iterator_category.h" "$(@D)/cuda/include/thrust/iterator/detail/is_iterator_category.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/is_trivial_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/is_trivial_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_adaptor_base.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_adaptor_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_category_to_system.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_to_system.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_category_to_traversal.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_to_traversal.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_facade_category.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_facade_category.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_traits.inl" "$(@D)/cuda/include/thrust/iterator/detail/iterator_traits.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/iterator_traversal_tags.h" "$(@D)/cuda/include/thrust/iterator/detail/iterator_traversal_tags.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/join_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/join_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/minimum_category.h" "$(@D)/cuda/include/thrust/iterator/detail/minimum_category.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/minimum_system.h" "$(@D)/cuda/include/thrust/iterator/detail/minimum_system.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/normal_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/normal_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/permutation_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/permutation_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/retag.h" "$(@D)/cuda/include/thrust/iterator/detail/retag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/reverse_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/reverse_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/reverse_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/reverse_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/tagged_iterator.h" "$(@D)/cuda/include/thrust/iterator/detail/tagged_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/transform_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/transform_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/transform_output_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/transform_output_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/tuple_of_iterator_references.h" "$(@D)/cuda/include/thrust/iterator/detail/tuple_of_iterator_references.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/universal_categories.h" "$(@D)/cuda/include/thrust/iterator/detail/universal_categories.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/zip_iterator.inl" "$(@D)/cuda/include/thrust/iterator/detail/zip_iterator.inl" && cp "/usr/local/cuda-9.0/include/thrust/iterator/detail/zip_iterator_base.h" "$(@D)/cuda/include/thrust/iterator/detail/zip_iterator_base.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/discard_iterator.h" "$(@D)/cuda/include/thrust/iterator/discard_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/iterator_adaptor.h" "$(@D)/cuda/include/thrust/iterator/iterator_adaptor.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/iterator_categories.h" "$(@D)/cuda/include/thrust/iterator/iterator_categories.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/iterator_facade.h" "$(@D)/cuda/include/thrust/iterator/iterator_facade.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/iterator_traits.h" "$(@D)/cuda/include/thrust/iterator/iterator_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/permutation_iterator.h" "$(@D)/cuda/include/thrust/iterator/permutation_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/retag.h" "$(@D)/cuda/include/thrust/iterator/retag.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/reverse_iterator.h" "$(@D)/cuda/include/thrust/iterator/reverse_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/transform_iterator.h" "$(@D)/cuda/include/thrust/iterator/transform_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/transform_output_iterator.h" "$(@D)/cuda/include/thrust/iterator/transform_output_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/iterator/zip_iterator.h" "$(@D)/cuda/include/thrust/iterator/zip_iterator.h" && cp "/usr/local/cuda-9.0/include/thrust/logical.h" "$(@D)/cuda/include/thrust/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/memory.h" "$(@D)/cuda/include/thrust/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/merge.h" "$(@D)/cuda/include/thrust/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/mismatch.h" "$(@D)/cuda/include/thrust/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/pair.h" "$(@D)/cuda/include/thrust/pair.h" && cp "/usr/local/cuda-9.0/include/thrust/partition.h" "$(@D)/cuda/include/thrust/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/random.h" "$(@D)/cuda/include/thrust/random.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/discard_block_engine.inl" "$(@D)/cuda/include/thrust/random/detail/discard_block_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/linear_congruential_engine.inl" "$(@D)/cuda/include/thrust/random/detail/linear_congruential_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/linear_congruential_engine_discard.h" "$(@D)/cuda/include/thrust/random/detail/linear_congruential_engine_discard.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/linear_feedback_shift_engine.inl" "$(@D)/cuda/include/thrust/random/detail/linear_feedback_shift_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h" "$(@D)/cuda/include/thrust/random/detail/linear_feedback_shift_engine_wordmask.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/mod.h" "$(@D)/cuda/include/thrust/random/detail/mod.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/normal_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/normal_distribution.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/normal_distribution_base.h" "$(@D)/cuda/include/thrust/random/detail/normal_distribution_base.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/random_core_access.h" "$(@D)/cuda/include/thrust/random/detail/random_core_access.h" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/subtract_with_carry_engine.inl" "$(@D)/cuda/include/thrust/random/detail/subtract_with_carry_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/uniform_int_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/uniform_int_distribution.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/uniform_real_distribution.inl" "$(@D)/cuda/include/thrust/random/detail/uniform_real_distribution.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/xor_combine_engine.inl" "$(@D)/cuda/include/thrust/random/detail/xor_combine_engine.inl" && cp "/usr/local/cuda-9.0/include/thrust/random/detail/xor_combine_engine_max.h" "$(@D)/cuda/include/thrust/random/detail/xor_combine_engine_max.h" && cp "/usr/local/cuda-9.0/include/thrust/random/discard_block_engine.h" "$(@D)/cuda/include/thrust/random/discard_block_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/random/linear_congruential_engine.h" "$(@D)/cuda/include/thrust/random/linear_congruential_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/random/linear_feedback_shift_engine.h" "$(@D)/cuda/include/thrust/random/linear_feedback_shift_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/random/normal_distribution.h" "$(@D)/cuda/include/thrust/random/normal_distribution.h" && cp "/usr/local/cuda-9.0/include/thrust/random/subtract_with_carry_engine.h" "$(@D)/cuda/include/thrust/random/subtract_with_carry_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/random/uniform_int_distribution.h" "$(@D)/cuda/include/thrust/random/uniform_int_distribution.h" && cp "/usr/local/cuda-9.0/include/thrust/random/uniform_real_distribution.h" "$(@D)/cuda/include/thrust/random/uniform_real_distribution.h" && cp "/usr/local/cuda-9.0/include/thrust/random/xor_combine_engine.h" "$(@D)/cuda/include/thrust/random/xor_combine_engine.h" && cp "/usr/local/cuda-9.0/include/thrust/reduce.h" "$(@D)/cuda/include/thrust/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/remove.h" "$(@D)/cuda/include/thrust/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/replace.h" "$(@D)/cuda/include/thrust/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/reverse.h" "$(@D)/cuda/include/thrust/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/scan.h" "$(@D)/cuda/include/thrust/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/scatter.h" "$(@D)/cuda/include/thrust/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/sequence.h" "$(@D)/cuda/include/thrust/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/set_operations.h" "$(@D)/cuda/include/thrust/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/sort.h" "$(@D)/cuda/include/thrust/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/swap.h" "$(@D)/cuda/include/thrust/swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/cpp/detail/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/cpp/detail/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/cpp/detail/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/copy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/cpp/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/count.h" "$(@D)/cuda/include/thrust/system/cpp/detail/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/equal.h" "$(@D)/cuda/include/thrust/system/cpp/detail/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/extrema.h" "$(@D)/cuda/include/thrust/system/cpp/detail/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/fill.h" "$(@D)/cuda/include/thrust/system/cpp/detail/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/find.h" "$(@D)/cuda/include/thrust/system/cpp/detail/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/for_each.h" "$(@D)/cuda/include/thrust/system/cpp/detail/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/gather.h" "$(@D)/cuda/include/thrust/system/cpp/detail/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/generate.h" "$(@D)/cuda/include/thrust/system/cpp/detail/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/get_value.h" "$(@D)/cuda/include/thrust/system/cpp/detail/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/cpp/detail/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/cpp/detail/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/logical.h" "$(@D)/cuda/include/thrust/system/cpp/detail/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/cpp/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/memory.inl" "$(@D)/cuda/include/thrust/system/cpp/detail/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/merge.h" "$(@D)/cuda/include/thrust/system/cpp/detail/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/cpp/detail/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/par.h" "$(@D)/cuda/include/thrust/system/cpp/detail/par.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/partition.h" "$(@D)/cuda/include/thrust/system/cpp/detail/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/reduce.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/remove.h" "$(@D)/cuda/include/thrust/system/cpp/detail/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/replace.h" "$(@D)/cuda/include/thrust/system/cpp/detail/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/reverse.h" "$(@D)/cuda/include/thrust/system/cpp/detail/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/scan.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/scatter.h" "$(@D)/cuda/include/thrust/system/cpp/detail/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/sequence.h" "$(@D)/cuda/include/thrust/system/cpp/detail/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/cpp/detail/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/sort.h" "$(@D)/cuda/include/thrust/system/cpp/detail/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/cpp/detail/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/cpp/detail/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/cpp/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/transform.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/cpp/detail/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/cpp/detail/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/cpp/detail/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/unique.h" "$(@D)/cuda/include/thrust/system/cpp/detail/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/cpp/detail/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/detail/vector.inl" "$(@D)/cuda/include/thrust/system/cpp/detail/vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/execution_policy.h" "$(@D)/cuda/include/thrust/system/cpp/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/memory.h" "$(@D)/cuda/include/thrust/system/cpp/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cpp/vector.h" "$(@D)/cuda/include/thrust/system/cpp/vector.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/config.h" "$(@D)/cuda/include/thrust/system/cuda/config.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/cuda/detail/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/cuda/detail/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/cuda/detail/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/cuda/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/core/agent_launcher.h" "$(@D)/cuda/include/thrust/system/cuda/detail/core/agent_launcher.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/core/alignment.h" "$(@D)/cuda/include/thrust/system/cuda/detail/core/alignment.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/core/triple_chevron_launch.h" "$(@D)/cuda/include/thrust/system/cuda/detail/core/triple_chevron_launch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/core/util.h" "$(@D)/cuda/include/thrust/system/cuda/detail/core/util.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/count.h" "$(@D)/cuda/include/thrust/system/cuda/detail/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cross_system.h" "$(@D)/cuda/include/thrust/system/cuda/detail/cross_system.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_histogram.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_downsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_downsweep.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_upsweep.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_radix_sort_upsweep.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_reduce_by_key.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_reduce_by_key.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_rle.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_rle.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_segment_fixup.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_segment_fixup.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_select_if.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_select_if.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_spmv_csrt.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_csrt.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_spmv_orig.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_orig.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/agent_spmv_row_based.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/agent_spmv_row_based.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/agent/single_pass_scan_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/agent/single_pass_scan_operators.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_adjacent_difference.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_adjacent_difference.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_discontinuity.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_histogram.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_load.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_load.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_radix_rank.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_radix_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_raking_layout.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_shuffle.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_shuffle.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/block_store.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/block_store.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_atomic.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_histogram_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_raking_commutative_only.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_reduce_warp_reductions.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans2.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans2.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans3.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans3.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/cub.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/cub.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_histogram.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_partition.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_partition.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_run_length_encode.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_segmented_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_segmented_radix_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_segmented_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_segmented_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_select.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_select.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/device_spmv.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/device_spmv.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_histogram.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_histogram.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_radix_sort.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_radix_sort.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce_by_key.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce_by_key.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_rle.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_rle.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_select_if.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_select_if.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_csrt.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_csrt.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_orig.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_orig.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_row_based.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_spmv_row_based.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_barrier.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_even_share.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_mapping.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/grid/grid_queue.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/host/mutex.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/host/mutex.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/arg_index_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/cache_modified_output_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/constant_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/counting_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/discard_output_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/discard_output_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/tex_obj_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/tex_ref_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/iterator/transform_input_iterator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_load.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_operators.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_search.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_search.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/thread/thread_store.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_allocator.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_allocator.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_arch.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_arch.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_debug.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_debug.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_device.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_device.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_macro.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_macro.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_namespace.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_namespace.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_ptx.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_ptx.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/util_type.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/util_type.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_shfl.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_reduce_smem.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_shfl.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/specializations/warp_scan_smem.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/warp_reduce.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh" "$(@D)/cuda/include/thrust/system/cuda/detail/cub/warp/warp_scan.cuh" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/equal.h" "$(@D)/cuda/include/thrust/system/cuda/detail/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/error.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/error.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/extrema.h" "$(@D)/cuda/include/thrust/system/cuda/detail/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/fill.h" "$(@D)/cuda/include/thrust/system/cuda/detail/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/find.h" "$(@D)/cuda/include/thrust/system/cuda/detail/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/for_each.h" "$(@D)/cuda/include/thrust/system/cuda/detail/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/gather.h" "$(@D)/cuda/include/thrust/system/cuda/detail/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/generate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/get_value.h" "$(@D)/cuda/include/thrust/system/cuda/detail/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h" "$(@D)/cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/guarded_driver_types.h" "$(@D)/cuda/include/thrust/system/cuda/detail/guarded_driver_types.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/cuda/detail/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/internal/copy_cross_system.h" "$(@D)/cuda/include/thrust/system/cuda/detail/internal/copy_cross_system.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/internal/copy_device_to_device.h" "$(@D)/cuda/include/thrust/system/cuda/detail/internal/copy_device_to_device.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/cuda/detail/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/logical.h" "$(@D)/cuda/include/thrust/system/cuda/detail/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/cuda/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/memory.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/memory_buffer.h" "$(@D)/cuda/include/thrust/system/cuda/detail/memory_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/merge.h" "$(@D)/cuda/include/thrust/system/cuda/detail/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/cuda/detail/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/par.h" "$(@D)/cuda/include/thrust/system/cuda/detail/par.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/par_to_seq.h" "$(@D)/cuda/include/thrust/system/cuda/detail/par_to_seq.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/parallel_for.h" "$(@D)/cuda/include/thrust/system/cuda/detail/parallel_for.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/partition.h" "$(@D)/cuda/include/thrust/system/cuda/detail/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/remove.h" "$(@D)/cuda/include/thrust/system/cuda/detail/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/replace.h" "$(@D)/cuda/include/thrust/system/cuda/detail/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/reverse.h" "$(@D)/cuda/include/thrust/system/cuda/detail/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/scatter.h" "$(@D)/cuda/include/thrust/system/cuda/detail/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/sequence.h" "$(@D)/cuda/include/thrust/system/cuda/detail/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/cuda/detail/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/sort.h" "$(@D)/cuda/include/thrust/system/cuda/detail/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/cuda/detail/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/cuda/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/terminate.h" "$(@D)/cuda/include/thrust/system/cuda/detail/terminate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/transform.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/cuda/detail/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/cuda/detail/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/cuda/detail/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/unique.h" "$(@D)/cuda/include/thrust/system/cuda/detail/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/cuda/detail/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/util.h" "$(@D)/cuda/include/thrust/system/cuda/detail/util.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/detail/vector.inl" "$(@D)/cuda/include/thrust/system/cuda/detail/vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/error.h" "$(@D)/cuda/include/thrust/system/cuda/error.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/execution_policy.h" "$(@D)/cuda/include/thrust/system/cuda/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/experimental/pinned_allocator.h" "$(@D)/cuda/include/thrust/system/cuda/experimental/pinned_allocator.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/memory.h" "$(@D)/cuda/include/thrust/system/cuda/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/cuda/vector.h" "$(@D)/cuda/include/thrust/system/cuda/vector.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/adl/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/assign_value.h" "$(@D)/cuda/include/thrust/system/detail/adl/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/adl/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/copy.h" "$(@D)/cuda/include/thrust/system/detail/adl/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/adl/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/count.h" "$(@D)/cuda/include/thrust/system/detail/adl/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/equal.h" "$(@D)/cuda/include/thrust/system/detail/adl/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/extrema.h" "$(@D)/cuda/include/thrust/system/detail/adl/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/fill.h" "$(@D)/cuda/include/thrust/system/detail/adl/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/find.h" "$(@D)/cuda/include/thrust/system/detail/adl/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/for_each.h" "$(@D)/cuda/include/thrust/system/detail/adl/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/gather.h" "$(@D)/cuda/include/thrust/system/detail/adl/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/generate.h" "$(@D)/cuda/include/thrust/system/detail/adl/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/get_value.h" "$(@D)/cuda/include/thrust/system/detail/adl/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/adl/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/iter_swap.h" "$(@D)/cuda/include/thrust/system/detail/adl/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/logical.h" "$(@D)/cuda/include/thrust/system/detail/adl/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/detail/adl/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/merge.h" "$(@D)/cuda/include/thrust/system/detail/adl/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/adl/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/partition.h" "$(@D)/cuda/include/thrust/system/detail/adl/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/reduce.h" "$(@D)/cuda/include/thrust/system/detail/adl/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/remove.h" "$(@D)/cuda/include/thrust/system/detail/adl/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/replace.h" "$(@D)/cuda/include/thrust/system/detail/adl/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/reverse.h" "$(@D)/cuda/include/thrust/system/detail/adl/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/scan.h" "$(@D)/cuda/include/thrust/system/detail/adl/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/scatter.h" "$(@D)/cuda/include/thrust/system/detail/adl/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/sequence.h" "$(@D)/cuda/include/thrust/system/detail/adl/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/adl/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/sort.h" "$(@D)/cuda/include/thrust/system/detail/adl/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/adl/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/adl/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/adl/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/transform.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/adl/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/adl/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/adl/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/unique.h" "$(@D)/cuda/include/thrust/system/detail/adl/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/adl/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/adl/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/bad_alloc.h" "$(@D)/cuda/include/thrust/system/detail/bad_alloc.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/errno.h" "$(@D)/cuda/include/thrust/system/detail/errno.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/error_category.inl" "$(@D)/cuda/include/thrust/system/detail/error_category.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/error_code.inl" "$(@D)/cuda/include/thrust/system/detail/error_code.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/error_condition.inl" "$(@D)/cuda/include/thrust/system/detail/error_condition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/generic/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/adjacent_difference.inl" "$(@D)/cuda/include/thrust/system/detail/generic/adjacent_difference.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/advance.h" "$(@D)/cuda/include/thrust/system/detail/generic/advance.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/advance.inl" "$(@D)/cuda/include/thrust/system/detail/generic/advance.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/generic/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/binary_search.inl" "$(@D)/cuda/include/thrust/system/detail/generic/binary_search.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/copy.h" "$(@D)/cuda/include/thrust/system/detail/generic/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/copy.inl" "$(@D)/cuda/include/thrust/system/detail/generic/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/generic/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/copy_if.inl" "$(@D)/cuda/include/thrust/system/detail/generic/copy_if.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/count.h" "$(@D)/cuda/include/thrust/system/detail/generic/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/count.inl" "$(@D)/cuda/include/thrust/system/detail/generic/count.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/distance.h" "$(@D)/cuda/include/thrust/system/detail/generic/distance.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/distance.inl" "$(@D)/cuda/include/thrust/system/detail/generic/distance.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/equal.h" "$(@D)/cuda/include/thrust/system/detail/generic/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/equal.inl" "$(@D)/cuda/include/thrust/system/detail/generic/equal.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/extrema.h" "$(@D)/cuda/include/thrust/system/detail/generic/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/extrema.inl" "$(@D)/cuda/include/thrust/system/detail/generic/extrema.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/fill.h" "$(@D)/cuda/include/thrust/system/detail/generic/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/find.h" "$(@D)/cuda/include/thrust/system/detail/generic/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/find.inl" "$(@D)/cuda/include/thrust/system/detail/generic/find.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/for_each.h" "$(@D)/cuda/include/thrust/system/detail/generic/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/gather.h" "$(@D)/cuda/include/thrust/system/detail/generic/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/gather.inl" "$(@D)/cuda/include/thrust/system/detail/generic/gather.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/generate.h" "$(@D)/cuda/include/thrust/system/detail/generic/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/generate.inl" "$(@D)/cuda/include/thrust/system/detail/generic/generate.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/generic/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/inner_product.inl" "$(@D)/cuda/include/thrust/system/detail/generic/inner_product.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/logical.h" "$(@D)/cuda/include/thrust/system/detail/generic/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/memory.h" "$(@D)/cuda/include/thrust/system/detail/generic/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/memory.inl" "$(@D)/cuda/include/thrust/system/detail/generic/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/merge.h" "$(@D)/cuda/include/thrust/system/detail/generic/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/merge.inl" "$(@D)/cuda/include/thrust/system/detail/generic/merge.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/generic/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/mismatch.inl" "$(@D)/cuda/include/thrust/system/detail/generic/mismatch.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/partition.h" "$(@D)/cuda/include/thrust/system/detail/generic/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/partition.inl" "$(@D)/cuda/include/thrust/system/detail/generic/partition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reduce.h" "$(@D)/cuda/include/thrust/system/detail/generic/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reduce.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reduce_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/remove.h" "$(@D)/cuda/include/thrust/system/detail/generic/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/remove.inl" "$(@D)/cuda/include/thrust/system/detail/generic/remove.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/replace.h" "$(@D)/cuda/include/thrust/system/detail/generic/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/replace.inl" "$(@D)/cuda/include/thrust/system/detail/generic/replace.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reverse.h" "$(@D)/cuda/include/thrust/system/detail/generic/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/reverse.inl" "$(@D)/cuda/include/thrust/system/detail/generic/reverse.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scalar/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/generic/scalar/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scalar/binary_search.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scalar/binary_search.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scan.h" "$(@D)/cuda/include/thrust/system/detail/generic/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scan.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scan_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scan_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scatter.h" "$(@D)/cuda/include/thrust/system/detail/generic/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/scatter.inl" "$(@D)/cuda/include/thrust/system/detail/generic/scatter.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/select_system.h" "$(@D)/cuda/include/thrust/system/detail/generic/select_system.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/sequence.h" "$(@D)/cuda/include/thrust/system/detail/generic/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/sequence.inl" "$(@D)/cuda/include/thrust/system/detail/generic/sequence.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/generic/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/set_operations.inl" "$(@D)/cuda/include/thrust/system/detail/generic/set_operations.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/sort.h" "$(@D)/cuda/include/thrust/system/detail/generic/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/sort.inl" "$(@D)/cuda/include/thrust/system/detail/generic/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/generic/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/swap_ranges.inl" "$(@D)/cuda/include/thrust/system/detail/generic/swap_ranges.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/generic/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/tabulate.inl" "$(@D)/cuda/include/thrust/system/detail/generic/tabulate.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/tag.h" "$(@D)/cuda/include/thrust/system/detail/generic/tag.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/generic/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/temporary_buffer.inl" "$(@D)/cuda/include/thrust/system/detail/generic/temporary_buffer.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform_reduce.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform_reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/generic/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/transform_scan.inl" "$(@D)/cuda/include/thrust/system/detail/generic/transform_scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/type_traits.h" "$(@D)/cuda/include/thrust/system/detail/generic/type_traits.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/uninitialized_copy.inl" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/uninitialized_fill.inl" "$(@D)/cuda/include/thrust/system/detail/generic/uninitialized_fill.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/unique.h" "$(@D)/cuda/include/thrust/system/detail/generic/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/unique.inl" "$(@D)/cuda/include/thrust/system/detail/generic/unique.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/generic/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/generic/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/detail/generic/unique_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/internal/decompose.h" "$(@D)/cuda/include/thrust/system/detail/internal/decompose.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/detail/sequential/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/assign_value.h" "$(@D)/cuda/include/thrust/system/detail/sequential/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/binary_search.h" "$(@D)/cuda/include/thrust/system/detail/sequential/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/copy.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/copy_backward.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy_backward.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/copy_if.h" "$(@D)/cuda/include/thrust/system/detail/sequential/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/count.h" "$(@D)/cuda/include/thrust/system/detail/sequential/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/equal.h" "$(@D)/cuda/include/thrust/system/detail/sequential/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/execution_policy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/extrema.h" "$(@D)/cuda/include/thrust/system/detail/sequential/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/fill.h" "$(@D)/cuda/include/thrust/system/detail/sequential/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/find.h" "$(@D)/cuda/include/thrust/system/detail/sequential/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/for_each.h" "$(@D)/cuda/include/thrust/system/detail/sequential/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/gather.h" "$(@D)/cuda/include/thrust/system/detail/sequential/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/general_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/general_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/generate.h" "$(@D)/cuda/include/thrust/system/detail/sequential/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/get_value.h" "$(@D)/cuda/include/thrust/system/detail/sequential/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/inner_product.h" "$(@D)/cuda/include/thrust/system/detail/sequential/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/insertion_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/insertion_sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/iter_swap.h" "$(@D)/cuda/include/thrust/system/detail/sequential/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/logical.h" "$(@D)/cuda/include/thrust/system/detail/sequential/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/detail/sequential/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/merge.h" "$(@D)/cuda/include/thrust/system/detail/sequential/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/merge.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/merge.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/mismatch.h" "$(@D)/cuda/include/thrust/system/detail/sequential/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/partition.h" "$(@D)/cuda/include/thrust/system/detail/sequential/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/reduce.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/remove.h" "$(@D)/cuda/include/thrust/system/detail/sequential/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/replace.h" "$(@D)/cuda/include/thrust/system/detail/sequential/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/reverse.h" "$(@D)/cuda/include/thrust/system/detail/sequential/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/scan.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/scan_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/scatter.h" "$(@D)/cuda/include/thrust/system/detail/sequential/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/sequence.h" "$(@D)/cuda/include/thrust/system/detail/sequential/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/set_operations.h" "$(@D)/cuda/include/thrust/system/detail/sequential/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_merge_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_merge_sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_merge_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_merge_sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_primitive_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_primitive_sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_primitive_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_primitive_sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_radix_sort.h" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_radix_sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/stable_radix_sort.inl" "$(@D)/cuda/include/thrust/system/detail/sequential/stable_radix_sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/swap_ranges.h" "$(@D)/cuda/include/thrust/system/detail/sequential/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/tabulate.h" "$(@D)/cuda/include/thrust/system/detail/sequential/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/detail/sequential/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/transform.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/transform_reduce.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/transform_scan.h" "$(@D)/cuda/include/thrust/system/detail/sequential/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/trivial_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/trivial_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/detail/sequential/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/detail/sequential/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/unique.h" "$(@D)/cuda/include/thrust/system/detail/sequential/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/sequential/unique_by_key.h" "$(@D)/cuda/include/thrust/system/detail/sequential/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/detail/system_error.inl" "$(@D)/cuda/include/thrust/system/detail/system_error.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/error_code.h" "$(@D)/cuda/include/thrust/system/error_code.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/omp/detail/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/omp/detail/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/omp/detail/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/copy.h" "$(@D)/cuda/include/thrust/system/omp/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/copy.inl" "$(@D)/cuda/include/thrust/system/omp/detail/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/omp/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/omp/detail/copy_if.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/count.h" "$(@D)/cuda/include/thrust/system/omp/detail/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/default_decomposition.h" "$(@D)/cuda/include/thrust/system/omp/detail/default_decomposition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/default_decomposition.inl" "$(@D)/cuda/include/thrust/system/omp/detail/default_decomposition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/equal.h" "$(@D)/cuda/include/thrust/system/omp/detail/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/omp/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/extrema.h" "$(@D)/cuda/include/thrust/system/omp/detail/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/fill.h" "$(@D)/cuda/include/thrust/system/omp/detail/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/find.h" "$(@D)/cuda/include/thrust/system/omp/detail/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/for_each.h" "$(@D)/cuda/include/thrust/system/omp/detail/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/omp/detail/for_each.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/gather.h" "$(@D)/cuda/include/thrust/system/omp/detail/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/generate.h" "$(@D)/cuda/include/thrust/system/omp/detail/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/get_value.h" "$(@D)/cuda/include/thrust/system/omp/detail/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/omp/detail/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/omp/detail/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/logical.h" "$(@D)/cuda/include/thrust/system/omp/detail/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/omp/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/memory.inl" "$(@D)/cuda/include/thrust/system/omp/detail/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/merge.h" "$(@D)/cuda/include/thrust/system/omp/detail/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/omp/detail/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/par.h" "$(@D)/cuda/include/thrust/system/omp/detail/par.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/partition.h" "$(@D)/cuda/include/thrust/system/omp/detail/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/partition.inl" "$(@D)/cuda/include/thrust/system/omp/detail/partition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_intervals.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reduce_intervals.inl" "$(@D)/cuda/include/thrust/system/omp/detail/reduce_intervals.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/remove.h" "$(@D)/cuda/include/thrust/system/omp/detail/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/remove.inl" "$(@D)/cuda/include/thrust/system/omp/detail/remove.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/replace.h" "$(@D)/cuda/include/thrust/system/omp/detail/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/reverse.h" "$(@D)/cuda/include/thrust/system/omp/detail/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/scan.h" "$(@D)/cuda/include/thrust/system/omp/detail/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/scatter.h" "$(@D)/cuda/include/thrust/system/omp/detail/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/sequence.h" "$(@D)/cuda/include/thrust/system/omp/detail/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/omp/detail/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/sort.h" "$(@D)/cuda/include/thrust/system/omp/detail/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/sort.inl" "$(@D)/cuda/include/thrust/system/omp/detail/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/omp/detail/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/omp/detail/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/omp/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/transform.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/omp/detail/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/omp/detail/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/omp/detail/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/unique.h" "$(@D)/cuda/include/thrust/system/omp/detail/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/unique.inl" "$(@D)/cuda/include/thrust/system/omp/detail/unique.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/omp/detail/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/omp/detail/unique_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/detail/vector.inl" "$(@D)/cuda/include/thrust/system/omp/detail/vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/execution_policy.h" "$(@D)/cuda/include/thrust/system/omp/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/memory.h" "$(@D)/cuda/include/thrust/system/omp/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/omp/vector.h" "$(@D)/cuda/include/thrust/system/omp/vector.h" && cp "/usr/local/cuda-9.0/include/thrust/system/system_error.h" "$(@D)/cuda/include/thrust/system/system_error.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/adjacent_difference.h" "$(@D)/cuda/include/thrust/system/tbb/detail/adjacent_difference.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/assign_value.h" "$(@D)/cuda/include/thrust/system/tbb/detail/assign_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/binary_search.h" "$(@D)/cuda/include/thrust/system/tbb/detail/binary_search.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/copy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/copy.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/copy.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/copy_if.h" "$(@D)/cuda/include/thrust/system/tbb/detail/copy_if.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/copy_if.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/copy_if.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/count.h" "$(@D)/cuda/include/thrust/system/tbb/detail/count.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/equal.h" "$(@D)/cuda/include/thrust/system/tbb/detail/equal.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/execution_policy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/extrema.h" "$(@D)/cuda/include/thrust/system/tbb/detail/extrema.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/fill.h" "$(@D)/cuda/include/thrust/system/tbb/detail/fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/find.h" "$(@D)/cuda/include/thrust/system/tbb/detail/find.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/for_each.h" "$(@D)/cuda/include/thrust/system/tbb/detail/for_each.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/for_each.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/for_each.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/gather.h" "$(@D)/cuda/include/thrust/system/tbb/detail/gather.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/generate.h" "$(@D)/cuda/include/thrust/system/tbb/detail/generate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/get_value.h" "$(@D)/cuda/include/thrust/system/tbb/detail/get_value.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/inner_product.h" "$(@D)/cuda/include/thrust/system/tbb/detail/inner_product.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/iter_swap.h" "$(@D)/cuda/include/thrust/system/tbb/detail/iter_swap.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/logical.h" "$(@D)/cuda/include/thrust/system/tbb/detail/logical.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/malloc_and_free.h" "$(@D)/cuda/include/thrust/system/tbb/detail/malloc_and_free.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/memory.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/memory.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/merge.h" "$(@D)/cuda/include/thrust/system/tbb/detail/merge.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/merge.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/merge.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/mismatch.h" "$(@D)/cuda/include/thrust/system/tbb/detail/mismatch.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/par.h" "$(@D)/cuda/include/thrust/system/tbb/detail/par.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/partition.h" "$(@D)/cuda/include/thrust/system/tbb/detail/partition.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/partition.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/partition.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce_by_key.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reduce_intervals.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reduce_intervals.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/remove.h" "$(@D)/cuda/include/thrust/system/tbb/detail/remove.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/remove.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/remove.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/replace.h" "$(@D)/cuda/include/thrust/system/tbb/detail/replace.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/reverse.h" "$(@D)/cuda/include/thrust/system/tbb/detail/reverse.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/scan.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/scan.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/scan.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/scan_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scan_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/scatter.h" "$(@D)/cuda/include/thrust/system/tbb/detail/scatter.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/sequence.h" "$(@D)/cuda/include/thrust/system/tbb/detail/sequence.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/set_operations.h" "$(@D)/cuda/include/thrust/system/tbb/detail/set_operations.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/sort.h" "$(@D)/cuda/include/thrust/system/tbb/detail/sort.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/sort.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/sort.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/swap_ranges.h" "$(@D)/cuda/include/thrust/system/tbb/detail/swap_ranges.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/tabulate.h" "$(@D)/cuda/include/thrust/system/tbb/detail/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/temporary_buffer.h" "$(@D)/cuda/include/thrust/system/tbb/detail/temporary_buffer.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/transform.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/transform_reduce.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/transform_scan.h" "$(@D)/cuda/include/thrust/system/tbb/detail/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/uninitialized_copy.h" "$(@D)/cuda/include/thrust/system/tbb/detail/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/uninitialized_fill.h" "$(@D)/cuda/include/thrust/system/tbb/detail/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/unique.h" "$(@D)/cuda/include/thrust/system/tbb/detail/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/unique.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/unique.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/unique_by_key.h" "$(@D)/cuda/include/thrust/system/tbb/detail/unique_by_key.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/unique_by_key.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/unique_by_key.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/detail/vector.inl" "$(@D)/cuda/include/thrust/system/tbb/detail/vector.inl" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/execution_policy.h" "$(@D)/cuda/include/thrust/system/tbb/execution_policy.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/memory.h" "$(@D)/cuda/include/thrust/system/tbb/memory.h" && cp "/usr/local/cuda-9.0/include/thrust/system/tbb/vector.h" "$(@D)/cuda/include/thrust/system/tbb/vector.h" && cp "/usr/local/cuda-9.0/include/thrust/system_error.h" "$(@D)/cuda/include/thrust/system_error.h" && cp "/usr/local/cuda-9.0/include/thrust/tabulate.h" "$(@D)/cuda/include/thrust/tabulate.h" && cp "/usr/local/cuda-9.0/include/thrust/transform.h" "$(@D)/cuda/include/thrust/transform.h" && cp "/usr/local/cuda-9.0/include/thrust/transform_reduce.h" "$(@D)/cuda/include/thrust/transform_reduce.h" && cp "/usr/local/cuda-9.0/include/thrust/transform_scan.h" "$(@D)/cuda/include/thrust/transform_scan.h" && cp "/usr/local/cuda-9.0/include/thrust/tuple.h" "$(@D)/cuda/include/thrust/tuple.h" && cp "/usr/local/cuda-9.0/include/thrust/uninitialized_copy.h" "$(@D)/cuda/include/thrust/uninitialized_copy.h" && cp "/usr/local/cuda-9.0/include/thrust/uninitialized_fill.h" "$(@D)/cuda/include/thrust/uninitialized_fill.h" && cp "/usr/local/cuda-9.0/include/thrust/unique.h" "$(@D)/cuda/include/thrust/unique.h" && cp "/usr/local/cuda-9.0/include/thrust/version.h" "$(@D)/cuda/include/thrust/version.h" && cp "/usr/local/cuda-9.0/include/vector_functions.h" "$(@D)/cuda/include/vector_functions.h" && cp "/usr/local/cuda-9.0/include/vector_functions.hpp" "$(@D)/cuda/include/vector_functions.hpp" && cp "/usr/local/cuda-9.0/include/vector_types.h" "$(@D)/cuda/include/vector_types.h"
    """,
 )
 
@@ -1264,72 +1192,69 @@ genrule(
     name = "cuda-nvvm",
     outs = [
         "cuda/nvvm/bin/cicc",
-        "cuda/nvvm/libdevice/libdevice.compute_50.10.bc",
-        "cuda/nvvm/libdevice/libdevice.compute_30.10.bc",
-        "cuda/nvvm/libdevice/libdevice.compute_20.10.bc",
-        "cuda/nvvm/libdevice/libdevice.compute_35.10.bc",
-        "cuda/nvvm/lib64/libnvvm.so.3",
-        "cuda/nvvm/lib64/libnvvm.so",
-        "cuda/nvvm/lib64/libnvvm.so.3.1.0",
         "cuda/nvvm/include/nvvm.h",
-        "cuda/nvvm/libnvvm-samples/ptxgen/README.txt",
-        "cuda/nvvm/libnvvm-samples/ptxgen/ptxgen.c",
-        "cuda/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt",
+        "cuda/nvvm/lib64/libnvvm.so",
+        "cuda/nvvm/lib64/libnvvm.so.3",
+        "cuda/nvvm/lib64/libnvvm.so.3.2.0",
+        "cuda/nvvm/libdevice/libdevice.10.bc",
+        "cuda/nvvm/libnvvm-samples/CMakeLists.txt",
+        "cuda/nvvm/libnvvm-samples/README.txt",
         "cuda/nvvm/libnvvm-samples/build.bat",
-        "cuda/nvvm/libnvvm-samples/cuda-c-linking/README.txt",
-        "cuda/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu",
+        "cuda/nvvm/libnvvm-samples/build.sh",
+        "cuda/nvvm/libnvvm-samples/common/include/DDSWriter.h",
+        "cuda/nvvm/libnvvm-samples/common/include/drvapi_error_string.h",
         "cuda/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt",
+        "cuda/nvvm/libnvvm-samples/cuda-c-linking/README.txt",
         "cuda/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp",
-        "cuda/nvvm/libnvvm-samples/README.txt",
-        "cuda/nvvm/libnvvm-samples/simple/simple.c",
-        "cuda/nvvm/libnvvm-samples/simple/simple-gpu.ll",
+        "cuda/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu",
+        "cuda/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt",
+        "cuda/nvvm/libnvvm-samples/ptxgen/README.txt",
+        "cuda/nvvm/libnvvm-samples/ptxgen/ptxgen.c",
+        "cuda/nvvm/libnvvm-samples/simple/CMakeLists.txt",
         "cuda/nvvm/libnvvm-samples/simple/README.txt",
+        "cuda/nvvm/libnvvm-samples/simple/simple-gpu.ll",
         "cuda/nvvm/libnvvm-samples/simple/simple-gpu64.ll",
-        "cuda/nvvm/libnvvm-samples/simple/CMakeLists.txt",
-        "cuda/nvvm/libnvvm-samples/common/include/DDSWriter.h",
-        "cuda/nvvm/libnvvm-samples/common/include/drvapi_error_string.h",
-        "cuda/nvvm/libnvvm-samples/build.sh",
-        "cuda/nvvm/libnvvm-samples/CMakeLists.txt",
+        "cuda/nvvm/libnvvm-samples/simple/simple.c",
     ],
     cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-8.0/nvvm/bin/cicc" "$(@D)/cuda/nvvm/bin/cicc" && cp "/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_50.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.compute_50.10.bc" && cp "/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_30.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.compute_30.10.bc" && cp "/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_20.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.compute_20.10.bc" && cp "/usr/local/cuda-8.0/nvvm/libdevice/libdevice.compute_35.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.compute_35.10.bc" && cp "/usr/local/cuda-8.0/nvvm/lib64/libnvvm.so.3" "$(@D)/cuda/nvvm/lib64/libnvvm.so.3" && cp "/usr/local/cuda-8.0/nvvm/lib64/libnvvm.so" "$(@D)/cuda/nvvm/lib64/libnvvm.so" && cp "/usr/local/cuda-8.0/nvvm/lib64/libnvvm.so.3.1.0" "$(@D)/cuda/nvvm/lib64/libnvvm.so.3.1.0" && cp "/usr/local/cuda-8.0/nvvm/include/nvvm.h" "$(@D)/cuda/nvvm/include/nvvm.h" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/ptxgen/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/README.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/ptxgen/ptxgen.c" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/ptxgen.c" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/build.bat" "$(@D)/cuda/nvvm/libnvvm-samples/build.bat" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/cuda-c-linking/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/README.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/README.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/simple.c" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple.c" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/simple-gpu.ll" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple-gpu.ll" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/simple/README.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/simple-gpu64.ll" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple-gpu64.ll" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/simple/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/simple/CMakeLists.txt" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/common/include/DDSWriter.h" "$(@D)/cuda/nvvm/libnvvm-samples/common/include/DDSWriter.h" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/common/include/drvapi_error_string.h" "$(@D)/cuda/nvvm/libnvvm-samples/common/include/drvapi_error_string.h" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/build.sh" "$(@D)/cuda/nvvm/libnvvm-samples/build.sh" && cp "/usr/local/cuda-8.0/nvvm/libnvvm-samples/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/CMakeLists.txt"
+if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-9.0/nvvm/bin/cicc" "$(@D)/cuda/nvvm/bin/cicc" && cp "/usr/local/cuda-9.0/nvvm/include/nvvm.h" "$(@D)/cuda/nvvm/include/nvvm.h" && cp "/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so" "$(@D)/cuda/nvvm/lib64/libnvvm.so" && cp "/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so.3" "$(@D)/cuda/nvvm/lib64/libnvvm.so.3" && cp "/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so.3.2.0" "$(@D)/cuda/nvvm/lib64/libnvvm.so.3.2.0" && cp "/usr/local/cuda-9.0/nvvm/libdevice/libdevice.10.bc" "$(@D)/cuda/nvvm/libdevice/libdevice.10.bc" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/CMakeLists.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/README.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/build.bat" "$(@D)/cuda/nvvm/libnvvm-samples/build.bat" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/build.sh" "$(@D)/cuda/nvvm/libnvvm-samples/build.sh" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/common/include/DDSWriter.h" "$(@D)/cuda/nvvm/libnvvm-samples/common/include/DDSWriter.h" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/common/include/drvapi_error_string.h" "$(@D)/cuda/nvvm/libnvvm-samples/common/include/drvapi_error_string.h" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/CMakeLists.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/cuda-c-linking/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/README.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/cuda-c-linking.cpp" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu" "$(@D)/cuda/nvvm/libnvvm-samples/cuda-c-linking/math-funcs.cu" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/CMakeLists.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/ptxgen/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/README.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/ptxgen/ptxgen.c" "$(@D)/cuda/nvvm/libnvvm-samples/ptxgen/ptxgen.c" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/CMakeLists.txt" "$(@D)/cuda/nvvm/libnvvm-samples/simple/CMakeLists.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/README.txt" "$(@D)/cuda/nvvm/libnvvm-samples/simple/README.txt" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/simple-gpu.ll" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple-gpu.ll" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/simple-gpu64.ll" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple-gpu64.ll" && cp "/usr/local/cuda-9.0/nvvm/libnvvm-samples/simple/simple.c" "$(@D)/cuda/nvvm/libnvvm-samples/simple/simple.c"
    """,
 )
 
 genrule(
     name = "cuda-extras",
     outs = [
-        "cuda/extras/CUPTI/include/cupti_result.h",
+        "cuda/extras/CUPTI/include/GL/gl.h",
+        "cuda/extras/CUPTI/include/GL/glew.h",
+        "cuda/extras/CUPTI/include/GL/glext.h",
+        "cuda/extras/CUPTI/include/GL/glu.h",
+        "cuda/extras/CUPTI/include/GL/glut.h",
+        "cuda/extras/CUPTI/include/GL/glx.h",
+        "cuda/extras/CUPTI/include/GL/glxext.h",
+        "cuda/extras/CUPTI/include/GL/wglew.h",
+        "cuda/extras/CUPTI/include/GL/wglext.h",
+        "cuda/extras/CUPTI/include/cuda_stdint.h",
+        "cuda/extras/CUPTI/include/cupti.h",
+        "cuda/extras/CUPTI/include/cupti_activity.h",
+        "cuda/extras/CUPTI/include/cupti_callbacks.h",
+        "cuda/extras/CUPTI/include/cupti_driver_cbid.h",
         "cuda/extras/CUPTI/include/cupti_events.h",
-        "cuda/extras/CUPTI/include/openacc/cupti_openacc.h",
+        "cuda/extras/CUPTI/include/cupti_metrics.h",
+        "cuda/extras/CUPTI/include/cupti_nvtx_cbid.h",
+        "cuda/extras/CUPTI/include/cupti_result.h",
+        "cuda/extras/CUPTI/include/cupti_runtime_cbid.h",
         "cuda/extras/CUPTI/include/cupti_version.h",
-        "cuda/extras/CUPTI/include/generated_cuda_gl_interop_meta.h",
+        "cuda/extras/CUPTI/include/generated_cudaGL_meta.h",
         "cuda/extras/CUPTI/include/generated_cudaVDPAU_meta.h",
-        "cuda/extras/CUPTI/include/cupti_activity.h",
-        "cuda/extras/CUPTI/include/generated_cuda_runtime_api_meta.h",
+        "cuda/extras/CUPTI/include/generated_cuda_gl_interop_meta.h",
         "cuda/extras/CUPTI/include/generated_cuda_meta.h",
-        "cuda/extras/CUPTI/include/cupti_nvtx_cbid.h",
-        "cuda/extras/CUPTI/include/cuda_stdint.h",
-        "cuda/extras/CUPTI/include/generated_cudaGL_meta.h",
+        "cuda/extras/CUPTI/include/generated_cuda_runtime_api_meta.h",
         "cuda/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h",
-        "cuda/extras/CUPTI/include/cupti_metrics.h",
-        "cuda/extras/CUPTI/include/cupti_callbacks.h",
-        "cuda/extras/CUPTI/include/cupti_runtime_cbid.h",
-        "cuda/extras/CUPTI/include/cupti.h",
-        "cuda/extras/CUPTI/include/GL/glut.h",
-        "cuda/extras/CUPTI/include/GL/glu.h",
-        "cuda/extras/CUPTI/include/GL/glxext.h",
-        "cuda/extras/CUPTI/include/GL/wglext.h",
-        "cuda/extras/CUPTI/include/GL/glx.h",
-        "cuda/extras/CUPTI/include/GL/glext.h",
-        "cuda/extras/CUPTI/include/GL/wglew.h",
-        "cuda/extras/CUPTI/include/GL/gl.h",
-        "cuda/extras/CUPTI/include/GL/glew.h",
-        "cuda/extras/CUPTI/include/cupti_driver_cbid.h",
         "cuda/extras/CUPTI/include/generated_nvtx_meta.h",
+        "cuda/extras/CUPTI/include/openacc/cupti_openacc.h",
     ],
     cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_result.h" "$(@D)/cuda/extras/CUPTI/include/cupti_result.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_events.h" "$(@D)/cuda/extras/CUPTI/include/cupti_events.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/openacc/cupti_openacc.h" "$(@D)/cuda/extras/CUPTI/include/openacc/cupti_openacc.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_version.h" "$(@D)/cuda/extras/CUPTI/include/cupti_version.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cuda_gl_interop_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_gl_interop_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cudaVDPAU_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cudaVDPAU_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_activity.h" "$(@D)/cuda/extras/CUPTI/include/cupti_activity.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cuda_runtime_api_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_runtime_api_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cuda_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_nvtx_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_nvtx_cbid.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cuda_stdint.h" "$(@D)/cuda/extras/CUPTI/include/cuda_stdint.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cudaGL_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cudaGL_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_metrics.h" "$(@D)/cuda/extras/CUPTI/include/cupti_metrics.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_callbacks.h" "$(@D)/cuda/extras/CUPTI/include/cupti_callbacks.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_runtime_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_runtime_cbid.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti.h" "$(@D)/cuda/extras/CUPTI/include/cupti.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glut.h" "$(@D)/cuda/extras/CUPTI/include/GL/glut.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glu.h" "$(@D)/cuda/extras/CUPTI/include/GL/glu.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glxext.h" "$(@D)/cuda/extras/CUPTI/include/GL/glxext.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/wglext.h" "$(@D)/cuda/extras/CUPTI/include/GL/wglext.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glx.h" "$(@D)/cuda/extras/CUPTI/include/GL/glx.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glext.h" "$(@D)/cuda/extras/CUPTI/include/GL/glext.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/wglew.h" "$(@D)/cuda/extras/CUPTI/include/GL/wglew.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/gl.h" "$(@D)/cuda/extras/CUPTI/include/GL/gl.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/GL/glew.h" "$(@D)/cuda/extras/CUPTI/include/GL/glew.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/cupti_driver_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_driver_cbid.h" && cp "/usr/local/cuda-8.0/extras/CUPTI/include/generated_nvtx_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_nvtx_meta.h"
+if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/gl.h" "$(@D)/cuda/extras/CUPTI/include/GL/gl.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glew.h" "$(@D)/cuda/extras/CUPTI/include/GL/glew.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glext.h" "$(@D)/cuda/extras/CUPTI/include/GL/glext.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glu.h" "$(@D)/cuda/extras/CUPTI/include/GL/glu.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glut.h" "$(@D)/cuda/extras/CUPTI/include/GL/glut.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glx.h" "$(@D)/cuda/extras/CUPTI/include/GL/glx.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/glxext.h" "$(@D)/cuda/extras/CUPTI/include/GL/glxext.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/wglew.h" "$(@D)/cuda/extras/CUPTI/include/GL/wglew.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/GL/wglext.h" "$(@D)/cuda/extras/CUPTI/include/GL/wglext.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cuda_stdint.h" "$(@D)/cuda/extras/CUPTI/include/cuda_stdint.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti.h" "$(@D)/cuda/extras/CUPTI/include/cupti.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_activity.h" "$(@D)/cuda/extras/CUPTI/include/cupti_activity.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_callbacks.h" "$(@D)/cuda/extras/CUPTI/include/cupti_callbacks.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_driver_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_driver_cbid.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_events.h" "$(@D)/cuda/extras/CUPTI/include/cupti_events.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_metrics.h" "$(@D)/cuda/extras/CUPTI/include/cupti_metrics.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_nvtx_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_nvtx_cbid.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_result.h" "$(@D)/cuda/extras/CUPTI/include/cupti_result.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_runtime_cbid.h" "$(@D)/cuda/extras/CUPTI/include/cupti_runtime_cbid.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/cupti_version.h" "$(@D)/cuda/extras/CUPTI/include/cupti_version.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cudaGL_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cudaGL_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cudaVDPAU_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cudaVDPAU_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cuda_gl_interop_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_gl_interop_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cuda_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cuda_runtime_api_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_runtime_api_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_cuda_vdpau_interop_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/generated_nvtx_meta.h" "$(@D)/cuda/extras/CUPTI/include/generated_nvtx_meta.h" && cp "/usr/local/cuda-9.0/extras/CUPTI/include/openacc/cupti_openacc.h" "$(@D)/cuda/extras/CUPTI/include/openacc/cupti_openacc.h"
    """,
 )
 
@@ -1337,26 +1262,21 @@ genrule(
     name = "cuda-lib",
     outs = [
         "cuda/lib/libcuda.so",
-        "cuda/lib/libcudart.so.8.0",
+        "cuda/lib/libcudart.so.9.0",
         "cuda/lib/libcudart_static.a",
-        "cuda/lib/libcublas.so.8.0",
-        "cuda/lib/libcusolver.so.8.0",
-        "cuda/lib/libcurand.so.8.0",
-        "cuda/lib/libcufft.so.8.0",
-        "cuda/lib/libcudnn.so.6",
-        "cuda/lib/libcupti.so.8.0",
+        "cuda/lib/libcublas.so.9.0",
+        "cuda/lib/libcusolver.so.9.0",
+        "cuda/lib/libcurand.so.9.0",
+        "cuda/lib/libcufft.so.9.0",
+        "cuda/lib/libcudnn.so.7",
+        "cuda/lib/libcupti.so.9.0",
     ],
     cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/stubs/libcuda.so" "$(@D)/cuda/lib/libcuda.so" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.61" "$(@D)/cuda/lib/libcudart.so.8.0" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart_static.a" "$(@D)/cuda/lib/libcudart_static.a" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0.88" "$(@D)/cuda/lib/libcublas.so.8.0" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcusolver.so.8.0.61" "$(@D)/cuda/lib/libcusolver.so.8.0" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcurand.so.8.0.61" "$(@D)/cuda/lib/libcurand.so.8.0" && cp "/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcufft.so.8.0.61" "$(@D)/cuda/lib/libcufft.so.8.0" && cp "/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21" "$(@D)/cuda/lib/libcudnn.so.6" && cp "/usr/local/cuda-8.0/extras/CUPTI/lib64/libcupti.so.8.0.61" "$(@D)/cuda/lib/libcupti.so.8.0"
+if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs/libcuda.so" "$(@D)/cuda/lib/libcuda.so" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart.so.9.0.176" "$(@D)/cuda/lib/libcudart.so.9.0" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart_static.a" "$(@D)/cuda/lib/libcudart_static.a" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcublas.so.9.0.282" "$(@D)/cuda/lib/libcublas.so.9.0" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcusolver.so.9.0.176" "$(@D)/cuda/lib/libcusolver.so.9.0" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcurand.so.9.0.176" "$(@D)/cuda/lib/libcurand.so.9.0" && cp "/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcufft.so.9.0.176" "$(@D)/cuda/lib/libcufft.so.9.0" && cp "/usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5" "$(@D)/cuda/lib/libcudnn.so.7" && cp "/usr/local/cuda-9.0/extras/CUPTI/lib64/libcupti.so.9.0.176" "$(@D)/cuda/lib/libcupti.so.9.0"
    """,
 )
 
-genrule(
+filegroup(
     name = "cudnn-include",
-    outs = [
-        "cuda/include/cudnn.h",
-    ],
-    cmd = """
-if [ -d "$(@D)/extras" ]; then rm $(@D)/extras -drf; fi && if [ -d "$(@D)/include" ]; then rm $(@D)/include -drf; fi && if [ -d "$(@D)/lib" ]; then rm $(@D)/lib -drf; fi && if [ -d "$(@D)/nvvm" ]; then rm $(@D)/nvvm -drf; fi && cp "/usr/include/cudnn.h" "$(@D)/cudnn.h"
-   """,
+    srcs = [],
 )
diff --git a/third_party/toolchains/gpus/py/BUILD b/third_party/toolchains/gpus/py/BUILD
new file mode 100644
index 0000000000000000000000000000000000000000..2d5ace93ff5054927cda61b0302af4edd8fe56c1
--- /dev/null
+++ b/third_party/toolchains/gpus/py/BUILD
@@ -0,0 +1,171 @@
+# A build file to configure python remote repository used with Bazel remote
+# execution service
+# DO NOT EDIT: automatically generated BUILD file
+
+licenses(["restricted"])
+
+package(default_visibility = ["//visibility:public"])
+
+cc_library(
+    name = "python_headers",
+    hdrs = [":python_include"],
+    data = select({
+        ":windows": [":python_import_lib"],
+        "//conditions:default": [],
+    }),
+    includes = ["python_include"],
+    linkopts = select({
+        # TODO(pcloudy): Ideally, this should just go into deps after resolving
+        # https://github.com/bazelbuild/bazel/issues/3237,
+        ":windows": ["$(locations :python_import_lib)"],
+        "//conditions:default": [],
+    }),
+)
+
+cc_library(
+    name = "numpy_headers",
+    hdrs = [":numpy_include"],
+    includes = ["numpy_include"],
+)
+
+config_setting(
+    name = "windows",
+    values = {"cpu": "x64_windows"},
+    visibility = ["//visibility:public"],
+)
+
+genrule(
+    name = "python_include",
+    outs = [
+        "python_include/Python-ast.h",
+        "python_include/Python.h",
+        "python_include/abstract.h",
+        "python_include/asdl.h",
+        "python_include/ast.h",
+        "python_include/bitset.h",
+        "python_include/boolobject.h",
+        "python_include/bufferobject.h",
+        "python_include/bytearrayobject.h",
+        "python_include/bytes_methods.h",
+        "python_include/bytesobject.h",
+        "python_include/cStringIO.h",
+        "python_include/cellobject.h",
+        "python_include/ceval.h",
+        "python_include/classobject.h",
+        "python_include/cobject.h",
+        "python_include/code.h",
+        "python_include/codecs.h",
+        "python_include/compile.h",
+        "python_include/complexobject.h",
+        "python_include/datetime.h",
+        "python_include/descrobject.h",
+        "python_include/dictobject.h",
+        "python_include/dtoa.h",
+        "python_include/enumobject.h",
+        "python_include/errcode.h",
+        "python_include/eval.h",
+        "python_include/fileobject.h",
+        "python_include/floatobject.h",
+        "python_include/frameobject.h",
+        "python_include/funcobject.h",
+        "python_include/genobject.h",
+        "python_include/graminit.h",
+        "python_include/grammar.h",
+        "python_include/import.h",
+        "python_include/intobject.h",
+        "python_include/intrcheck.h",
+        "python_include/iterobject.h",
+        "python_include/listobject.h",
+        "python_include/longintrepr.h",
+        "python_include/longobject.h",
+        "python_include/marshal.h",
+        "python_include/memoryobject.h",
+        "python_include/metagrammar.h",
+        "python_include/methodobject.h",
+        "python_include/modsupport.h",
+        "python_include/moduleobject.h",
+        "python_include/node.h",
+        "python_include/object.h",
+        "python_include/objimpl.h",
+        "python_include/opcode.h",
+        "python_include/osdefs.h",
+        "python_include/parsetok.h",
+        "python_include/patchlevel.h",
+        "python_include/pgen.h",
+        "python_include/pgenheaders.h",
+        "python_include/py_curses.h",
+        "python_include/pyarena.h",
+        "python_include/pycapsule.h",
+        "python_include/pyconfig.h",
+        "python_include/pyctype.h",
+        "python_include/pydebug.h",
+        "python_include/pyerrors.h",
+        "python_include/pyexpat.h",
+        "python_include/pyfpe.h",
+        "python_include/pygetopt.h",
+        "python_include/pymacconfig.h",
+        "python_include/pymactoolbox.h",
+        "python_include/pymath.h",
+        "python_include/pymem.h",
+        "python_include/pyport.h",
+        "python_include/pystate.h",
+        "python_include/pystrcmp.h",
+        "python_include/pystrtod.h",
+        "python_include/pythonrun.h",
+        "python_include/pythread.h",
+        "python_include/rangeobject.h",
+        "python_include/setobject.h",
+        "python_include/sliceobject.h",
+        "python_include/stringobject.h",
+        "python_include/structmember.h",
+        "python_include/structseq.h",
+        "python_include/symtable.h",
+        "python_include/sysmodule.h",
+        "python_include/timefuncs.h",
+        "python_include/token.h",
+        "python_include/traceback.h",
+        "python_include/tupleobject.h",
+        "python_include/ucnhash.h",
+        "python_include/unicodeobject.h",
+        "python_include/warnings.h",
+        "python_include/weakrefobject.h",
+    ],
+    cmd = """
+cp "/usr/include/python2.7/Python-ast.h" "$(@D)/python_include/Python-ast.h" && cp "/usr/include/python2.7/Python.h" "$(@D)/python_include/Python.h" && cp "/usr/include/python2.7/abstract.h" "$(@D)/python_include/abstract.h" && cp "/usr/include/python2.7/asdl.h" "$(@D)/python_include/asdl.h" && cp "/usr/include/python2.7/ast.h" "$(@D)/python_include/ast.h" && cp "/usr/include/python2.7/bitset.h" "$(@D)/python_include/bitset.h" && cp "/usr/include/python2.7/boolobject.h" "$(@D)/python_include/boolobject.h" && cp "/usr/include/python2.7/bufferobject.h" "$(@D)/python_include/bufferobject.h" && cp "/usr/include/python2.7/bytearrayobject.h" "$(@D)/python_include/bytearrayobject.h" && cp "/usr/include/python2.7/bytes_methods.h" "$(@D)/python_include/bytes_methods.h" && cp "/usr/include/python2.7/bytesobject.h" "$(@D)/python_include/bytesobject.h" && cp "/usr/include/python2.7/cStringIO.h" "$(@D)/python_include/cStringIO.h" && cp "/usr/include/python2.7/cellobject.h" "$(@D)/python_include/cellobject.h" && cp "/usr/include/python2.7/ceval.h" "$(@D)/python_include/ceval.h" && cp "/usr/include/python2.7/classobject.h" "$(@D)/python_include/classobject.h" && cp "/usr/include/python2.7/cobject.h" "$(@D)/python_include/cobject.h" && cp "/usr/include/python2.7/code.h" "$(@D)/python_include/code.h" && cp "/usr/include/python2.7/codecs.h" "$(@D)/python_include/codecs.h" && cp "/usr/include/python2.7/compile.h" "$(@D)/python_include/compile.h" && cp "/usr/include/python2.7/complexobject.h" "$(@D)/python_include/complexobject.h" && cp "/usr/include/python2.7/datetime.h" "$(@D)/python_include/datetime.h" && cp "/usr/include/python2.7/descrobject.h" "$(@D)/python_include/descrobject.h" && cp "/usr/include/python2.7/dictobject.h" "$(@D)/python_include/dictobject.h" && cp "/usr/include/python2.7/dtoa.h" "$(@D)/python_include/dtoa.h" && cp "/usr/include/python2.7/enumobject.h" "$(@D)/python_include/enumobject.h" && cp "/usr/include/python2.7/errcode.h" "$(@D)/python_include/errcode.h" && cp "/usr/include/python2.7/eval.h" "$(@D)/python_include/eval.h" && cp "/usr/include/python2.7/fileobject.h" "$(@D)/python_include/fileobject.h" && cp "/usr/include/python2.7/floatobject.h" "$(@D)/python_include/floatobject.h" && cp "/usr/include/python2.7/frameobject.h" "$(@D)/python_include/frameobject.h" && cp "/usr/include/python2.7/funcobject.h" "$(@D)/python_include/funcobject.h" && cp "/usr/include/python2.7/genobject.h" "$(@D)/python_include/genobject.h" && cp "/usr/include/python2.7/graminit.h" "$(@D)/python_include/graminit.h" && cp "/usr/include/python2.7/grammar.h" "$(@D)/python_include/grammar.h" && cp "/usr/include/python2.7/import.h" "$(@D)/python_include/import.h" && cp "/usr/include/python2.7/intobject.h" "$(@D)/python_include/intobject.h" && cp "/usr/include/python2.7/intrcheck.h" "$(@D)/python_include/intrcheck.h" && cp "/usr/include/python2.7/iterobject.h" "$(@D)/python_include/iterobject.h" && cp "/usr/include/python2.7/listobject.h" "$(@D)/python_include/listobject.h" && cp "/usr/include/python2.7/longintrepr.h" "$(@D)/python_include/longintrepr.h" && cp "/usr/include/python2.7/longobject.h" "$(@D)/python_include/longobject.h" && cp "/usr/include/python2.7/marshal.h" "$(@D)/python_include/marshal.h" && cp "/usr/include/python2.7/memoryobject.h" "$(@D)/python_include/memoryobject.h" && cp "/usr/include/python2.7/metagrammar.h" "$(@D)/python_include/metagrammar.h" && cp "/usr/include/python2.7/methodobject.h" "$(@D)/python_include/methodobject.h" && cp "/usr/include/python2.7/modsupport.h" "$(@D)/python_include/modsupport.h" && cp "/usr/include/python2.7/moduleobject.h" "$(@D)/python_include/moduleobject.h" && cp "/usr/include/python2.7/node.h" "$(@D)/python_include/node.h" && cp "/usr/include/python2.7/object.h" "$(@D)/python_include/object.h" && cp "/usr/include/python2.7/objimpl.h" "$(@D)/python_include/objimpl.h" && cp "/usr/include/python2.7/opcode.h" "$(@D)/python_include/opcode.h" && cp "/usr/include/python2.7/osdefs.h" "$(@D)/python_include/osdefs.h" && cp "/usr/include/python2.7/parsetok.h" "$(@D)/python_include/parsetok.h" && cp "/usr/include/python2.7/patchlevel.h" "$(@D)/python_include/patchlevel.h" && cp "/usr/include/python2.7/pgen.h" "$(@D)/python_include/pgen.h" && cp "/usr/include/python2.7/pgenheaders.h" "$(@D)/python_include/pgenheaders.h" && cp "/usr/include/python2.7/py_curses.h" "$(@D)/python_include/py_curses.h" && cp "/usr/include/python2.7/pyarena.h" "$(@D)/python_include/pyarena.h" && cp "/usr/include/python2.7/pycapsule.h" "$(@D)/python_include/pycapsule.h" && cp "/usr/include/python2.7/pyconfig.h" "$(@D)/python_include/pyconfig.h" && cp "/usr/include/python2.7/pyctype.h" "$(@D)/python_include/pyctype.h" && cp "/usr/include/python2.7/pydebug.h" "$(@D)/python_include/pydebug.h" && cp "/usr/include/python2.7/pyerrors.h" "$(@D)/python_include/pyerrors.h" && cp "/usr/include/python2.7/pyexpat.h" "$(@D)/python_include/pyexpat.h" && cp "/usr/include/python2.7/pyfpe.h" "$(@D)/python_include/pyfpe.h" && cp "/usr/include/python2.7/pygetopt.h" "$(@D)/python_include/pygetopt.h" && cp "/usr/include/python2.7/pymacconfig.h" "$(@D)/python_include/pymacconfig.h" && cp "/usr/include/python2.7/pymactoolbox.h" "$(@D)/python_include/pymactoolbox.h" && cp "/usr/include/python2.7/pymath.h" "$(@D)/python_include/pymath.h" && cp "/usr/include/python2.7/pymem.h" "$(@D)/python_include/pymem.h" && cp "/usr/include/python2.7/pyport.h" "$(@D)/python_include/pyport.h" && cp "/usr/include/python2.7/pystate.h" "$(@D)/python_include/pystate.h" && cp "/usr/include/python2.7/pystrcmp.h" "$(@D)/python_include/pystrcmp.h" && cp "/usr/include/python2.7/pystrtod.h" "$(@D)/python_include/pystrtod.h" && cp "/usr/include/python2.7/pythonrun.h" "$(@D)/python_include/pythonrun.h" && cp "/usr/include/python2.7/pythread.h" "$(@D)/python_include/pythread.h" && cp "/usr/include/python2.7/rangeobject.h" "$(@D)/python_include/rangeobject.h" && cp "/usr/include/python2.7/setobject.h" "$(@D)/python_include/setobject.h" && cp "/usr/include/python2.7/sliceobject.h" "$(@D)/python_include/sliceobject.h" && cp "/usr/include/python2.7/stringobject.h" "$(@D)/python_include/stringobject.h" && cp "/usr/include/python2.7/structmember.h" "$(@D)/python_include/structmember.h" && cp "/usr/include/python2.7/structseq.h" "$(@D)/python_include/structseq.h" && cp "/usr/include/python2.7/symtable.h" "$(@D)/python_include/symtable.h" && cp "/usr/include/python2.7/sysmodule.h" "$(@D)/python_include/sysmodule.h" && cp "/usr/include/python2.7/timefuncs.h" "$(@D)/python_include/timefuncs.h" && cp "/usr/include/python2.7/token.h" "$(@D)/python_include/token.h" && cp "/usr/include/python2.7/traceback.h" "$(@D)/python_include/traceback.h" && cp "/usr/include/python2.7/tupleobject.h" "$(@D)/python_include/tupleobject.h" && cp "/usr/include/python2.7/ucnhash.h" "$(@D)/python_include/ucnhash.h" && cp "/usr/include/python2.7/unicodeobject.h" "$(@D)/python_include/unicodeobject.h" && cp "/usr/include/python2.7/warnings.h" "$(@D)/python_include/warnings.h" && cp "/usr/include/python2.7/weakrefobject.h" "$(@D)/python_include/weakrefobject.h"
+   """,
+)
+
+genrule(
+    name = "numpy_include",
+    outs = [
+        "numpy_include/numpy/__multiarray_api.h",
+        "numpy_include/numpy/__ufunc_api.h",
+        "numpy_include/numpy/_neighborhood_iterator_imp.h",
+        "numpy_include/numpy/_numpyconfig.h",
+        "numpy_include/numpy/arrayobject.h",
+        "numpy_include/numpy/arrayscalars.h",
+        "numpy_include/numpy/halffloat.h",
+        "numpy_include/numpy/multiarray_api.txt",
+        "numpy_include/numpy/ndarrayobject.h",
+        "numpy_include/numpy/ndarraytypes.h",
+        "numpy_include/numpy/noprefix.h",
+        "numpy_include/numpy/npy_1_7_deprecated_api.h",
+        "numpy_include/numpy/npy_3kcompat.h",
+        "numpy_include/numpy/npy_common.h",
+        "numpy_include/numpy/npy_cpu.h",
+        "numpy_include/numpy/npy_endian.h",
+        "numpy_include/numpy/npy_interrupt.h",
+        "numpy_include/numpy/npy_math.h",
+        "numpy_include/numpy/npy_no_deprecated_api.h",
+        "numpy_include/numpy/npy_os.h",
+        "numpy_include/numpy/numpyconfig.h",
+        "numpy_include/numpy/old_defines.h",
+        "numpy_include/numpy/oldnumeric.h",
+        "numpy_include/numpy/ufunc_api.txt",
+        "numpy_include/numpy/ufuncobject.h",
+        "numpy_include/numpy/utils.h",
+    ],
+    cmd = """
+cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/__multiarray_api.h" "$(@D)/numpy_include/numpy/__multiarray_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/__ufunc_api.h" "$(@D)/numpy_include/numpy/__ufunc_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/_neighborhood_iterator_imp.h" "$(@D)/numpy_include/numpy/_neighborhood_iterator_imp.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/_numpyconfig.h" "$(@D)/numpy_include/numpy/_numpyconfig.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h" "$(@D)/numpy_include/numpy/arrayobject.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayscalars.h" "$(@D)/numpy_include/numpy/arrayscalars.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/halffloat.h" "$(@D)/numpy_include/numpy/halffloat.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/multiarray_api.txt" "$(@D)/numpy_include/numpy/multiarray_api.txt" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h" "$(@D)/numpy_include/numpy/ndarrayobject.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h" "$(@D)/numpy_include/numpy/ndarraytypes.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/noprefix.h" "$(@D)/numpy_include/numpy/noprefix.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_1_7_deprecated_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_3kcompat.h" "$(@D)/numpy_include/numpy/npy_3kcompat.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_common.h" "$(@D)/numpy_include/numpy/npy_common.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_cpu.h" "$(@D)/numpy_include/numpy/npy_cpu.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_endian.h" "$(@D)/numpy_include/numpy/npy_endian.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_interrupt.h" "$(@D)/numpy_include/numpy/npy_interrupt.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_math.h" "$(@D)/numpy_include/numpy/npy_math.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_no_deprecated_api.h" "$(@D)/numpy_include/numpy/npy_no_deprecated_api.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_os.h" "$(@D)/numpy_include/numpy/npy_os.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/numpyconfig.h" "$(@D)/numpy_include/numpy/numpyconfig.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/old_defines.h" "$(@D)/numpy_include/numpy/old_defines.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/oldnumeric.h" "$(@D)/numpy_include/numpy/oldnumeric.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ufunc_api.txt" "$(@D)/numpy_include/numpy/ufunc_api.txt" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ufuncobject.h" "$(@D)/numpy_include/numpy/ufuncobject.h" && cp "/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/utils.h" "$(@D)/numpy_include/numpy/utils.h"
+   """,
+)