Commit 4125012a authored Jul 10, 2018 by Brennan Saeta Committed by TensorFlower Gardener Jul 10, 2018

[tf.data / Bigtable] Parallel scan Bigtable tables

In order to stream data from Cloud Bigtable at high speed, it's important to use multiple connections to stream simultaneously from multiple tablet servers. This change adds two new methods to the BigTable object to set up a dataset based on the SampleKeys method, and tf.contrib.data.parallel_interleave. Because the keys returned from SampleKeys is not guaranteed to be deterministic (in fact, it can change over time without any new data added to the table), the resulting datasets are not deterministic. (In order to further boost performance, we enable sloppy interleaving.)

When comparing the table.scan_* methods vs the table.parallel_scan_* methods for a test workload (based on ImageNet), we see performance gains of over 15x, and over 10x compared to a reasonably tuned GCS input pipeline.

PiperOrigin-RevId: 203945580

parent 70592a56

Show whitespace changes

Inline Side-by-side

Please to comment