[tf.data / Bigtable] Parallel scan Bigtable tables
In order to stream data from Cloud Bigtable at high speed, it's important to use multiple connections to stream simultaneously from multiple tablet servers. This change adds two new methods to the BigTable object to set up a dataset based on the SampleKeys method, and tf.contrib.data.parallel_interleave. Because the keys returned from SampleKeys is not guaranteed to be deterministic (in fact, it can change over time without any new data added to the table), the resulting datasets are not deterministic. (In order to further boost performance, we enable sloppy interleaving.) When comparing the table.scan_* methods vs the table.parallel_scan_* methods for a test workload (based on ImageNet), we see performance gains of over 15x, and over 10x compared to a reasonably tuned GCS input pipeline. PiperOrigin-RevId: 203945580
Loading
Please sign in to comment