Improve performance of several utility functions in TensorFlow
framework/types.h defines a variety of functions on DataType enums. Some of these functions are implemented by allocating arrays in the heap. Even though DataTypeVector is a typedef for InlinedVector, it only stores 4 elements inline. Many of the vectors used in types.h/types.cc contain more than 4 elements. To make matters worse, some of these functions are called quite frequently under load, so we're wasting time allocating and copying arrays. The set of distinct DataType values is so small, however, that we can represent a set of DataType values as a bitmask, and use bit-shifts and tests instead of sequential scans of arrays. Even the functions that do not allocate, such as DataTypeCanUseMemcpy(), are needlessly inefficient (read: they use control-flow and indirect jumps when a simple table-based load would do; they are also not inlined). These costs were significant enough that they consumed about 1.2% of CPU cycles under heavy load. The surprising cost of DataTypeCanUseMemcpy() inspired this change. I went ahead and made the change fully general, by adding a DataTypeSet type and changing all of the utility functions in framework/types.h to use it (with the exception of DataTypeAlwaysOnHost because it uses a _REF type), for the sake of generality and performance. PiperOrigin-RevId: 181695458
Loading
Please sign in to comment