cicyt UNIZAR
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: Communication Scheduling as a First-Class Citizen in Distributed Machine Learning Systems

Abstract: State-of-the-art machine learning systems rely on graph-based models, with the distributed training of these models being the norm in AI-powered production pipelines. The performance of these communication-heavy systems depends on the effective overlap of communication and computation. While the overlap challenge has been addressed in systems with simpler model representations, it remains an open problem in graph-based models.
In this work, we develop a system for communication scheduling which realizes near-optimal overlap of communication and computation in graph-based models. Our system is implemented over TensorFlow and requires no changes in the model or developer inputs. Our system improves the throughput by up to 82% in inference and 20% in training, while also reducing straggler effect by up to 2.8x. A part of our implementation is already merged with TensorFlow codebase; the rest is publicly available.
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Learning (cs.LG); Performance (cs.PF)
Cite as: arXiv:1803.03288 [cs.DC]
  (or arXiv:1803.03288v1 [cs.DC] for this version)

Submission history

From: Sayed Hadi Hashemi [view email]
[v1] Thu, 8 Mar 2018 20:03:51 GMT (432kb,D)