Gandiva: Introspective Cluster Scheduling for Deep Learning -湖大信息科学与工程学院

我的位置在：首页 > 学术报告 > 正文

Gandiva: Introspective Cluster Scheduling for Deep Learning

浏览次数:日期：2018-12-03编辑：信科院科研办

报告时间：2018年12月6日 14：30

报告地点：湖南大学信息科学与工程学院106室

报告题目：Gandiva: Introspective Cluster Scheduling for Deep Learning（发表于OSDI 2018）

报告人：肖文聪，北京航空航天大学—微软亚洲研究院联合培养五年级博士生，现在微软亚洲研究院系统组实习，研究方向是机器学习系统和基础架构。曾在系统顶会OSDI、NSDI、SOSP等发表多篇论文。

报告简介： We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster. One key characteristic of deep learning is feedbackdriven exploration, where a user often runs a set of jobs (or a multi-job) to achieve the best result for a specific mission and uses early feedback on accuracy to dynamically prioritize or kill a subset of jobs; simultaneous early feedback on the entire multi-job is critical. A second characteristic is the heterogeneity of deep learning jobs in terms of resource usage, making it hard to achieve best-fit a priori. Gandiva addresses these two challenges by exploiting a third key characteristic of deep learning: intra-job predictability, as they perform numerous repetitive iterations called mini-batch iterations. Gandiva exploits intra-job predictability to time-slice GPUs efficiently across multiple jobs, thereby delivering lowlatency. This predictability is also used for introspecting job performance and dynamically migrating jobs to better-fit GPUs, thereby improving cluster efficiency. We show via a prototype implementation and microbenchmarks that Gandiva can speed up hyper-parameter searches during deep learning by up to an order of magnitude, and achieves better utilization by transparently migrating and time-slicing jobs to achieve better job-toresource fit. We also show that, in a real workload of jobs running in a 180-GPU cluster, Gandiva improves aggregate cluster utilization by 26%, pointing to a new way of managing large GPU clusters for deep learning.

上一篇：: 私家车轨迹大数据的时空属性挖掘关键技术及应用研究

下一篇：: 自然语言处理中的语义计算