20221129论文报告-In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

当前位置: 首页 2014贵州省先进计算与医疗信息服务工程实验室通知公告

报告题目：In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

论文出处：SIGMOD 2022

作者：Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara,Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang

单位：ETH Zürich,State Key Laboratory of Computer Science Institute of Software Chinese Academy of Sciences,University of Chicago,Tsinghua University,Kwai Inc,Microsoft Research,University of Michigan

报告人：许玉田

报告时间：2022年11月29日上午 9:00

报告地点：贵州大学北校区博学楼624室

报告内容摘要：Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access).In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement–they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6×-12.8× faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

【关闭本页】　【返回顶部】