报告题目:TASTI: Semantic Indexes for Machine Learning-based Queries over Unstructured Data∗
论文出处:SIGMOD 2022
作者:Daniel Kang*, John Guibas*, Peter D. Bailis, Tatsunori Hashimoto, Matei Zaharia
单位:Stanford University Stanford, USA
报告人:吴胜发
报告时间:2022年11月21日 下午 14:00
报告地点:贵州大学北校区博学楼624室
报告内容摘要:Unstructured data (e.g., video or text) is now commonly queried by using computationally expensive deep neural networks or human labelers to produce structured information, e.g., object types and positions in video. To accelerate queries, many recent systems (e.g., BlazeIt, NoScope, Tahoma, SUPG, etc.) train a query-specific proxy model to approximate a large target labelers (i.e., these expensiveneural networks or human labelers). These models return proxy scores that are then used in query processing algorithms. Unfor-tunately, proxy models usually have to be trained per query and require large amounts of annotations from the target labelers.In this work, we develop an index (trainable semantic index,TASTI) that simultaneously removes the need for per-query prox-ies and is more efficient to construct than prior indexes. TASTI accomplishes this by leveraging semantic similarity across recordsin a given dataset. Specifically, it produces embeddings for each record such that records with close embeddings have similar target labeler outputs. TASTI then generates high-quality proxy scores via embeddings without needing to train a per-query proxy. These scores can be used in existing proxy-based query processing al- gorithms (e.g., for aggregation, selection, etc.). We theoretically analyze TASTI and show that a low embedding training error guar-antees downstream query accuracy for a natural class of queries. We evaluate TASTI on five video, text, and speech datasets, and three query types. We show that TASTI’s indexes can be 10× less expensive to construct than generating annotations for current proxy-based methods, and accelerate queries by up to 24×.