2022/01/10 | I am teaching CMPT 733: Big Data Programming and CMPT 354: Database Systems I for Spring Semester 2022. |
2021/08/17 | We are thrilled to win the VLDB Best Experiments, Analysis & Benchmark Paper Award (2021) |
2021/07/26 | We are super excited to announce the release of ConnectorX 0.2 (a subproject in dataprep.ai). ConnectorX is the fastest library to load data from DB to DataFrames in Rust and Python. It can accelerate Pandas read_sql by 10x with one line of code. Since its first release, the library has been downloaded by ~12000 times. Please check out our blog post and benchmark results. |
2021/07/15 | Want to know how to debug an ML model in federated learning? Please check out our recent paper, entitled "Enabling SQL-based Training Data Debugging for Federated Learning", in VLDB 2022! |
2021/04/21 | I was invited to give two talks in the Thomson Reuters AI@TR Invited Speaker Series on our AutoML-EM and DataPrep projects. |
2021/04/08 | Congratulations to Brandon Lockhart on his successful M.Sc. thesis defense! In his dissertation he studies "Explaining Inference Queries with Bayesian Optimization". Learn more about Brandon and his research from this page. |
2021/03/18 | Congratulations to Dr. Pei Wang on her successful Ph.D. thesis defense! Her thesis is about "Automating Data Preparation with Statistical Analysis", which covers her work on AutoML-EM@ICDE 2021, ActiveDeeper Demo@VLDB 2020, Deeper@SIGMOD 2019, Uni-Detect@SIGMOD 2019, and Deeper Demo@SIGMOD 2018. |
2021/03/15 | I am thrilled to receive a CS-Can|Info-Can Outstanding Early Career Researcher Award (2020). [SFU News] |
2021/03/15 | Want to know whether we are ready to deploy learned cardinality models in production database systems? Please check out our recent paper in VLDB 2021! |
2021/03/11 | Our DataPrep.EDA paper got accepted by SIGMOD 2021! |
2021/01/26 | I am serving as a General Co-chair for VLDB 2023 @ Vancouver. |
2021/01/06 | I was invited to give a talk at Databricks to introduce DataPrep: The easiest way to prepare data in Python. |
2020/12/17 | Congratulations to Xiaoying Wang on her successful M.Sc. thesis defense! In her dissertation she studies "Are We Ready For Learned Cardinality Estimation?". |
2020/12/17 | A new paper on "Automating Entity Matching Model Development" got accepted by ICDE 2021! |
2020/11/12 | My Ph.D. students (Pei Wang and Weiyuan Wu) gave a talk at PyData Global 2020, the world's premier data science conference (video). |
2020/11/09 | We are pleased to announce the launch of DataPrep's brand new website: http://dataprep.ai! |
2020/10/01 | Receive the "Distinguished PVLDB Review Board Member" award! |
2020/09/05 | Welcome Danrui Qi (Ph.D.) to join the lab! Congratulate her on receiving the prestigious Graduate Dean's Entrance Scholarship (GDES) |
2020/09/01 | Promoted to Associate Professor (Tenure)! |
2020/05/07 | We wrote two blog posts to describe i) how to use DataPrep.EDA to accelerate EDA and ii) why DataPrep.EDA is better than Pandas-profiling.
|
2020/03/20 | We are super excited to announce the release of DataPrep 0.2. DataPrep wants to become "scikit-learn" for data preparation. Since its first release, the library has been downloaded by ~4000 times. This release contains a data connector component to facilitate web data collection and an exploratory data analysis component to enable fast data understanding. More components will be added in future releases. |
2020/03/20 | I am currently serving an Associate Editor for VLDB 2021. |
2020/03/13 | Want to know how to debug training data for SQL-ML queries. Please check out our recent paper, entitled "Complaint-driven Training Data Debugging for Query 2.0", in SIGMOD 2020! |
2020/01/03 | Want to know how to detect data errors for machine learning applications? Please check out our recent paper, entitled "SCODED: Statistical Constraint Oriented Data Error Detection", in SIGMOD 2020! |
2019/09/22 | I am honored to be invited to serve as an Associate Editor for VLDB 2021. |
2019/09/03 | Welcome 130+ new professional master's students! I gave a welcome speech at student orientation sessions. |
2019/09/01 | Welcome new lab members: Brandon Lockhart (M.Sc.) and Yi Xie (M.Sc.)! |
2019/08/10 | Want to know how to automatically extract highlights (i.e., attractive short video clips) from massive recorded live videos? Please check out our recent paper, entitled "Towards Extracting Highlights From Recorded Live Videos: An Implicit Crowdsourcing Approach.", in ICDE 2020! |
2019/07/17 | Tianzheng Wang and I created a new website for the SFU Data Science Research Group. |
2019/06/01 | I took over the Director role of the Professional Master's Programs in Big Data and Visual Computing. |
2019/02/01 | Steven Bergner and I created the SFU Big Data Science Publication on Medium. |
2019/01/19 | Want to enrich your local database with Deep Websites (e.g., Yelp, IMDb, DBLP)? Please check out our recent paper, entitled "Progressive Deep Web Crawling Through Keyword Queries For Data Enrichment.", in SIGMOD 2019! |
2019/01/11 | I visited the Data Analytics and Intelligence Lab (DAIL) at Alibaba Group (hosted by Dr. Bolin Ding) and gave a talk to introduce our lab's research. |
2019/01/10 | I visited the Data Management, Exploration and Mining (DMX) group at Microsoft Research (hosted by Dr. Yeye He) and gave a talk to introduce our lab's research. |
2019/01/03 | I am teaching CMPT 733: Big Data Programming and CMPT 843: Traditional vs. Modern Database Systems for Spring Semester 2019. |
2019/01/02 | Welcome new lab members: Xiaoying Wang (Master) and Lydia Zheng (Undergrad)! |
2018/11/16 | Want to fill the gap between learning with noisy labels and ground-truth labels? Please check out our recent paper, entitled "Cleaning Crowdsourced Labels Using Oracles For Statistical Classification.", in VLDB 2019! |
2018/09/08 | I am teaching CMPT 354: Database Systems for Fall Semester 2018. |
2018/09/05 | Received a Mitacs Accelerate fund ($990,000) for our proposal: "Democratizing Data Preparation for AI" (PI). |
2018/08/31 | Congratulations to Mohamad Dolatshah for successfully defending his MSc thesis. |
2018/08/03 | I visited the Product Graph Team at Amazon (hosted by Dr. Xin Luna Dong) and gave a talk to introduce our lab's research. |
2018/04/18 | I am thrilled to win the IEEE TCDE Rising Star Award for my contribution to human-in-the-loop data analytics. |
2018/02/24 | The Deeper demo paper got accepted by SIGMOD 2018. |
2018/01/31 | We are super excited to announce the release of Deeper v0.1, a data enrichment system powered by deep web (system, video, paper). |
2018/01/05 | I was invited to give a talk to introduce SFU DB/DM group at the NWDS Annual Meeting (slides). |
2018/01/03 | I am teaching CMPT 733: Big Data Programming and CMPT 843: Traditional vs. Modern Database Systems for Spring Semester 2018. |
2018/01/01 | Welcome Liang Zhao (Postdoc) to join our lab. Dr. Zhao got her Ph.D. from Tsinghua University, and she is interested in the research topic of data cleaning for machine learning. |
2017/12/15 | Congratulations to Ruochen Jiang, who won an SFU entrance scholarship. |
2017/11/24 | Welcome Xi Yang to join our lab. Xi Yang is an undergrad from SFU/ZJU Dual Degree Program. He is interested in the research topic of interactive analytics over Big Data. |
2017/11/03 | Want to analyze Big Data interactively? Please check out our recent paper on interactive analytics, entitled "AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics.", in SIGMOD 2018! |
2017/08/26 | Our paper, entitled "Preference-driven Similarity Join", won a Best Student Paper Award at the IEEE/WIC/ACM WI 2017 conference. |
2017/07/24 | I visited the UBC database group and gave a talk entitled Speeding Up Data Science: From a Data Management Perspective. |
2017/06/07 | Received an NSERC CRD Grant for our proposal: "Entity Augmentation and Data Cleaning for Machine Learning" (PI). |
2017/05/14 | We gave a tutorial entitled "Crowdsourced Data Management: Overview and Challenges" at the ACM SIGMOD 2017 conference. The slides can be downloaded from here. |
2017/05/01 | Welcome new MSc students, Changbo Qu and Young Wu, to our lab. |
2017/01/04 | I am teaching CMPT 843: Traditional vs. Modern Database Systems and CMPT 733: Big Data Programming II for Spring Semester 2017. |
2016/11/20 | Want to improve query performance for your big data systems? Please check out our recent paper on data skipping, entitled "Skipping-oriented Partitioning for Columnar Layouts.", in VLDB 2017! |
2016/11/10 | We are happy to announce the first release of Reprowd! Reprowd facilitates the use of crowdsourcing for Data Labeling and Active Learning. The system was recently demonstrated at HCOMP 2016 and covered by the Reproducible Science |
2016/09/06 | Welcome Pei Wang, Mohamad Dolatshah, Jinglin Peng, Mathew Teoh to our lab! Thrilled to be able to work with such a group of talented students! |
2016/09/01 | I am teaching CMPT 884: Human-in-the-loop Data Management and CMPT 732: Big Data Programming for Fall Semester 2016. |
2016/07/15 |
Want to know how the ActiveClean system works? Please check out our latest paper titled "ActiveClean: Interactive Data Cleaning For Statistical Modeling" in VLDB 2016. |
2016/06/30 | Our ActiveClean system has won the Best Demonstration Award in the ACM SIGMOD 2016 conference. The SIGMOD attendees were excited to see that the system helps data scientists to train a more reliable machine-learning model with much less time. |
2016/06/26 | We gave a tutorial entitled "Data Cleaning: Overview and Emerging Challenges" at the ACM SIGMOD 2016 conference. The slides can be downloaded from here |
2016/05/11 | Want to extract new insights from graph data? Please check out our paper titled "Finding Gangs in War from Signed Networks" in KDD 2016! |
2016/04/15 | "Analysis of Vancouver's Housing price market", a student project from my "Big Data Programming" course was featured in the Globe and Mail |
2016/04/07 | Received an NSERC Discovery Grant for my proposal: "Crowdsourced Data Cleaning" (PI) |
2016/04/06 | One research paper, one demo paper, and one tutorial were accepted by SIGMOD 2016! |
2016/02/16 | Received an NSERC RTI Grant for our proposal: "Computational Infrastructure for Online Big Data Analytics" (Co-PI) |
2016/02/16 | Want to know how crowdsourcing can help with data management? Please check out our latest survey on crowdsourced data management. |
2016/01/26 | I am teaching CMPT 733: Big Data Programming for Spring Semester 2016. |
2016/01/19 | I join the School of Computing Science at Simon Fraser University as an Assistant Professor. |
2016/01/15 | Complete a postdoc journey at UC Berkeley! Cannot believe what I learned in this period. Thanks to all the AMPLab folks! |
2015/12/04 | I gave a talk entitled My Research Journey on 'Crowdsourced Data Cleaning' in the UW database group meeting. |
2015/11/13 | I was invited to write a trip report of VLDB 2015 by the Communications of the CCF (China Computer Foundation). |
2015/11/05 | An overview paper of our SampleClean project was published in the latest issue of the IEEE Data Engineering Bulletin. |
2015/10/27 | A new paper entitled "CLAMShell: Speeding up Crowds for Low-latency Data Labeling" got accepted by VLDB 2016! If you are complaining that "the crowd is so slow", you will find a solution from our paper. |
2015/06/09 | One research paper and one demo paper from our SampleClean project got accepted by VLDB 2015! |
2015/05/16 | We are happy to announce the release of SampleClean 0.1! |
2015/03/06 | I wrote an AMPLab blog post: When Data Cleaning Meets Crowdsourcing |
2015/03/05 | Our paper entitled "QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications" got accepted by SIGMOD 2015. |
2014/11/20 | Our SampleClean system was demonstrated at AMPCamp5 [slides] [video]. The vision of SampleClean is to bring data cleaning and crowdsourcing into the BDAS stack. |
2014/11/14 | I gave a talk to introduce the SampleClean system at UCI. |
2014/10/09 | I visited the Database Groups at Brown and MIT, and gave talks on our SampleClean project. |
2014/08/29 | Since 2011, the research topic of crowdsourced query processing has been gaining increasing attention in the database community. To help people better understand the research progress of this topic, I created a spreadsheet for maintaining the recent papers published on this topic. If you want to be a contributor to the list or if you find some interesting papers missing in the list, please feel free to drop me an email. |
2014/04/16 | A new paper entitled "A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data" got accepted by SIGMOD 2014. The paper presented SampleClean, a novel framework that marries data cleaning with sampling-based approximate query processing. This framework enables us to achieve accurate query results on dirty data, at significantly reduced cleaning cost. Please visit sampleclean.org for more details. |
2014/04/16 | A new paper entitled "Towards Dependable Data Repairing with Fixing Rules" was accepted by SIGMOD 2014. We proposed Fixing Rules, a new class of cleaning rules designed for automated and dependable data repairing. The paper shows that we can perform more reliable data repairing using fixing rules than other automated repairing approaches. |
2014/01/18 | I received the China Computer Federation (CCF) Distinguished Dissertation Award for my PhD work in crowdsourcing entity resolution. [News] |
2013/10/08 | A new paper entitled "Extending String Similarity Join to Tolerant Fuzzy Token Matching" was accepted by the ACM Transactions on Database Systems (TODS). |
2013/08/01 | Start a new journey at UC Berkeley! |
2013/06/22 | Yu Jiang, Jian He, Dong Deng and I (advised by Prof. Guoliang Li and Jianhua Feng) participated in the SIGMOD 2013 Programming Contest. We were selected as one of five finalists, and presented our methods at SIGMOD 2013. |
2013/06/03 | Defend my PhD dissertation! :) |
2013/03/22 | Yu Jiang, Dong Deng and I (advised by Prof. Guoliang Li and Jianhua Feng) participated in the String Similarity Search/Join Competition in EDBT 2013. The competition consisted of four tasks, where we won 1st place in the three tasks, and 2nd place in the other task. In particular, our programs ran 10~100x faster than the second best team in the two similarity-join tasks. [Results] [Paper] [News] |
2013/02/10 | Our joint paper with Brown University and UC Berkeley AMPLab was accepted by SIGMOD 2013. The paper deeply investigated the effect of transitive relations on crowdsourced joins, and presented a hybrid labeling framework that achieved a 95%+ reduction in cost and time over the state-of-the-art approach. [Paper] |
2013/02/08 | Complete a three-month internship at Qatar Computing Research Institute (QCRI), a lot of fun! |