Research Experimence in Undergraduate

Statement of Purpose used in applying PhD

Can we design future-focused systems and algorithms to handle new challenges in the Big Data Era? Led by this overarching question, my research interests lie in database systems and algorithms for large-scale data management. Specifically, throughout my undergraduate studies, I have been researching the optimization of query processing in large data systems and offering new analytical data management methods. I want to reveal more insights into big data to aid decision-making. With this broad goal, I decided to pursue a Ph.D. degree.

Research Experience. Academic Computer Science (CS) courses have equipped me with rich theoretical and practical knowledge. Beyond classes, I also sought research opportunities on campus. I joined the database group led by Professor Bo Tang at SUSTech and performed research on systems and algorithms for large-scale data. My first project targeted GPU Acceleration for Online Analytical Processing (OLAP). Our team constructed GHive, designed to use heterogeneous computing platforms deftly. As a complete novice in research, I spent one month preparing myself by exploring the literature on GPU databases and learning CUDA C++ programming from scratch. Our system at that time finished the testing on SSB, and I was assigned to test it on TPC-DS. To run the experiments, I needed to extend the supported data types and GPU-based operators. For a column with primitive data types (e.g., Long and Double), our data model gColumn stores the values in the data array and uses a bitmap as the aux array to indicate the null rows. However, the storage was not enough for variable-length data types like String. So, I redesigned the aux array to store each string’s starting and ending positions, and both positions are set to -1 when the corresponding string is null.

I also implemented a GPU-based PTF operator using the template library Thrust. Our system achieved over 2X speedup over Hive for computation-intensive queries and orders of magnitude speedup for individual operators. This work led to a conference publication in SOCC'22 and a demo in SIGMOD'22. This work, which integrates modern hardware into database systems, introduced me to system research. Through my research experience, I have acquired scientific research skills, for instance, searching the literature, implementing techniques, analyzing results, and writing papers.

I think database systems are an important and exciting field of study, and I would like to delve deeper into it. This summer, I interned at the University of California at Irvine. Supervised by Prof. Michael J. Carey, I was engaged in performing at-scale experiments to evaluate the characteristics of Python User Defined Functions in AsterixDB. To run the experiments, I developed a UDF function using Sklearn, prepared a distributed data generator and a query generator, and ran the system in a set of varied-sized clusters. Not only did I learn how to configure and run the database systems in a distributed way, but I also realized the importance of benchmarking the Big Data Management Systems (BDMS). Through thorough performance study, we can provide downstream users with more practical and reliable guidance.

Except for system research, I also have a strong foundation in data mining research. I participated in the SIGMOD Programming Contest in March 2022, which required solving entity blocking on million-level datasets. While confronting this problem, the most intuitive solution that came to mind is to use regular rules to extract features from each description and then generate pairs with the same features. I quickly implemented my idea in Python, and it worked well on the thousand-level datasets released. However, when I submitted my solution, it showed a dramatic performance degradation in terms of accuracy. Instead of being frustrated, I carefully analyzed the problem and identified the principal barrier: large-scale datasets. My regular rules failed to cover the features for the unreleased part of the dataset. After extensively reading literature and discussions with lab mates, I overcame the barriers by integrating a Bert model into our system and utilizing the Faiss index to accelerate matching. The solution achieved the highest recall in one test dataset and ranked 4th out of 55 in the competition. This project honed my skills in developing an algorithm—designing, implementing, testing, evaluating, and refining. It also made me realize the importance of the scalability of algorithms, especially when datasets are enormous.

After receiving extensive research training, I took the next step by leading a research project myself. With my awareness that edge computing is a new merging computing paradigm used to realize large-scale data management, I wanted to initiate a project on it. After extensive brainstorming with my supervisor, we decided to take outlier detection as a data management application and explore how to integrate edge computing into the outlier detection. We redefine outlier detection in a unique edge computation setting such that (1) each edge device collects data and reports outliers locally and (2) outliers need detection in the global scope. We proposed a general fingerprint-based mechanism that exploits locality-sensitive hashing to generate fingerprints for each device. Once deciding upon the big picture, I quickly surveyed existing outlier detection algorithms and implemented our framework in JAVA. While reproducing the algorithms, I found that the state-of-the-art solution NETS adopted the low-quality idea of the Grid Index, which can also generate fingerprints. So, we integrated it as an alternative way for fingerprint generation. These two ways can map nearby high-dim points to the same bin and quickly filter non-relevant devices during the data-sharing stage. The preliminary results have demonstrated the effectiveness and efficiency of our system. We plan to submit this work at the beginning of next year. This was the first time I took part in a research problem’s whole lifecycle, from its definition to its solution and every step in between, looking for opportunities for optimization.

Teaching Experience. Research projects are only part of my life at SUSTech. I can never forget every effort and struggle I went through as a freshman doing coursework and research, which consolidated my resolve to reach out to those in need. I have been a teaching assistant for five semesters across four courses. I always strived to foster an inclusive environment for all students to prevent anyone from being stigmatized. I also learned the importance of valuing and respecting each individual’s ingenuity. Students gave such positive feedback that I was honored as an outstanding teaching assistant and invited to deliver a presentation in the CS department at SUSTech.

Life Career. Graduate study is just the first step in my plan. Following that, I intend to pursue an academic career to continue my work in the data science field. I would love to keep teaching and researching as a way of life. I want to design future-oriented systems and algorithms that will change people’s lives.

Xinying Zheng 郑鑫颖
Xinying Zheng 郑鑫颖
PhD in Computer Science

My research interests include big data systems and distributed systems.