Making Sense of Data in the Wild: Data Analysis Automation at Scale

January 27, 2025

Mara Graziani, Malina Molnar, Irina Espejo Morales, Joris Cadow-Gossweiler, Teodoro Laino

Computer Science

Information Retrieval

Machine Learning

As the volume of publicly available data continues to grow, researchers face the challenge of limited diversity in benchmarking machine learning tasks. Although thousands of datasets are available in public repositories, the sheer abundance often complicates the search for suitable data, leaving many valuable datasets underexplored. This situation is further amplified by the fact that, despite longstanding advocacy for improving data curation quality, current solutions remain prohibitively time-consuming and resource-intensive. In this paper, we propose a novel approach that combines intelligent agents with retrieval augmented generation to automate data analysis, dataset curation and indexing at scale. Our system leverages multiple agents to analyze raw, unstructured data across public repositories, generating dataset reports and interactive visual indexes that can be easily explored. We demonstrate that our approach results in more detailed dataset descriptions, higher hit rates and greater diversity in dataset retrieval tasks. Additionally, we show that the dataset reports generated by our method can be leveraged by other machine learning models to improve the performance on specific tasks, such as improving the accuracy and realism of synthetic data generation. By streamlining the process of transforming raw data into machine-learning-ready datasets, our approach enables researchers to better utilize existing data resources.

Benchmarking Data Science Agents

February 27, 2024

92% Match

Yuge Zhang, Qiyang Jiang, Xingyu Han, Nan Chen, ... , Ren Kan

Artificial Intelligence

Computation and Language

In the era of data-driven decision-making, the complexity of data analysis necessitates advanced expertise and tools of data science, presenting significant challenges even for specialists. Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process. In this paper, we intr...

Find SimilarView on arXiv

AutoDDG: Automated Dataset Description Generation using Large Language Models

February 3, 2025

91% Match

Haoxiang Allen Zhang, Yurong Allen Liu, Allen Wei-Lun, Hung, ... , Freire Juliana

Databases

The proliferation of datasets across open data portals and enterprise data lakes presents an opportunity for deriving data-driven insights. However, widely-used dataset search systems rely on keyword searches over dataset metadata, including descriptions, to facilitate discovery. When these descriptions are incomplete, missing, or inconsistent with dataset contents, findability is severely hindered. In this paper, we address the problem of automatic dataset description genera...

Find SimilarView on arXiv

Augmented Data Science: Towards Industrialization and Democratization of Data Science

September 12, 2019

91% Match

Huseyin Uzunalioglu, Jin Cao, Chitra Phadke, Gerald Lehmann, Ahmet Akyamac, Ran He, ... , Able Maria

Artificial Intelligence

Machine Learning

Conversion of raw data into insights and knowledge requires substantial amounts of effort from data scientists. Despite breathtaking advances in Machine Learning (ML) and Artificial Intelligence (AI), data scientists still spend the majority of their effort in understanding and then preparing the raw data for ML/AI. The effort is often manual and ad hoc, and requires some level of domain knowledge. The complexity of the effort increases dramatically when data diversity, both ...

Find SimilarView on arXiv

DCA-Bench: A Benchmark for Dataset Curation Agents

June 11, 2024

91% Match

Benhao Huang, Yingzhuo Yu, Jin Huang, ... , Ma Jiaqi

Artificial Intelligence

The quality of datasets plays an increasingly crucial role in the research and development of modern artificial intelligence (AI). Despite the proliferation of open dataset platforms nowadays, data quality issues, such as insufficient documentation, inaccurate annotations, and ethical concerns, remain common in datasets widely used in AI. Furthermore, these issues are often subtle and difficult to be detected by rule-based scripts, requiring expensive manual identification an...

Find SimilarView on arXiv

Data Analysis in the Era of Generative AI

September 27, 2024

91% Match

Jeevana Priya Inala, Chenglong Wang, Steven Drucker, Gonzalo Ramos, Victor Dibia, Nathalie Riche, Dave Brown, ... , Gao Jianfeng

Artificial Intelligence

Human-Computer Interaction

This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow by translating high-level user intentions into executable code, charts, and insights. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streaml...

Find SimilarView on arXiv

OpenDataLab: Empowering General Artificial Intelligence with Open Datasets

June 4, 2024

91% Match

Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, ... , Lin Dahua

Digital Libraries

Artificial Intelligence

The advancement of artificial intelligence (AI) hinges on the quality and accessibility of data, yet the current fragmentation and variability of data sources hinder efficient data utilization. The dispersion of data sources and diversity of data formats often lead to inefficiencies in data retrieval and processing, significantly impeding the progress of AI research and applications. To address these challenges, this paper introduces OpenDataLab, a platform designed to bridge...

Find SimilarView on arXiv

A Survey on Large Language Model-based Agents for Statistics and Data Science

December 18, 2024

91% Match

Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, ... , Huang Jian

Artificial Intelligence

Computation and Language

Machine Learning

Other Statistics

In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frame...

Find SimilarView on arXiv

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

October 3, 2024

91% Match

Patara Trirat, Wonyong Jeong, Sung Ju Hwang

Machine Learning

Artificial Intelligence

Computation and Language

Multiagent Systems

Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general time-consuming and requires a large amount of human effort. Therefore, recent works have started exploiting large language models (LLM) to lessen such burden and increase the usability of AutoML frameworks via...

Find SimilarView on arXiv

Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models

October 5, 2024

91% Match

Teruaki Hayashi, Hiroki Sakaji, ... , Goebel Randy

Information Retrieval

Developing the capacity to effectively search for requisite datasets is an urgent requirement to assist data users in identifying relevant datasets considering the very limited available metadata. For this challenge, the utilization of third-party data is emerging as a valuable source for improvement. Our research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery. The system...

Find SimilarView on arXiv

Better Synthetic Data by Retrieving and Transforming Existing Datasets

April 22, 2024

91% Match

Saumya Gandhi, Ritu Gala, Vijay Viswanathan, ... , Neubig Graham

Computation and Language

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introdu...

Find SimilarView on arXiv