# plm4ndv

**Repository Path**: ByteDance/plm4ndv

## Basic Information

- **Project Name**: plm4ndv
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-03-06
- **Last Updated**: 2025-09-16

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Introduction

This repository contains the implementation of **PLM4NDV**. We provide details in our paper, including but not limited to the train/validation/test dataset splits, preprocessed data, semantic embedding, and model training. You can obtain the results presented in our paper by following the instructions below

> The paper is publicly available at [arxiv](https://arxiv.org/pdf/2504.00608).
>
> If you find our work useful, please cite the paper:

```
@article{xu2025plm4ndv,
  title={PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models},
  author={Xu, Xianghong and He, Xiao and Zhang, Tieying and Zhang, Lei and Shi, Rui and Chen, Jianjun},
  journal={Proceedings of the ACM on Management of Data},
  volume={3},
  number={3},
  pages={1--28},
  year={2025},
  publisher={ACM New York, NY, USA}
}
```

# Instruction

1. Establish the experimental environment in Python 3.10.

```bash
pip3 install -r requirement.txt
```

2. Download [TabLib](https://huggingface.co/datasets/approximatelabs/tablib-v1-sample) dataset, and put the parquet files in a folder.
3. Read the parquet files, extract primary component of each table from each file, and the extracted content should be saved to `./data/extracted/`. The default data access method is set as sequential access, if you want to use random sampling please comment out Line 56 and use Line 57 instead.

```bash
python extract_parquet.py
```

4. Traverse the extracted content, filter useless columns and save the filtered content to `./data/traversed/`.

```bash
python traverse_columns.py
```

5. Split the traversed content into train/test/val sets, deduplicate the contents and save to `./data/splitted/`.

```bash
python split_traversed.py
```

6. Download [sentence-t5-large](https://huggingface.co/sentence-transformers/sentence-t5-large) and set the model path. Generate the embedding of a column using PLM. Save them to `./data/embedding/`.

```bash
python semantic_embedding.py
```

7. Train the model and the model parameters will be saved to `./ckpt/`. The inference code is also in the file and the performance on NDV estimation under sequential access reproted in the paper will be presented.

```bash
python train_and_test.py
```

> If you want to reproduce the performance of out method under random sampling 100 rows, please follow the instructions in Step 3.
>
> If you do not want to train the model from scratch, you can load our model parameters to obtain the results on the test set by commenting out Line 300 in `train_and_test.py`.