Upload dataset to Hugging Face
Published:
Upload dataset to Hugging Face
Register on Hugging Face
Create a account on Hugging Face and login to your account. https://huggingface.co/
Requirements
# create a virtual environment using conda or venv
conda create -n upload_data python=3.9
conda activate upload_data
# install required packages
conda install datasets transformers
Get the token
Go to your profile and apply for a token in the Access Tokens
section.
Choose the Write
option and copy the token. see the different options here
Login to Hugging Face in your terminal
huggingface-cli login
# paste the token
Upload the dataset
Here is an example of uploading a dataset to Hugging Face.
In this example, I will upload the cat dot image dataset. The dataset contains 25000 images of cats and dogs. And it also has a csv file that contains the labels of the images.
labes.txt:
cat.0.jpg;0
cat.1.jpg;0
cat.10.jpg;0
cat.100.jpg;0
cat.1000.jpg;0
cat.10000.jpg;0
cat.10001.jpg;0
cat.10002.jpg;0
...
The first column is the image name and the second column is the label. 0 for cat and 1 for dog.
import os
import pandas as pd
from datasets import Dataset, DatasetDict
from PIL import Image
# set your dataset path
img_dir = 'dataset/images'
label_file = 'dataset/labels.txt'
# read the labels
labels_df = pd.read_csv(label_file, sep=';', header=None, names=['image', 'label'])
# read the images
def load_image(sample):
image_path = os.path.join(img_dir, sample['image'])
with Image.open(image_path).convert('RGB') as img:
img_byte_arr = io.BytesIO()
img.save(img_byte_arr, format='JPEG') # convert the image to byte array
img_byte_arr = img_byte_arr.getvalue() # get the byte array
sample['image'] = img_byte_arr
return sample
# create Hugging face dataset
dataset = Dataset.from_pandas(labels_df)
dataset = dataset.map(load_image, num_proc=4)
# split the dataset
train_test_split = dataset.train_test_split(test_size=0.2)
dataset_dict = DatasetDict({
'train': train_test_split['train'],
'test': train_test_split['test']
})
# push the dataset to the hugging face
# Hugging Face will create a new dataset with the name `username/dataset_name`
dataset_dict.push_to_hub("username/dataset_name")
Run the above code in your terminal.
If it is successful, you will see the dataset in your Hugging Face.
Use the dataset
You can use the dataset in your code like this:
from datasets import load_dataset
# load the dataset
# The dataset will be downloaded from the Hugging Face
dataset = load_dataset("username/dataset_name", split="train")
# get the first sample
byte = dataset['train']['image'][0]
image = Image.open(io.BytesIO(byte))
plt.imshow(image)