Create Dataset
Create New Dataset¤
The demo dataset
We will use the data files in this repository as a demo. Please download the zip of this repo.
Create¤
After downloading and unzipping the demo dataset, cd
into the folder. In my case, it is
cd ~/Downloads/dataherb-python-demo-dataset-main
Run the command
dataherb create
and a few questions will pop out.
Your current working directory is /Users/itsme/Downloads/dataherb-python-demo-dataset-main
A dataherb.json file will be created right here.
Are you sure this is the correct path? [y/N]: y
This makes sure that we are working in the correct folder. In this case, we type y
to confirm.
Which Type of Remote¤
[?] Where is/will be the dataset synced to?: git
> git
s3
Dataherb supports two different types of sources, S3 and git. In this case, we choose git
as we would like to sync the dataset to a GitHub repo.
Name of the Dataset¤
[?] How would you like to name the dataset?: Dataherb Demo Dataset
This will be the name of your dataset.
ID of the Dataset¤
[?] Please specify a unique id for the dataset: git-dataherb-python-demo-dataset
ID of the dataset has to be unique in your whole flora.
Description¤
[?] What is the dataset about? This will be the description of the dataset.: This is a demo dataset to test and show dataherb.
Describe the dataset here.
URI¤
[?] What is the dataset's URI? This will be the URI of the dataset.: https://github.com/DataHerb/dataherb-python-demo-dataset-created.git
Result¤
Two things happened after this.
- A file
dataherb.json
will be created in the current folder. - The metadata for this dataset has been added to the Flora.
Content of the dataherb.json
file
The content of the file should be the following.
{
"source": "git",
"name": "Dataherb Demo Dataset",
"id": "git-dataherb-python-demo-dataset",
"description": "This is a demo dataset to test and show dataherb.",
"uri": "https://github.com/DataHerb/dataherb-python-demo-dataset-created.git",
"metadata_uri": "https://raw.githubusercontent.com/DataHerb/dataherb-python-demo-dataset-created/main/dataherb.json",
"datapackage": {
"profile": "tabular-data-package",
"resources": [
{
"path": "dataset/indeed_job_listing.csv",
"profile": "tabular-data-resource",
"name": "indeed_job_listing",
"format": "csv",
"mediatype": "text/csv",
"encoding": "windows-1252",
"schema": {
"fields": [
{
"name": "title",
"type": "string",
"format": "default"
},
{
"name": "location",
"type": "string",
"format": "default"
},
{
"name": "company",
"type": "string",
"format": "default"
},
{
"name": "description",
"type": "string",
"format": "default"
},
{
"name": "salary",
"type": "string",
"format": "default"
},
{
"name": "url",
"type": "string",
"format": "default"
},
{
"name": "published_at",
"type": "string",
"format": "default"
},
{
"name": "id",
"type": "string",
"format": "default"
}
],
"missingValues": [
""
]
}
},
{
"path": "dataset/stackoverflow_job_listing.csv",
"profile": "tabular-data-resource",
"name": "stackoverflow_job_listing",
"format": "csv",
"mediatype": "text/csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "link",
"type": "string",
"format": "default"
},
{
"name": "category",
"type": "string",
"format": "default"
},
{
"name": "title",
"type": "string",
"format": "default"
},
{
"name": "description",
"type": "string",
"format": "default"
},
{
"name": "published_at",
"type": "string",
"format": "default"
},
{
"name": "location",
"type": "string",
"format": "default"
},
{
"name": "stackoverflow_id",
"type": "integer",
"format": "default"
},
{
"name": "author",
"type": "string",
"format": "default"
},
{
"name": "location_country",
"type": "string",
"format": "default"
},
{
"name": "location_city",
"type": "string",
"format": "default"
},
{
"name": "updated_at",
"type": "string",
"format": "default"
}
],
"missingValues": [
""
]
}
}
]
}
}
Sync to Remote¤
WIP
Sync local dataset to remote is still a WIP.
After creating the dataset, run the command dataherb upload
.
Using Experimental Feature
Use dataherb upload --experimental True
to use experimental feature. For example, git repo will automatically add, commit ,and push.