Dataset

Definitions

Dataset

A set of related datafiles that exist in the same location, dataset metadata, and helper methods.

Locality

A dataset has one of these localities:

  • Cloud-based: it exists only in the cloud

  • Local: it exists only on your local filesystem

Tip

Use a dataset if you want to:

  • Group together a set of files that naturally relate to each other e.g. a timeseries that’s been split into multiple files.

  • Add metadata to it for future sorting and filtering

  • Include it in a manifest with other datasets and send them to an Octue service for processing

Key features

Work with local and cloud datasets

Working with a dataset is the same whether it’s local or cloud-based.

from octue.resources import Dataset

dataset = Dataset(path="path/to/dataset")

dataset = Dataset(path="gs://my-bucket/path/to/dataset")

Warning

Datasets recurse all subdirectories by default unless recursive=False is set.

Upload a dataset

Back up and share your datasets for collaboration.

dataset.upload("gs://my-bucket/path/to/upload")

Download a dataset

Use a shared or public dataset or retrieve a backup.

dataset.download("path/to/download")

Easy and expandable custom metadata

Find the needle in the haystack by making your data searchable. You can set the following metadata on a dataset:

  • Name

  • Labels (a set of lowercase strings)

  • Tags (a dictionary of key-value pairs)

This metadata is stored locally in a .octue file in the same directory as the dataset and is used during Dataset instantiation. It can be accessed like this:

dataset.name
>>> "my-dataset"

dataset.labels
>>> {"processed"}

dataset.tags
>>> {"organisation": "octue", "energy": "renewable"}

You can update the metadata by setting it on the instance while inside the Dataset context manager.

with dataset:
    datafile.labels.add("updated")

You can do this outside the context manager too, but you then need to call the update method:

dataset.labels.add("updated")
dataset.update_metadata()

Get dataset and metadata hashes

Make your analysis reproducible: guarantee a dataset contains exactly the same data as before by checking its hash.

dataset.hash_value
>>> 'uvG7TA=='

Note

A dataset’s hash is a function of its datafiles’ hashes. Datafile and dataset metadata do not affect it.

You can also check that dataset metadata is the same.

dataset.metadata_hash_value
>>> 'DIgCHg=='

Immutable ID

Each dataset has an immutable UUID:

dataset.id
>>> '9a1f9b26-6a48-4f2d-be80-468d3270d79c'

Check a dataset’s locality

Is this dataset local or in the cloud?

dataset.exists_locally
>>> True

dataset.exists_in_cloud
>>> False

A dataset can only return True for one of these at a time.

Filter datasets

Narrow down a dataset to just the files you want to avoiding extra downloading and processing.

Datafiles in a dataset are stored in a FilterSet, meaning they can be easily filtered by any attribute of the datafiles contained e.g. name, extension, ID, timestamp, tags, labels, size. The filtering syntax is similar to Django’s i.e.

# Get datafiles that have an attribute that satisfies the filter.
dataset.files.filter(<datafile_attribute>__<filter>=<value>)

# Or, if your filter is a simple equality filter:
dataset.files.filter(<datafile_attribute>=<value>)

Here’s an example:

# Make a dataset.
dataset = Dataset(
    path="blah",
    files=[
        Datafile(path="my_file.csv", labels=["one", "a", "b" "all"]),
        Datafile(path="your_file.txt", labels=["two", "a", "b", "all"),
        Datafile(path="another_file.csv", labels=["three", "all"]),
    ]
)

# Filter it!
dataset.files.filter(name__starts_with="my")
>>> <FilterSet({<Datafile('my_file.csv')>})>

dataset.files.filter(extension="csv")
>>> <FilterSet({<Datafile('my_file.csv')>, <Datafile('another_file.csv')>})>

dataset.files.filter(labels__contains="a")
>>> <FilterSet({<Datafile('my_file.csv')>, <Datafile('your_file.txt')>})>

You can iterate through the filtered files:

for datafile in dataset.files.filter(labels__contains="a"):
    print(datafile.name)
>>> 'my_file.csv'
    'your_file.txt'

If there’s just one result, get it via the FilterSet.one method:

dataset.files.filter(name__starts_with="my").one()
>>> <Datafile('my_file.csv')>

You can also chain filters or specify them all at the same time - these two examples produce the same result:

# Chaining multiple filters.
dataset.files.filter(extension="csv").filter(labels__contains="a")
>>> <FilterSet({<Datafile('my_file.csv')>})>

# Specifying multiple filters at once.
dataset.files.filter(extension="csv", labels__contains="a")
>>> <FilterSet({<Datafile('my_file.csv')>})>

For the full list of available filters, click here.

Order datasets

A dataset can also be ordered by any of the attributes of its datafiles:

dataset.files.order_by("name")
>>> <FilterList([<Datafile('another_file.csv')>, <Datafile('my_file.csv')>, <Datafile(path="your_file.txt")>])>

The ordering can also be carried out in reverse (i.e. descending order) by passing reverse=True as a second argument to the FilterSet.order_by method.