Datafile

Definitions

Datafile

A single local or cloud file, its metadata, and helper methods.

Locality

A datafile has one of these localities:

  • Cloud-based: it exists only in the cloud

  • Local: it exists only on your local filesystem

  • Cloud-based and local: it’s cloud-based but has been downloaded for low-latency reading/writing

Tip

Use a datafile to work with a file if you want to:

  • Read/write to local and cloud files in the same way

  • Include it in a dataset that can be sent to an Octue service for processing

  • Add metadata to it for future sorting and filtering

Key features

Work with local and cloud data

Working with a datafile is the same whether it’s local or cloud-based. It’s also almost identical to using python built-in open function. For example, to write to a datafile:

from octue.resources import Datafile

datafile = Datafile("path/to/file.dat")

# Or:

datafile = Datafile("gs://my-bucket/path/to/file.dat")

with datafile.open("w") as f:
    f.write("Some data")
    datafile.labels.add("processed")

All the same file modes you’d use with the python built-in open context manager are available for datafiles e.g. "r" and "a".

Automatic lazy downloading

Save time and bandwidth by only downloading when necessary.

Downloading data from cloud datafiles is automatic and lazy so you get both low-latency content read/write and quick metadata reads. This makes viewing and filtering by the metadata of cloud datasets and datafiles quick and avoids unnecessary data transfer, energy usage, and costs.

Datafile content isn’t downloaded until you:

Read more about downloading files here.

CLI command friendly

Use any command line tool on your datafiles. Datafiles are python objects, but they represent real files that can be fed to any CLI command you like

import subprocess
output = subprocess.check_output(["openfast", datafile.local_path])

Easy and expandable custom metadata

Find the needle in the haystack by making your data searchable. You can set the following metadata on a datafile:

  • Timestamp

  • Labels (a set of lowercase strings)

  • Tags (a dictionary of key-value pairs)

This metadata is stored locally in a .octue file for local datafiles or on the cloud objects for cloud datafiles and is used during Datafile instantiation. It can be accessed like this:

datafile.timestamp
>>> datetime.datetime(2022, 5, 4, 17, 57, 57, 136739)

datafile.labels
>>> {"processed"}

datafile.tags
>>> {"organisation": "octue", "energy": "renewable"}

You can update the metadata by setting it on the instance while inside the Datafile.open context manager.

with datafile.open("a"):
    datafile.labels.add("updated")

You can do this outside the context manager too, but you then need to call the update method:

datafile.labels.add("updated")
datafile.update_metadata()

Upload an existing local datafile

Back up and share your datafiles for collaboration. You can upload an existing local datafile to the cloud without using the Datafile.open context manager if you don’t need to modify its contents:

datafile.upload("gs://my-bucket/my_datafile.dat", update_metadata=True)

Get file and metadata hashes

Make your analysis reproducible: guarantee a datafile contains exactly the same data as before by checking its hash.

datafile.hash_value
>>> 'mnG7TA=='

You can also check that any metadata is the same.

datafile.metadata_hash_value
>>> 'DIgCHg=='

Immutable ID

Each datafile has an immutable UUID:

datafile.id
>>> '9a1f9b26-6a48-4f2d-be80-468d3270d79b'

Check a datafile’s locality

Is this datafile local or in the cloud?

datafile.exists_locally
>>> True

datafile.exists_in_cloud
>>> False

A cloud datafile that has been downloaded will return True for both of these properties.

Represent HDF5 files

Support fast I/O processing and storage.

Warning

If you want to represent HDF5 files with a Datafile, you must include the extra requirements provided by the hdf5 key at installation i.e.

pip install octue[hdf5]

or

poetry add octue -E hdf5

Usage examples

The Datafile class can be used functionally or as a context manager. When used as a context manager, it is analogous with the python built-in open function. On exiting the context (the with block), it closes the datafile locally and, if the datafile also exists in the cloud, updates the cloud object with any data or metadata changes.

_images/datafile_use_cases.png

Example A

Scenario: Download a cloud object, calculate Octue metadata from its contents, and add the new metadata to the cloud object

Starting point: Object in cloud with or without Octue metadata

Goal: Object in cloud with updated metadata

from octue.resources import Datafile


datafile = Datafile("gs://my-bucket/path/to/data.csv")

with datafile.open() as f:
    data = f.read()
    new_metadata = metadata_calculating_function(data)

    datafile.timestamp = new_metadata["timestamp"]
    datafile.tags = new_metadata["tags"]
    datafile.labels = new_metadata["labels"]

Example B

Scenario: Add or update Octue metadata on an existing cloud object without downloading its content

Starting point: A cloud object with or without Octue metadata

Goal: Object in cloud with updated metadata

from datetime import datetime
from octue.resources import Datafile


datafile = Datafile("gs://my-bucket/path/to/data.csv")

datafile.timestamp = datetime.now()
datafile.tags = {"manufacturer": "Vestas", "output": "1MW"}
datafile.labels = {"new"}

datafile.upload(update_metadata=True)  # Or, datafile.update_metadata()

Example C

Scenario: Read in the data and Octue metadata of an existing cloud object without intent to update it in the cloud

Starting point: A cloud object with Octue metadata

Goal: Cloud object data (contents) and metadata held locally in local variables

from octue.resources import Datafile


datafile = Datafile("gs://my-bucket/path/to/data.csv")

with datafile.open() as f:
    data = f.read()

metadata = datafile.metadata()

Example D

Scenario: Create a new cloud object from local data, adding Octue metadata

Starting point: A file-like locally (or content data in local variable) with Octue metadata stored in local variables

Goal: A new object in the cloud with data and Octue metadata

For creating new data in a new local file:

from octue.resources import Datafile


datafile = Datafile(
    "path/to/local/file.dat",
    tags={"cleaned": True, "type": "linear"},
    labels={"Vestas"}
)

with datafile.open("w") as f:
    f.write("This is some cleaned data.")

datafile.upload("gs://my-bucket/path/to/data.dat")

For existing data in an existing local file:

from octue.resources import Datafile


tags = {"cleaned": True, "type": "linear"}
labels = {"Vestas"}

datafile = Datafile(path="path/to/local/file.dat", tags=tags, labels=labels)
datafile.upload("gs://my-bucket/path/to/data.dat")