DataHub¶

Data that can be queried to power apps can come from various sources, and transformed along the way. The DataHub is a set of User Interfaces to import, preview, and transform all this data.

Data sources and their datasets¶

Data can either be: - imported from files (CSV, Excel, XML, JSON, Parquet) either uploaded by users or retrieved from an S3 bucket or an FTP (via Peakina) - queried from connections (to databases or online services)

Today, these two are handled in two different sets of UI (data sources for files and connectors for connections). They will be merged in the unique DataHub view.

Files¶

Files are either uploaded to Laputa server or, if the file is too big, uploaded to S3. They are uploaded by chunks.

Data model¶

Configurations for files are stored in the etl_config file, as items of the DATA_SOURCES array.

Each entry with file represents a dataset that is extracted from this file. Eventually, one file can be present multiple time in this list (e.g. this is the case for Excel files with multiples sheets).

API Routes¶

Routes that are concern for this topic are:

/data/sources/upload-params (POST)
/data/sources, /data/sources/<string:filename> (GET/POST/DELETE)

File upload¶

Route to retrieve upload parameters¶

A first POST request is made to /data/sources/upload-params, to get all major info regarding:

The maximum size allowed for a file upload
The size for each chunk
The s3_enable value which is true if the instance will upload data to s3

Inside laputa/api/routes.py, we have this endpoint '/data/sources/upload-params'

Where we add the resource DataSourcesUploadParams to the endpoint and inside the class we just return the listed elements up there from global parameters following this structure :

{
  maxFileSize, // int (in bytes)
  chunkSize, // int (in bytes)
  uploadToS3 // boolean
  maxNumberOfFilesUploaded // int | None
}

Route to upload a file¶

When we start the upload process, we hit the route /data/sources with a POST request and we have two cases here, we can do a single upload, that means we have a small file size that didn't need to be split or a big file that will be split in multiple pieces(with the s3 feature flag activated or not)...

Note

The split process is made by the frontend, so that it manage to send itself chunks per chunks... (this choice was made to not overload the backend instance for a process that can slow down other clients)

So, inside laputa/api/routes.py, we have this line the endpoint '/data/sources'

Where we add the DataSources resource class, and the method for this scope is the post method :

def post(self):
    try:
        operation_id, file_path_to_add_in_etl = handle_data_source_file_writing_request(request)
        ...
    except:
        ...

Where we called the function handle_data_source_file_writing_request, that we're going to write the whole file or just a chunk of the target file :

def handle_data_source_file_writing_request(request):
    ...
      status = write_file_by_chunk_on_s3(source_file, request.form, data_source_filename, g.small_app.id)
    ...

As return, of handle_data_source_file_writing_request method, we will have the operation-id and the file path that we're going to add to the etl, depending of the status returned by the appropriate writing function, and the return payload will be having this structure:

{
    message // str,
    filePath // str
}

Connections¶

Connections are configured using connectors. These connections are stored in the etl_config file, as items of the DATA_PROVIDERS array.

In the naming (routes, UI, models), we often call a connector what is actually a connection. A connector would be an element of the toucan-connectors repo, like "the PostgreSQL connector". A connection would be one configured instance of this connector in an app, like "the connection to my PostgreSQL DB hosted on Amazon".

API routes¶

Routes to manage these connections: - /connectors/all (GET)

The class resource ConnectorsAll has only one GET method, calling the get_all-status function, and that method return all existing connectors with their availability (are all required libs for this connector to be used installed ?)

/connectors/schemas (GET)

The resource ConnectorsSchemas has only one get method, having the responsability of getting JSON schemas (connector and query) of available connectors.

/connectors/<string:connector>/config (GET/PUT/DELETE) [deprecated]
/connectors/<string:connector>/config/schema (GET) [deprecated]
/connectors/<string:connector>/secrets (GET/PUT/DELETE)

The resource class associated to this endpoint is a CRUD for connector's secrets.
/connectors/<string:connector>/secrets/schema (GET)

The resource class associated to this endpoint will call the method get_connector_secrets_schema and this method return secrets schemas for connector.
/oauth/redirect (GET)

The resource class associated to this endpoint handles redirection step in oauth2 dance sith the get method.
/connectors/new (GET)

The resource class associated to this endpoint handles the oauth2 dance. The get method sends a request to a 3rd party API to initialize oauth2 dance. An example call would be GET /connectors/new?type=foo, where type is the type of connector. Semantically, this call should be a put. But since this is opened in a popup by the front, it can only be a GET.

Datasets scoped to the app¶

Once a connection is configured, one can create multiple datasets from it (called queries in the UI). These are stored in the etl_config, as items of the DATA_SOURCES array. They can be distinguished from datasets created from files by their attribute name, which references the name of the connection (can be found in DATA_PROVIDERS).

API routes¶

PUT /config/queries/<string:connection_name>: create a new dataset
PUT /config/queries/<string:connector_name>/<string:query_domain>: edit an existing dataset
DELETE /config/queries/<string:connector_name>/<string:query_domain>: delete an existing dataset

Datasets bound to elements of the app (stories, requesters, tiles, etc.)¶

It's also possible to define a dataset directly bound to the part of the app that will use it. They are stored in the queries collection, and referenced in the relevant front_config object using a uid.

These datasets can only be live (queried on-demand).

These queries are defined by a pipeline and extraDomains. The extraDomains part describes how the data is fetched from its source. The pipeline part specifies which transformations should be applied to this data, on-the-fly. Depending on the connector, the transformation can be done by the source, or by Toucan. See Fetching data (Section "Over live data") for more info about their model.

API routes¶

GET /queries: list all queries
PUT /queries: import/restore a list of queries
POST /query: save a new query
GET /query/<uid>: return a query
PUT /query/<uid>: edit a query
DELETE /query/<uid>: delete a query
GET /query/preview: return data from an un-saved query
GET /query/<uid>/execute: return data from a saved query

Materialized queries (formerly prepared datasets)¶

These are transformation pipelines, executed with pandas, over datasets that are loaded (either from files or connections). Their definitions are stored in the queries-staging collection. See data processing for more info.

API routes¶

The API keeps using the old routes until the frontend is reworked.

GET /prepared-datasets/preview: start a preview operation
GET /prepared-datasets/preview/<string:operation_id>: return the result of a preview operation