API

The best way to use Featrix is via the object API provided in Python.

Working with it is pretty easy. You can call help() on Python objects to see the docstrings on the objects.

Generally, you create a Featrix object for a specific server.

The API contains three primary objects: FeatrixEmbeddingSpace, FeatrixModel, and FeatrixDataSpace.

The FeatrixDataSpace is a way to associate data into one or more embedding spaces. Using this is required when you want to train an embedding space using multiple sources of data together.

The FeatrixEmbeddingSpace is trained on the contents of a data space, or it can be trained directly on a single table (e.g., a Pandas dataframe or a CSV file).

And finally, the FeatrixModel represents a predictive model. It is trained using an embedding space and either new data for the model specifically or some of the data that the embedding space was trained on, depending on your application needs.

You can build scalar predictions, classifications, recommendations, and more with this API. You can also cluster data or query for nearest neighbors by leveraging the embedding space. You can extend the embedding spaces, branch them, tune their training, and more.

We have designed this API to work with standard Python ecosystems: you should be able to easily connect Pandas, databases, matplotlib, sklearn, numpy, and PyTorch with the API. If you run into issues or want enhancements, drop us a note at mitch@featrix.ai or join our Slack community <https://join.slack.com/t/featrixcommunity/shared_invite/zt-25ad0tj5j-3VyaO3YdI8qI4kdr2VhUGA>!

class featrixclient.networkclient.Featrix(url='http://embedding.featrix.com:8080')
__init__(url='http://embedding.featrix.com:8080')

Create a Featrix client object.

Parameters:

url (str) –

The url of the Featrix server you are using.

The default is http://embedding.featrix.com:8080

Return type:

A Featrix object ready for embedding data.

class featrixclient.networkclient.FeatrixDataSpace(client, data_space_name=None)
__init__(client, data_space_name=None)
clearAll()

A shortcut to remove and create a dataspace. Great for ensuring reproducible results.

create(metadata: dict = None)

Creates a new data space to be used to train downstream embedding spaces.

You can load your own metadata into the passed dictionary for tracking any metadata you need.

Parameters:
  • data_space_name (str) – Arbitrary name you want to use to refer to the data space. If none, a string of an uuid will be used and returned.

  • metadata (dict) – Arbitrary metadata for version, debug capture, whatever you need. If None, a metadata dict with the creation time will be supplied for you.

Return type:

A handle confirming the name or an error if the name is already in use.

removeIfExists()

Remove the data space if it exists. Any vector spaces built from the data space are NOT deleted.

exists()

Returns True if the data space was confirmed to exist.

metadata()

Retrieve the metadata for the specified data space.

loadFileIfNeeded(path: str = None, label: str = None, df: DataFrame = None, on_bad_lines: str = 'warn', sample_percentage: float = None, sample_row_count: int = None, drop_duplicates: bool = True)

Copy a file or dataframe to the Featrix server if the file does not exist

Also associates the file with the specified data space.

The file can be associated with multiple dataspaces.

Safe to call this multiple times.

Parameters:
  • path (str) – either use this for the dataframe df but not both.

  • label (str) – this is the label for the file that will be used in this data space.

  • df (pd.DataFrame) – use either this or path, but not both.

  • on_bad_lines (str) – this is passed to pandas pd.read_csv without editing. ‘skip’ will ignore the bad lines; ‘error’ will stop loading and fail if there are bad lines. In the current software, passing ‘warn’ will not get returned to the API client (we need to fix this).

  • sample_percentage (float.) – Take a percentage of the rows at random for training the vector. The sample will be captured at the time the embedding space is trained; in other words, which part of the data is sampled will change on every training.

  • sample_row_count (int) – Take an absolute number of rows. Cannot be used with sample_percentage.

  • drop_duplicates (bool) – Ignore duplicate rows. True by default. The dropping will occur before sampling.

Notes

Files are compared with a local md5 hash and a remote md5 hash before deciding to transmit the file.

File hashes happen on the entire file; the data file is not sampled or de-duplicated prior to training a vector space. In other words, the sampling and de-duplication parameters are intended for convenience and not to save bandwidth or storage. We are open to feedback on this behavior. One implication is that samples will vary across trainings.

No partial copies are supported.

autoJoin()

Computes auto joining possibilities for the specified data space.

There are two goals with the mappings:

First, we want to map fields that uniquely identify objects to whatever degree we can.

Second, we want to map mutual information that spans data sets so that those fields get input into the same place in our input vectors to the embeddings transformation.

We infer the second set of mappings by leveraging the first set to identify linked records and then we sample and look for high conditional probabilities of fields resulting in the same field. For example, in this dataset, the joint distribution of unrelated fields often looks promising, such as when comparing “building square feet” and “street number” from different tables. But when we condition this comparison to specific linked entities, those unrelated false positives no longer hold and we latch onto the “correct” associations, such as zip code information across different source files, even if their unconditioned mutual information was “closer”.

This feature is in beta.

This can be used to confirm there are statistical relationships that are clearly present in the various files associated. This serves to diagnose and verify behavior before training the embedding space on the data.

You can also get back a CSV file of the projection to examine before creating the embedding space.

The linkage includes a list of the full columns in the result.

The columns get a hierarchical naming:

The base data columns are not changed. Additional data files get the label as a prefix with all spaces converted to underscores, and an underscore before the field name.

Parameters:

None

Return type:

Dictionary associating each file in the dataspace and the best detected linking columns.

setMappings(mappings)

Set the mappings between the base data set and another data set in the dataspace.

This overwrites any previously set mapping for the two specified data sets, but does not change the relationships between other pairs in the dataspace.

ignoreColumns(ignore_list: [<class 'str'>])

Set a list of columns to ignore. If any are set, this will overwrite the list.

The column names specified are the final projected names.

newEmbeddingSpace(metadata: dict = None, ignore_cols: [<class 'str'>] = None, detect_only: bool = False, n_epochs: int = 5, learning_rate: float = 0.01, print_debug=False)

Create a multimodal embedding space on data space.

This lets you create multimodal embeddings for assorted tabular data sources in a single trained embedding space. This essentially lets you build a foundational model on your data in such a way that you can query the entire data by using partial information that maps into as little as one of your original data sources, or you can leverage partial information spanning multiple data sources.

You can create multiple embedding spaces from a data space; you can use a subset of the data, ignore columns, or change mappings to rapidly experiment with models using this call.

The data space must already be loaded with 1 or more data source files.

This function will use auto-join (which you can try directly with EZ_DataSpaceAutoJoin) to find the linkage and corresponding overlapping mutual information between data files that have been loaded. Then a new embedding space is trained with the following columns:

Base data file: all columns (unless ignored in the ignore_cols parameter)

2nd data file: all columns, renamed to <2nd data file label> + “_” + <original_col_name>

However, the columns used for linking will not be present, as they will get their mapped names in the base data file.

To ignore a column in the 2nd data file, specify the name in the transformed format.

3rd data file: same as 2nd data file.

This trains the embedding space in the following manner:

Let’s imagine the 2nd_file_col1 and 3rd_file_col2 are the linkage to col1 in the base data set. The training space will effectively be a sparse matrix:

Parameters:
  • metadata (dict) – Your own dictionary of metadata if you want to track version information or other characteristics.

  • ignore_cols – A list of columns you want to ignore. You can get the list of columns, which may include renamed columns from merging multiple files, by calling detect_only=True

  • detect_only – If set to True, Featrix will not create the embedding space, but instead it will construct the mapping links, detect data types and encoders, identify opportunities for enrichment, and return all of this information to you.

  • n_epochs (int) – Number of epochs to train on. 5 is the default to let you try things rapidly, but usually at least 25 epochs and in some cases as many as 5000 may be needed for a high quality embedding space. You can visualize the impact of n_epochs and learning_rate with EZ_PlotLoss().

  • learning_rate (float) – Learning rate. Default is 0.01. 0.001 can be useful in some situations.

  • capture_training_debug (bool) – Pass in true to capture debug dumps every epoch.

  • encoders_override (dict) – Override encoders for specific columns.

Return type:

FeatrixEmbeddingSpace object representing the new embedding space. Will raise an assertion if an error.

class featrixclient.networkclient.FeatrixEmbeddingSpace(client: Featrix, vector_space_id=None)

FeatrixEmbeddingSpace provides access to all the facilities in an embedding space.

__init__(client: Featrix, vector_space_id=None)
remove(force=False)

Removes an embedding space, if it exists. No error if it does not exist.

Does not reset the id on this object.

Parameters:

force (bool) – If set to True, will kill a training process if the embedding space is currently training.

detectEncoders(df: DataFrame = None, csv_path: str = None, on_bad_lines: str = 'skip', print_result: bool = True, print_and_return: bool = False)

Query Featrix with some data to see how Featrix will interpret the data for encoding into the vector embeddings. You can override these detections if needed.

This will also return information about enriched columns that Featrix will extract.

Parameters:
  • df (pd.DataFrame) – The dataframe with both the input and target values. This can be the same as the dataframe used to train models in embedding space.

  • csv_path (str) – Path to a CSV file to open and read.

  • on_bad_lines – For reading the CSV file. ‘skip’, ‘error’, ‘warn’.

  • print_result – If True, prints the result to the console as a nice table. Same as calling EZ_PrintDetectOnly(EZ_DetectEncoders(…).

  • print_and_return – If True, will print to the console AND return the dictionary. Default is False (which means will not return anything without setting print_result=False)

Returns:

  • A dictionary of the printed table; useful for storing or comparing results across

  • data sets or different runs. You can ignore this if you are using print_result=True.

train(df: DataFrame = None, csv_path: str = None, on_bad_lines: str = 'skip', ignore_cols=None, n_epochs=None, learning_rate=None)

Train a new embedding space on a dataframe. The dataframe should include all target columns that you might want to predict.

You do not need to clean nulls or make the data numeric; pass in strings or missing values all that you need to.

You can pass in timestamps as a string in a variety of formats (ctime, ISO 8601, etc) and Featrix will detect and extract features as appropriate.

Parameters:
  • df (pd.DataFrame) – The dataframe with both the input and target values. This can be the same as the dataframe used to train models in embedding space.

  • csv_path (str) – Path to a CSV file to open and read.

  • on_bad_lines – For reading the CSV file. ‘skip’, ‘error’, ‘warn’.

  • ignore_cols – List of columns to ignore when training. If a column is specified that is not found in the dataframe, an exception is raised.

  • n_epochs (int or None) – Number of epochs to train on. Eventually this will support ‘auto’.

  • learning_rate (float or None) – Learning rate. Eventually this will support ‘auto’.

  • sample_percentage (float, 0.0 to 0.1) – How much of the dataframe to load. This will be passed to DataFrame.sample() on the data.

  • drop_duplicates (bool) – By default, drops duplicate rows. Set to False if you want to super sample data.

  • capture_training_debug (bool) – If set to true, this will capture a dump of the embedding space on every epoch. This is useful to create animated visualizations of the convergence of the embedding space.

  • encoders_override (dict) – Override auto-detection of encoders.

Return type:

str uuid of the embedding space (which we call vector_space_id on other calls)

Notes

This call blocks until training has completed; the lower level API gives you more async control.

To combine a series and dataframe, do something like:

continueTraining(n_epochs: int, learning_rate: float = 0.01)

Continue training the embedding space.

You can query the training history and loss by checking the structure returned from EZ_VectorSpaceMetaData().

Parameters:
  • epochs (int) – Number of epochs.

  • learning_rate (float) – Learning rate. 0.01 is the default.

  • capture_training_debug (bool) – Pass in true to capture debug dumps every epoch.

metadata()

Get metadata for the specified embedding space.

Parameters:

None

Returns:

  • A dictionary of metadata. The dictionary contains – Information about the columns used to train the embedding space. The training time, batch dimensions, and other statistics. Which encoders were used for which columns and the detected probability of data types. Information about every set of training done on the embedding space, including loss per iteration and number of epochs.

  • The specific format of the dictionary may change from release to release as we improve the product.

  • Returns None if the embedding space does not exist.

columns()

Retrieve the list of columns that were embedded in the embedding space. The embedding space must have already been trained using the train() method.

If Featrix was unable to process a column, then it will not be in the list.

Parameters:

None

Return type:

A list of column names

plotLoss()

Show a matplotlib plot of the loss on training the embedding space.

plotEmbeddings(col1_name, col2_name=None, col1_range=None, col2_range=None, col1_steps=None, col2_steps=None, relative_scale=False, axis_label_precision=5, show_unknown=False)

Plot similarity plots of embeddings in the embedding space.

Parameters:
  • col1_name (str) – The first field in the embedding space to plot.

  • col2_name (str) – This can be specified to show the similarity of embeddings between two columns. If this is NOT specified, then this will produce a self-similarity plot of col1_name vs col1_name itself.

  • col1_range ((min, max) tuple) – Range of values, used if col1 is a scalar. Default is mean ± 2 * std.

  • col2_range ((min, max) tuple) – Range of values, used if col2 is a scalar. Default is mean ± 2 * std.

  • col1_steps (int) – Number of steps to sample across col1_range, if col1 is a scalar.

  • col2_steps (int) – Number of steps to sample across col2_range, if col2 is a scalar.

  • relative_scale (bool) – If true, a relative scale will be used. Defaults to False; the default scale is [-1, 1].

  • rotate_x_labels (bool) – If true, will turn the X labels to be easier to read.

  • axis_label_precision (int) – The number of decimal places to show the scalar axis labels. Set to 0 to round to integers.

  • show_unknown (bool) – Show the <UNKNOWN> token for sets. The unknown is a special value used when embedding sets, which provides a lot of power to Featrix, but it can make for some confusing visualizations.

  • show_metadata (bool) – If true, will show the time the embedding space was trained.

  • epoch_index (int) – Show a specific view at the end of the specified epoch during training.

  • animate_training (bool) – If true, will create an animated gif showing the training of the vectors.

Raises:
  • FeatrixEmbeddingSpaceNotFound – If the embedding space specified doesn’t exist.

  • FeatrixColumnNotFound – If a specified column name doesn’t exist in the embedding space.

  • FeatrixColumnNotAvailable – If a specified column is not available for distance plotting.

cluster(k: int, columns: [<class 'str'>] = None, forceNewCluster: bool = False, return_centroids=False, return_columns: [<class 'str'>] = None, n_epochs: int = 10000)

Cluster the training data.

Future versions of the API will let you cluster new data sets, with and without the training data.

This call will block until the clustering is complete; this may take a few seconds or minutes.

Parameters:
  • k (int) –

    The number of clusters. There is an art to picking k and we will add tools later to help. If you pick k to be too small, many clusters will appear to be similar.

    Note that this implementation does not split the data into equal sized k groups.

  • columns

    The column names to cluster on.

    NOTE: The current release supports just one column. This will be fixed ASAP.

    This defaults to None. When None is used, the cluster index will be built considering all columns. Often times we want to experiment with different arrangements and this lets us do that–we can verify clusters on a single column or a reduced subset without training a new embedding space.

  • return_centroids – Return the centroid coordinates of the clusters. This can be used to evaluate the cluster separation.

  • return_columns – List of columns to return in the returned data. This lets you get back just the columns you want to evaluate the clusters. If None, then all data fields will be returned.

  • n_epochs – Number of epochs to train the clustering on.

Returns:

  • A dictionary with a few values

  • { – finished: True - indicates the clustering process has finished. error: if present, it will be set to True, and indicate something has gone wrong. message: error message if error is set.

    result: The result dictionary if there was no error:
    {

    centroids: The vector centers of each cluster. id_map: A dictionary that maps the cluster offsets to the original data file. The values in the dictionary contain the file name, hash of the data, and row index. Helper functions to deal with this are coming in a future release. label_histogram: A convenience histogram of k elements that indicates the number of items in each cluster. labels: Similar to sklearn’s fit() results, this maps offsets (which are keys into id_map) to the cluster id. total_square_error: The total squared error in each cluster (contains k elements)

    }

  • }

embed(*args, **kwargs)
embedRecords(records=None, colList: [<class 'list'>] = None)

Embed new records. You can use this to test embeddings for new data that isn’t trained in the embedding space (or that is); you can pass partial records, the sky is the limit.

This does not edit the embedding space.

Parameters:
  • records

    This can be a dataframe or a list of dictionaries.

    The keys in the dictionary need to be column names from the embedding space (which you can query with columns().

  • colList – A list of keys. You can use this to pass only some of the fields in the records argument without having to manually drop or reduce the data.

newModel()

Creates a new model object using this embedding space and the Featrix client object in use by this embedding space.

Call train() on the returned new model to do useful things with it and instantiate it on the server.

class featrixclient.networkclient.FeatrixModel(embeddingSpace=None, client=None, vector_space_id=None, model_id=None)
__init__(embeddingSpace=None, client=None, vector_space_id=None, model_id=None)
train(target_column_name: str, df: DataFrame | list[DataFrame], n_epochs: int = 25, size: str = 'small')

Create a new model in a given embedding space.

Parameters:
  • target_column_name (str) – Name of the target column. Must be present in the passed DataFrame df or the passed DataFrame dictionary after mapping.

  • df (pd.DataFrame | list[pd.DataFrame]) –

    The dataframe with both the input and target values. This can be the same as the dataframe used to train the embedding space.

    For embedding spaces created from joined data in data spaces, this can be a list of data frames to train the model.

  • n_epochs (int) – Number of epochs to train on.

  • size (str) – Can be ‘small’, ‘large’ For models that run in the Featrix server, ‘small’ is a 2 hidden layer model with 50 dimensions. ‘large’ is a 6 layer model with 50 dimensions.

Return type:

str uuid of the model (which we call model_id on other calls)

Raises:

FeatrixEmbeddingSpaceNotFound – if the embedding space specified does not exist.

Notes

This call blocks until training has completed; the lower level API gives you more async control.

To combine a series and dataframe, do something like:

all_df = df.copy() # call .copy() if you do not want to change the original. all_df[target_col] = target_values

checkGuardrails(query, issues_only=False)

Checks the parameters of the query for potential inconsistencies between the query and what the embedding space has been trained on.

Warnings or errors from this do not mean running a prediction will fail, but they can indicate that the query is beyond the bounds of what has been trained and therefore results may be unexpected.

Use this for debugging and getting a feel for the embedding space shapes.

This call is designed to be interchangeable with predict().

Parameters:
  • query (dict or [dict]) – This is exactly what you would pass to predict() Either a single parameter or a list of parameters. { col1: <value> }, { col2: <value> }

  • issues_only (bool) – If True, will return only warnings and errors and no informative messages.

predict(query)

Predict a probability distribution on a given model in a embedding space.

Query can be a list of dictionaries or a dictionary for a single query.

Parameters:
  • query (dict or [dict]) – Either a single parameter or a list of parameters. { col1: <value> }, { col2: <value> }

  • check_guardrails (bool) – If True, will run checkGuardrails() first, and print out any errors or warnings to the console.

Return type:

A dictionary of values of the model’s target_column and the probability of each of those values occurring.

predictOnDataFrame(df: DataFrame, target_column: str = None, include_probabilities: bool = False, check_accuracy: bool = False, print_info: bool = False)

Given a dataframe, treat the rows as queries and run the given model to provide a prediction on the target specified when creating the model.

Parameters:
  • target_column (str) – The target to remove from the dataframe, if it is present. None will default to the target column of the model.

  • df (pd.DataFrame) – The dataframe to run our queries on.

  • include_probabilities (bool, default False) – If True, the result will be a list of dictionaries of probabilities of values. This works like sklearn’s predict_proba() on classifiers, though our return value is not an ndarray. If check_accuracy is set to true, this will just ensure that the highest probability is right; we do not (yet) support checking an ordered list of probabilities or other nice things like that.

  • check_accuracy (bool, default False) – If True, will compare the result value from the model with the target values from the passed dataframe.

  • print_info (bool, default False) – If True, will print out some stats as queries are batched and processed.

Return type:

A list of predictions in the symbols of the original target.

Notes

In this version of the API, queries are for categorical values only.