Quick-start guide¶
This page is a quick tutorial to help you get ATM up and running for the first
time. We’ll use a featurized dataset for a binary classification problem,
already saved in atm/data/test/pollution_1.csv
. This is one of the datasets
available on openml.org. More information about the
data can be found here.
Our goal is predict mortality using the metrics associated with the air pollution. Below we show a snapshot of the csv file. The dataset has 15 features, all numeric, and and a binary label column called “class”.
PREC JANT JULT OVR65 POPN EDUC HOUS DENS NONW WWDRK POOR HC NOX SO@ HUMID class
35 23 72 11.1 3.14 11 78.8 4281 3.5 50.7 14.4 8 10 39 57 1
44 29 74 10.4 3.21 9.8 81.6 4260 0.8 39.4 12.4 6 6 33 54 1
47 45 79 6.5 3.41 11.1 77.5 3125 27.1 50.2 20.6 18 8 24 56 1
43 35 77 7.6 3.44 9.6 84.6 6441 24.4 43.7 14.3 43 38 206 55 1
53 45 80 7.7 3.45 10.2 66.8 3325 38.5 43.1 25.5 30 32 72 54 1
43 30 74 10.9 3.23 12.1 83.9 4679 3.5 49.2 11.3 21 32 62 56 0
45 30 73 9.3 3.29 10.6 86 2140 5.3 40.4 10.5 6 4 4 56 0
.. .. .. ... .... .... ... .... .. .... .... .. .. .. .. .
.. .. .. ... .... .... ... .... .. .... .... .. .. .. .. .
.. .. .. ... .... .... ... .... .. .... .... .. .. .. .. .
37 31 75 8 3.26 11.9 78.4 4259 13.1 49.6 13.9 23 9 15 58 1
35 46 85 7.1 3.22 11.8 79.9 1441 14.8 51.2 16.1 1 1 1 54 0
Create a datarun¶
Before we can train any classifiers, we need to create a datarun. In ATM, a
datarun is a single logical machine learning task. The enter_data.py
script
will set up everything you need.:
(atm-env) $ atm enter_data
The first time you run it, the above command will create a ModelHub database, a
dataset, and a datarun. If you run it without any arguments, it will load
configuration from the default values defined in atm/config.py
. By default,
it will create a new SQLite3 database at ./atm.db, create a new dataset instance
which refers to the data at atm/data/test/pollution_1.csv
, and create a
datarun instance which points to that dataset.
The command should produce output that looks something like this::
method logreg has 6 hyperpartitions
method dt has 2 hyperpartitions
method knn has 24 hyperpartitions
Data entry complete. Summary:
Dataset ID: 1
Training data: /home/bcyphers/work/fl/atm/atm/data/test/pollution_1.csv
Test data: None
Datarun ID: 1
Hyperpartition selection strategy: uniform
Parameter tuning strategy: uniform
Budget: 100 (classifier)
The datarun you just created will train classifiers using the “logreg” (logistic regression), “dt” (decision tree), and “knn” (k nearest neighbors) methods. It is using the “uniform” strategy for both hyperpartition selection and parameter tuning, meaning it will choose parameters uniformly at random. It has a budget of 100 classifiers, meaning it will train and test 100 models before completing. More info about what is stored in the database, and what the fields of the datarun control, can be found here.
The most important piece of information is the datarun ID. You’ll need to reference that when you want to actually compute on the datarun.
Execute the datarun¶
An ATM worker is a process that connects to a ModelHub, asks it what dataruns need to be worked on, and trains and tests classifiers until all the work is done. To run one, use the following command:
(atm-env) $ atm worker.py
This will start a process that builds classifiers, tests them, and saves them to the ./models/ directory. As it runs, it should print output indicating which hyperparameters are being tested, the performance of each classifier it builds, and the best overall performance so far. One round of training looks like this:
Computing on datarun 1
Selector: <class 'btb.selection.uniform.Uniform'>
Tuner: <class 'btb.tuning.uniform.Uniform'>
Chose parameters for method "knn":
_scale = True
algorithm = brute
metric = euclidean
n_neighbors = 8
weights = distance
Judgment metric (f1, cv): 0.813 +- 0.081
New best score! Previous best (classifier 24): 0.807 +- 0.284
Saving model in: models/pollution_1-62233d75.model
Saving metrics in: metrics/pollution_1-62233d75.metric
Saved classifier 63.
And that’s it! You’re executing your first datarun, traversing the vast space of hyperparameters to find the absolute best model for your problem. You can break out of the worker with Ctrl+C and restart it with the same command; it will pick up right where it left off. You can also run the command simultaneously in different terminals to parallelize the work – all workers will refer to the same ModelHub database.
Occassionally, a worker will encounter an error in the process of building and testing a classifier. Don’t worry: when this happens, the worker will print error data to the terminal, log the error in the database, and move on to the next classifier.
When all 100 classifiers in your budget have been built, the datarun is finished! All workers will exit gracefully.
Classifier budget has run out!
Datarun 1 has ended.
No dataruns found. Exiting.
You can then load the best classifier from the datarun and use it to make predictions on new datapoints.
>>> from atm.database import Database
>>> db = Database(dialect='sqlite', database='atm.db')
>>> model = db.load_model(classifier_id=110)
>>> import pandas as pd
>>> data = pd.read_csv('atm/data/test/pollution_1.csv')
>>> model.predict(data[0])