How to load Data in Python with Scikit-Learn
Prior to developing machine learning models, you are required to load your data into memory.
In this blog article, you will find out how to load data for machine learning in Python leveraging scikit-learn.
Packaged Datasets
The scikit-learn library is packaged with datasets. These datasets are good for obtaining a handle on a provided machine learning algorithm or library feature prior to leveraging it in your own work.
This recipe illustrates how to load the widespread Iris flowers dataset.
1 2 3 4 | # Load the packaged iris flowers dataset # Iris flower dataset (4×150, reals, multi-label classification) iris = load_iris() print(iris) |
Load from CSV
It is really typical for you to possess a dataset as a CSV file on your local workstation or on a remote server.
This recipe illustrates to you how to load a CSV file from a URL, in this scenario the Pima Indians diabetes classification dataset.
You can know more with regards to the dataset here.
From the prepped X and Y variables, you can train a machine learning model.
d the Pima Indians diabetes dataset from CSV URL
Python
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 | # Load the Pima Indians diabetes dataset from CSV URL import numpy as np import urllib # URL for the Pima Indians Diabetes dataset (UCI Machine Learning Repository) url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv” # download the file raw_data = urllib.urlopen(url) # load the CSV file as a numpy matrix dataset = np.loadtxt(raw_data, delimiter=”,”) print(dataset.shape) # separate the data from the target attributes X = dataset[:,0:7] y = dataset[:,8] |
Conclusion
In this blog article, you found out about the scikit-learn method comes with packaged data sets which includes the iris flowers dataset. These datasets can be loaded simply and leveraged for explore and experiment with differing machine learning models.
You also observed how you can load CSV data with scikit-learn. You learned of a way of opening CSV files from the web leveraging the urllib library and how you can read that information as a NumPy matrix for leveraging in scikit-learn.