Application of machine learning to agriculture: revolutionizing a thousand-year old industry
The Climate Corporation intends to assist farmers to obtain an improved understanding of their operations and arrive at better decisions to improve their crop yields in a sustainable fashion. They’ve produced a model-driven software platform referred to as Climate FieldView that captures, visualizes, and undertakes a broad analysis of a broad array of information for farmers and provides fresh insight and personalized recommendations to optimize crop yield. FieldView can integrate grower-particular information, like historical harvest information and operational information streaming in from specialist devices, which includes their FieldView Drive which is setup in tractors, combines, and other farming tools. It integrates public and third-party data sets, like soil, weather, satellite, elevation information and proprietary data, like genetic data of seed hybrids that they obtain from their Parent organization, Bayer.
With their platform, farmers can comprehend the effects of a specific fertilization strategy and alter as required. They identify prospective problematic areas, like a plot of land that’s impacted by crop disease, at a preliminary phase so they can swiftly undertake action. Farmers can detect which seeds they plant in which areas of their operation to get the most out of every stretch of land.
These suggestions aren’t just good for the finances of agriculturists, they’re also best practice for our environment. For instance, our model can demonstrate to farmers how to enhance their production while utilizing a reduced quantity of fertilizer.
Quickening the speed of knowledge gain within agriculture
The marrying of machine learning to the field of agriculture is comparatively nascent. To facilitate a digital transformation in agriculture, they must experiment and learn quickly throughout the comprehensive model lifecycle. With that in mind, they have formulated a company strategy and architecture to quicken up the total model pipeline from conception to commercialization.
This blog post concentrates on quickening up model generation.
Quickening model development is dependent around two primary factors: the capability to obtain new knowledge and to identify knowledge produced by others. By making it easy and to identify new insights and utilize the insights of others, a network effect is developed to quicken up upcoming discoveries.
Each experiment that is carried out, every feature that is produced, and every model we construct most be discoverable and utilizable for every researcher in the organization. Secondly, due to the width and voluminous nature of data we must get together – everything ranging from the genome of the seed, the soil variant, the weather, administration practices, drone imagery and satellite – the training and data infrastructure must have flexibility, scalable while eradicating time consuming activities that don’t supplement the knowledge acquisition cycle. Our goal is to enable data scientists function as such instead of infrastructure or data engineers.
We opt to go about leveraging the best aspects of Domino Data Lab, Amazon Web Services, and the open source society to provide a solution to this knowledge acceleration issue.
Domino is excellent at Usability and Discovery
At the Climate Corporation, they have taken the predeveloped Domino containers and extended them to function within our environment. Everything ranging from access and identity management (IAM) policies to typical package setup, to connectivity to Spark on Amazon EMR is good to go for the scientist’s applications. With the tap of a button they have a setting that is poised to go, and Domino gives a simple extension point for researchers to personalize the setting without the requirement to know the intricacies of Docker.
Domino shines through in reproducibility and discovery. Each individual run of a project is documented and recallable. Experimentation and collaboration are developed into the foundation of the platform. The actual knowledge acceleration happens by discovery of other work on the platform. With a mere keyword search, a researcher can undertake scanning of projects, files, collaborative comments, user developed tags, historical runs, and more to identify other specific research or subject matter specialists.
As mentioned prior, the domain consists a massive amount of data of a plethora of shapes and sizes, they required a data format and training platform that can manage this intricacy and scale. They required the platform to be simplistic and to work with several frameworks and with the current infrastructure. They required an “evolvable architecture” which would function with the upcoming deep learning framework or compute platform. The option of model framework or amongst 1 machine or 50 machines doesn’t need any extra work and should be comparatively seamless for the researcher. In much the same way, a group of features should be reusable by differing frameworks and technologies without costly format translations.
Why Sagemaker?
- Poised to run and extensible training containers
If you have ever attempted to develop a TensorFlow setting from the ground up you know how tough it is to obtain the precise versions of all the dependencies and the drivers functioning properly. Now we can merely opt for one of the pre-developed environments or even put forth a container to add Petastorm or other libraries.
- Instantaneous training infrastructure
By merely altering the configuration within your API call towards Sagemaker, you can develop training infrastructure and assets with differing combos of CPU, GPU, RAM and networking capacity that provides you with a flexible option to opt for the precise mix of assets. This capability improves the efficiency of operational management and the optimizes the expenditure of experimenting.
- Tuning of hyperparameters
The coded-in capability to attempt several combos of hyperparameters in parallel over executing these tests in serial largely quickens up the model building process by enhancing the efficiency of experimentation.
Bringing it all together
Here are extracts from an instance workflow leveraging Domino, Petastorm, and SageMaker utilizing the MNIST dataset
.
- Start a project in Domino and initiate a workspace:
Create the petastorm features:
- # Get training and test data
- if mnist_data is None:mnist_data = {‘train’: download_mnist_data(download_dir, train=True),
- ‘test’: download_mnist_data(download_dir, train=False)}
- # The MNIST data is small enough to do everything here in Python
- for dset, data in mnist_data.items():
- dset_output_url = ‘{}/{}’.format(output_url, dset)
- # Using row_group_size_mb=1 to avoid having just a single rowgroup in this example. In a real store, the value
- # should be similar to an HDFS block size.
- with materialize_dataset(spark, dset_output_url, MnistSchema, row_group_size_mb=1):
- # List of [(idx, image, digit), …]
- # where image is shaped as a 28×28 numpy matrix
- idx_image_digit_list = map(lambda idx_image_digit: {
- MnistSchema.idx.name: idx_image_digit[0],
- MnistSchema.digit.name: idx_image_digit[1][1],
- MnistSchema.image.name: np.array(list(idx_image_digit[1][0].getdata()), dtype=np.uint8).reshape(28, 28)
- }, enumerate(data))
- # Convert to pyspark.sql.Row
- sql_rows = map(lambda r: dict_to_spark_row(MnistSchema, r), idx_image_digit_list)
- # Write out the result
- spark.createDataFrame(sql_rows, MnistSchema.as_spark_schema()) \
- .coalesce(parquet_files_count) \
- .write \
- .option(‘compression’, ‘none’) \
- .parquet(dset_output_url)
- Train on SageMaker with five devices
- kwargs = dict(entry_point=entry_point,
- image_name=IMAGE_NAME,
- role=IAM_ROLE,
- sagemaker_session=sagemaker.Session(boto_session=boto_session),
- train_instance_count=5,train_instance_type=’ml.m5.xlarge’,
- framework_version=’1.13′,
- hyperparameters={‘dataset-url’: DATASET_URL,’training_steps’: training_step_count,’batch_size’: batch_size,’evaluation_steps’: 10,},
- py_version = ‘py3’,
- output_path = output_path,
- code_location=code_location,
- distributions={‘parameter_server’: {‘enabled’: True}})
- mnist_estimator = TensorFlow(**kwargs)
- # we’re bypassing the conventional sagemaker input methods because we are using petastorm. We will show this in a moment.
- mnist_estimator.fit(inputs=None)
- Critical aspect of our entry_point script where we interpreted the petastorm dataset and convert it into tensors.
def streaming_parser(serialized_example):”””Parses a single tf.Example into image and label tensors.”””
# 28 x 28 is size of MNIST example
image = tf.cast(tf.reshape(serialized_example.image, [28 * 28]), tf.float32)
label = serialized_example.digit
return {“image”: image}, label
def _input_fn(reader, batch_size, num_parallel_batches):
dataset = (make_petastorm_dataset(reader)
# Per Petastorm docs, do not add a .repeat(num_epochs) here
# Petastorm will cycle indefinitely through the data given `num_epochs=None`
# provided to make_reader
.apply(tf.contrib.data.map_and_batch(streaming_parser,
batch_size=batch_size,
num_parallel_batches=num_parallel_batches)))
return dataset
- The outcome of the run is saved on the Domino workspace instantaneously for repeatability and the model stayed on Amazon S3 for later promotion to production by other aspects of the infrastructure stack.
The future
By following through on this pattern of alignment of the infrastructure with the model’s lifecycle and concentrating on quickening up the knowledge acquisition procedure via the democratization of information, model, features, and experimenting, they have been able to swiftly increase the number of the models that have had deployment and the duration it takes to go about deploying them on an annual basis. The next stage for us will be to chop down the feature engineering hurdles around bringing together the various variants of spatial, temporal, and non-spatial information in a fashion that is simple for the researchers to leverage in training models. The infrastructure is present in order to go about iterations swiftly on a model after the training set is developed, but the hope is to further eradicate the data engineering activity from the part played by a data scientist, and let them be just what their name and designation pegs them as.