Apache Ignite is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads, delivering in-memory speeds at petabyte scale. This contrib package contains an integration between Apache Ignite and TensorFlow. The integration is based on tf.data from TensorFlow side and Binary Client Protocol from Apache Ignite side. It allows to use Apache Ignite as a data source for neural network training, inference and all other computations supported by TensorFlow. Another part of this module is an integration with distributed file system based on Apache Ignite.
Ignite Dataset provides features that you can use in a wide range of cases. The most important and interesting features are described below.
Apache Ignite is a distributed in-memory database, caching, and processing platform that provides fast data access. It allows you to avoid limitations of hard drive and store and operate with as much data as you need in distributed cluster. You can utilize these benefits of Apache Ignite by using Ignite Dataset. Moreover, Ignite Dataset can be used for the following use-cases:
Note that Apache Ignite is not just a step of ETL pipeline between a database or a data warehouse and TensorFlow. Apache Ignite is a high-grade database itself. By choosing Apache Ignite and TensorFlow you are getting everything you need to work with operational or historical data and, at the same time, an ability to use this data for neural network training and inference.
$ apache-ignite-fabric/bin/ignite.sh $ apache-ignite-fabric/bin/sqlline.sh -u "jdbc:ignite:thin://localhost:10800/" jdbc:ignite:thin://localhost/> CREATE TABLE KITTEN_CACHE (ID LONG PRIMARY KEY, NAME VARCHAR); jdbc:ignite:thin://localhost/> INSERT INTO KITTEN_CACHE VALUES (1, 'WARM KITTY'); jdbc:ignite:thin://localhost/> INSERT INTO KITTEN_CACHE VALUES (2, 'SOFT KITTY'); jdbc:ignite:thin://localhost/> INSERT INTO KITTEN_CACHE VALUES (3, 'LITTLE BALL OF FUR');
>>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> tf.enable_eager_execution() >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> >>> for element in dataset: >>> print(element) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
Apache Ignite allows to store any type of objects. These objects can have any hierarchy. Ignite Dataset provides an ability to work with such objects.
>>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> tf.enable_eager_execution() >>> >>> dataset = IgniteDataset(cache_name="IMAGES") >>> >>> for element in dataset.take(1): >>> print(element) { 'key': 'kitten.png', 'val': { 'metadata': { 'file_name': b'kitten.png', 'label': b'little ball of fur', width: 800, height: 600 }, 'pixels': [0, 0, 0, 0, ..., 0] } }
Neural network training and other computations require transformations that can be done as part of tf.data pipeline if you use Ignite Dataset.
>>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> tf.enable_eager_execution() >>> >>> dataset = IgniteDataset(cache_name="IMAGES").map(lambda obj: obj['val']['pixels']) >>> >>> for element in dataset: >>> print(element) [0, 0, 0, 0, ..., 0]
TensorFlow is a machine learning framework that natively supports distributed neural network training, inference and other computations. The main idea behind the distributed neural network training is the ability to calculate gradients of loss functions (squares of the errors) on every partition of data (in terms of horizontal partitioning) and then sum them to get loss function gradient of the whole dataset.
Using this ability we can calculate gradients on the nodes the data is stored on, reduce them and then finally update model parameters. It allows to avoid data transfers between nodes and thus to avoid network bottlenecks.
Apache Ignite uses horizontal partitioning to store data in distributed cluster. When we create Apache Ignite cache (or table in terms of SQL), we can specify the number of partitions the data will be partitioned on. For example, if an Apache Ignite cluster consists of 10 machines and we create cache with 10 partitions, then every machine will maintain approximately one data partition.
Ignite Dataset allows using these two aspects of distributed neural network training (using TensorFlow) and Apache Ignite partitioning. Ignite Dataset is a computation graph operation that can be performed on a remote worker. The remote worker can override Ignite Dataset parameters (such as host
, port
or part
) by setting correspondent environment variables for worker process (such as IGNITE_DATASET_HOST
, IGNITE_DATASET_PORT
or IGNITE_DATASET_PART
). Using this overriding approach, we can assign a specific partition to every worker so that one worker handles one partition and, at the same time, transparently work with single dataset.
>>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset("IMAGES") >>> >>> # Compute gradients locally on every worker node. >>> gradients = [] >>> for i in range(5): >>> with tf.device("/job:WORKER/task:%d" % i): >>> device_iterator = tf.compat.v1.data.make_one_shot_iterator(dataset) >>> device_next_obj = device_iterator.get_next() >>> gradient = compute_gradient(device_next_obj) >>> gradients.append(gradient) >>> >>> # Aggregate them on master node. >>> result_gradient = tf.reduce_sum(gradients) >>> >>> with tf.Session("grpc://localhost:10000") as sess: >>> print(sess.run(result_gradient))
High-level TensorFlow API for distributed training is supported as well.
In addition to database functionality Apache Ignite provides a distributed file system called IGFS. IGFS delivers a similar functionality to Hadoop HDFS, but only in-memory. In fact, in addition to its own APIs, IGFS implements Hadoop FileSystem API and can be transparently plugged into Hadoop or Spark deployments. This contrib package contains an integration between IGFS and TensorFlow. The integration is based on custom filesystem plugin from TensorFlow side and IGFS Native API from Apache Ignite side. It has numerous uses, for example:
Apache Ignite allows to protect data transfer channels by SSL and authentication. Ignite Dataset supports both SSL connection with and without authentication. For more information, please refer to the Apache Ignite SSL/TLS documentation.
>>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> tf.enable_eager_execution() >>> >>> dataset = IgniteDataset(cache_name="IMAGES", certfile="client.pem", cert_password="password", username="ignite", password="ignite")
Ignite Dataset is fully compatible with Windows. You can use it as part of TensorFlow on your Windows workstation as well as on Linux/MacOS systems.
Following examples will help you to easily start working with this module.
The simplest way to try Ignite Dataset is to run a Docker container with Apache Ignite and loaded MNIST data and after start interrupt with it using Ignite Dataset. Such container is available on Docker Hub: dmitrievanthony/ignite-with-mnist. You need to start this container on your machine:
docker run -it -p 10800:10800 dmitrievanthony/ignite-with-mnist
After that you will be able to work with it following way:
The simplest way to try IGFS with TensorFlow is to run Docker container with Apache Ignite and enabled IGFS and then interrupt with it using TensorFlow tf.gfile. Such container is available on Docker Hub: dmitrievanthony/ignite-with-igfs. You need to start this container on your machine:
docker run -it -p 10500:10500 dmitrievanthony/ignite-with-igfs
After that you will be able to work with it following way:
>>> import tensorflow as tf >>> import tensorflow.contrib.ignite.python.ops.igfs_ops >>> >>> with tf.gfile.Open("igfs:///hello.txt", mode='w') as w: >>> w.write("Hello, world!") >>> >>> with tf.gfile.Open("igfs:///hello.txt", mode='r') as r: >>> print(r.read()) Hello, world!
Presently, Ignite Dataset works with assumption that all objects in the cache have the same structure (homogeneous objects) and the cache contains at least one object. Another limitation concerns structured objects, Ignite Dataset does not support UUID, Maps and Object arrays that might be parts of an object structure.