Google introduced a new version of Cloud Storage Connector for Hadoop which is also known as GCS Connector. In a recent blog post Google announced that they are one step ahead to make it easier for their user to substitute their Hadoop Distributed File System (HDFS) with Google Cloud Storage. This new version is capable of providing increased thourghput efficiency for columnar file formats such Parquet and ORC. This new version also includes isolation for cloud storage directory modifications, reduced latency, improved parallelization and intelligent defaults.
Cloud Storage Connector is basically an open source jave client library that runs in Hadoop JVMs. The connector allows open source software such as Hadoop and Spark Jobs, read/write data directly to cloud storage, instead of HDFS.
Benefits of Storing Data in Google Cloud Storage Over HDFS
According to Google there are certain benefits of storing data in Google Cloud Storage over HDFS, including:
- It reduces cost as compared to a long running HDFS cluster which create three replicas on persistent disks
- User can grow each layer independently by separating storage from compute
- Carry through the storage even after Hadoop Clusters are finished
- User can share cloud storage buckets between ephemeral Hadoop clusters
- No storage administration overhead, like managing upgrades and high availability for HDFS
Cloud Storage Connector for Hadoop is completely open source and it is supported by Google Cloud Platform (GCP). This connector comes pre-configured in Cloud Dataproc, GCP’s Hadoop and Spark offering. However if someone want to use it in other Hadoop distributions such as MapR, Cloudera, and Hortonworks, it is completely supported to all of these. Which makes it easy for users to migrate on-prem HDFS data to cloud.
Also Read: List of Top 5 Cloud Storage Services
Architecture of Cloud Storage Connector
This is how Cloud Storage Connector architecture looks like:
Google’s Cloud Storage Connector is an Apache 2.0 implementation of an HCFS interface for cloud storage. It has four major components
Key Features of New Google Cloud Storage Connector
In old versions of connector all the data in file was processed sequentially. New version of the connector is designed to perform predicate pushdown, which allows data engine to read only the files which are required to process the query. Twitter’s Engineering Team customized open source cloud storage connector to ready only the data which was required by the query engine and Google incorporated that work in a more generalized fadvise feature.
Cloud Storage Connector’s new fadvise feature itself detects if the current big data application’s I/O access pattern is sequential of random. By default it starts with sequential read pattern, but after detecting backward seek or long forward seek it switches to random mode.
One other noticeable addition in this new version of the connector is Cooperative locking which is used to isolate directory modifications operations performed through Hadoop fs command or other HCFS API interfaces to cloud storage.
Below is an image which explains how a directory move with cooperative locking looks like:
Atomic lock acquisition in lock file (_lock/all.lock) is used in implementing Cooperative Locking. In the bucket-wide lock file the connector automatically obtain a lock, before it modifies any directory.
With all above mentioned features, some other performance improvement related features are also introduced in the new version of connector. Such as:
- Directory modification parallelization
- Latency optimizations
- Concurrent glob algorithms
- Repair of implicit directories during delete and rename operations
- Read consistency of the storage
Existing users can easily upgrade to the new version of the connector by using connectors initialization action for existing Cloud Dataproc version. In Cloud Dataproc 2.0 it will be a standard connector.