Spark — hbase integration

Thulasitharan Govindaraj
3 min readFeb 16, 2020

Hey Folks

Thought of sharing a solution for an issue which took me a week or so to figure to the solution for it.

Hbase is a Nosql technology which runs over Hadoop, As huge amount’s of data are getting generated every minute some might be without schema, so it can be stored in no sql data base and later refined for data science purpose.

Ask: To read hbase table as a DataFrame in Spark.

Spark version: 2.4.0 — Hbase integration, 2.4.5 for orc/csv to parquet conversion

Hbase version: 1.4.12

Used the common syntax, catalog for column mapping and read with hbase as format. Took coludera’s spark dependency. As on the below link.

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-using-spark-query-hbase

Code:

===

spark-shell — packages com.hortonworks:shc-core:1.1.1–2.1-s_2.11,com.hortonworks:shc:1.1.1–2.1-s_2.11 — repositories https://repository.apache.org/content/repositories/releases

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog

def withCatalog(catalog:String): DataFrame =spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog)).format(“org.apache.spark.sql.execution.datasources.hbase”).load

def carCatalog = s”””{
“table”:{“namespace”:”default”, “name”:”cardata”},
“rowkey”:”key”,
“columns”:{
“vehicle_id”:{“cf”:”rowkey”, “col”:”key”, “type”:”string”},
“alloy_wheels”:{“cf”:”hardware”, “col”:”alloy_wheels”, “type”:”string”},
“anti_Lock_break”:{“cf”:”hardware”, “col”:”anti_Lock_break”, “type”:”string”},
“electronic_breakforce_distribution”:{“cf”:”software”, “col”:”electronic_breakforce_distribution”, “type”:”string”},
“terrain_mode”:{“cf”:”software”, “col”:”terrain_mode”, “type”:”string”},
“traction_control”:{“cf”:”software”, “col”:”traction_control”, “type”:”string”},
“stability_control”:{“cf”:”software”, “col”:”stability_control”, “type”:”string”},
“cruize_control”:{“cf”:”software”, “col”:”cruize_control”, “type”:”string”},
“make”:{“cf”:”other”, “col”:”make”, “type”:”string”},
“model”:{“cf”:”other”, “col”:”model”, “type”:”string”},
“variant”:{“cf”:”other”, “col”:”variant”, “type”:”string”}
}
}”””.stripMargin

val hbaseDf=withCatalog(spark,carCatalog)

Issue :

java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;

at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:257)

at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.<init>(HBaseRelation.scala:80)

at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)

at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)

at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)

Issue

Tried all the solutions in stack overflow, passing hbase’s conf file in spark submit, additional jar’s. Noting worked.

The issue is json4s jar’s.

json4s jar’s above 3.2.10 will throw the above error wile reading hbase table as DF.

Step’s to fix it. Take a back up of the below jar’s in SPARK_HOME/jars folder.

json4s-ast_2.11–3.5.3.jar
json4s-core_2.11–3.5.3.jar
json4s-ext_2.11–3.5.3.jar
json4s-jackson_2.11–3.5.3.jar
json4s-native_2.11–3.5.3.jar

Already existing jar’s

Remove all these jars.

Note: json4s-scalap_2.11–3.5.3.jar alone can remain.

Replace the deleted jar’s with below jar’s.

json4s-ast_2.11–3.2.10.jar
json4s-core_2.11–3.2.10.jar
json4s-ext_2.11–3.2.10.jar
json4s-jackson_2.11–3.2.10.jar
json4s-native_2.11–3.2.10.jar

Replaced one
Working now.

Basically any version of json4s other than 3.2.10 is not working while we read data from hbase, there exists an integration problem between these two tech’s. For knowing this i referred post in cloudera website.

Visible data.

You can read/write from/to hbase now. :)

Note if you do this you won’t be able read or write any data frame as a parquet file. So think of another data source than parquet for saving this DF like orc and from another version of spark read it and convert it into Parquet or Delta table. I overcame this by using spark 2.4.5 and reading the ORC over there and converting it to parquet to suite for my data pipeline.

Trying to write to parquet format
CSV Format
ORC Format

--

--

Thulasitharan Govindaraj

Big data engineer - Spark Scala,Hadoop,HIVE,Impala,Kafka,AWS,HBASE,Snowflake,Deltalake,CDH,DOCKER,k8s . For the love of Formula one #SennaSempre