Spark — hbase integration
Hey Folks
Thought of sharing a solution for an issue which took me a week or so to figure to the solution for it.
Hbase is a Nosql technology which runs over Hadoop, As huge amount’s of data are getting generated every minute some might be without schema, so it can be stored in no sql data base and later refined for data science purpose.
Ask: To read hbase table as a DataFrame in Spark.
Spark version: 2.4.0 — Hbase integration, 2.4.5 for orc/csv to parquet conversion
Hbase version: 1.4.12
Used the common syntax, catalog for column mapping and read with hbase as format. Took coludera’s spark dependency. As on the below link.
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-using-spark-query-hbase
Code:
===
spark-shell — packages com.hortonworks:shc-core:1.1.1–2.1-s_2.11,com.hortonworks:shc:1.1.1–2.1-s_2.11 — repositories https://repository.apache.org/content/repositories/releases
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog
def withCatalog(catalog:String): DataFrame =spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog)).format(“org.apache.spark.sql.execution.datasources.hbase”).load
def carCatalog = s”””{
“table”:{“namespace”:”default”, “name”:”cardata”},
“rowkey”:”key”,
“columns”:{
“vehicle_id”:{“cf”:”rowkey”, “col”:”key”, “type”:”string”},
“alloy_wheels”:{“cf”:”hardware”, “col”:”alloy_wheels”, “type”:”string”},
“anti_Lock_break”:{“cf”:”hardware”, “col”:”anti_Lock_break”, “type”:”string”},
“electronic_breakforce_distribution”:{“cf”:”software”, “col”:”electronic_breakforce_distribution”, “type”:”string”},
“terrain_mode”:{“cf”:”software”, “col”:”terrain_mode”, “type”:”string”},
“traction_control”:{“cf”:”software”, “col”:”traction_control”, “type”:”string”},
“stability_control”:{“cf”:”software”, “col”:”stability_control”, “type”:”string”},
“cruize_control”:{“cf”:”software”, “col”:”cruize_control”, “type”:”string”},
“make”:{“cf”:”other”, “col”:”make”, “type”:”string”},
“model”:{“cf”:”other”, “col”:”model”, “type”:”string”},
“variant”:{“cf”:”other”, “col”:”variant”, “type”:”string”}
}
}”””.stripMargin
val hbaseDf=withCatalog(spark,carCatalog)
Issue :
java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:257)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.<init>(HBaseRelation.scala:80)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
Tried all the solutions in stack overflow, passing hbase’s conf file in spark submit, additional jar’s. Noting worked.
The issue is json4s jar’s.
json4s jar’s above 3.2.10 will throw the above error wile reading hbase table as DF.
Step’s to fix it. Take a back up of the below jar’s in SPARK_HOME/jars folder.
json4s-ast_2.11–3.5.3.jar
json4s-core_2.11–3.5.3.jar
json4s-ext_2.11–3.5.3.jar
json4s-jackson_2.11–3.5.3.jar
json4s-native_2.11–3.5.3.jar
Remove all these jars.
Note: json4s-scalap_2.11–3.5.3.jar alone can remain.
Replace the deleted jar’s with below jar’s.
json4s-ast_2.11–3.2.10.jar
json4s-core_2.11–3.2.10.jar
json4s-ext_2.11–3.2.10.jar
json4s-jackson_2.11–3.2.10.jar
json4s-native_2.11–3.2.10.jar
Basically any version of json4s other than 3.2.10 is not working while we read data from hbase, there exists an integration problem between these two tech’s. For knowing this i referred post in cloudera website.
You can read/write from/to hbase now. :)
Note if you do this you won’t be able read or write any data frame as a parquet file. So think of another data source than parquet for saving this DF like orc and from another version of spark read it and convert it into Parquet or Delta table. I overcame this by using spark 2.4.5 and reading the ORC over there and converting it to parquet to suite for my data pipeline.