JupyterLab HIVE Data Synchronization Process

The company’s data is stored on HDFS, but the model training needs to use this data, so there is a need for data synchronization. The following is a personal data synchronization process, which is only applicable to the company, and may not be available in other places due to different environments.

Data synchronization from Hive to JupyterLab

View data file locations via Hive

The path to the database table can be viewed via Hive’s show create table statement.

`1`	`show create table tmp_db.my_table_name`

Execution provides a query to the data file path, e.g.

`1`	`'viewfs://dcfs/user/hive/warehouse/tmp_db/my_table_name'`

The actual location is: /user/hive/warehouse/tmp_db/my_table_name

Synchronize data to JupyterLab via HDFS commands

Open Terminal in Jupyterlab and enter the following name (the path is the one queried by Hive in the previous step).

`1`	`hdfs dfs -copyToLocal /user/hive/warehouse/tmp_db/my_table_name ./`

hdfs dfs command detailed explanation

view file common commands

hdfs dfs -ls path View a list of files
hdfs dfs -lsr path View file list recursively
hdfs dfs -du path View disk under path, in bytes

create folder

hdfs dfs -mkdir path

Note: This command recursively creates folders, which cannot be created repeatedly and are not visible in the Linux file system

create file

hdfs dfs -touchz path

Note: This command can not recursively create files, that is, when the file’s parent directory does not exist, the file can be created repeatedly but will overwrite the original content

copy files and directories

hdfs dfs -cp source directory destination directory

** move files and directories**

hdfs dfs -mv source directory target directory

Give permissions

hdfs dfs -chmod [permissions parameter][owner][:[group]] path

upload files

hdfs dfs -put source folder destination folder

Similar commands.

hdfs dfs -copyFromLocal source folder target folder Works as put
hdfs dfs -moveFromLocal source folder target folder delete local after upload

download file

hdfs dfs -getSourceFolder TargetFolder

Similar commands

hdfs dfs -copyToLocal source folder target folder works like get
hdfs dfs -moveToLocal source folder target folder delete source file after get

remove files

hdfs dfs -rm target file
hdfs dfs -rmr target file Recursive deletion (use with caution)

Read synchronized files via Pandas

import pandas as pd
import glob

file_list = glob.glob('my_table_name/*')
df_list = []

for file in file_list:
    df_temp = pd.read_csv(file, sep='\001', header=None, na_values=['\\N'])
    df_list.append(df_temp)

df = pd.concat(df_list, ignore_index=True)
df.to_csv('output.csv')

Notes.

sep is the interval symbol for data stored by HDFS
na_values is the notation for NULL values in HDFS files

Data synchronization from jupyterlab to Hive repository

Write files from Jupyter to HDFS via the hdfs command

1
2

hdfs dfs -mkdir /user/hive/warehouse/tmp_db/poi_info
hdfs dfs -copyFromLocal -f poi_info.csv /user/hive/warehouse/tmp_db/poi_info/

Create a table and associate it with data via HIVE

create external table if not exists tmp_db.poi_info
(
    poi_id bigint,
    poi_name string,
    lat string,
    lon string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
LOCATION '/user/hive/warehouse/tmp_db/poi_info/';