The company’s data is stored on HDFS, but the model training needs to use this data, so there is a need for data synchronization. The following is a personal data synchronization process, which is only applicable to the company, and may not be available in other places due to different environments.

Data synchronization from Hive to JupyterLab

View data file locations via Hive

The path to the database table can be viewed via Hive’s show create table statement.

1
show create table tmp_db.my_table_name

Execution provides a query to the data file path, e.g.

1
  'viewfs://dcfs/user/hive/warehouse/tmp_db/my_table_name'

The actual location is: /user/hive/warehouse/tmp_db/my_table_name

Synchronize data to JupyterLab via HDFS commands

Open Terminal in Jupyterlab and enter the following name (the path is the one queried by Hive in the previous step).

1
hdfs dfs -copyToLocal /user/hive/warehouse/tmp_db/my_table_name ./

hdfs dfs command detailed explanation

view file common commands

  • hdfs dfs -ls path View a list of files
  • hdfs dfs -lsr path View file list recursively
  • hdfs dfs -du path View disk under path, in bytes

create folder

  • hdfs dfs -mkdir path

Note: This command recursively creates folders, which cannot be created repeatedly and are not visible in the Linux file system

create file

  • hdfs dfs -touchz path

Note: This command can not recursively create files, that is, when the file’s parent directory does not exist, the file can be created repeatedly but will overwrite the original content

copy files and directories

  • hdfs dfs -cp source directory destination directory

** move files and directories**

  • hdfs dfs -mv source directory target directory

Give permissions

  • hdfs dfs -chmod [permissions parameter][owner][:[group]] path

upload files

  • hdfs dfs -put source folder destination folder

Similar commands.

  • hdfs dfs -copyFromLocal source folder target folder Works as put
  • hdfs dfs -moveFromLocal source folder target folder delete local after upload

download file

  • hdfs dfs -getSourceFolder TargetFolder

Similar commands

  • hdfs dfs -copyToLocal source folder target folder works like get
  • hdfs dfs -moveToLocal source folder target folder delete source file after get

remove files

  • hdfs dfs -rm target file
  • hdfs dfs -rmr target file Recursive deletion (use with caution)

Read synchronized files via Pandas

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import pandas as pd
import glob

file_list = glob.glob('my_table_name/*')
df_list = []

for file in file_list:
    df_temp = pd.read_csv(file, sep='\001', header=None, na_values=['\\N'])
    df_list.append(df_temp)

df = pd.concat(df_list, ignore_index=True)
df.to_csv('output.csv')

Notes.

  • sep is the interval symbol for data stored by HDFS
  • na_values is the notation for NULL values in HDFS files

Data synchronization from jupyterlab to Hive repository

Write files from Jupyter to HDFS via the hdfs command

1
2
hdfs dfs -mkdir /user/hive/warehouse/tmp_db/poi_info
hdfs dfs -copyFromLocal -f poi_info.csv /user/hive/warehouse/tmp_db/poi_info/

Create a table and associate it with data via HIVE

1
2
3
4
5
6
7
8
9
create external table if not exists tmp_db.poi_info
(
    poi_id bigint,
    poi_name string,
    lat string,
    lon string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
LOCATION '/user/hive/warehouse/tmp_db/poi_info/';

Notes.

  • ROW FORMAT DELIMITED FIELDS TERMINATED BY Field interval symbols that need to be named CSV.