The company’s data is stored on HDFS, but the model training needs to use this data, so there is a need for data synchronization. The following is a personal data synchronization process, which is only applicable to the company, and may not be available in other places due to different environments.
Data synchronization from Hive to JupyterLab
View data file locations via Hive
The path to the database table can be viewed via Hive’s show create table statement.
Execution provides a query to the data file path, e.g.
The actual location is: /user/hive/warehouse/tmp_db/my_table_name
Synchronize data to JupyterLab via HDFS commands
Open Terminal in Jupyterlab and enter the following name (the path is the one queried by Hive in the previous step).
hdfs dfs command detailed explanation
view file common commands
- hdfs dfs -ls path View a list of files
- hdfs dfs -lsr path View file list recursively
- hdfs dfs -du path View disk under path, in bytes
- hdfs dfs -mkdir path
Note: This command recursively creates folders, which cannot be created repeatedly and are not visible in the Linux file system
- hdfs dfs -touchz path
Note: This command can not recursively create files, that is, when the file’s parent directory does not exist, the file can be created repeatedly but will overwrite the original content
copy files and directories
- hdfs dfs -cp source directory destination directory
** move files and directories**
- hdfs dfs -mv source directory target directory
- hdfs dfs -chmod [permissions parameter][owner][:[group]] path
- hdfs dfs -put source folder destination folder
- hdfs dfs -copyFromLocal source folder target folder Works as put
- hdfs dfs -moveFromLocal source folder target folder delete local after upload
- hdfs dfs -getSourceFolder TargetFolder
- hdfs dfs -copyToLocal source folder target folder works like get
- hdfs dfs -moveToLocal source folder target folder delete source file after get
- hdfs dfs -rm target file
- hdfs dfs -rmr target file Recursive deletion (use with caution)
Read synchronized files via Pandas
- sep is the interval symbol for data stored by HDFS
- na_values is the notation for NULL values in HDFS files
Data synchronization from jupyterlab to Hive repository
Write files from Jupyter to HDFS via the hdfs command
Create a table and associate it with data via HIVE
- ROW FORMAT DELIMITED FIELDS TERMINATED BY Field interval symbols that need to be named CSV.