Installation And Configuration Of Hadoop Definition

10/27/2017

Partitioning in Hive Hadoop Online Tutorials. In this post, we will discuss about one of the most critical and important concept in Hive, Partitioning in Hive Tables. Table partitioning means dividing table data into some parts based on the values of particular columns like date or country, segregate the input records into different filesdirectories based on date or country. Partitioning can be done based on more than column which will impose multi dimensional structure on directory storage. Petaa Bytes is a leading Center of Data Science Course in Mumbai, Big Data Hadoop Training in Mumbai. Data Science course is having basic to advanced level. For Example, In addition to partitioning log records by date column, we can also sup divide the single day records into country wise separate files by including country column into partitioning. We will see more about this in the examples. Partitions are defined at the time of table creation using the PARTITIONED BY clause, with a list of column definitions for partitioning. Syntax. CREATE EXTERNAL TABLE tablename colname1 datatype1,. PARTITIONED BY colnamen datatypen COMMENT colcomment,. CREATEEXTERNALTABLEtablenamecolname1datatype1. PARTITIONEDBYcolnamendatatypenCOMMENTcolcomment. As shown in syntax, we can also add comments to partitioned columns. Advantages. Partitioning is used for distributing execution load horizontally. As the data is stored as slicesparts, query response time is faster to process the small part of the data instead of looking for a search in the entire data set. Installation And Configuration Of Hadoop Definition For example, In a large user table where the table is partitioned by country, then selecting users of country IN will just scan one directory countryIN instead of all the directories. Limitations. Having too many partitions in table creates large number of files and directories in HDFS, which is an overhead to Name. Node since it must keep all metadata for the file system in memory only. Partitions may optimize some queries based on Where clauses, but may be less responsive for other important queries on grouping clauses. In Mapreduce processing, Huge number of partitions will lead to huge no of tasks which will run in separate JVM in each mapreduce job, thus creates lot of overhead in maintaining JVM start up and tear down. For small files, a separate task will be used for each file. In worst scenarios, the overhead of JVM start up and tear down can exceed the actual processing time. Installation And Configuration Of Hadoop Definition NoClassDefFoundError In Java. Definition Java Virtual Machine is not able to find a particular class at runtime which was available at compile time. Example Scenarios. Partitioning is used in real time log files analysis to segregate the records based on time stamp or date value to see the results day wise quickly. Another real time use is that, Customeruser details are partitioned by countrystate or department for fast retrieval of subset data pertaining to some category. Sales records by product type, country, year and month is another commonly used scenario. In this post we will try examples of use case 2. Sample Use Case. Lets explore the other features of partitions with the help of sample use case of Loading User records into Hive and performing some queries. Sample User Records file for testing in this post User. Records. firstname,lastname,address,country,city,state,post,phone. Rebbecca,Didio,1. E 2. 4th St,AU,Leith,TA,7. Rebbecca,Didio,1. E2. 4th. St,AU,Leith,TA,7. Observation of Input Data. Input data has below fields or columns. First Name. Last Name. Address. Country. City. State. Postal Code. Phone Number. Alternative Phone Number. Email Id. Website URLEasiest part is that, each field is separated by, and no field contains the same, in its values. Lets Assume we need to create Hive Table partitioneduser partitioned by Country and State and load these input records into table is our requirement. Creation of Partition Table. Managed Partitioned Table. Below is the Hive. QL to create managed partitioneduser table as per the above requirements. CREATE TABLE partitioneduser. VARCHAR6. 4. lastname VARCHAR6. STRING. city VARCHAR6. STRING. phone. 1 VARCHAR6. STRING. email STRING. STRING. PARTITIONED BY country VARCHAR6. VARCHAR6. 4. STORED AS SEQUENCEFILE CREATETABLEpartitioneduserfirstname. VARCHAR6. 4,lastname VARCHAR6. STRING,city VARCHAR6. STRING,phone. 1 VARCHAR6. STRING,email STRING,web STRINGPARTITIONEDBYcountry. VARCHAR6. 4,state. VARCHAR6. 4STOREDASSEQUENCEFILE Note that we didnt include country and state columns in table definition but included in partition definition. If we include them, then we will encounter error scenario 1. We can verify the partition columns of the table with the help of below command. DESCRIBE FORMATTED partitioneduser hive DESCRIBEFORMATTEDpartitioneduser Partitioned columns country and state can be used in Query statements WHERE clause and can be treated regular column names even though there is actual column inside the input file data. External Partitioned Tables. We can create external partitioned tables as well, just by using the EXTERNAL keyword in the CREATE statement, but for creation of External Partitioned Tables, we do not need to mention LOCATION clause as we will mention locations of each partitions separately while inserting data into table. Inserting Data Into Partitioned Tables. Data insertion into partitioned tables can be done in two modes. Static Partitioning. Dynamic Partitioning. Static Partitioning in Hive. In this mode, input data should contain the columns listed only in table definition for example, firstname, lastname, address, city, post, phone. If our input column layout is according to the expected layout and we already have separate input files for each partitioned key value pairs, like one separate file for each combination of country and state values countryUS and stateCA, then these files can be easily loaded into partitioned tables with below syntax. Loading Data into Managed Partitioned Table From Local FSExample. For example, lets take below 3 records, which are not containing partitioned columns and save into staticinput. And assume that all these records belongs to countryUS and StateCA. Rebbecca,Didio,1. E 2. 4th St,Leith,7. Stevie,Hallo,2. 22. Acoma St,Proston,4. Mariko,Stayer,5. 34 Schoenborn St 5. Hamel,6. 21. 5,0. Rebbecca,Didio,1. E2. 4th. St,Leith,7. Stevie,Hallo,2. 22. Acoma. St,Proston,4. Mariko,Stayer,5. 34. Schoenborn. St5. Hamel,6. 21. 5,0. Now this file can be loaded into partitioned table with below syntax by specifying the country and state value at load time itself. LOAD DATA LOCAL INPATH env HOMEstaticinput. INTO TABLE partitioneduser. PARTITION country US, state CA hive LOAD DATALOCALINPATHenv HOMEstaticinput. INTOTABLEpartitioneduser PARTITIONcountryUS,stateCA This will create separate directory under the default warehouse directory in HDFS. USstateCAuserhivewarehousepartitionedusercountryUSstateCASimilarly we have to add other partitions, which will create corresponding directories in HDFS. Or else we can load the entire directory into Hive table with single command and can add partitions for each file with ALTER command. LOAD DATA LOCAL INPATH env HOMEinputdir. INTO TABLE partitioneduser hive LOAD DATALOCALINPATHenv HOMEinputdir INTOTABLEpartitioneduser Loading Partition From Other Table. We can load or add partitions with query results from another table as shown below. INSERT OVERWRITE TABLE partitioneduser. PARTITION country US, state AL. SELECT FROM anotheruser au. WHERE au. country US AND au. AL hive INSERTOVERWRITETABLE partitioneduser PARTITIONcountryUS,stateAL SELECTROManotheruser au WHEREau. USANDau. stateAL Overwriting Existing Partition. We can overwrite an existing partition with help of OVERWRITE INTO TABLE partitioneduser clause. Loading Data into External Partitioned Table From HDFSThere is alternative for bulk loading of partitions into hive table. As data is already present in HDFS and should be made accessible by Hive, we will just mention the locations of the HDFS files for each partition. If our files are on Local FS, they can be moved to a directory in HDFS and we can add partition for each file in that directory with commands similar to below. ALTER TABLE partitioneduser ADD PARTITION country US, state CA. LOCATION hiveexternaltablesusercountryusstatecahive ALTERTABLEpartitioneduser.

0 Comments

Installation And Configuration Of Hadoop Definition

Leave a Reply.

Author

Archives

Categories