GVKun编程网logo

[bigdata-008]将bson文件转储到hive[step by step](bson转json)

25

本文将介绍[bigdata-008]将bson文件转储到hive[stepbystep]的详细情况,特别是关于bson转json的相关信息。我们将通过案例分析、数据研究等多种方式,帮助您更全面地了解这

本文将介绍[bigdata-008]将bson文件转储到hive[step by step]的详细情况,特别是关于bson转json的相关信息。我们将通过案例分析、数据研究等多种方式,帮助您更全面地了解这个主题,同时也将涉及一些关于005.hive中order by,distribute by,sort by,cluster by、android – MediaMetadataRetriever.setDataSource(Native Method)导致RuntimeException:status = 0x8000000、com.fasterxml.jackson.databind.JsonMappingException: Invalid UTF-8 start byte 0xb1、Configure High Availability Cluster in CentOS 7 (Step by Step Guide)的知识。

本文目录一览:

[bigdata-008]将bson文件转储到hive[step by step](bson转json)

[bigdata-008]将bson文件转储到hive[step by step](bson转json)

(对若干名词进行修改,不能直接执行,仅作示意)

1. mongodb的数据导出为bson文件,例如a.bson。

 

2. mongodb提供一个工具bsondump,用它将bson文件转成json文件,命令为: bsondump a.bson > a.json

    a.json的每一行,是一个json格式的完整记录,比如:

    {"_id":{"$oid":"09f89b8bb2"},"name":"WX","pageUrl":"start","time":{"$date":"2016-09-21"},"event":"page","userId":null,"createTime":{"$date":"2016-09-21T08:47:07.271Z"}}

 

3. 用python3对a.json逐行读取,处理成想要的数据格式,然后写入到文本文件a.txt。a.txt将后面被导入到hive。示意性代码如下,其中process_data函数可以根据实际需要修改代码。

#!/usr/bin/env python
#! -*- coding:utf-8 -*-
import json


f_json = open(''a.json'',''r'')
f_txt = oepn(''a.txt'',''w'')


def process_data(json_obj):
    return str(json_obj)


while True:
    line = f_json.readline()
    if None == line or 0 == len(line):
        break
    json_obj = json.loads(line)
    processed_str = process_data(json_obj)
    f_txt.write(processed_str+''\n'')
f_txt.close()
f_json.close()

 

4. a.txt的每一行是一个记录,比如:

1||start|2016-0|page|2016-09-22||||2016-09-22

这里,''|''是字段分隔符,且有些字段可能会没有值。

 

5. 将a.txt的数据导入到Hive。将a.txt复制到hive集群的一个目录,然后执行hive,进入交互界面,然后依次执行命令,示意性代码如下(请根据具体问题修改使用):

create table t_1(id int,userId string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ''|'' STORED AS TEXTFILE;
load data local inpath ''a.txt'' into table t_1;
select count(*) from t_1;
 

 

6. 创建分区表:经各种尝试发现,从已经建好的未作分区的表创建一个新的有分区的表是最简单最快捷的,其他方式都比它复杂。示意性代码如下(请根据具体问题修改使用):
(在Hive交互界面依次执行如下三条命令,将从t1表创建t2表,且使用createTimeDate作分区)
set hive.exec.dynamic.partition.mode=nonstrict;
create table t2(id int,userid string,page string,time timestamp,even string,createtime timestamp,param string) partitioned by (createtimedate date);
insert overwrite table t2 partition(createtimedate) select x.id,x.userid,x.page,x.time,x.event,x.createtime,x.param,x.createtimedate from t1 x;

 

 

7.分区表创建的注意事项:
    对于语句“insert overwrite table ”,新表使用createTimeDate作为分区,它是指后面select出来的 最后一个 字段,而并不关心旧表的最后一个字段叫什么,比如,“insert overwrite table t2 partition(createtimedate) select x.id,x.pageurl,x.eventtype,x.ctimedate from xloan_wx_np x;”,如果t1表存在字段x.ctimedate且是date,就能正常做分区。如果分区partition对应多个字段,意味着是导入数据的最后若干个字段。

005.hive中order by,distribute by,sort by,cluster by

005.hive中order by,distribute by,sort by,cluster by


order by,distribute by,sort by,cluster by  查询使用说明

// 根据年份和气温对气象数据进行排序,以确保所有具有相同年份的行最终都在一个reducer分区中 

// 一个reduce(海量数据,速度很慢)
select year, temperature
order by year asc, temperature desc
limit 100;  


// 多个reduce(海量数据,速度很快)
select year, temperature  
distribute by year  
sort by year asc, temperature desc
limit 100;



order by  (全局排序 )
order by 会对输入做全局排序,因此只有一个reducer(多个reducer无法保证全局有序)
只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。

在hive.mapred.mode=strict模式下,强制必须添加limit限制,这么做的目的是减少reducer数据规模
例如,当限制limit 100时, 如果map的个数为50, 则reducer的输入规模为100*50



distribute by  (类似于分桶)
根据distribute by指定的字段对数据进行划分到不同的输出reduce 文件中。


sort by   (类似于桶内排序)
sort by不是全局排序,其在数据进入reducer前完成排序。
因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1, 则sort by只保证每个reducer的输出有序,不保证全局有序。



cluster by
cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。 
但是排序只能是倒序排序,不能指定排序规则为asc 或者desc。

因此,常常认为cluster by = distribute by + sort by




参考地址: http://blog.csdn.net/jojo52013145/article/details/19199595
参考地址: http://blog.sina.com.cn/s/blog_9f48885501017aib.html




android – MediaMetadataRetriever.setDataSource(Native Method)导致RuntimeException:status = 0x8000000

android – MediaMetadataRetriever.setDataSource(Native Method)导致RuntimeException:status = 0x8000000

我尝试使用android.media.MediaMetadataRetriever在我的Android应用程序中从jpg文件中获取一些元数据信息.这是我的代码:

public long getDuration(String videoFilePath, Context context) {
    File file = loadVideoFile(videoFilePath);
    if (file == null) {
        return -1;
    }

    MediaMetadataRetriever retriever = new MediaMetadataRetriever();
    file.setReadable(true, false);
    retriever.setDataSource(file.getAbsolutePath());
    return getDurationProperty(retriever);
}

当我调用setDataSource方法时,它会抛出RuntimeException:

09-10 15:22:25.576: D/PowerManagerService(486): releaseWakeLock(419aa2a0): cpu_MIN_NUM , tag=AbsListViewScroll_5.0, flags=0x400
09-10 15:22:26.481: I/HtcModeClient(12704): handler message = 4011
09-10 15:22:26.481: E/HtcModeClient(12704): Check connection and retry 9 times.
09-10 15:22:27.681: W/dalvikvm(13569): threadid=1: thread exiting with uncaught exception (group=0x40bc92d0)
09-10 15:22:27.696: E/AndroidRuntime(13569): FATAL EXCEPTION: main
09-10 15:22:27.696: E/AndroidRuntime(13569): java.lang.RuntimeException: setDataSource Failed: status = 0x80000000
09-10 15:22:27.696: E/AndroidRuntime(13569):    at android.media.MediaMetadataRetriever.setDataSource(Native Method)
09-10 15:22:27.696: E/AndroidRuntime(13569):    at android.media.MediaMetadataRetriever.setDataSource(MediaMetadataRetriever.java:66)

奇怪的是,这只能在HTC One X和Android 4.2.2上失败.应用程序适用于其他具有其他Android版本的设备(例如4.2.1).

编辑:

哇.也许是关于我在maven中的错误依赖:

<dependency>
    <groupId>com.google.android</groupId>
    <artifactId>android</artifactId>
    <version>4.1.1.4</version>
    <scope>provided</scope>
</dependency>

但我找不到Android 4.2.2的依赖.哪里可以找到它?

解决方法:

自己打开文件并使用FileDescriptor似乎在API 10上工作得更好:

FileInputStream inputStream = new FileInputStream(file.getAbsolutePath());
retriever.setDataSource(inputStream.getFD());
inputStream.close();

com.fasterxml.jackson.databind.JsonMappingException: Invalid UTF-8 start byte 0xb1

com.fasterxml.jackson.databind.JsonMappingException: Invalid UTF-8 start byte 0xb1

在 windows 环境,springboot 处理提交的 json 数据报错 “com.fasterxml.jackson.databind.JsonMappingException: Invalid UTF-8 start byte 0xb1”。

解决方法:

启动命令加 “-Dfile.encoding=UTF-8 ”

如下:

java  -Dfile.encoding=UTF-8 -jar xxxx.war

Configure High Availability Cluster in CentOS 7 (Step by Step Guide)

Configure High Availability Cluster in CentOS 7 (Step by Step Guide)

【直播预告】程序员逆袭 CEO 分几步?

Configure High Availability Cluster in CentOS 7 (Step by Step Guide)


Written By - admin
January 2, 2024
Topics we will cover hide
Features of Highly Available Clusters?
What Is Pacemaker?
Bring up Environment
Configure NTP
Install pre-requisite rpms
Configure High Availability Cluster
Configure Corosync
Verify the cluster configuration
 
 
 
Current Time  1:50
/
Duration  2:02
 
 
 
 
 
 

In my last article I had explained about the different kinds of clustering and their architecture. Before you start with the configuration of High Availability Cluster, you must be aware of the basic terminologies related to Clustering. In this article I will share step by step guide to configure high availability cluster in CentOS Linux 7 using 3 virtual machines. These virtual machines are running on my Oracle VirtualBox installed on my Linux Server.

Still installing Linux manually?
I would recommend to configure one click installation using Network PXE Boot Server. Using PXE server you can install Oracle Virtual Machines or KVM based Virtual Machines or any type of physical server without any manual intervention saving time and effort.

Configure High Availability Cluster in CentOS 7 (Step by Step Guide)

NOTE:
The steps to configure High Availability Cluster on Red Hat 7 will be same as CentOS 7. On RHEL system you must have an active subscription to RHN or you can configure a local offline repository using which "yum" package manager can install the provided rpm and it''s dependencies.

 

Features of Highly Available Clusters?

The ClusterLabs stack, incorporating Corosync and Pacemaker defines an Open Source, High Availability cluster offering suitable for both small and large deployments.

  • Detection and recovery of machine and application-level failures
  • Supports practically any redundancy configuration
  • Supports both quorate and resource-driven clusters
  • Configurable strategies for dealing with quorum loss (when multiple machines fail)
  • Supports application startup/shutdown ordering, regardless of which machine(s) the applications are on
  • Supports applications that must/must-not run on the same machine
  • Supports applications which need to be active on multiple machines
  • Supports applications with multiple modes (eg. master/slave)
ALSO READ
How to configure HA LVM cluster resource to share LVM in Linux
 

 

What Is Pacemaker?

We will use pacemaker and corosync to configure High Availability Cluster. Pacemaker is a cluster resource manager, that is, a logic responsible for a life-cycle of deployed software — indirectly perhaps even whole systems or their interconnections — under its control within a set of computers (a.k.a. nodes) and driven by prescribed rules.

It achieves maximum availability for your cluster services (a.k.a. resources) by detecting and recovering from node- and resource-level failures by making use of the messaging and membership capabilities provided by your preferred cluster infrastructure (either Corosync or Heartbeat), and possibly by utilizing other parts of the overall cluster stack.

ALSO READ:
Understanding resource group and constraints in a Cluster with examples

 

Bring up Environment

First of all before we start to Configure High Availability Cluster, let us bring up our virtual machines with CentOS 7. I am using Oracle VirtualBox. You can also install Oracle VirtualBox on Linux environment. Below are my vm''s configuration details

properties node1 node2 node3
OS CentOS 7 CentOS 7 CentOS 7
vCPU 2 2 2
Memory 2GB 2GB 2GB
Disk 10GB 10GB 10GB
FQDN node1.example.com node2.example.com node3.example.com
Hostname node1 node2 node3
IP Address (Internal) 10.0.2.20 10.0.2.21 10.0.2.22
IP Address (External) DHCP DHCP DHCP

 

Edit the /etc/hosts file and add the IP address, followed by an FQDN and a short cluster node name for every available cluster node network interface.

 
bash
[root@node1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.0.2.20 node1.example.com node1
10.0.2.21 node2.example.com node2
10.0.2.22 node3.example.com node3

[root@node2 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.0.2.20 node1.example.com node1
10.0.2.21 node2.example.com node2
10.0.2.22 node3.example.com node3

[root@node3 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.0.2.20 node1.example.com node1
10.0.2.21 node2.example.com node2
10.0.2.22 node3.example.com node3

To finish, you must check and confirm connectivity among the cluster nodes. You can do this by simply releasing a ping command to every cluster node.

ALSO READ
How to create cluster resource in HA Cluster (with examples)
 

 

Stop and disable Network Manager on all the nodes

bash
[root@node1 ~]# systemctl disable NetworkManager
Removed symlink /etc/systemd/system/dbus-org.freedesktop.NetworkManager.service.
Removed symlink /etc/systemd/system/multi-user.target.wants/NetworkManager.service.
NOTE:
You must remove or disable the NetworkManager service because you will want to avoid any automated configuration of network interfaces on your cluster nodes.
After removing or disabling the NetworkManager service, you must restart the networking service.

 

Configure NTP

To configure High Availability Cluster it is important that all your nodes in the cluster are connected and synced to a NTP server. Since my machines are in IST timezone I will use the India pool of NTP servers

.

bash
[root@node1 ~]# systemctl start ntpd
[root@node1 ~]# systemctl enable ntpd
Created symlink from /etc/systemd/system/multi-user.target.wants/ntpd.service to /usr/lib/systemd/system/ntpd.service.

 

Install pre-requisite rpms

The high availability package is not part of CentOS repo so you will need epel-release repo.

HINT:
Steps to install EPEL repository in RHEL 8
bash
[root@node1 ~]# yum install epel-release -y

pcs is the pcaemaker software and all it''s dependencies The fence-agents-all will install all the default fencing agents which is available for Red Hat Cluster

bash
[root@node1 ~]# yum install pcs fence-agents-all -y

Add firewall rules

bash
[root@node1 ~]# firewall-cmd --permanent --add-service=high-availability; firewall-cmd --reload
success
success
NOTE:
If you are using iptables directly, or some other firewall solution besides firewalld, simply open the following ports: TCP ports 2224, 3121, and 21064, and UDP port 5404, 5405.
If you run into any problems during testing, you might want to disable the firewall and SELinux entirely until you have everything working. This may create significant security issues and should not be performed on machines that will be exposed to the outside world, but may be appropriate during development and testing on a protected host.

 

ALSO READ
5 ways to check if server is physical or virtual in Linux or Unix
 

Configure High Availability Cluster

The installed packages will create a hacluster user with a disabled password. While this is fine for running pcs commands locally, the account needs a login password in order to perform such tasks as syncing the corosync configuration, or starting and stopping the cluster on other nodes.

Set the password for the Pacemaker cluster on each cluster node using the following command. Here my password is password

bash
[root@node1 ~]# echo password | passwd --stdin hacluster
Changing password for user hacluster.
passwd: all authentication tokens updated successfully.

Start the Pacemaker cluster manager on each node:

 
bash
[root@node1 ~]# systemctl enable --now pcsd
Created symlink from /etc/systemd/system/multi-user.target.wants/pcsd.service to /usr/lib/systemd/system/pcsd.service.

 

Configure Corosync

To configure Openstack High Availability we need to configure corosync on any one of the node, use pcs cluster auth to authenticate as the hacluster user:

bash
[root@node1 ~]# pcs cluster auth node1.example.com node2.example.com node3.example.com
Username: hacluster
Password:
node2.example.com: Authorized
node1.example.com: Authorized
node3.example.com: Authorized
NOTE:
If you face any issues at this step, check your firewalld/iptables or selinux policy

Finally, run the following commands on the first node to create the cluster and start it. Here our cluster name will be mycluster

bash
[root@node1 ~]# pcs cluster setup --start --name mycluster node1.example.com node2.example.com node3.example.com
Destroying cluster on nodes: node1.example.com, node2.example.com, node3.example.com...
node3.example.com: Stopping Cluster (pacemaker)...
node2.example.com: Stopping Cluster (pacemaker)...
node1.example.com: Stopping Cluster (pacemaker)...
node1.example.com: Successfully destroyed cluster
node2.example.com: Successfully destroyed cluster
node3.example.com: Successfully destroyed cluster

Sending ''pacemaker_remote authkey'' to ''node1.example.com'', ''node2.example.com'', ''node3.example.com''
node1.example.com: successful distribution of the file ''pacemaker_remote authkey''
node2.example.com: successful distribution of the file ''pacemaker_remote authkey''
node3.example.com: successful distribution of the file ''pacemaker_remote authkey''
Sending cluster config files to the nodes...
node1.example.com: Succeeded
node2.example.com: Succeeded
node3.example.com: Succeeded

Starting cluster on nodes: node1.example.com, node2.example.com, node3.example.com...
node2.example.com: Starting Cluster...
node1.example.com: Starting Cluster...
node3.example.com: Starting Cluster...

Synchronizing pcsd certificates on nodes node1.example.com, node2.example.com, node3.example.com...
node2.example.com: Success
node1.example.com: Success
node3.example.com: Success
Restarting pcsd on the nodes in order to reload the certificates...
node1.example.com: Success
node3.example.com: Success
node2.example.com: Success

Enable the cluster service i.e. pacemaker and corosync so they can automatically start on boot

bash
[root@node1 ~]# pcs cluster enable --all
node1.example.com: Cluster Enabled
node2.example.com: Cluster Enabled
node3.example.com: Cluster Enabled

Lastly check the cluster status

bash
[root@node1 ~]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: node2.example.com (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
 Last updated: Sat Oct 27 08:41:52 2018
 Last change: Sat Oct 27 08:41:18 2018 by hacluster via crmd on node2.example.com
 3 nodes configured
 0 resources configured

PCSD Status:
  node3.example.com: Online
  node1.example.com: Online
  node2.example.com: Online

To check the cluster''s Quorum status using the corosync-quorumtool command.

bash
[root@node1 ~]# corosync-quorumtool
Quorum information
------------------
Date:             Sat Oct 27 08:43:22 2018
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          1
Ring ID:          1/8
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
         1          1 node1.example.com (local)
         2          1 node2.example.com
         3          1 node3.example.com

To get the LIVE status of the cluster use crm_mon

bash
[root@node1 ~]# crm_mon
Connection to the CIB terminated

 

ALSO READ
DRBD Tutorial | Setup Linux Disk Replication | CentOS 8
 

Verify the cluster configuration

Before we make any changes, it’s a good idea to check the validity of the configuration.

bash
[root@node1 ~]#  crm_verify -L -V
   error: unpack_resources:     Resource start-up disabled since no STONITH resources have been defined
   error: unpack_resources:     Either configure some or disable STONITH with the stonith-enabled option
   error: unpack_resources:     NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid

As you can see, the tool has found some errors.

In order to guarantee the safety of your data, [5] fencing (also called STONITH) is enabled by default. However, it also knows when no STONITH configuration has been supplied and reports this as a problem (since the cluster will not be able to make progress if a situation requiring node fencing arises).

We will disable this feature for now and configure it later. To disable STONITH, set the stonith-enabled cluster option to false on both the controller nodes:

WARNING:
The use of stonith-enabled=false is completely inappropriate for a production cluster. It tells the cluster to simply pretend that the nodes which fails are safely in powered off state. Some vendors will refuse to support clusters that STONITH disabled.
bash
[root@node1 ~]# pcs property set stonith-enabled=false

Next re-validate the cluster

bash
[root@node1 ~]# crm_verify -L -V

 

This all about Configure High Availability Cluster on Linux, Below are some more articles on Cluster which you can use to understand about cluster architecture, resource group and resource constraints etc.

 

ALSO READ
⇒ Understanding High Availability Cluster and Architecture
⇒ Understanding resource group and constraints in a Cluster with examples
⇒ How to configure HA LVM cluster resource to share LVM in Linux
⇒ How to create cluster resource in HA Cluster (with examples)
⇒  How to set up GFS2 with clustering on Linux ( RHEL / CentOS 7 )

 

ALSO READ
Create man page in Linux with examples (sample man page template)
 

Lastly I hope the steps from this article to configure high availability cluster on Linux was helpful. So, let me know your suggestions and feedback using the comment section.

 

今天的关于[bigdata-008]将bson文件转储到hive[step by step]bson转json的分享已经结束,谢谢您的关注,如果想了解更多关于005.hive中order by,distribute by,sort by,cluster by、android – MediaMetadataRetriever.setDataSource(Native Method)导致RuntimeException:status = 0x8000000、com.fasterxml.jackson.databind.JsonMappingException: Invalid UTF-8 start byte 0xb1、Configure High Availability Cluster in CentOS 7 (Step by Step Guide)的相关知识,请在本站进行查询。

本文标签: