ClickHouse系列教程 8. 数据字典相关内容

clickhouse

发布日期: 2021-05-03

文章字数: 8k

阅读时长: 34 分

阅读次数:

1. 数据字典介绍

数据字典是ClickHouse提供的一种非常简单、实用的存储媒介，它以键值和属性映射的形式定义数据。
字典中的数据会主动或者被动加载到内存，并支持动态更新。
由于字典数据常驻内存的特性，所以它非常适合保存常量或经常使用的维度表数据，以避免不必要的JOIN查询。
在默认情况下，字典中的数据只能通过字典函数访问。
数据字典分为内置与扩展两种形式：
- 内置字典：ClickHouse默认自带的字典
- 扩展字典：用户通过自定义配置实现的字典
在字典表引擎的帮助下，可以将数据字典挂载到一张代理的数据表下，从而实现数据表与字典数据的JOIN查询。

2. 内置字典

ClickHouse目前只有 Yandex.Metrica一种内置字典
Yandex.Metrica 设计意图是快速存取 geo 地理数据
版权原因Yandex并没有将geo地理数据开放出来，只是提供了字典的定义机制和取数函数
如果想使用 Yandex.Metrica字典，需要遵照它的字典规范自行导入数据

2.1 内置字典配置说明

内置字典在默认的情况下是禁用状态，需要开启后才能使用。
内置字典的配置是惰性加载的，只有当字典首次被查询的时候才会触发加载动作。
填充 Yandex.Metrica 字典的 geo 地理数据由两组模型组成，可以分别理解为地区数据的主表及维度表。
开启它的方式也十分简单，只需将 config.xml 文件的主表与维表两项配置打开即可。
这两项配置是惰性加载的，只有当字典首次被查询的时候才会触发加载动作。

2.1.1 path_to_regions_hierarchy_file

path_to_regions_hierarchy_file 相当于区域数据的主表，由1个 regions_hierarchy.txt 和多个 regions_hierarchy_[name].txt 区域层次的数据文件共同组成，缺一不可。其中 [name] 表示区域标识符，与i18n类似。这些TXT文件内的数据需要使用 TabSeparated 格式定义，其数据模型的格式如下表所示。

字段名	类型	必须	说明
Region ID	UInt32	是	区域 ID
Parent Region ID	UInt32	是	上级区域 ID
Region Type	UInt8	是	区域类型： 1：continent 大陆 3：country 国家 4：federal district 联邦区 5：region 地区 6：city 城市
Population	UInt32	否	人口

2.1.2 path_to_regions_names_files

path_to_regions_names_files 相当于区域数据的维度表，记录了与区域ID对应的区域名称。维度数据使用 6 个regions_names_[name].txt文件保存，其中 [name] 表示区域标识符与regions_hierarchy_[name].txt对应，目前包括ru、en、ua、by、kz和tr。上述这些区域的数据文件必须全部定义，这是因为内置字典在初次加载时，会一次性加载上述6个区域标识的数据文件。如果缺少任何一个文件就会抛出异常并导致初始化失败。这些TXT文件内的数据同样需要使用 TabSeparated 格式定义，其数据模型的格式如下表所示。

字段名	类型	必须	说明
Region ID	UInt32	是	区域 ID
Region Name	UInt32	是	区域名称

2.2 使用内置字典

下载测试数据 https://github.com/nauu/clickhousebook/tree/master/source/dict/built-in/geo

在没有配置内置字典的时候

SELECT regionToName(toUInt32(20009));

SELECT regionToName(toUInt32(20009))

Query id: b0aba82e-d5eb-4f88-997a-8d874572a882


0 rows in set. Elapsed: 0.206 sec.

Received exception from server (version 21.4.3):
Code: 156. DB::Exception: Received from localhost:9977. DB::Exception: Embedded dictionaries were not loaded. You need to check configuration file.: While processing regionToName(toUInt32(20009)).

修改配置文件

# 编辑配置文件
vim /etc/clickhouse-server/config.xml

# 增加配置


  /var/lib/clickhouse/geo/regions_hierarchy.txt




  /var/lib/clickhouse/geo/

创建内置字典数据文件夹并上传字典数据

[root@node3 geo]# cd /var/lib/clickhouse/
[root@node3 geo]# mkdir geo
# 上传数据文件，之后查看
[root@node3 geo]# cd geo
总用量 10256
-rw-r--r-- 1 root root    3096 5月   6 15:59 regions_hierarchy_ru.txt
-rw-r--r-- 1 root root    3096 5月   6 15:59 regions_hierarchy.txt
-rw-r--r-- 1 root root    3957 5月   6 15:59 regions_names_ar.txt
-rw-r--r-- 1 root root    3957 5月   6 15:59 regions_names_by.txt
-rw-r--r-- 1 root root    3957 5月   6 15:59 regions_names_en.txt
-rw-r--r-- 1 root root    3957 5月   6 15:59 regions_names_kz.txt
-rw-r--r-- 1 root root    3957 5月   6 15:59 regions_names_ru.txt
-rw-r--r-- 1 root root    3957 5月   6 15:59 regions_names_tr.txt
-rw-r--r-- 1 root root    3957 5月   6 15:59 regions_names_ua.txt

重启 clickhouse

# 关闭服务
systemctl stop clickhouse-server.service

# 检查服务是否完全关闭
netstat -nltp | grep clickhouse

# 启动服务
systemctl start clickhouse-server.service

测试内置字典


SELECT regionToName(toUInt32(20009)) as name;

┌─name──────────────────┐
│ Buenos Aires Province │
└───────────────────────┘

3. 外部扩展字典

扩展字典与内置字典更加常用。
外部扩展字典是以插件形式注册到ClickHouse中的，由用户自行定义数据模式及数据来源。
扩展字典支持文件、数据库和其他类型三种数据来源。
扩展字典支持flat、hashed、range_hashed、cache、complex_key_hashed、complex_key_cache和ip_trie共 7 种类型。

3.1 配置文件介绍

扩展字典的配置文件由 config.xml 文件中的 dictionaries_config 配置项指定：

<!-- Configurationofexternaldictionaries.See: 
    https://clickhouse.yandex/docs/en/dicts/external_dicts/ -->
<!-- 会扫描 /etc/clickhouse-server/ 下所有以 _dictionary.xml 结尾的配置文件 -->
<dictionaries_config>
  *_dictionary.xml
</dictionaries_config>

ClickHouse 会自动识别并加载 /etc/clickhouseserver 目录下所有以 _dictionary.xml 结尾的配置文件。
ClickHouse 也能够动态感知到此目录下配置文件的各种变化，并支持不停机在线更新配置文件。
在单个字典配置文件内可以定义多个字典，其中每一个字典由一组dictionary元素定义。
在dictionary元素之下又分为5个子元素，均为必填项。
配置结构

<dictionaries>
  <dictionary>
    <name> dict_name </name>
    <layout> 在内存中的数据格式类型 </layout>
    <structure> 字典的数据结构 </structure>
    <source> 数据源配置 </source>
    <lifetime> 字典的自动更新频率 </lifetime>
  </dictionary>
</dictionaries>

在上述结构中，主要配置的含义如下。

配置	含义
name	字典的名称，用于确定字典的唯一标识，必须全局唯一，多个字典之间不允许重复。
layout	字典的类型，它决定了数据在内存中以何种结构组织和存储。目前扩展字典共拥有7种类型
structure	字典的数据结构
source	字典的数据源，它决定了字典中数据从何处加载。目前扩展字典共拥有文件、数据库和其他三类数据来源
lifetime	字典的更新时间，扩展字典支持数据在线更新

3.2 数据结构

扩展字典的数据结构由 structure 元素定义，由键值 key 和属性 attribute 两部分组成，它们分别描述字典的数据标识和字段属性。structure 的完整形式如下所示：

<!-- 在扩展字典的查询过程中都会通过这些字段来访问字典中的数据 -->

<dictionary>
  <structure>
    <!-- 单值类型 key 使用<id> ， 复合类型 key 使用<key> -->
    <id> Key属性 </id>
    <attribute> 字段属性 </attribute>
 </structure>
</dictionary>

3.2.1 key

key用于定义字典的键值，每个字典必须包含1个键值 key 字段，用于定位数据，类似数据库的表主键。
键值key分为数值型和复合型两类。

（1）数值型：数值型key由UInt64整型定义，支持 flat、hashed、range_hashed 和 cache 类型的字典（扩展字典类型会在后面介绍），它的定义方法如下所示。

<structure>
  <id>
    <name>Id</name>
  </id>
  ...
</structure>

（2）复合型：复合型 key 使用 Tuple 元组定义，可以由1到多个字段组成，类似数据库中的复合主键。它仅支持complex_key_hashed、complex_key_cache和ip_trie类型的字典。其定义方法如下所示。

<structure>
  <key>
    <attribute>
      <name>field1</name>
      <type>String</type>
    </attribute>
    <attribute>
      <name>field2</name>
      <type>UInt64</type>
    </attribute>
    ...
  </key>
  ...
</structure>

3.2.2 attribute

attribute 用于定义字典的属性字段，字典可以拥有1到多个属性字段。
attribute 元素下共有7个配置项，其中name、type和null_value为必填项。
attribute 的完整定义方法如下所示：

<structure>
  ...
  <attribute>
    <name>Name</name>
    <type>DataType</type>
    <!-- 默认值 -->
    <null_value></null_value>
    <expression>generateUUIDv4()</expression>
    <hierarchical>true</hierarchical>
    <injective>true</injective>
    <is_object_id>true</is_object_id>
  </attribute>
  ...
</structure>

配置项	是否必填	默认值	说明
name	是	-	字段名称
type	是	-	字段类型
null_value	是	-	在查询时，条件 key 没有对应元素的默认值
expression	否	无表达式	表达式，可以使用运算符或调用函数
hierarchical	否	false	是否支持层次结构
injective	否	false	是否支持集合单射优化，开启后，在后续的 group by 查询中，如果调用 dictGet函数通过 key 获取 value，则该 value 直接从group by 数据返回
is_object_id	否	false	是否开启MongoDB 优化，通过 ObjectID 对 MongoDB 文档执行查询

单射集合：有 A，B 两个集合，假如 A 集合中的每个元素都在 B 集合有一个唯一与之对应的元素，那么A 集合到 B 集合的对应关系即单射集合。

3.3 准备测试数据

企业组织数据

将用于flat、hashed、cache、complex_key_hashed和complex_key_cache字典的演示场景。

id	code	name
1	a0001	研发部
2	a0002	产品部
3	a0003	数据部

下载地址：https://github.com/nauu/clickhousebook/dict/plugin/testdata/organization.csv

销售数据

将用于range_hashed字典的演示场景。

id	start	end	price
1	20160101	20170110	100
2	20160501	20170701	200
3	20140305	20180120	300

下载地址：https://github.com/nauu/clickhousebook/dict/plugin/testdata/sales.csv

asn数据

用于演示ip_trie字典的场景。

ip	asn	country
82.118.230.0/24	AS42831	GB
148.163.0.0/17	AS53755	US
178.93.0.0/18	AS6849	UA

下载地址：https://github.com/nauu/clickhousebook/dict/plugin/testdata/asn.csv

3.4 扩展字典的类型

一个字典的类型，既决定了其数据在内存中的存储结构，也决定了该字典支持的key键类型。
扩展字典的类型使用 layout 元素定义，目前共有7种类型。
根据key键类型的不同，可以将它们划分为两类：
- 单数值key类型：使用单个数值型的id，flat、hashed、range_hashed和cache
- 复合key类型：complex_key_hashed、complex_key_cache和ip_trie
- complex_key_hashed与complex_key_cache字典在功能方面与hashed和cache并无二致，只是单纯地将数值型key替换成了复合型key而已。
flat、hashed和range_hashed依次拥有最高的性能，而cache性能最不稳定。

名称	存储结构	字典键类型	支持数据源
flat	数组	UInt64	Local File Executable File HTTP DBMS
hashed	散列	UInt64	Local File Executable File HTTP DBMS
range_hashed	散列按时间排序	UInt64和时间	Local File Executable File HTTP DBMS
complex_key_hashed	散列	复合型 key	Local File Executable File HTTP DBMS
ip_trie	层次结构	复合型 key (单个 string)	Local File Executable File HTTP DBMS
cache	固定大小数组	UInt64	Executable File HTTP Clickhouse、MySQL
complex_key_cache	固定大小数组	复合型 key	Executable File HTTP Clickhouse、MySQL

3.4.1 flat

flat 字典是所有类型中性能最高的字典类型，它只能使用UInt64数值型key。
flat 字典的数据在内存中使用数组结构保存，数组的初始大小为1024，上限为500000，这意味着它最多只能保存500000行数据。
如果在创建字典时数据量超出其上限，那么字典会创建失败。

vim /etc/clickhouse-server/flat_dictionary.xml

<!-- 通过手动创建的flat字典配置 -->
<dictionaries>
    <dictionary>
        <name>test_flat_dict</name>
        <source>
            <!-- 准备好的测试数据 -->
            <file>
                <path>/etc/clickhouse-server/dict_data/organization.csv</path>
                <format>CSV</format>
            </file>
        </source>
        <layout>
              <flat/>
        </layout>
        <!-- 与测试数据的结构对应 -->
        <structure>

            <id>
                  <name>id</name>
            </id>
            <attribute>
                <name>code</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
            <attribute>
                <name>name</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
        </structure>
        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
    </dictionary>
</dictionaries>

在上述的配置中，source数据源是 CSV 格式的文件，structure 数据结构与其对应。将配置文件复制到 ClickHouse 服务节点的 /etc/clickhouseserver 目录后，即完成了对该字典的创建过程。查验 system.dictionaries 系统表后，能够看到flat字典已经创建成功。


SELECT name,type,key.names,key.types,attribute.names,attribute.types,origin FROM system.dictionaries;


┌─name───────────┬─type─┬─key.names─┬─key.types──┬─attribute.names─┬─attribute.types─────┬─origin─────────────────────────────────────┐
│ test_flat_dict │      │ ['id']    │ ['UInt64'] │ ['code','name'] │ ['String','String'] │ /etc/clickhouse-server/flat_dictionary.xml │
└────────────────┴──────┴───────────┴────────────┴─────────────────┴─────────────────────┴────────────────────────────────────────────┘

3.4.2 hashed

hashed 字典同样只能够使用UInt64数值型key
hashed 字典的数据在内存中通过散列结构保存，
hashed 字典的数据没有存储上限的制约。

vim /etc/clickhouse-server/hashed_dictionary.xml

<!-- 通过手动创建的hashed字典配置 -->
<dictionaries>
    <dictionary>
        <name>test_hashed_dict</name>
        <source>
            <!-- 准备好的测试数据 -->
            <file>
                <path>/etc/clickhouse-server/dict_data/organization.csv</path>
                <format>CSV</format>
            </file>
        </source>
        <layout>
              <hashed/>
        </layout>
        <!-- 与测试数据的结构对应 -->
        <structure>
            <id>
                  <name>id</name>
            </id>
            <attribute>
                <name>code</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
            <attribute>
                <name>name</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
        </structure>
        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
    </dictionary>
</dictionaries>

在上述的配置中，source 数据源是 CSV 格式的文件，structure 数据结构与其对应。将配置文件复制到 ClickHouse 服务节点的 /etc/clickhouseserver 目录后，即完成了对该字典的创建过程。查验 system.dictionaries 系统表后，能够看到 hashed 字典已经创建成功。


SELECT name,type,key.names,key.types,attribute.names,attribute.types,origin FROM system.dictionaries;


┌─name─────────────┬─type─┬─key.names─┬─key.types──┬─attribute.names─┬─attribute.types─────┬─origin───────────────────────────────────────┐
│ test_hashed_dict │      │ ['id']    │ ['UInt64'] │ ['code','name'] │ ['String','String'] │ /etc/clickhouse-server/hashed_dictionary.xml │
└──────────────────┴──────┴───────────┴────────────┴─────────────────┴─────────────────────┴──────────────────────────────────────────────┘

3.4.3 range_hashed

range_hashed 字典可以看作 hashed 字典的变种。
range_hashed 字典在原有功能的基础上增加了指定时间区间的特性，数据会以散列结构存储并按照时间排序。
时间区间通过 range_min 和 range_max 元素指定，所指定的字段必须是 Date 或者 DateTime 类型。

vim /etc/clickhouse-server/range_hashed_dictionary.xml

<!-- 现在仿照hashed字典的配置，创建一个名为/etc/clickhouse-server/range_hashed_dictionary.xml的配置文件 -->
<dictionaries>
  <dictionary>
    <name>test_range_hashed_dict</name>
    <source>
      <file>
        <path>/etc/clickhouse-server/dict_data/sales.csv</path>
        <format>CSV</format>
      </file>
    </source>
    <layout>
      <range_hashed/>
    </layout>
    <structure>
      <id>
        <name>id</name>
      </id>
      <range_min>
        <name>start</name>
      </range_min>
      <range_max>
        <name>end</name>
      </range_max>
      <attribute>
        <name>price</name>
        <type>Float32</type>
        <null_value></null_value>
      </attribute>
    </structure>
    <lifetime>
      <min>300</min>
      <max>360</max>
    </lifetime>
  </dictionary>
</dictionaries>

在上述的配置中，使用了一份销售数据，数据中的 star t和 end 字段分别与 range_min 和 range_max 对应。将配置文件复制到ClickHouse服务节点的/etc/clickhouse-server目录后，即完成了对该字典的创建过程。查验system.dictionaries系统表后，能够看到range_hashed字典已经创建成功：


SELECT name,type,key.names,key.types,attribute.names,attribute.types,origin FROM system.dictionaries;


┌─name───────────────────┬─type─┬─key.names─┬─key.types──┬─attribute.names─┬─attribute.types─────┬─origin─────────────────────────────────────────────┐
│ test_range_hashed_dict │      │ ['id']    │ ['UInt64'] │ ['price']       │ ['Float32']         │ /etc/clickhouse-server/range_hashed_dictionary.xml │
└────────────────────────┴──────┴───────────┴────────────┴─────────────────┴─────────────────────┴────────────────────────────────────────────────────┘

3.4.4 cache

cache 字典只能够使用 UInt64 数值型key，它的字典数据在内存中会通过固定长度的向量数组保存。
定长的向量数组又称cells，它的数组长度由 size_in_cells 指定。
size_in_cells 的取值大小必须是2的整数倍，如若不是，则会自动向上取为2的倍数的整数。
cache字典的取数逻辑与其他字典有所不同，它并不会一次性将所有数据载入内存。
当从cache字典中获取数据的时候，它首先会在cells数组中检查该数据是否已被缓存。如果数据没有被缓存，它才会从源头加载数据并缓存到cells中。
所以cache字典是性能最不稳定的字典，因为它的性能优劣完全取决于缓存的命中率（缓存命中率=命中次数/查询次数）。
如果无法做到99%或者更高的缓存命中率，则最好不要使用此类型。

vim /etc/clickhouse-server/cache_dictionary.xml

<dictionaries>
  <dictionary>
    <name>test_cache_dict</name>
    <source>
        <!-- 本地文件需要通过executable形式 -->
        <executable>
            <command>cat /etc/clickhouse-server/dict_data/organization.csv</command>
            <format>CSV</format>
        </executable>
    </source>
    <layout>
        <cache>
            <!-- 缓存大小 -->
            <size_in_cells>10000</size_in_cells>
        </cache>
    </layout>
        <!-- 与测试数据的结构对应 -->
        <structure>
            <id>
                  <name>id</name>
            </id>
            <attribute>
                <name>code</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
            <attribute>
                <name>name</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
        </structure>
        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
  </dictionary>
</dictionaries>

在上述配置中，layout被声明为cache并将缓存大小 size_in_cells 设置为10000。关于cells的取值可以根据实际情况考虑，在内存宽裕的情况下设置成1000000000也是可行的。还有一点需要注意，如果cache字典使用本地文件作为数据源，则必须使用executable的形式设置。


SELECT name,type,key.names,key.types,attribute.names,attribute.types,origin FROM system.dictionaries;


┌─name───────────────────┬─type─┬─key.names─┬─key.types──┬─attribute.names─┬─attribute.types─────┬─origin─────────────────────────────────────────────┐
│ test_range_hashed_dict │      │ ['id']    │ ['UInt64'] │ ['price']       │ ['Float32']         │ /etc/clickhouse-server/range_hashed_dictionary.xml │
└────────────────────────┴──────┴───────────┴────────────┴─────────────────┴─────────────────────┴────────────────────────────────────────────────────┘

3.4.5 complex_key_hashed

complex_key_hashed 字典在功能方面与hashed字典完全相同，只是将单个数值型key替换成了复合型。

<dictionaries>
  <dictionary>
    <name></name>
    <!—与hashed字典配置相同,省略…>
    <layout>
      <complex_key_hashed/>
    </layout>
    <structure>
      <!—复合型key>
      <key>
        <attribute>
          <name>id</name>
          <type>UInt64</type>
        </attribute>
        <attribute>
          <name>code</name>
          <type>String</type>
        </attribute>
      </key>
      省略…
    </structure>
    省略…


vim /etc/clickhouse-server/complex_key_hashed_dictionary.xml

<!-- 通过手动创建的hashed字典配置 -->
<dictionaries>
    <dictionary>
        <name>test_complex_key_hashed_dict</name>
        <source>
            <!-- 准备好的测试数据 -->
            <file>
                <path>/etc/clickhouse-server/dict_data/organization.csv</path>
                <format>CSV</format>
            </file>
        </source>
        <layout>
              <complex_key_hashed/>
        </layout>
        <!-- 与测试数据的结构对应 -->
        <structure>
            <key>
              <attribute>
                <name>id</name>
                <type>UInt64</type>
              </attribute>
              <attribute>
                <name>code</name>
                <type>String</type>
              </attribute>
            </key>
            <attribute>
                <name>name</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
        </structure>
        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
    </dictionary>
</dictionaries>

SELECT name,type,key.names,key.types,attribute.names,attribute.types,origin FROM system.dictionaries;

┌─name─────────────────────────┬─type─┬─key.names─────┬─key.types───────────┬─attribute.names─┬─attribute.types─────┬─origin───────────────────────────────────────────────────┐
│ test_complex_key_hashed_dict │      │ ['id','code'] │ ['UInt64','String'] │ ['name']        │ ['String']          │ /etc/clickhouse-server/complex_key_hashed_dictionary.xml │
└──────────────────────────────┴──────┴───────────────┴─────────────────────┴─────────────────┴─────────────────────┴──────────────────────────────────────────────────────────┘

3.4.6 complex_key_cache

complex_key_cache字典同样与cache字典的特性完全相同，只是将单个数值型key替换成了复合型。

vim /etc/clickhouse-server/complex_key_cache_dictionary.xml

<dictionaries>
  <dictionary>
    <name>test_complex_key_cache_dict</name>
    <source>
        <!-- 本地文件需要通过executable形式 -->
        <executable>
            <command>cat /etc/clickhouse-server/dict_data/organization.csv</command>
            <format>CSV</format>
        </executable>
    </source>
    <layout>
      <complex_key_cache>
            <size_in_cells>10000</size_in_cells>
      </complex_key_cache>
    </layout>
        <!-- 与测试数据的结构对应 -->
        <structure>
            <!-- 复合型Key -->
            <key>
              <attribute>
                <name>id</name>
                <type>UInt64</type>
              </attribute>
              <attribute>
                <name>code</name>
                <type>String</type>
              </attribute>
            </key>
            <attribute>
                <name>name</name>
                <type>String</type>
                <null_value></null_value>
            </attribute>
        </structure>
        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
  </dictionary>
</dictionaries>


SELECT name,type,key.names,key.types,attribute.names,attribute.types,origin FROM system.dictionaries;

SELECT
    name,
    type,
    key.names,
    key.types,
    attribute.names,
    attribute.types,
    origin
FROM system.dictionaries

Query id: 187447fa-d02a-4b26-98c7-259c2b9243f5

┌─name─────────────────────────┬─type─┬─key.names─────┬─key.types───────────┬─attribute.names─┬─attribute.types─────┬─origin───────────────────────────────────────────────────┐
│ test_complex_key_cache_dict  │      │ ['id','code'] │ ['UInt64','String'] │ ['name']        │ ['String']          │ /etc/clickhouse-server/complex_key_cache_dictionary.xml  │
└──────────────────────────────┴──────┴───────────────┴─────────────────────┴─────────────────┴─────────────────────┴──────────────────────────────────────────────────────────┘

3.4.7 ip_trie

虽然同为复合型key的字典，但ip_trie字典却较为特殊，因为它只能指定单个String类型的字段，用于指代IP前缀。ip_trie字典的数据在内存中使用trie树结构保存，且专门用于IP前缀查询的场景，例如通过IP前缀查询对应的ASN信息。

vim /etc/clickhouse-server/ip_trie_dictionary.xml

<dictionaries>
  <dictionary>
    <name>test_ip_trie_dict</name>
    <source><file>
      <path>/etc/clickhouse-server/dict_data/asn.csv</path>
      <format>CSV</format>
      </file>
    </source>
    <layout>
      <ip_trie/>
    </layout>
    <structure>
      <!-- 虽然是复合类型,但是只能设置单个String类型的字段 -->
      <key>
        <attribute>
          <name>prefix</name>
          <type>String</type>
        </attribute>
      </key>
      <attribute>
        <name>asn</name>
        <type>String</type>
        <null_value></null_value>
      </attribute>
      <attribute>
        <name>country</name>
        <type>String</type>
        <null_value></null_value>
        </attribute>
    </structure>
        <lifetime>
            <min>300</min>
            <max>360</max>
        </lifetime>
  </dictionary>
</dictionaries>

SELECT name,type,key.names,key.types,attribute.names,attribute.types,origin FROM system.dictionaries;

SELECT
    name,
    type,
    key.names,
    key.types,
    attribute.names,
    attribute.types,
    origin
FROM system.dictionaries

Query id: 2baf7779-3759-4bf9-8302-bb742a29c917

┌─name─────────────────────────┬─type─┬─key.names─────┬─key.types───────────┬─attribute.names───┬─attribute.types─────┬─origin───────────────────────────────────────────────────┐
│ test_ip_trie_dict            │      │ ['prefix']    │ ['String']          │ ['asn','country'] │ ['String','String'] │ /etc/clickhouse-server/ip_trie_dictionary.xml            │
└──────────────────────────────┴──────┴───────────────┴─────────────────────┴───────────────────┴─────────────────────┴──────────────────────────────────────────────────────────┘

3.5 数据源

扩展字典数据源使用 source 元素定义，它指定了字典的数据从何而来。扩展字典支持3大类共计9种数据源，接下来会介绍这些数据源的使用方式。

3.5.1 文件类型

3.5.1.1 本地文件

本地文件使用file元素定义。其中，path表示数据文件的绝对路径，而format表示数据格式，例如CSV或者TabSeparated等。它的完整配置如下所示。

<source>
  <file>
    <path>/etc/clickhouse-server/dict_data/asn.csv</path>
    <format>CSV</format>
  </file>
</source>

3.5.1.2 可执行文件

可执行文件数据源属于本地文件的变种，它需要通过cat命令访问数据文件。对于cache和complex_key_cache类型的字典，必须使用此类型的文件数据源。可执行文件使用executable元素定义。其中，command表示数据文件的绝对路径，format表示数据格式，例如CSV或者TabSeparated等。

<source>
  <executable>
    <command>cat /etc/clickhouse-server/dict_data/asn.csv</command>
    <format>CSV</format>
  </executable>
</source>

3.5.1.3 远程文件

远程文件与可执行文件类似，只是它将cat命令替换成了post请求，支持HTTP与HTTPS协议。远程文件使用http元素定义。其中，url表示远程数据的访问地址，format表示数据格式，例如CSV或者TabSeparated。

<source>
  <http>
    <url>http://10.37.129.6/organization.csv</url>
    <format>CSV</format>
  </http>
</source>

3.5.2 数据库类型

数据库类型的数据源比文件类型更适合在正式的生产环境中使用。目前扩展字典支持MySQL、ClickHouse本身及MongoDB三种数据库。接下来会分别介绍它们的创建方法。

3.5.2.1 MySQL

MySQL数据源支持从指定的数据库中提取数据，作为其字典的数据来源。首先，需要准备源头数据，执行下面的语句在MySQL中创建测试表：

-- 创建MySQL 表
CREATETABLE 't_organization'(
  `id` int(11) NOTNULLAUTO_INCREMENT, 
  `code` varchar(40) DEFAULT NULL,
  `name` varchar(60) DEFAULT NULL,
  `updatetime` datetime DEFAULT NULL,
  PRIMARYKEY(`id`)
)ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=utf8;

-- 插入数据
INSERT INTO t_organization(code,name,updatetime) VALUES ('a0001','研发部',NOW()),('a0002','产品部',NOW());

完成上述准备之后，就可以配置MySQL数据源的字典了。现在仿照flat字典进行配置，创建一个名为 test_mysql_dictionary.xml 的配置文件，将source替换成MySQL数据源：

<dictionaries>
  <dictionary>
    <name>test_mysql_dict</name>
    <source>
      <mysql>
        <!-- 数据库端口 -->
        <port>3306</port>
                <!-- 数据库用户名 -->
        <user>root</user>
        <!-- 数据库密码 -->
        <password></password>
          <!-- 数据库host地址，支持MySQL集群 -->
        <replica>
          <!-- 数据库host地址-->
          <host>10.37.129.2</host>
          <!-- 优先级 -->
          <priority>1</priority>
        </replica>
        <!-- database数据库 -->
        <db>test</db>
        <!-- 字典对应的数据表 -->
        <table>t_organization</table>

        <!-- 非必填项 -->
        <!-- 查询table时的过滤条件 -->
                <where>id=1</where>
        <!-- 指定一条SQL语句，用于在数据更新时判断是否需要更 -->
        <invalidate_query>SQL_QUERY</invalidate_query> 
      </mysql>
    </source>
            ...
  </dictionary>
</dictionaries>

3.5.2.2 ClickHouse

扩展字典支持将ClickHouse数据表作为数据来源，在配置之前同样需要准备数据源的测试数据，执行下面的语句在ClickHouse中创建测试表并写入测试数据：

-- 创建 clickhouse 表
CREATE TABLE t_organization(
  ID UInt64,
  Code String,
  Name String,
  UpdateTime DateTime
) ENGINE = TinyLog;

-- 插入数据
INSERT INTO t_organization VALUES 
(1,'a0001','研发部',NOW()),
(2,'a0002','产品部',NOW()),
(3,'a0003','数据部',NOW()),
(4,'a0004','测试部',NOW()),
(5,'a0005','运维部',NOW()),
(6,'a0006','规划部',NOW()),
(7,'a0007','市场部',NOW());

ClickHouse数据源的配置与MySQL数据源极为相似，所以我们可以仿照MySQL数据源的字典配置，创建一个名为test_ch_dictionary.xml的配置文件，将source替换成ClickHouse数据源：


<dictionaries>
  <dictionary>
    <name>test_ch_dict</name>
    <source>
      <clickhouse>
        <!-- 数据库host地址 -->
        <host>10.37.129.6</host>
        <!-- 数据库端口 -->
        <port>9000</port>
                <!-- 数据库用户名 -->
        <user>default</user>
        <!-- 数据库密码 -->
        <password></password>
        <!-- database数据库 -->
        <db>default</db>
        <!-- 字典对应的数据表 -->
        <table>t_organization</table>

        <!-- 非必填 -->
        <!-- 查询table时的过滤条件 -->
        <where>id=1</where>
        <!-- 指定一条SQL语句，用于在数据更新时判断是否需要更 -->
        <invalidate_query>SQL_QUERY</invalidate_query>
      </clickhouse>
    </source>
    ...
  </dictionary>
</dictionaries>

3.5.2.3 MongoDB

MongoDB数据源，执行下面的语句，MongoDB会自动创建相应的schema并写入数据：

db.t_organization.insertMany(
  [
    {id:1,code:'a0001',name:'研发部'},
    {id:2,code:'a0002',name:'产品部'},
    {id:3,code:'a0003',name:'数据部'},
    {id:4,code:'a0004',name:'测试部'}
  ]
)

完成上述准备之后就可以配置MongoDB数据源的字典了，同样仿照MySQL字典配置，创建一个名为test_mongodb_dictionary.xml的配置文件，将source替换成mongodb数据源：

<dictionaries>
  <dictionary>
    <name>test_mongodb_dict</name>
      <source>
        <mongodb>
          <!-- 数据库host地址 -->
          <host>10.37.129.2</host>
          <!-- 数据库端口 -->
          <port>27017</port>
          <!-- 数据库用户名 -->
          <user></user>
          <!-- 数据库密码 -->
          <password></password>
          <!-- database数据库 -->
          <db>test</db>
          <!-- 与字典对应的collection的名称 -->
          <collection>t_organization</collection>
        </mongodb>
      </source>
      ...
  </dictionary>
</dictionaries>

3.5.3 其他类型

除了上述已经介绍过的两类数据源之外，扩展字典还支持通过ODBC的方式连接PostgreSQL和MSSQLServer数据库作为数据源。它们的配置方式与数据库类型数据源类似。

3.6 数据更新策略

扩展字典支持数据的在线更新，更新后无须重启服务。字典数据的更新频率由配置文件中的lifetime元素指定，单位为秒：

<lifetime>
  <min>300</min>
  <max>360</max>
</lifetime>

min与max分别指定了更新间隔的上下限。
ClickHouse 会在这个时间区间内随机触发更新动作，这样能够有效错开更新时间，避免所有字典在同一时间内爆发性的更新。
当min和max都是0的时候，将禁用字典更新。
对于cache字典而言，lifetime 还代表了它的缓存失效时间。

字典内部拥有版本的概念，在数据更新的过程中，旧版本的字典将持续提供服务，只有当更新完全成功之后，新版本的字典才会替代旧版本。所以更新操作或者更新时发生的异常，并不会对字典的使用产生任何影响。

不同类型的字典数据源，更新机制也稍有差异。总体来说，扩展字典目前并不支持增量更新。但部分数据源能够依照标识判断，只有在源数据发生实质变化后才实施更新动作。这个判断源数据是否被修改的标识，在字典内部称为 previous，它保存了一个用于比对的值。ClickHouse的后台进程每隔5秒便会启动一次数据刷新的判断，依次对比每个数据字典中前后两次previous的值是否相同。若相同，则代表无须更新数据；若不同且满足更新频率，则代表需要更新数据。而对于previous值的获取方式，不同的数据源有不同的实现逻辑，下面详细介绍。

文件数据源

文件类型的数据源的previous来自修改文件的系统时间，当前后两次修改时间不同时，才会进行更新。


[root@node3 opt]# stat multisqls
  文件："multisqls"
  大小：81            块：8          IO 块：4096   普通文件
设备：803h/2051d    Inode：2152750551  硬链接：1
权限：(0644/-rw-r--r--)  Uid：(    0/    root)   Gid：(    0/    root)
最近访问：2021-04-26 15:19:33.637603807 +0800
最近更改：2021-04-26 15:18:43.166934380 +0800
最近改动：2021-04-26 15:18:43.209934098 +0800
创建时间：-

MySQL(InnoDB)、ClickHouse、ODBC

对于MySQL(InnoDB)、ClickHouse和ODBC数据源，previous 来源于invalidate_query中定义的 SQL 语句。


<source>
    <mysql>
    <!-- 如果前后两次 updatetime 不一致，则说明源数据发生了变化需要更新数据字典 -->
    <invalidate_query>
      select updatetime from t_organization where id = 8
    </invalidate_query>
  </mysql>
</source>

MySQL（MyIMyISAM）

如果 MySQL 使用的是MyIMyISAM表引擎，可以使用 show table status 查询修改时间，进而判断是否需要更新字典

其余数据源

除了前面介绍的两种数据源，其他数据源无法通过判断标识符更新的情况那么只能根据 lifetime 设置的时间来进行更新。

数据字典除了被动更新之外还可以进行主动更新：

-- 更新全部字典
system reload dictionaries;
-- 更新指定字典
system reload dictionary dict_name;

3.7 元数据介绍

通过 system.dictionaries 系统表可以查看到扩展字典的元数据信息。

desc dictionaries;

name	type	comment
database	String	所在数据库
name	String	字典名称，在使用字典函数时需要使用字典名称访问数据
uuid	UUID
status	Enum8( ‘NOT_LOADED’ = 0, ‘LOADED’ = 1, ‘FAILED’ = 2, ‘LOADING’ = 3, ‘FAILED_AND_RELOADING’ = 4, ‘LOADED_AND_RELOADING’ = 5, ‘NOT_EXIST’ = 6 )
origin	String
type	String	字典类型
key.names	Array(String)	Key 名称
key.types	Array(String)	Key 数据类型
attribute.names	Array(String)
attribute.types	Array(String)
bytes_allocated	UInt64	已经载入数据占用内存字节数
query_count	UInt64	字典被查询次数
hit_rate	Float64	命中率
element_count	UInt64	载入行数
load_factor	Float64	数据加载率
source	String	数据源信息
lifetime_min	UInt64	生命周期最小时间
lifetime_max	UInt64	生命周期最大时间
loading_start_time	DateTime	最初加载字典时间
last_successful_update_time	DateTime	最后更新字典时间
loading_duration	Float32	加载字典消耗时间
last_exception	String	加载字典过程中的异常信息，

3.8 基本操作

数据字典只能通过字典函数获取，以下为查询示例：

-- 使用 dictGet 函数查询数据
select dictGet('test_flat_dict','name',toUInt64(1)) as dept_name;

┌─dept_name─┐
│ 研发部    │
└───────────┘

-- 使用 dictGet 函数查询复合key数据
select dictGet('test_ip_trie_dict','asn',tuple(IPv4StringToNum('82.118.230.0'))) as asn;

┌─asn─────┐
│ AS42831 │
└─────────┘

3.8.1 字典函数

其他数据字典函数使用方式与dictGet 基本相同，下面列出一些常用的数据字典函数：

功能	函数
获取整型数据的函数	dictGetUInt8 , dictGetUInt16 , dictGetUInt32 , dictGetUInt64 dictGetInt8,dictGetInt16 , dictGetInt32 , dictGetInt64
获取浮点数据的函数	dictGetFloat32 , dictGetFloat64
获取日期数据函数	dictGetDate , dictGetDateTime
获取字符串数据函数	dictGetString , dictGetUUID

-- 使用字典函数查询数据
select dictGetString('test_flat_dict','name',toUInt64(1)) as dept_name;
┌─dept_name─┐
│ 研发部     │
└───────────┘

select dictGet('test_flat_dict',('name','code'),toUInt64(2)) as dept_name;
┌─dept_name──────────┐
│ ('产品部','a0002')  │
└────────────────────┘

3.8.2 字典表

除了字典函数还可以使用字典表查询外部字典的数据，字典表使用的表引擎是 DIctionary。

-- 创建字典表
create table t_test_flat_dict(
  id UInt64,
  code String,
  name String
) ENGINE=Dictionary(test_flat_dict)

-- 查询数据
select * from t_test_flat_dict;

┌─id─┬─code──┬─name───┐
│  1 │ a0001 │ 研发部 │
│  2 │ a0002 │ 产品部 │
│  3 │ a0003 │ 数据部 │
│  4 │ a0004 │ 测试部 │
│  5 │ a0005 │ 运维部 │
│  6 │ a0006 │ 规划部 │
│  7 │ a0007 │ 市场部 │
└────┴───────┴────────┘

3.8.2 使用 DDL 创建字典

-- 创建字典
create dictionary d_test_flat_dict(
  id UInt64,
  code String,
  name String
)
primary key id 
layout(FLAT())
source(file(path '/var/lib/clickhouse/user_files/organization.csv' format CSV))
lifetime(300);

-- 查询数据
select * from d_test_flat_dict;

┌─id─┬─code──┬─name───┐
│  1 │ a0001 │ 研发部 │
│  2 │ a0002 │ 产品部 │
│  3 │ a0003 │ 数据部 │
│  4 │ a0004 │ 测试部 │
│  5 │ a0005 │ 运维部 │
│  6 │ a0006 │ 规划部 │
│  7 │ a0007 │ 市场部 │
└────┴───────┴────────┘

-- 删除字典
drop dictionary d_test_flat_dict;

hnbian

https://www.hnbian.cn/posts/481cbc74.html

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 hnbian !

clickhouse olap

ClickHouse表引擎 1.MergeTree 建表方式与分区规则

1. MergeTree 介绍表引擎是 ClickHouse 设计实现中的一大特色。可以说，是表引擎决定了一张数据表最终的“性格”，比如数据表拥有何种特性、数据以何种形式被存储以及如何被加载。 ClickHouse拥有非常庞大的表引擎体

2021-05-12 clickhouse

clickhouse olap

ClickHouse系列教程 7. DML 相关操作介绍

1. 数据的写入 ClickHouse内部所有的数据操作都是面向Block数据块的，所以INSERT查询最终会将数据转换为Block数据块。 INSERT语句在单个数据块的写入过程中是具有原子性的。在默认的情况下，每个数据块最多可以写

2021-04-28 clickhouse

clickhouse olap

ClickHouse系列教程 8. 数据字典相关内容

1. 数据字典介绍

2. 内置字典

2.1 内置字典配置说明

2.1.1 path_to_regions_hierarchy_file

2.1.2 path_to_regions_names_files

2.2 使用内置字典

3. 外部扩展字典

3.1 配置文件介绍

3.2 数据结构

3.2.1 key

3.2.2 attribute

3.3 准备测试数据

3.4 扩展字典的类型

3.4.1 flat

3.4.2 hashed

3.4.3 range_hashed

3.4.4 cache

3.4.5 complex_key_hashed

3.4.6 complex_key_cache

3.4.7 ip_trie

3.5 数据源

3.5.1 文件类型

3.5.1.1 本地文件

3.5.1.2 可执行文件

3.5.1.3 远程文件

3.5.2 数据库类型

3.5.2.1 MySQL

3.5.2.2 ClickHouse

3.5.2.3 MongoDB

3.5.3 其他类型

3.6 数据更新策略

3.7 元数据介绍

3.8 基本操作

3.8.1 字典函数

3.8.2 字典表

你的赏识是我前进的动力