0%

使用正则表达式筛选日志中的表名或SQL语句

近几天收到学弟的求助,具体是关于从日志中筛选数据的,帮忙写了几则正则,总结记录一下。

要求

  1. 一开始的要求:

    在日志文件中,一行为一条日志记录,从这些日志中,筛选出SQL语句涉及到的表名。

  2. 修改后的要求:

    从日志中,筛选SQL语句

编写正则表达式

首先看一下日志格式:

1
2
3
4
Info 2020-04-19 20:02:36.114+08:00 logger.go:81 10.11.39.230 ad.tetris.clue_binlog - default canary lf 0 _podName=dp-8d1fdf3061-d474b99cc-hggd5 _language=go _process=186 _msg=GORM LOG SQL:SELECT * FROM `instance_consult`  WHERE (consult_id = '1663479833969672') LIMIT 1 Cost:1.09ms
Info 2020-04-19 20:02:36.144+08:00 logger.go:81 10.11.39.230 ad.tetris.clue_binlog - default canary lf 0 _podName=dp-8d1fdf3061-d474b99cc-hggd5 _process=186 _language=go _msg=GORM LOG SQL:SELECT * FROM `clue` WHERE `clue`.`clue_id` = '1664399848891405' ORDER BY `clue`.`clue_id` ASC LIMIT 1 Cost:1.51ms
Info 2020-04-19 20:02:36.149+08:00 logger.go:81 10.11.39.230 ad.tetris.clue_binlog - default canary lf 0 _process=186 _podName=dp-8d1fdf3061-d474b99cc-hggd5 _language=go _msg=GORM LOG SQL:SELECT * FROM `instance_form` WHERE `instance_form`.`instance_id` = '1662691860828215' ORDER BY `instance_form`.`instance_id` ASC LIMIT 1 Cost:1.34ms
Info 2020-04-19 20:02:36.144+08:00 logger.go:81 10.11.39.230 ad.tetris.clue_binlog - default canary lf 0 _podName=dp-8d1fdf3061-d474b99cc-hggd5 _process=186 _language=go _msg=GORM LOG SQL:SELECT * FROM `clue`, `test1`, `test_table` WHERE `clue`.`clue_id` = '1664399848891405' ORDER BY `clue`.`clue_id` ASC LIMIT 1 Cost:1.51ms

正则我也是即用即学(不想记),Google上搜个正则表达式30分钟教程,随便一个点进去就行了。

开工!

对应要求1的正则表达式

由于要得到表名(学弟说不考虑join的情况),那么一般就是from table a, table b where的情况了,考虑多表情况下,正则如下:

1
2
# apply only to "from ... where" situation
regex = r"(\bFROM\s)+(`\w+`,?\s?)+(\s+\w?\sWHERE)+"

加上(),分组捕获,由于这里没有多种情况,所以不考虑捕获分组或者非捕获分组,感兴趣的可以查看《正则之括号的捕获和非捕获属性》这篇文章;

示例代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import re


# apply only to "from ... where" situation
regex = r"(\bFROM\s)+(`\w+`,?\s?)+(\s+\w?\sWHERE)+"
# Compile match pattern
pattern = re.compile(regex)

res = set()
# The file path is changed according to the situation
with open("./log.txt", "r") as f:
while True:
line = f.readline().replace("\n", "")
if not line:
break
# Just get the table name(or tables' name);
# Because of the multi-table case, group(0) is used to get the entire matching string
find = re.search(regex, line).group(0)
if find:
find = find.replace("FROM ", "").replace(" WHERE", "")
# Multi-table situation: from table1, table2, table3 ... where
if "," in find:
names = find.split(", ")
for name in names:
res.add(name.replace("`", ""))
else:
res.add(find.replace("`", ""))

print(res)

在这里解释一下,find = find.replace("FROM ", "").replace(" WHERE", "")这条语句,使用replace进行替换而非正则的sub,原因是:对于一般固定不变的常规字符串进行替换,使用replace速度最快;详情可自行查阅python中replace与re.sub运行效率比较

对应要求2的正则表达式

要求2仅要求筛选出SQL语句,并写入文件,那就更方便了,只要匹配以SQL:开头的日志记录即可,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import re


regex = r"(SQL:)(.+)(\sCost)"
# Compile match pattern
pattern = re.compile(regex)

res = set()
try:
# The file path is changed according to the situation
with open("./log.txt", "r") as f:
while True:
line = f.readline().replace("\n", "")
if not line:
break
# Parenthesis is used for group capture and the 2th group is the SQL statement
find = pattern.search(line)
if find:
res.add(find.group(2))
except Exception as e:
print("Reason:", e)

try:
with open("./sql_statements.txt", "a") as f:
for sql in res:
f.write(sql)
except Exception as e:
print("Reason:", e)

意外状况

要筛选出像这样的语句:

1
2
3
SELECT `id` FROM `form_element`  WHERE (id = '13126454')
SELECT `id` FROM `form_element` WHERE (id = '13126418')
SELECT `id` FROM `form_element` WHERE (id = '13126429')

不考虑key值,这类的语句仅保留一条。其实和要求1类似,稍微修改一下即可,正则规则如下:

1
(SELECT)\s(`\w+`.?\s?)+\s(FROM)\s(`\w+.?\s?`)+\s+(WHERE)\s+\(.+\)

尾巴

本篇文章就到这里,感谢阅读。

------ 本文结束 ------