使用正则表达式筛选日志中的表名或SQL语句

近几天收到学弟的求助，具体是关于从日志中筛选数据的，帮忙写了几则正则，总结记录一下。

要求

一开始的要求：
在日志文件中，一行为一条日志记录，从这些日志中，筛选出SQL语句涉及到的表名。
修改后的要求：
从日志中，筛选SQL语句

编写正则表达式

首先看一下日志格式：

Info 2020-04-19 20:02:36.114+08:00 logger.go:81 10.11.39.230 ad.tetris.clue_binlog - default canary lf 0 _podName=dp-8d1fdf3061-d474b99cc-hggd5 _language=go _process=186 _msg=GORM LOG SQL:SELECT * FROM `instance_consult`  WHERE (consult_id = '1663479833969672') LIMIT 1 Cost:1.09ms
Info 2020-04-19 20:02:36.144+08:00 logger.go:81 10.11.39.230 ad.tetris.clue_binlog - default canary lf 0 _podName=dp-8d1fdf3061-d474b99cc-hggd5 _process=186 _language=go _msg=GORM LOG SQL:SELECT * FROM `clue`  WHERE `clue`.`clue_id` = '1664399848891405' ORDER BY `clue`.`clue_id` ASC LIMIT 1 Cost:1.51ms
Info 2020-04-19 20:02:36.149+08:00 logger.go:81 10.11.39.230 ad.tetris.clue_binlog - default canary lf 0 _process=186 _podName=dp-8d1fdf3061-d474b99cc-hggd5 _language=go _msg=GORM LOG SQL:SELECT * FROM `instance_form`  WHERE `instance_form`.`instance_id` = '1662691860828215' ORDER BY `instance_form`.`instance_id` ASC LIMIT 1 Cost:1.34ms
Info 2020-04-19 20:02:36.144+08:00 logger.go:81 10.11.39.230 ad.tetris.clue_binlog - default canary lf 0 _podName=dp-8d1fdf3061-d474b99cc-hggd5 _process=186 _language=go _msg=GORM LOG SQL:SELECT * FROM `clue`, `test1`, `test_table`  WHERE `clue`.`clue_id` = '1664399848891405' ORDER BY `clue`.`clue_id` ASC LIMIT 1 Cost:1.51ms

正则我也是即用即学（不想记），Google上搜个正则表达式30分钟教程，随便一个点进去就行了。

开工！

对应要求1的正则表达式

由于要得到表名（学弟说不考虑join的情况），那么一般就是from table a, table b where的情况了，考虑多表情况下，正则如下：

1 2	# apply only to "from ... where" situation regex = r"(\bFROM\s)+(`\w+`,?\s?)+(\s+\w?\sWHERE)+"

加上()，分组捕获，由于这里没有多种情况，所以不考虑捕获分组或者非捕获分组，感兴趣的可以查看《正则之括号的捕获和非捕获属性》这篇文章；

示例代码如下：

import re


# apply only to "from ... where" situation
regex = r"(\bFROM\s)+(`\w+`,?\s?)+(\s+\w?\sWHERE)+"
# Compile match pattern
pattern = re.compile(regex) 

res = set()
# The file path is changed according to the situation
with open("./log.txt", "r") as f:
    while True:
        line = f.readline().replace("\n", "")
        if not line:
            break
        # Just get the table name(or tables' name);
        # Because of the multi-table case, group(0) is used to get the entire matching string
        find = re.search(regex, line).group(0)
        if find:
            find = find.replace("FROM ", "").replace("  WHERE", "")
            # Multi-table situation: from table1, table2, table3 ... where
            if "," in find:
                names = find.split(", ")
                for name in names:
                    res.add(name.replace("`", ""))
            else:
                res.add(find.replace("`", ""))

print(res)

在这里解释一下，find = find.replace("FROM ", "").replace(" WHERE", "")这条语句，使用replace进行替换而非正则的sub，原因是：对于一般固定不变的常规字符串进行替换，使用replace速度最快；详情可自行查阅python中replace与re.sub运行效率比较。

对应要求2的正则表达式

要求2仅要求筛选出SQL语句，并写入文件，那就更方便了，只要匹配以SQL:开头的日志记录即可，代码如下：

import re


regex = r"(SQL:)(.+)(\sCost)"
# Compile match pattern
pattern = re.compile(regex)

res = set()
try:
    # The file path is changed according to the situation
    with open("./log.txt", "r") as f:
        while True:
            line = f.readline().replace("\n", "")
            if not line:
                break
            # Parenthesis is used for group capture and the 2th group is the SQL statement
            find = pattern.search(line)
            if find:
                res.add(find.group(2))
except Exception as e:
    print("Reason:", e)

try:
    with open("./sql_statements.txt", "a") as f:
        for sql in res:
            f.write(sql)
except Exception as e:
    print("Reason:", e)

意外状况

要筛选出像这样的语句：

1
2
3

SELECT `id` FROM `form_element`  WHERE (id = '13126454')
SELECT `id` FROM `form_element`  WHERE (id = '13126418')
SELECT `id` FROM `form_element`  WHERE (id = '13126429')

不考虑key值，这类的语句仅保留一条。其实和要求1类似，稍微修改一下即可，正则规则如下：

1	(SELECT)\s(`\w+`.?\s?)+\s(FROM)\s(`\w+.?\s?`)+\s+(WHERE)\s+\(.+\)

尾巴

本篇文章就到这里，感谢阅读。