本帖最后由 modao 于 2023-11-04 15:23 编辑
前几天视频刷到一个小说站点,其内容没啥营养,却使用了CSS反爬和OB混淆。于是我去读了一下它的算法。读懂并写出脚本后,感觉这个网站逆向难度较小,可以尝试完美破解然后水篇入门文,即:只需要给出小说某一章的html链接,就能获取其内容。 样本地址:aHR0cHM6Ly9nLmhvbmdzaHUuY29tL2NvbnRlbnQvMTIxMTAyLzI3NzM2MDkuaHRtbA== 技术栈: ts-node 、typescript 。babel 处理AST。superagent 发起网络请求。cheerio 解析HTML文本。
算法分析小说内容的HTML结构很简单,每段都是一个p 标签,文本中间有一些不和谐的span 标签: <div class="rdtext"fsize="16"> <p><span class="context_kw6"></span>叫顾凡<span class="context_kw0"></span>顾<span class="context_kw7"></span>得<span class="context_kw4"></span>顾<span class="context_kw0"></span>平凡<span class="context_kw4"></span>凡(省略一些文字)</p> </div>
小说的部分文字有缺失,需要用CSS::before 伪元素渲染,比如: .context_kw0::before { content: ","; }
这样既不影响用户阅读,又增大了逆向难度。 面对这个样本,我能想到的切入点不多:全局搜索context_kw 、document ,因为可以猜测JS代码需要动态创建style 标签和添加CSS 规则。发现context_kw 在HTML文档的JS代码中只出现了1次,在一个巨大的常量串数组中。如果运气好,那个常量串没被混淆,可以看到context_kw 出现在业务代码中。熟悉OB混淆套路的佬们都能立刻意识到,这段代码在做些见不得人的关键操作。 首先我们用AST还原一下(用到的cff, switchCFF 等函数都来自我的开源项目): import * as parser from '@babel/parser'; import { renameVars } from './rename_vars'; import generator from '@babel/generator'; import { getFile, writeOutputToFile } from './file_utils'; import { memberExpComputedToFalse } from './member_exp_computed_to_false'; import { translateLiteral } from './translate_literal'; import traverse from '@babel/traverse'; import { Node, isIdentifier, isNumericLiteral, stringLiteral, isStringLiteral } from '@babel/types'; import { cff } from './remove_cff'; import { switchCFF } from './remove_switch_cff';
const jsCode = getFile('src/inputs/小说.ts'); const ast = parser.parse(jsCode);
// 如果常量表不止1处,则此代码不正确 function restoreStringLiteral (ast: Node, stringLiteralFuncs: string[], getStringArr: (idx: number) => string) { // 收集与常量串隐藏有关的变量 traverse(ast, { VariableDeclarator (path) { const vaNode = path.node; if (!isIdentifier(vaNode.init) || !isIdentifier(vaNode.id)) return; if (stringLiteralFuncs.includes(vaNode.init.name)) { stringLiteralFuncs.push(vaNode.id.name); } } }); traverse(ast, { CallExpression (path) { const cNode = path.node; if (!isIdentifier(cNode.callee)) return; const varName = cNode.callee.name; if (!stringLiteralFuncs.includes(varName)) return; const literalNode = cNode.arguments[0]; if (cNode.arguments.length !== 1 || (!isNumericLiteral(literalNode) && !isStringLiteral(literalNode))) return; const idx = Number(literalNode.value); path.replaceWith(stringLiteral(getStringArr(idx))); } }); } // 这里需要人工运行JS代码,获取最终的大数组 restoreStringLiteral(ast, ['_0x0a9e'], (idx: number) => { return [ 'pad', 'clamp', 'sigBytes', 'words', 'BXNBf', 'OMxlD', 'GhFlG', 'JxsFw', 'iksgN', 'qDbwG', 'prototype', 'spzgJ', 'test', 'lo1c0tQyRk7E/Lr2p3puiAKrzgb8Absq4EWawXjoVfP230ItoMvvmsg3H8ccHG1u1qA+T/T4f3Rwi5j40osnuhQGtUj0w5rjN5FglNam4JRHNS126MHWX6+Zk/Aez8M7WttDCxtn6N6/pwWRtVat6vPkvmw9ETifmJ5C94R9hoGnDvNjntiKW6m5HPr+b/j0IvHCUJz8pX4ofi12NyD5aA==', 'enc', 'Latin1', 'parse', 'B79CD410AF398F7A', 'window', 'location', 'href', '146385F634C9CB00', 'ZeroPadding', 'toString', 'Utf8', 'split', 'length', 'createElement', 'style', 'type', 'text/css', 'setAttribute', 'link', 'getElementsByTagName', 'gHLRp', 'CbiRt', 'oKMpY', 'parentNode', 'head', 'appendChild', '4|1|2|5|3|0', 'fromCharCode', 'NQvuJ', 'TYKEL', 'undefined', 'tfwZU', 'ffVsL', 'styleSheets', 'addRule', '.context_kw', '::before', 'content: "', 'insertRule', '::before{content: "' ][idx]; });
cff(ast); switchCFF(ast);
memberExpComputedToFalse(ast); renameVars( ast, (name:string) => name.substring(0, 3) === '_0x', {} ); translateLiteral(ast);
const { code } = generator(ast); writeOutputToFile('小说_out.js', code);
还原后算法分析过程就畅通无阻了,甚至不需要动态调试。 给样式表添加用于展示缺失文本的CSS规则: for (var i = 0; i < words.length; i++) { try { document.styleSheets[0].addRule(".context_kw" + i + "::before", "content: \"" + words[i] + "\""); } catch (v72) { document.styleSheets[0].insertRule(".context_kw" + i + "::before{content: \"" + words[i] + "\"}", document.styleSheets[0].cssRules.length); } }
AES解密数据如下。data 是加密数据,每个html链接都不一样,猜测是先由模板引擎渲染得到常量串,再调用OB进行混淆。有兴趣的佬可以搭一个node/django 后台实现一下这个方案,用node 实现的难度应该是最低的。 var data = "<加密的数据>"; var keywords = CryptoJS.enc.Latin1.parse("B79CD410AF398F7A"); var iv = ''; try { // 甚至有环境检测呵呵,node环境会得到错误的iv if (top.window.location.href != window.location.href) { top.window.location.href = window.location.href; } iv = CryptoJS.enc.Latin1.parse('E91BF3347F7D1274'); } catch (v44) { iv = CryptoJS.enc.Latin1.parse("146385F634C9CB00"); } var decrypted = CryptoJS.AES.decrypt(data, keywords, { 'iv': iv, 'padding': CryptoJS.pad.ZeroPadding }); var secWords = decrypted.toString(CryptoJS.enc.Utf8).split(','); var words = new Array(secWords.length);
获取words 数组的算法(这段代码是有控制流平坦化的): for (var i = 0; i < secWords.length; i++) { var v50 = secWords[i]; var v51 = function (v52) { var v53 = { 'NQvuJ': function v54(v55, v56) { return v55 % v56; }, 'TYKEL': function v57(v58, v59) { return v58 - v59; }, 'lHuwG': function v60(v61, v62) { return v61 - v62; } }; return v52 % 2 ? v52 - 2 : v52 - 4; }; var v63 = function (v64) { var v65 = { 'HYHqw': function v66(v67, v68) { return v67 + v68; }, 'tfwZU': function v69(v70, v71) { return v70 * v71; }, 'ffVsL': "undefined" }; // 环境检测 return v64 + 3 * +!(typeof document === "undefined"); }; v50 = v51(v50); v50 = v63(v50); words[i] = String.fromCharCode(v50); }
至此算法已经很清晰了,可以写exp了。 完美逆向完美逆向有必要嘛?似乎没有,但自己写一遍这个网站的完美逆向代码,编程能力大概也能有微不足道的提升吧(不能只有我在这件小事上浪费宝贵青春)。算法很简单,但考虑完美破解时一切都会麻烦起来。我们需要解决许多很简单的问题: - python or nodejs?因为只需要发一个请求,且不涉及Cookie的维护,所以我选择了
ts-node 、typescript 、superagent 。 - 如何解析HTML文档?
cheerio ,API设计特意对标jQuery ,体验极佳。 - 如何获取
data, key, iv ?上面人工运行JS代码,获取最终的大数组的做法需要扩展了。 - 如何将
<span> 替换为文本?基于cheerio 就很简单:
const htmlHasOnlyP = inputHTML.replaceAll(/<span.*?<\/span>/g, (span) => { const idx = Number((span.match(/\d+/) || [])[0]); return words[idx]; }); const $ = cheerio.load(htmlHasOnlyP); return $('p').text().split('\n').map((txt) => txt.trim()).join(os.EOL);
最后简单说下上述第3点的实现。上一节提到,可以猜测data, key, iv 先是使用模板引擎渲染,再调用OB进行混淆。因此我们的过程要与之相反:先使用AST解混淆,再在所有常量串均已恢复的假设前提下,匹配data, key, iv 的声明or赋值语句的AST节点,最后获取常量串。 这个过程最麻烦的一步,是解决常量串隐藏。限于编程能力,我写不出一个放之四海而皆准的AST脚本,因此反而可以放开手脚,为每个使用OB的网站进行代码定制,将更多“业务相关”(即只适用于当前网站)的特征用代码表达出来。这个小说网站的常量串隐藏代码样例如下: var _0xa9e0 = ['JxsFw', 'iksgN', 'qDbwG', 'prototype', 'spzgJ', 'test', 'lo1c0tQyRk7E/Lr2p3puiAKrzgb8Absq4EWawXjoVfP230ItoMvvmsg3H8ccHG1u1qA+T/T4f3Rwi5j40osnuhQGtUj0w5rjN5FglNam4JRHNS126MHWX6+Zk/Aez8M7WttDCxtn6N6/pwWRtVat6vPkvmw9ETifmJ5C94R9hoGnDvNjntiKW6m5HPr+b/j0IvHCUJz8pX4ofi12NyD5aA==', 'enc', 'Latin1', 'parse', 'B79CD410AF398F7A', 'window', 'location', 'href', '146385F634C9CB00', 'ZeroPadding', 'toString', 'Utf8', 'split', 'length', 'createElement', 'style', 'type', 'text/css', 'setAttribute', 'link', 'getElementsByTagName', 'gHLRp', 'CbiRt', 'oKMpY', 'parentNode', 'head', 'appendChild', '4|1|2|5|3|0', 'fromCharCode', 'NQvuJ', 'TYKEL', 'undefined', 'tfwZU', 'ffVsL', 'styleSheets', 'addRule', '.context_kw', '::before', 'content:\x20\x22', 'insertRule', '::before{content:\x20\x22', 'pad', 'clamp', 'sigBytes', 'words', 'BXNBf', 'OMxlD', 'GhFlG']; (function (_0x149720, _0x36191f) { var _0x19a768 = function (_0x5065e2) { while (--_0x5065e2) { _0x149720['push'](_0x149720['shift']()); } }; _0x19a768(++_0x36191f); }(_0xa9e0, 0x1a9)); var _0x0a9e = function (_0x2b4d76, _0x47bf96) { _0x2b4d76 = _0x2b4d76 - 0x0; var _0x4230d8 = _0xa9e0[_0x2b4d76]; return _0x4230d8; };
我们需要做的事情主要有: - 获取偏移量,即上述例子中的
0x0 。 - 获取
rotate 次数,即上述例子中的0x1a9 。这是为了获取大数组最终的值。 - 获取大数组。
- 调用上面已经实现的
restoreStringLiteral 函数。
这个样例是旧版OB生成的,情况比较简单。我这次实现选择的策略如下: - 获取偏移量函数
_0x0a9e 的函数体仅识别具有上述3条语句的结构,然后从第1条语句中取出偏移量的值0x0 。 - 通过匹配具有2个参数,且第二个参数是
NumericLiteral 的自执行函数来获取rotate 次数0x1a9 。 - 复用第2点的逻辑,我们认为自执行函数的第一个参数
_0xa9e0 就是大数组的名称,并通过名称字符串来匹配相应的声明语句。
函数名为autoRestoreStringLiteralViaIIFE ,顾名思义,这里我选择的切入点就是自执行函数。代码传送门: import { isArrayExpression, isBlockStatement, isCallExpression, isExpressionStatement, isFunctionExpression, isIdentifier, isNumericLiteral, isReturnStatement, isStringLiteral, isVariableDeclaration, Node, stringLiteral, File } from '@babel/types'; import traverse from '@babel/traverse'; import { strict as assert } from 'assert'; import generator from '@babel/generator';
// 如果常量表不止1处,则此代码不正确 export function restoreStringLiteral (ast: Node, stringLiteralFuncs: string[], getStringArr: (idx: number) => string) { // 收集与常量串隐藏有关的变量 traverse(ast, { VariableDeclarator (path) { const vaNode = path.node; if (!isIdentifier(vaNode.init) || !isIdentifier(vaNode.id)) return; if (stringLiteralFuncs.includes(vaNode.init.name)) { stringLiteralFuncs.push(vaNode.id.name); } } }); traverse(ast, { CallExpression (path) { const cNode = path.node; if (!isIdentifier(cNode.callee)) return; const varName = cNode.callee.name; if (!stringLiteralFuncs.includes(varName)) return; const literalNode = cNode.arguments[0]; if (cNode.arguments.length !== 1 || (!isNumericLiteral(literalNode) && !isStringLiteral(literalNode))) return; const idx = Number(literalNode.value); path.replaceWith(stringLiteral(getStringArr(idx))); } }); }
export function rotateArray<T> (a: T[], count: number) { count %= a.length; return [...a.slice(count), ...a.slice(0, count)]; }
export function autoRestoreStringLiteralViaIIFE (ast: File) { let constArrName = ''; const INITIAL_SHIFT_NUM = -1234567; let shiftNum = INITIAL_SHIFT_NUM; ast.program.body.findIndex((bodyItem) => { if (!isExpressionStatement(bodyItem) || !isCallExpression(bodyItem.expression) || !isFunctionExpression(bodyItem.expression.callee) || bodyItem.expression.arguments.length !== 2) return false; const [arg0, arg1] = bodyItem.expression.arguments; if (!isIdentifier(arg0) || !isNumericLiteral(arg1)) return false; constArrName = arg0.name; shiftNum = arg1.value; return true; }); assert.ok(constArrName); assert.notEqual(shiftNum, INITIAL_SHIFT_NUM);
let constArrContent: string[] = []; let stringHideVarName = ''; let globalOffset = 0; traverse(ast, { VariableDeclaration (path) { const decl = path.node.declarations[0]; if (!isIdentifier(decl.id)) return; if (decl.id.name === constArrName && isArrayExpression(decl.init)) { constArrContent = decl.init.elements.map((item) => { assert.ok(isStringLiteral(item)); return item.value; }); } if (isFunctionExpression(decl.init)) { if (decl.init.params.length !== 2 || !isBlockStatement(decl.init.body) || decl.init.body.body.length !== 3) return; const [s1, s2, s3] = decl.init.body.body; if (!isExpressionStatement(s1) || !isVariableDeclaration(s2) || !isReturnStatement(s3)) return;
path.traverse({ BinaryExpression (path) { assert.ok(isNumericLiteral(path.node.right)); globalOffset = path.node.right.value; } });
const { code } = generator(s2); if (!code.includes(constArrName)) return; stringHideVarName = decl.id.name; } } }); constArrContent = rotateArray(constArrContent, shiftNum);
restoreStringLiteral(ast, [stringHideVarName], (idx: number) => { return constArrContent[idx - globalOffset]; }); }
完整代码: import { isIdentifier, isStringLiteral } from '@babel/types'; import traverse from '@babel/traverse'; import CryptoJS from 'crypto-js'; import superagent from 'superagent'; import * as cheerio from 'cheerio'; import * as parser from '@babel/parser'; import fs from 'fs'; import { strict as assert } from 'assert'; import path from 'path'; import { autoRestoreStringLiteralViaIIFE } from '../restoreStringLiteral'; import os from 'os'; import { getLegalFileName } from '../file_utils';
export function parseHTML (inputHTML: string, data: string, keywordsEnc: string, ivEnc: string) { const keywords = CryptoJS.enc.Latin1.parse(keywordsEnc); const iv = CryptoJS.enc.Latin1.parse(ivEnc); const decrypted = CryptoJS.AES.decrypt(data, keywords, { iv, padding: CryptoJS.pad.ZeroPadding }); const secWords = decrypted.toString(CryptoJS.enc.Utf8).split(','); // console.log('secWords', secWords); // dbg const words: string[] = []; for (let i = 0; i < secWords.length; i++) { let v50 = Number(secWords[i]); const v51 = function (v52: number) { return v52 % 2 ? v52 - 2 : v52 - 4; }; const v63 = function (v64: number) { return v64 + 3 * +true; }; v50 = v51(v50); v50 = v63(v50); words[i] = String.fromCharCode(v50); } // console.log(words); // dbg
const htmlHasOnlyP = inputHTML.replaceAll(/<span.*?<\/span>/g, (span) => { const idx = Number((span.match(/\d+/) || [])[0]); return words[idx]; }); const $ = cheerio.load(htmlHasOnlyP); return $('p').text().split('\n').map((txt) => txt.trim()).join(os.EOL); }
export function getDataFromJSCode (jsCode: string) { const ast = parser.parse(jsCode);
autoRestoreStringLiteralViaIIFE(ast);
let data = ''; let keywordsEnc = ''; let ivEnc = ''; traverse(ast, { VariableDeclarator (path) { const idNode = path.node.id; if (!isIdentifier(idNode)) return; if (idNode.name === 'keywords') { path.traverse({ CallExpression (path) { const args = path.node.arguments; assert.equal(args.length, 1); assert.ok(isStringLiteral(args[0])); keywordsEnc = args[0].value; } }); } if (idNode.name === 'data') { assert.ok(isStringLiteral(path.node.init)); data = path.node.init.value; } }, AssignmentExpression (path) { if (!isIdentifier(path.node.left) || path.node.left.name !== 'iv' || ivEnc) return; path.traverse({ CallExpression (path) { const args = path.node.arguments; assert.equal(args.length, 1); assert.ok(isStringLiteral(args[0])); ivEnc = args[0].value; } }); } }); return { data, keywordsEnc, ivEnc }; }
export function getDataFromHTML ($: cheerio.CheerioAPI) { const scriptTags = $('script'); let data = '', keywordsEnc = '', ivEnc = ''; scriptTags.each((i, el) => { const jsCode = $(el).text(); if (!jsCode.includes('CryptoJS')) return true; const res = getDataFromJSCode(jsCode); data = res.data; keywordsEnc = res.keywordsEnc; ivEnc = res.ivEnc; return false; }); return { data, keywordsEnc, ivEnc }; }
export function getInputHTML (text: string) { const $ = cheerio.load(text); return { $, inputHTML: $('.rdtext').html() || '' }; }
function main () { const novelURLs = [ // html列表,简单的逐个遍历 ]; const headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'accept-language': 'zh-CN,zh;q=0.9', 'cache-control': 'no-cache', 'pragma': 'no-cache', 'sec-ch-ua': '"Not_A Brand";v="99", "Google Chrome";v="109", "Chromium";v="109"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'same-origin', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'Referer': '<referer>', 'Referrer-Policy': 'strict-origin-when-cross-origin' }; novelURLs.forEach(async (novelURL) => { const resp = await superagent.get(novelURL).set(headers); console.log('resp.text', resp.text.substring(0, 100)); // dbg const { $, inputHTML } = getInputHTML(resp.text); const { data, keywordsEnc, ivEnc } = getDataFromHTML($); const resultNovelText = parseHTML(inputHTML, data, keywordsEnc, ivEnc); console.log('resultNovelText', resultNovelText); // dbg const fileName = path.resolve('src', 'outputs', getLegalFileName( `${$('title').html()}.txt`, `${novelURL.substring(novelURL.lastIndexOf('/') + 1)}.txt` )); fs.writeFileSync(fileName, resultNovelText, 'utf-8'); }); }
if (require.main === module) { main(); }
完。
注:若转载请注明大神论坛来源(本贴地址)与作者信息。
|