Mechanize, 一个ruby类库,可以对网站进行交互,如抓取、登录等;
OCR, 验证码识别神器.
1. 简单的用户名和密码登陆,无验证码
1
2
3
4
5
| @agent = Mechanize.new
form = @agent.get("login_url")
form.username = 'username'
form.password = 'password'
form.submit
|
2. 带验证码的登陆
首先安装神器 tesseract
1
2
| brew install tesseract
gem install tesseract-ocr
|
不一定能一次识别到,所以通过一个loop直到找到4位数字验证码。
1
2
| images = page.search('.checkcode img')
vali_code = get_vali_code(images.first.attributes["src"])
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| def get_vali_code(src)
code = ''
loop do
image_string = @agent.get(src).body_io.string
e = Tesseract::Engine.new {|e|
e.language = :eng
e.blacklist = '|'
}
code = e.text_for(image_string).strip.match(/\d+/).to_s
break if code.size == 4
end
code
end
|
3. 一些用到的
Get and post
1
2
| page1 = @agent.get(url)
page2 = @agent.post(url, params)
|
find element
1
| page.search('.checkcode img')
|
current url
操作链接
遍历
1
| page.links.each {|link| }
|
获取地址
点击
4. Reference
http://asciicasts.com/episodes/191-mechanize
https://github.com/meh/ruby-tesseract-ocr