使用Mechanize和OCR进行网站交互

Mechanize, 一个ruby类库,可以对网站进行交互,如抓取、登录等; OCR, 验证码识别神器.

1. 简单的用户名和密码登陆,无验证码

1
2
3
4
5
@agent = Mechanize.new
form = @agent.get("login_url")
form.username = 'username'
form.password = 'password'
form.submit

2. 带验证码的登陆

首先安装神器 tesseract

1
2
brew install tesseract
gem install tesseract-ocr

不一定能一次识别到,所以通过一个loop直到找到4位数字验证码。

1
2
images = page.search('.checkcode img')
vali_code = get_vali_code(images.first.attributes["src"])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def get_vali_code(src)
  code = ''
  loop do
    image_string = @agent.get(src).body_io.string

    e = Tesseract::Engine.new {|e|
      e.language  = :eng
      e.blacklist = '|'
    }

    code = e.text_for(image_string).strip.match(/\d+/).to_s

    break if code.size == 4
  end
  code
end

3. 一些用到的

Get and post

1
2
page1 = @agent.get(url)
page2 = @agent.post(url, params)

find element

1
page.search('.checkcode img')

current url

1
page.uri.to_s

操作链接

遍历

1
page.links.each {|link| }

获取地址

1
link.href

点击

1
link.click

4. Reference

http://asciicasts.com/episodes/191-mechanize https://github.com/meh/ruby-tesseract-ocr