Win 2 Computer Science: python中自动识别文本编码

Universal Encoding Detector
http://chardet.feedparser.org/
源自Mozilla中的auto-detection code
试用了下，比较有效

顺便看了下http://www.feedparser.org/
发现还有个Universal Feed Parser
用于对RSS 和 ATOM数据的解析

Example: Using the detect function

The detect function takes one argument, a non-Unicode string. It returns a dictionary containing the auto-detected character encoding and a confidence level from 0 to 1.

>>> import urllib
>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}

Example: Detecting encoding incrementally

import urllib
from chardet.universaldetector
import UniversalDetector usock = urllib.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
detector.feed(line)
if detector.done: break
detector.close()
usock.close()
print detector.result
{'encoding': 'EUC-JP', 'confidence': 0.99}

Example: Detecting encodings of multiple files

import glob
from charset.universaldetector import UniversalDetector
detector = UniversalDetector()
for filename in glob.glob('*.xml'):
print filename.ljust(60),
detector.reset()
for line in file(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print detector.result

Win 2 Computer Science

Friday, April 11, 2008

python中自动识别文本编码

Example: Using the detect function

0 Comments:

About Me

Blog Archive

Labels