mò mã¸Ec@sßdZdkZdkZdkZdkZdklZlZdefd„ƒYZ dfd„ƒYZ d„Z dfd „ƒYZ d e eifd „ƒYZ d e eifd „ƒYZd„Zedjo eƒndS(sÃA simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Examples This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the ... tags: import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) for token in p.tags("a"): if token.type == "endtag": continue url = dict(token.attrs).get("href", "-") text = p.get_compressed_text(endat=("endtag", "a")) print "%s %s" % (url, text) This program extracts the from the document: import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) if p.get_tag("title"): title = p.get_compressed_text() print "Title: %s" % title Copyright 2003-2006 John J. Lee <jjl@pobox.com> Copyright 1998-2001 Gisle Aas (original libwww-perl code) This code is free software; you can redistribute it and/or modify it under the terms of the BSD or ZPL 2.1 licenses. N(���s���unescapes���unescape_charreft���NoMoreTokensErrorc�����������B���s���t��Z�RS(���N(���t���__name__t ���__module__(����(����(����t4���/data/zmath/zope/lib/python/mechanize/_pullparser.pyR����*���s����t���Tokenc�����������B���s>���t��Z�d��Z�e�d�„�Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�RS(���sk��Represents an HTML tag, declaration, processing instruction etc. Behaves as both a tuple-like object (ie. iterable) and has attributes .type, .data and .attrs. >>> t = Token("starttag", "a", [("href", "http://www.python.org/")]) >>> t == ("starttag", "a", [("href", "http://www.python.org/")]) True >>> (t.type, t.data) == ("starttag", "a") True >>> t.attrs == [("href", "http://www.python.org/")] True Public attributes type: one of "starttag", "endtag", "startendtag", "charref", "entityref", "data", "comment", "decl", "pi", after the corresponding methods of HTMLParser.HTMLParser data: For a tag, the tag name; otherwise, the relevant data carried by the tag, as a string attrs: list of (name, value) pairs representing HTML attributes (or None if token does not represent an opening tag) c���������C���s���|�|��_��|�|��_�|�|��_�d��S(���N(���t���typet���selft���datat���attrs(���R���R���R���R���(����(����R���t���__init__E���s����  c���������C���s���t��|��i�|��i�|��i�f�ƒ�S(���N(���t���iterR���R���R���R���(���R���(����(����R���t���__iter__I���s����c���������C���sO���|�\�}�}�}�|��i�|�j�o(�|��i�|�j�o�|��i�|�j�o�t�Sn�t�Sd��S(���N(���t���otherR���R���R���R���t���Truet���False(���R���R ���R���R���R���(����(����R���t���__eq__K���s����0c���������C���s���|��i�|�ƒ� S(���N(���R���R���R ���(���R���R ���(����(����R���t���__ne__S���s����c���������C���s<���d�i��t�t�|��i�|��i�|��i�g�ƒ�ƒ�}�|��i�i �d�|�S(���Ns���, s���(%s)( ���t���joint���mapt���reprR���R���R���R���t���argst ���__class__R���(���R���R���(����(����R���t���__repr__T���s����*( ���R���R���t���__doc__t���NoneR ���R ���R���R���R���(����(����(����R���R���,���s ��� �    c���������o���s9���x2�y�|��|�|�Ž��VWq�|�j �o �t�‚�q�Xq�Wd��S(���Ni���(���t���fnR���t���kwdst ���exceptiont ���StopIteration(���R���R���R���R���(����(����R���t���iter_until_exceptionX���s ������t���_AbstractParserc�����������B���s��t��Z�d�Z�e�i�d�ƒ�Z�h��d�d�<d�d�<d�d��d�„�Z�d�„��Z�d �„��Z �d �„��Z �d �„��Z �d �„��Z �d �„��Z �d�„��Z�d��d�„�Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�d�„��Z�RS(���Ni���s���\s+t���imgt���altt���applett���asciic���������C���sK���|�|��_�g��|��_�|�|��_�|�|��_�|�d�j�o �t�i �}�n�|�|��_ �d�S(���s†�� fh: file-like object (only a .read() method is required) from which to read HTML to be parsed textify: mapping used by .get_text() and .get_compressed_text() methods to represent opening tags as text encoding: encoding used to encode numeric character references by .get_text() and .get_compressed_text() ("ascii" by default) entitydefs: mapping like {"amp": "&", ...} containing HTML entity definitions (a sensible default is used). This is used to unescape entities in .get_text() (and .get_compressed_text()) and attribute values. If the encoding can not represent the character, the entity reference is left unescaped. Note that entity references (both numeric - e.g. { or ઼ - and non-numeric - e.g. &) are unescaped in attribute values and the return value of .get_text(), but not in data outside of tags. Instead, entity references outside of tags are represented as tokens. This is a bit odd, it's true :-/ If the element name of an opening tag matches a key in the textify mapping then that tag is converted to text. The corresponding value is used to specify which tag attribute to obtain the text from. textify maps from element names to either: - an HTML attribute name, in which case the HTML attribute value is used as its text value along with the element name in square brackets (eg."alt text goes here[IMG]", or, if the alt attribute were missing, just "[IMG]") - a callable object (eg. a function) which takes a Token and returns the string to be used as its text value If textify has no key for an element name, nothing is substituted for the opening tag. Public attributes: encoding and textify: see above N( ���t���fhR���t���_fht ���_tokenstackt���textifyt���encodingt ���entitydefsR���t���htmlentitydefst���name2codepointt ���_entitydefs(���R���R#���R&���R'���R(���(����(����R���R ���c���s����'�      c���������C���s���|��S(���N(���R���(���R���(����(����R���R ���“���s����c���������G���s���t��|��i�t�|�Œ�S(���N(���R���R���t���get_tagR����t���names(���R���R-���(����(����R���t���tags•���s����c���������G���s���t��|��i�t�|�Œ�S(���N(���R���R���t ���get_tokenR����t ���tokentypes(���R���R0���(����(����R���t���tokens˜���s����c���������C���s1���y�|��i�ƒ��SWn�t�j �o�t�ƒ��‚�n�Xd��S(���N(���R���R/���R����R���(���R���(����(����R���t���next›���s����c���������G���s���x†�xH�|��i�o=�|��i�i�d�ƒ�}�|�o�|�i�|�j�o�|�SqI�q�|�Sq�W|��i�i�|��i�ƒ�}�|�p �t �ƒ��‚�n�|��i �|�ƒ�q�Wd�S(���s<��Pop the next Token object from the stack of parsed tokens. If arguments are given, they are taken to be token types in which the caller is interested: tokens representing other elements will be skipped. Element names must be given in lower case. Raises NoMoreTokensError. i���i����N( ���R���R%���t���popt���tokenR0���R���R$���t���readt���chunkR���R����t���feed(���R���R0���R���R4���(����(����R���R/���¡���s���� ����    c���������C���s���|��i�i�d�|�ƒ�d�S(���s!���Push a Token back onto the stack.i����N(���R���R%���t���insertR4���(���R���R4���(����(����R���t ���unget_token¸���s�����c���������G���s_���xX�|��i�ƒ��}�|�i�d�d�d�g�j�o�q�n�|�o�|�i�|�j�o�|�SqV�q�|�Sq�Wd�S(���sA��Return the next Token that represents an opening or closing tag. If arguments are given, they are taken to be element names in which the caller is interested: tags representing other elements will be skipped. Element names must be given in lower case. Raises NoMoreTokensError. i���t���starttagt���endtagt ���startendtagN(���R���R/���t���tokR���R-���R���(���R���R-���R=���(����(����R���R,���¼���s���� ���  c��� ������C���s��g��}�d �}�xýy�|��i�ƒ��}�Wn,�t�j �o �|�o�|��i�|�ƒ�n�Pn�X|�i�d�j�o�|�i�|�i �ƒ�q�|�i�d�j�o0�t �d�|�i �|��i �|��i �ƒ�}�|�i�|�ƒ�q�|�i�d�j�o&�t�|�i �|��i �ƒ�}�|�i�|�ƒ�q�|�i�d�d�d�g�j�o|�i �}�|�i�d�d�g�j�o®�|��i�i�|�ƒ�}�|�d �j �o‹�t�|�ƒ�o�|�i�|�|�ƒ�ƒ�qÊ|�i�d �j �oS�x5�|�i�D]*�\�}�}�|�|�j�o�|�i�|�ƒ�q}q}W|�i�d �|�i�ƒ��ƒ�qÊqÎn�|�d �j�p�|�|�i�|�f�j�o�|��i�|�ƒ�Pqq�q�Wd �i�|�ƒ�S( ���s¯��Get some text. endat: stop reading text at this tag (the tag is included in the returned text); endtag is a tuple (type, name) where type is "starttag", "endtag" or "startendtag", and name is the element name of the tag (element names must be given in lower case) If endat is not given, .get_text() will stop at the next opening or closing tag, or when there are no more tokens (no exception is raised). Note that .get_text() includes the text representation (if any) of the opening tag, but pushes the opening tag back onto the stack. As a result, if you want to call .get_text() again, you need to call .get_tag() first (unless you want an empty string returned when you next call .get_text()). Entity references are translated using the value of the entitydefs constructor argument (a mapping from names to characters like that provided by the standard module htmlentitydefs). Named entity references that are not in this mapping are left unchanged. The textify attribute is used to translate opening tags into text: see the class docstring. i���R���t ���entityrefs���&%s;t���charrefR:���R;���R<���s���[%s]t����N(���t���textR���R=���R���R/���R����R9���R���t���appendR���t���unescapeR+���R'���t���tt���unescape_charreft���tag_nameR&���t���getR ���t���callableR���t���kt���vt���uppert���endatR���( ���R���RL���RA���R=���RF���RD���RJ���RI���R ���(����(����R���t���get_textÐ���sH��������    �  ## c���������O���s1���|��i�|�|�Ž��}�|�i�ƒ��}�|��i�i�d�|�ƒ�S(���s²��� As .get_text(), but collapses each group of contiguous whitespace to a single space character, and removes all initial and trailing whitespace. t��� N(���R���RM���R���R���RA���t���stript ���compress_ret���sub(���R���R���R���RA���(����(����R���t���get_compressed_text ��s����� c���������C���s ���|��i�i�t�d�|�|�ƒ�ƒ�d��S(���NR<���(���R���R%���RB���R���t���tagR���(���R���RS���R���(����(����R���t���handle_startendtag��s����c���������C���s ���|��i�i�t�d�|�|�ƒ�ƒ�d��S(���NR:���(���R���R%���RB���R���RS���R���(���R���RS���R���(����(����R���t���handle_starttag��s����c���������C���s���|��i�i�t�d�|�ƒ�ƒ�d��S(���NR;���(���R���R%���RB���R���RS���(���R���RS���(����(����R���t ���handle_endtag��s����c���������C���s���|��i�i�t�d�|�ƒ�ƒ�d��S(���NR?���(���R���R%���RB���R���t���name(���R���RW���(����(����R���t���handle_charref��s����c���������C���s���|��i�i�t�d�|�ƒ�ƒ�d��S(���NR>���(���R���R%���RB���R���RW���(���R���RW���(����(����R���t���handle_entityref��s����c���������C���s���|��i�i�t�d�|�ƒ�ƒ�d��S(���NR���(���R���R%���RB���R���R���(���R���R���(����(����R���t ���handle_data ��s����c���������C���s���|��i�i�t�d�|�ƒ�ƒ�d��S(���Nt���comment(���R���R%���RB���R���R���(���R���R���(����(����R���t���handle_comment"��s����c���������C���s���|��i�i�t�d�|�ƒ�ƒ�d��S(���Nt���decl(���R���R%���RB���R���R]���(���R���R]���(����(����R���t ���handle_decl$��s����c���������C���s���|��i�i�t�d�|�ƒ�ƒ�d��S(���NR]���(���R���R%���RB���R���R���(���R���R���(����(����R���t ���unknown_decl&��s����c���������C���s���|��i�i�t�d�|�ƒ�ƒ�d��S(���Nt���pi(���R���R%���RB���R���R���(���R���R���(����(����R���t ���handle_pi*��s����c���������C���s���t��|�|��i�|��i�ƒ�S(���N(���RC���RW���R���R+���R'���(���R���RW���(����(����R���t ���unescape_attr-��s����c���������C���s=���g��}�x0�|�D](�\�}�}�|�i�|�|��i�|�ƒ�f�ƒ�q �W|�S(���N(���t ���escaped_attrsR���t���keyt���valRB���R���Rb���(���R���R���Rc���Rd���Re���(����(����R���t���unescape_attrs/��s �����  (���R���R���R6���t���ret���compileRP���R���R ���R ���R.���R1���R2���R/���R9���R,���RM���RR���RT���RU���RV���RX���RY���RZ���R\���R^���R_���Ra���Rb���Rf���(����(����(����R���R���`���s0���$0        ;           t ���PullParserc�����������B���s���t��Z�d�„��Z�d�„��Z�RS(���Nc���������O���s'���t��i��i�|��ƒ�t�i�|��|�|�Ž�d��S(���N(���t ���HTMLParserR ���R���R���R���R���(���R���R���R���(����(����R���R ���6��s����c���������C���s ���|��i�|�ƒ�S(���N(���R���Rb���RW���(���R���RW���(����(����R���RC���9��s����(���R���R���R ���RC���(����(����(����R���Ri���5��s��� t���TolerantPullParserc�����������B���s#���t��Z�d�„��Z�d�„��Z�d�„��Z�RS(���Nc���������O���s'���t��i�i�|��ƒ�t�i�|��|�|�Ž�d��S(���N(���t���sgmllibt ���SGMLParserR ���R���R���R���R���(���R���R���R���(����(����R���R ���?��s����c���������C���s/���|��i�|�ƒ�}�|��i�i�t�d�|�|�ƒ�ƒ�d��S(���NR:���(���R���Rf���R���R%���RB���R���RS���(���R���RS���R���(����(����R���t���unknown_starttagB��s����c���������C���s���|��i�i�t�d�|�ƒ�ƒ�d��S(���NR;���(���R���R%���RB���R���RS���(���R���RS���(����(����R���t���unknown_endtagE��s����(���R���R���R ���Rn���Ro���(����(����(����R���Rk���>��s���  c����������C���s���d��k��}�d��k�}��|�i�|��ƒ�S(���N(���t���doctestt ���_pullparsert���testmod(���Rq���Rp���(����(����R���t���_testI��s����t���__main__(���R���Rg���R)���Rl���Rj���t���_htmlRC���RE���t ���ExceptionR����R���R���R���Ri���Rm���Rk���Rs���R���( ���Rl���R���RC���R����RE���R)���Rg���R���Rj���Rk���R���Rs���Ri���(����(����R���t���?"���s���, Õ