Class Undress::Grammar

  1. lib/undress/grammar.rb
Parent: Object

Grammars give you a DSL to declare how to convert an HTML document into a different markup language.

Public class methods

default () {|element| ...}

Set a default rule for unrecognized tags.

Unless you define a special case, it will ignore the tags and just output the contents of unrecognized tags.

[show source]
# File lib/undress/grammar.rb, line 30
    def self.default(&handler) # :yields: element
      define_method :method_missing do |tag, node, *args|
post_processing (regexp, replacement = nil) {|matched_string| ...}

Add a post-processing rule to your parser.

This takes a regular expression that will be applied to the output after processing any nodes. It can take a string as a replacement, or a block that will be passed to String#gsub.

post_processing(/\n\n+/, "\n\n") # compress more than two newlines
post_processing(/whatever/) { ... }
[show source]
# File lib/undress/grammar.rb, line 44
    def self.post_processing(regexp, replacement = nil, &handler) #:yields: matched_string
      post_processing_rules[regexp] = replacement || handler
pre_processing (selector) {|element| ...}

Add a pre-processing rule to your parser.

This lets you mutate the DOM before applying any rule defined with rule_for. You need to pass a CSS/XPath selector, and a block that takes an Hpricot element to parse it.

pre_processing "ul.toc" do |element|

Would replace any unordered lists with the class toc for a paragraph containing the code [[toc]].

[show source]
# File lib/undress/grammar.rb, line 60
    def self.pre_processing(selector, &handler) # :yields: element
      pre_processing_rules[selector] = handler
rule_for (*tags) {|element| ...}

Add a parsing rule for a group of html tags.

rule_for :p do |element|
  "<this was a paragraph>#{content_of(element)}</this was a paragraph>"

will replace your <p> tags for <this was a paragraph> tags, without altering the contents.

The element yielded to the block is an Hpricot element for the given tag.

[show source]
# File lib/undress/grammar.rb, line 20
    def self.rule_for(*tags, &handler) # :yields: element
      tags.each do |tag|
        define_method tag.to_sym, &handler
whitelist_attributes (*attrs)

Set a list of attributes you wish to whitelist

Any attribute not in this list at the moment of parsing will be ignored by the parser. The method Grammar#attributes(node) will return a hash of the filtered attributes. Read its documentation for more details.

whitelist_attributes :id, :class, :lang
[show source]
# File lib/undress/grammar.rb, line 71
    def self.whitelist_attributes(*attrs)
      @whitelisted_attributes = attrs

Public instance methods

attributes (node)

Hash of attributes, according to the white list. By default, no attributes are whitelisted, so you must set which ones to whitelist on each grammar.

Supposing you set :id and :class as your whitelisted_attributes, and you have a node representing this HTML:

<p lang="en" class="greeting">Hello World</p>

Then the method would return:

{ :class => "greeting" }

You can override this method in each grammar and call super if you will represent your attributes consistently across all nodes (for example, Textile always shows class an id inside parenthesis.)

[show source]
# File lib/undress/grammar.rb, line 156
    def attributes(node)
      node.attributes.inject({}) do |attrs,(key,value)|
        attrs[key.to_sym] = value if whitelisted_attributes.include?(key.to_sym)
content_of (node)

Get the result of parsing the contents of a node.

[show source]
# File lib/undress/grammar.rb, line 129
    def content_of(node)
      process(node.respond_to?(:children) ? node.children : node)
process (nodes)

Process a DOM node, converting it to your markup language according to your defined rules. If the node is a Text node, it will return it’s string representation. Otherwise it will call the rule defined for it.

[show source]
# File lib/undress/grammar.rb, line 104
    def process(nodes)
      Array(nodes).map do |node|
        if node.text?
        elsif node.elem?
          send, node
surrounded_by_whitespace? (node)

Helper method that tells you if the given DOM node is immediately surrounded by whitespace.

[show source]
# File lib/undress/grammar.rb, line 135
    def surrounded_by_whitespace?(node)
      (node.previous.text? && node.previous.to_s =~ /\s+$/) ||
        ( && =~ /^\s+/)