This blog has moved here.

Friday, March 30, 2012

Escape HTML Entities Except for Some Well Known Tags

We are currently developing an APEX application and one of the requirement is to let users enter formatted text data into the system via a CKEDITOR control.


As it may be noticed above, not all the formatting options are enabled but just the possibility to apply bold, italic and underlined styling and, in addition, to create ordered and unordered lists. This means that the data we store into the database has html tags within and it must be echoed back as it is into the browser when the page which renders that data is about to be displayed.

Of course, the first thing you must take into consideration is the security issue: we don't want to render our pages with unsecure tags therefore we must sanitize the input. What we want is to remove all tags which are not into our whitelist. It seems like a very simple problem but the solution is not so obvious.

Our first attempt was to use a smart regular expression to do the job. It turned out that the expression was simply too complicated and very difficult to maintain. Put into the mix the fact that the Oracle regexp engine doesn't have all the goodies you'd expect if you are a Perl or Ruby developer and you quickly end up in disapointment.

The actual solution we implemented (which seems to work pretty well) is to parse the html content and to remove the offending tags. In order to parse we used just an ordinary xmltype. Let's see an example:
select regexp_replace( 
        deletexml(
          xmltype('<post>' ||  
                  'You have <b>been</b> <u>hacked</u> ' ||
                  'by <br/><a href="http://www.youtube.com/watch?v=CjPfBxJboJI">Gurin Remus Adrian</a>.' ||
                  '</post>'),
          '/post/*[(name() != "b") and (name() != "u")]'),
        '(^<post>)|(</post>$)',
        '') 
  from dual;

CLEAN_HTML                            
--------------------------------------
You have <b>been</b><u>hacked</u> by .
Great! Having that it's quite easy to write a PL/SQL function to embed the above logic:
  function strip_html_tags(html_in in clob, except_tags_in in varchar2 := null) return clob as
    l_clean_html clob;
    l_filter varchar2(32767) := '/post';
    l_predicates varchar2(32767);
    l_array apex_application_global.vc_arr2;
  begin
    if except_tags_in is null then
      l_filter := l_filter || '/*[(name() != "post")]';
    else
      l_array := apex_util.string_to_table(except_tags_in, ',');
      for i in 1..l_array.count loop
        if l_array(i) is not null then
          l_predicates := l_predicates || case when i > 1 then ' and ' else '' end || '(name() != "' || l_array(i) || '")';
        end if;
      end loop;
      if l_predicates is not null then
       l_filter := l_filter || '/*[' || l_predicates || ']';
      end if;
    end if;
    select regexp_replace( 
              deletexml(
                xmltype('<post>' || html_in || '</post>'),
                l_filter),
              '(^<post>)|(</post>$)',
           '') into l_clean_html
      from dual;
    return l_clean_html;
  end;
The only problem we had using this approach was with German characters or other HTML character entities. For example, by default CKEDIT sends the German umlaut character as &uuml; and, apparently, the Oracle XML parser is not smart enough to recognize this notation. It simply fails with LPX-00118 error. In fact, if you think about it, it's not so unusual given that we are talking about a XML parser, not an HTML one. However, if you encode umlaut with its numerical notation which is &#252; then everything is working like a charm. This boils down to properly configure CKEDITOR in order to force it to encode this characters with their numerical notation. The key property is:
entities_processNumerical": 'force'
Mission accomplished!