trace-viewer/third_party/Paste/docs/url-parsing-with-wsgi.txt - platform/external/chromium-trace - Git at Google

 URL Parsing With WSGI And Paste
 +++++++++++++++++++++++++++++++

 :author: Ian Bicking <ianb@colorstudy.com>
 :revision: $Rev$
 :date: $LastChangedDate$

 .. contents::

 Introduction and Audience
 =========================

 This document is intended for web framework authors and integrators,
 and people who want to understand the internal architecture of Paste.

 .. include:: include/contact.txt

 URL Parsing
 ===========

 .. note::

    Sometimes people use "URL", and sometimes "URI".  I think URLs are
    a subset of URIs.  But in practice you'll almost never see URIs
    that aren't URLs, and certainly not in Paste.  URIs that aren't
    URLs are abstract Identifiers, that cannot necessarily be used to
    Locate the resource.  This document is *all* about locating.

 Most generally, URL parsing is about taking a URL and determining what
 "resource" the URL refers to.  "Resource" is a rather vague term,
 intentionally.  It's really just a metaphor -- in reality there aren't
 any "resources" in HTTP; there are only requests and responses.

 In Paste, everything is about WSGI.  But that can seem too fancy.
 There are four core things involved: the *request* (personified in the
 WSGI environment), the *response* (personified inthe
 ``start_response`` callback and the return iterator), the WSGI
 application, and the server that calls that application.  The
 application and request are objects, while the server and response are
 really more like actions than concrete objects.

 In this context, URL parsing is about mapping a URL to an
 *application* and a *request*.  The request actually gets modified as
 it moves through different parts of the system.  Two dictionary keys
 in particular relate to URLs -- ``SCRIPT_NAME`` and ``PATH_INFO`` --
 but any part of the environment can be modified as it is passed
 through the system.

 Dispatching
 ===========

 .. note::

    WSGI isn't object oriented?  Well, if you look at it, you'll notice
    there's no objects except built-in types, so it shouldn't be a
    surprise.  Additionally, the interface and promises of the objects
    we do see are very minimal.  An application doesn't have any
    interface except one method -- ``__call__`` -- and that method
    *does* things, it doesn't give any other information.

 Because WSGI is action-oriented, rather than object-oriented, it's
 more important what we *do*.  "Finding" an application is probably an
 intermediate step, but "running" the application is our ultimate goal,
 and the only real judge of success.  An application that isn't run is
 useless to us, because it doesn't have any other useful methods.

 So what we're really doing is *dispatching* -- we're handing the
 request and responsibility for the response off to another object
 (another actor, really).  In the process we can actually retain some
 control -- we can capture and transform the response, and we can
 modify the request -- but that's not what the typical URL resolver will
 do.

 Motivations
 ===========

 The most obvious kind of URL parsing is finding a WSGI application.

 Typically when a framework first supports WSGI or is integrated into
 Paste, it is "monolithic" with respect to URLs.  That is, you define
 (in Paste, or maybe in Apache) a "root" URL, and everything under that
 goes into the framework.  What the framework does internally, Paste
 does not know -- it probably finds internal objects to dispatch to,
 but the framework is opaque to Paste.  Not just to Paste, but to
 any code that isn't in that framework.

 That means that we can't mix code from multiple frameworks, or as
 easily share services, or use WSGI middleware that doesn't apply to
 the entire framework/application.

 An example of someplace we might want to use an "application" that
 isn't part of the framework would be uploading large files.  It's
 possible to keep track of upload progress, and report that back to the
 user, but no framework typically is capable of this.  This is usually
 because the POST request is completely read and parsed before it
 invokes any application code.

 This is resolvable in WSGI -- a WSGI application can provide its own
 code to read and parse the POST request, and simultaneously report
 progress (usually in a way that *another* WSGI application/request can
 read and report to the user on that progress).  This is an example
 where you want to allow "foreign" applications to be intermingled with
 framework application code.

 Finding Applications
 ====================

 OK, enough theory.  How does a URL parser work?  Well, it is a WSGI
 application, and a WSGI server, in the typical "WSGI middleware"
 style.  Except that it determines which application it will serve
 for each request.

 Let's consider Paste's ``URLParser`` (in ``paste.urlparser``).  This
 class takes a directory name as its only required argument, and
 instances are WSGI applications.

 When a request comes in, the parser looks at ``PATH_INFO`` to see
 what's left to parse.  ``SCRIPT_NAME`` represents where we are *now*;
 it's the part of the URL that has been parsed.

 There's a couple special cases:

 The empty string:

     URLParser serves directories.  When ``PATH_INFO`` is empty, that
     means we got a request with no trailing ``/``, like say ``/blog``
     If URLParser serves the ``blog`` directory, then this won't do --
     the user is requesting the ``blog`` *page*.  We have to redirect
     them to ``/blog/``.

 A single ``/``:

     So, we got a trailing ``/``.  This means we need to serve the
     "index" page.  In URLParser, this is some file named ``index``,
     though that's really an implementation detail.  You could create
     an index dynamically (like Apache's file listings), or whatever.

 Otherwise we get a string like ``/path...``.  Note that ``PATH_INFO``
 *must* start with a ``/``, or it must be empty.

 URLParser pulls off the first part of the path.  E.g., if
 ``PATH_INFO`` is ``/blog/edit/285``, then the first part is ``blog``.
 It appends this to ``SCRIPT_NAME``, and strips it off ``PATH_INFO``
 (which becomes ``/edit/285``).

 It then searches for a file that matches "blog".  In URLParser, this
 means it looks for a filename which matches that name (ignoring the
 extension).  It then uses the type of that file (determined by
 extension) to create a WSGI application.

 One case is that the file is a directory.  In that case, the
 application is *another* URLParser instance, this time with the new
 directory.

 URLParser actually allows per-extension "plugins" -- these are just
 functions that get a filename, and produce a WSGI application.  One of
 these is ``make_py`` -- this function imports the module, and looks
 for special symbols; if it finds a symbol ``application``, it assumes
 this is a WSGI application that is ready to accept the request.  If it
 finds a symbol that matches the name of the module (e.g., ``edit``),
 then it assumes that is an application *factory*, meaning that when
 you call it with no arguments you get a WSGI application.

 Another function takes "unknown" files (files for which no better
 constructor exists) and creates an application that simply responds
 with the contents of that file (and the appropriate ``Content-Type``).

 In any case, ``URLParser`` delegates as soon as it can.  It doesn't
 parse the entire path -- it just finds the *next* application, which
 in turn may delegate to yet another application.

 Here's a very simple implementation of URLParser::

     class URLParser(object):
         def __init__(self, dir):
             self.dir = dir
         def __call__(self, environ, start_response):
             segment = wsgilib.path_info_pop(environ)
             if segment is None: # No trailing /
                 # do a redirect...
             for filename in os.listdir(self.dir):
                 if os.path.splitext(filename)[0] == segment:
                     return self.serve_application(
                         environ, start_response, filename)
             # do a 404 Not Found
         def serve_application(self, environ, start_response, filename):
             basename, ext = os.path.splitext(filename)
             filename = os.path.join(self.dir, filename)
             if os.path.isdir(filename):
                 return URLParser(filename)(environ, start_response)
             elif ext == '.py':
                 module = import_module(filename)
                 if hasattr(module, 'application'):
                     return module.application(environ, start_response)
                 elif hasattr(module, basename):
                     return getattr(module, basename)(
                         environ, start_response)
             else:
                 return wsgilib.send_file(filename)

 Modifying The Request
 =====================

 Well, URLParser is one kind of parser.  But others are possible, and
 aren't too hard to write.

 Lets imagine a URL like ``/2004/05/01/edit``.  It's likely that
 ``/2004/05/01`` doesn't point to anything on file, but is really more
 of a "variable" that gets passed to ``edit``.  So we can pull them off
 and put them somewhere.  This is a good place for a WSGI extension.
 Lets put them in ``environ["app.url_date"]``.

 We'll pass one other applications in -- once we get the date (if any)
 we need to pass the request onto an application that can actually
 handle it.  This "application" might be a URLParser or similar system
 (that figures out what ``/edit`` means).

 ::

     class GrabDate(object):
         def __init__(self, subapp):
             self.subapp = subapp
         def __call__(self, environ, start_response):
             date_parts = []
             while len(date_parts) < 3:
                first, rest = wsgilib.path_info_split(environ['PATH_INFO'])
                try:
                    date_parts.append(int(first))
                    wsgilib.path_info_pop(environ)
                except (ValueError, TypeError):
 	           break
             environ['app.date_parts'] = date_parts
             return self.subapp(environ, start_response)

 This is really like traditional "middleware", in that it sits between
 the server and just one application.

 Assuming you put this class in the ``myapp.grabdate`` module, you
 could install it by adding this to your configuration::

     middleware.append('myapp.grabdate.GrabDate')

 Object Publishing
 =================

 Besides looking in the filesystem, "object publishing" is another
 popular way to do URL parsing.  This is pretty easy to implement as
 well -- it usually just means use ``getattr`` with the popped
 segments.  But we'll implement a rough approximation of `Quixote's
 <http://www.mems-exchange.org/software/quixote/>`_ URL parsing::

     class ObjectApp(object):
         def __init__(self, obj):
             self.obj = obj
         def __call__(self, environ, start_response):
             next = wsgilib.path_info_pop(environ)
             if next is None:
                 # This is the object, lets serve it...
                 return self.publish(obj, environ, start_response)
             next = next or '_q_index' # the default index method
             if next in obj._q_export and getattr(obj, next, None):
                 return ObjectApp(getattr(obj, next))(
                     environ, start_reponse)
             next_obj = obj._q_traverse(next)
             if not next_obj:
                 # Do a 404
             return ObjectApp(next_obj)(environ, start_response)

         def publish(self, obj, environ, start_response):
             if callable(obj):
                 output = str(obj())
             else:
                 output = str(obj)
             start_response('200 OK', [('Content-type', 'text/html')])
             return [output]

 The ``publish`` object is a little weak, and functions like
 ``_q_traverse`` aren't passed interesting information about the
 request, but this is only a rough approximation of the framework.
 Things to note:

 * The object has standard attributes and methods -- ``_q_exports``
   (attributes that are public to the web) and ``_q_traverse``
   (a way of overriding the traversal without having an attribute for
   each possible path segment).

 * The object isn't rendered until the path is completely consumed
   (when ``next`` is ``None``).  This means ``_q_traverse`` has to
   consume extra segments of the path.  In this version ``_q_traverse``
   is only given the next piece of the path; Quixote gives it the
   entire path (as a list of segments).

 * ``publish`` is really a small and lame way to turn a Quixote object
   into a WSGI application.  For any serious framework you'd want to do
   a better job than what I do here.

 * It would be even better if you used something like `Adaptation
   <http://www.python.org/peps/pep-0246.html>`_ to convert objects into
   applications.  This would include removing the explicit creation of
   new ``ObjectApp`` instances, which could also be a kind of fall-back
   adaptation.

 Anyway, this example is less complete, but maybe it will get you
 thinking.
	URL Parsing With WSGI And Paste
	+++++++++++++++++++++++++++++++

	:author: Ian Bicking <ianb@colorstudy.com>
	:revision: $Rev$
	:date: $LastChangedDate$

	.. contents::

	Introduction and Audience
	=========================

	This document is intended for web framework authors and integrators,
	and people who want to understand the internal architecture of Paste.

	.. include:: include/contact.txt

	URL Parsing
	===========

	.. note::

	Sometimes people use "URL", and sometimes "URI". I think URLs are
	a subset of URIs. But in practice you'll almost never see URIs
	that aren't URLs, and certainly not in Paste. URIs that aren't
	URLs are abstract Identifiers, that cannot necessarily be used to
	Locate the resource. This document is all about locating.

	Most generally, URL parsing is about taking a URL and determining what
	"resource" the URL refers to. "Resource" is a rather vague term,
	intentionally. It's really just a metaphor -- in reality there aren't
	any "resources" in HTTP; there are only requests and responses.

	In Paste, everything is about WSGI. But that can seem too fancy.
	There are four core things involved: the request (personified in the
	WSGI environment), the response (personified inthe
	``start_response`` callback and the return iterator), the WSGI
	application, and the server that calls that application. The
	application and request are objects, while the server and response are
	really more like actions than concrete objects.

	In this context, URL parsing is about mapping a URL to an
	application and a request. The request actually gets modified as
	it moves through different parts of the system. Two dictionary keys
	in particular relate to URLs -- ``SCRIPT_NAME`` and ``PATH_INFO`` --
	but any part of the environment can be modified as it is passed
	through the system.

	Dispatching
	===========

	.. note::

	WSGI isn't object oriented? Well, if you look at it, you'll notice
	there's no objects except built-in types, so it shouldn't be a
	surprise. Additionally, the interface and promises of the objects
	we do see are very minimal. An application doesn't have any
	interface except one method -- ``__call__`` -- and that method
	does things, it doesn't give any other information.

	Because WSGI is action-oriented, rather than object-oriented, it's
	more important what we do. "Finding" an application is probably an
	intermediate step, but "running" the application is our ultimate goal,
	and the only real judge of success. An application that isn't run is
	useless to us, because it doesn't have any other useful methods.

	So what we're really doing is dispatching -- we're handing the
	request and responsibility for the response off to another object
	(another actor, really). In the process we can actually retain some
	control -- we can capture and transform the response, and we can
	modify the request -- but that's not what the typical URL resolver will
	do.

	Motivations
	===========

	The most obvious kind of URL parsing is finding a WSGI application.

	Typically when a framework first supports WSGI or is integrated into
	Paste, it is "monolithic" with respect to URLs. That is, you define
	(in Paste, or maybe in Apache) a "root" URL, and everything under that
	goes into the framework. What the framework does internally, Paste
	does not know -- it probably finds internal objects to dispatch to,
	but the framework is opaque to Paste. Not just to Paste, but to
	any code that isn't in that framework.

	That means that we can't mix code from multiple frameworks, or as
	easily share services, or use WSGI middleware that doesn't apply to
	the entire framework/application.

	An example of someplace we might want to use an "application" that
	isn't part of the framework would be uploading large files. It's
	possible to keep track of upload progress, and report that back to the
	user, but no framework typically is capable of this. This is usually
	because the POST request is completely read and parsed before it
	invokes any application code.

	This is resolvable in WSGI -- a WSGI application can provide its own
	code to read and parse the POST request, and simultaneously report
	progress (usually in a way that another WSGI application/request can
	read and report to the user on that progress). This is an example
	where you want to allow "foreign" applications to be intermingled with
	framework application code.

	Finding Applications
	====================

	OK, enough theory. How does a URL parser work? Well, it is a WSGI
	application, and a WSGI server, in the typical "WSGI middleware"
	style. Except that it determines which application it will serve
	for each request.

	Let's consider Paste's ``URLParser`` (in ``paste.urlparser``). This
	class takes a directory name as its only required argument, and
	instances are WSGI applications.

	When a request comes in, the parser looks at ``PATH_INFO`` to see
	what's left to parse. ``SCRIPT_NAME`` represents where we are now;
	it's the part of the URL that has been parsed.

	There's a couple special cases:

	The empty string:

	URLParser serves directories. When ``PATH_INFO`` is empty, that
	means we got a request with no trailing ``/``, like say ``/blog``
	If URLParser serves the ``blog`` directory, then this won't do --
	the user is requesting the ``blog`` page. We have to redirect
	them to ``/blog/``.

	A single ``/``:

	So, we got a trailing ``/``. This means we need to serve the
	"index" page. In URLParser, this is some file named ``index``,
	though that's really an implementation detail. You could create
	an index dynamically (like Apache's file listings), or whatever.

	Otherwise we get a string like ``/path...``. Note that ``PATH_INFO``
	must start with a ``/``, or it must be empty.

	URLParser pulls off the first part of the path. E.g., if
	``PATH_INFO`` is ``/blog/edit/285``, then the first part is ``blog``.
	It appends this to ``SCRIPT_NAME``, and strips it off ``PATH_INFO``
	(which becomes ``/edit/285``).

	It then searches for a file that matches "blog". In URLParser, this
	means it looks for a filename which matches that name (ignoring the
	extension). It then uses the type of that file (determined by
	extension) to create a WSGI application.

	One case is that the file is a directory. In that case, the
	application is another URLParser instance, this time with the new
	directory.

	URLParser actually allows per-extension "plugins" -- these are just
	functions that get a filename, and produce a WSGI application. One of
	these is ``make_py`` -- this function imports the module, and looks
	for special symbols; if it finds a symbol ``application``, it assumes
	this is a WSGI application that is ready to accept the request. If it
	finds a symbol that matches the name of the module (e.g., ``edit``),
	then it assumes that is an application factory, meaning that when
	you call it with no arguments you get a WSGI application.

	Another function takes "unknown" files (files for which no better
	constructor exists) and creates an application that simply responds
	with the contents of that file (and the appropriate ``Content-Type``).

	In any case, ``URLParser`` delegates as soon as it can. It doesn't
	parse the entire path -- it just finds the next application, which
	in turn may delegate to yet another application.

	Here's a very simple implementation of URLParser::

	class URLParser(object):
	def __init__(self, dir):
	self.dir = dir
	def __call__(self, environ, start_response):
	segment = wsgilib.path_info_pop(environ)
	if segment is None: # No trailing /
	# do a redirect...
	for filename in os.listdir(self.dir):
	if os.path.splitext(filename)[0] == segment:
	return self.serve_application(
	environ, start_response, filename)
	# do a 404 Not Found
	def serve_application(self, environ, start_response, filename):
	basename, ext = os.path.splitext(filename)
	filename = os.path.join(self.dir, filename)
	if os.path.isdir(filename):
	return URLParser(filename)(environ, start_response)
	elif ext == '.py':
	module = import_module(filename)
	if hasattr(module, 'application'):
	return module.application(environ, start_response)
	elif hasattr(module, basename):
	return getattr(module, basename)(
	environ, start_response)
	else:
	return wsgilib.send_file(filename)

	Modifying The Request
	=====================

	Well, URLParser is one kind of parser. But others are possible, and
	aren't too hard to write.

	Lets imagine a URL like ``/2004/05/01/edit``. It's likely that
	``/2004/05/01`` doesn't point to anything on file, but is really more
	of a "variable" that gets passed to ``edit``. So we can pull them off
	and put them somewhere. This is a good place for a WSGI extension.
	Lets put them in ``environ["app.url_date"]``.

	We'll pass one other applications in -- once we get the date (if any)
	we need to pass the request onto an application that can actually
	handle it. This "application" might be a URLParser or similar system
	(that figures out what ``/edit`` means).

	::

	class GrabDate(object):
	def __init__(self, subapp):
	self.subapp = subapp
	def __call__(self, environ, start_response):
	date_parts = []
	while len(date_parts) < 3:
	first, rest = wsgilib.path_info_split(environ['PATH_INFO'])
	try:
	date_parts.append(int(first))
	wsgilib.path_info_pop(environ)
	except (ValueError, TypeError):
	break
	environ['app.date_parts'] = date_parts
	return self.subapp(environ, start_response)

	This is really like traditional "middleware", in that it sits between
	the server and just one application.

	Assuming you put this class in the ``myapp.grabdate`` module, you
	could install it by adding this to your configuration::

	middleware.append('myapp.grabdate.GrabDate')

	Object Publishing
	=================

	Besides looking in the filesystem, "object publishing" is another
	popular way to do URL parsing. This is pretty easy to implement as
	well -- it usually just means use ``getattr`` with the popped
	segments. But we'll implement a rough approximation of `Quixote's
	<http://www.mems-exchange.org/software/quixote/>`_ URL parsing::

	class ObjectApp(object):
	def __init__(self, obj):
	self.obj = obj
	def __call__(self, environ, start_response):
	next = wsgilib.path_info_pop(environ)
	if next is None:
	# This is the object, lets serve it...
	return self.publish(obj, environ, start_response)
	next = next or '_q_index' # the default index method
	if next in obj._q_export and getattr(obj, next, None):
	return ObjectApp(getattr(obj, next))(
	environ, start_reponse)
	next_obj = obj._q_traverse(next)
	if not next_obj:
	# Do a 404
	return ObjectApp(next_obj)(environ, start_response)

	def publish(self, obj, environ, start_response):
	if callable(obj):
	output = str(obj())
	else:
	output = str(obj)
	start_response('200 OK', [('Content-type', 'text/html')])
	return [output]

	The ``publish`` object is a little weak, and functions like
	``_q_traverse`` aren't passed interesting information about the
	request, but this is only a rough approximation of the framework.
	Things to note:

	* The object has standard attributes and methods -- ``_q_exports``
	(attributes that are public to the web) and ``_q_traverse``
	(a way of overriding the traversal without having an attribute for
	each possible path segment).

	* The object isn't rendered until the path is completely consumed
	(when ``next`` is ``None``). This means ``_q_traverse`` has to
	consume extra segments of the path. In this version ``_q_traverse``
	is only given the next piece of the path; Quixote gives it the
	entire path (as a list of segments).

	* ``publish`` is really a small and lame way to turn a Quixote object
	into a WSGI application. For any serious framework you'd want to do
	a better job than what I do here.

	* It would be even better if you used something like `Adaptation
	<http://www.python.org/peps/pep-0246.html>`_ to convert objects into
	applications. This would include removing the explicit creation of
	new ``ObjectApp`` instances, which could also be a kind of fall-back
	adaptation.

	Anyway, this example is less complete, but maybe it will get you
	thinking.