<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>VendAsta Technologies &#187; Jason A. Collins</title>
	<atom:link href="http://www.vendasta.com/author/jason-collins/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.vendasta.com</link>
	<description>White Label SMB Solutions for Publishers</description>
	<lastBuildDate>Tue, 24 Jan 2012 17:31:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Incremental data backups on Google App Engine using Fantasm and datastore namespaces.</title>
		<link>http://www.vendasta.com/2011/09/02/incremental-data-backups-on-app-engine-using-fantasm/</link>
		<comments>http://www.vendasta.com/2011/09/02/incremental-data-backups-on-app-engine-using-fantasm/#comments</comments>
		<pubDate>Fri, 02 Sep 2011 15:30:27 +0000</pubDate>
		<dc:creator>Jason A. Collins</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[data backup]]></category>
		<category><![CDATA[fantasm]]></category>
		<category><![CDATA[finite state machines]]></category>
		<category><![CDATA[google app engine]]></category>

		<guid isPermaLink="false">http://www.vendasta.com/?p=1876</guid>
		<description><![CDATA[Data reliability is essential for any application. Applications running on Google App Engine have well-protected data out-of-the-box: Google provides redundant disks and data replication between data centres. One can rest quite assured that your data is safe from hardware failure. However, a final risk remains: you. If you, or your team, develop a script with &#8230; <a href="http://www.vendasta.com/2011/09/02/incremental-data-backups-on-app-engine-using-fantasm/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Data  reliability is essential for any application. Applications running on  Google App Engine have well-protected data out-of-the-box: Google  provides redundant disks and data replication between data centres. One  can rest quite assured that your data is safe from hardware failure.</p>
<p>However,  a final risk remains: you. If you, or your team, develop a script with a  bug, it can wreak havoc on your own data. Given tools and techniques  like continuations and Map-Reduce, you can accidentally do some  wide-spread damage to your own data. Of course, things should be well  tested, but there is always the possibility that something unexpected  occurs. It remains vital to have a data backup, even in a safe  environment like App Engine.</p>
<h3>General approach to an incremental backup.</h3>
<p>First,  a word on my mindset: I’m of the opinion that Google protects my data  from hardware failure and hackers, that is, they have considerably more  resources on this front than I will ever have. As such, I really only  need to keep snapshots of my data around in case my own code goes crazy  so that I can recover. Since Google provides great performance and  hardware redundancy, it makes an ideal home for my backup storage. So  the balance of this article will focus on making an incremental backup  of your data <em>in the same App Engine application</em>.</p>
<p><a href="http://code.google.com/p/fantasm">Fantasm</a>, developed by <a href="http://www.vendasta.com/">VendAsta Technologies</a>,   is an open-source finite state machine engine for building  taskqueue-based workflows on Google App Engine. In particular, it allows  you to write resilient workflows with automatic retry mechanisms  without worrying about the details of the highly flexible Task Queue  API. Most importantly, Fantasm’s continuation feature makes it a piece  of cake to walk across very large datasets &#8211; perfect for constructing a  snapshot of your data. If you’re new to Fantasm, please start with this <a href="http://code.google.com/appengine/articles/fantasm.html">article</a>.</p>
<h3>Implementation.</h3>
<p>To begin, we’ll construct a finite state machine for our backup in <code>fsm.yaml</code>:</p>
<pre>- name: Backup

 context_types:
   lastBackup: datetime

 states:

 - name: SelectBackup
   initial: True
   action: SelectBackup
   transitions:
   - event: ok
     to: EnumerateBackupModels

 - name: EnumerateBackupModels
   continuation: True
   action: EnumerateBackupModels
   transitions:
   - event: ok
     to: BackupEntity

 - name: BackupEntity
   continuation: True
   action: BackupEntity
   final: True</pre>
<p>There are three states to this machine with a simple linear transition  between each. The graphic representation looks like the following:<br />
<img class="aligncenter size-full wp-image-1897" title="Diagram of finite state machine for backup." src="/wp-content/uploads/backup-finite-state-machine.png" alt="Diagram of finite state machine for backup." width="275" height="590" /><br />
In the first state, <code>SelectBackup</code>, we will do a bit of date math to determine which backup we are working with.</p>
<pre>FULL_BACKUP_ON_DAY_OF_WEEK = 0 # 0 = Monday, 1 = Tuesday, ...

class SelectBackup(FSMAction):

   def execute(self, context, obj, utcnow=None):
       utcnow = utcnow or datetime.datetime.utcnow()
       fullBackupDay = utcnow - datetime.timedelta(days=utcnow.weekday()) \
                       + datetime.timedelta(days=FULL_BACKUP_ON_DAY_OF_WEEK)
       # above date might be in the future, if so, we need to back it up a week
       if fullBackupDay &gt; utcnow:
           fullBackupDay = fullBackupDay - datetime.timedelta(days=7)
       context['backupId'] = 'backup-' + fullBackupDay.strftime('%Y-%m-%d')
       return 'ok'</pre>
<p>The <code>SelectBackup</code> state selects the most recent past Monday, and places that date on the context as a string like <code>backup-2011-03-21</code>. The control then passes to the next state, <code>EnumerateBackupModels</code>.</p>
<p>The <code>EnumerateBackupModels</code> state is demarcated as a &#8220;continuation&#8221; state. In this context, you can  think of this as a loop. However, because of the way that Fantasm and Task Queue work, each iteration of the loop executes in parallel. So, in  <code>EnumerateBackupModels</code>, we’re going to loop through each of the models that we want to backup and get parallel processes running for each.</p>
<pre>BACKUP_CONFIG = (
    # MODEL       MODIFIED DATE      BACKUP BATCH SIZE
   (Account,      'modifiedTime',      1),
   (Company,      'modifiedTime',     10),
)

# create some mappings to make the later code more readable
BACKUP_CLASS = dict( (m[0].__name__, m[0]) for m in BACKUP_CONFIG )
BACKUP_INCREMENTAL_PROPERTY = dict( (m[0].__name__, m[1]) for m in BACKUP_CONFIG )
BACKUP_BATCH_SIZE = dict( (m[0].__name__, m[2]) for m in BACKUP_CONFIG )
BACKUP_MODELS = sorted(BACKUP_CLASS.keys())

class _Backup(db.Model):
   backupId = db.StringProperty(required=True)
   model = db.StringProperty(required=True)
   lastBackup = db.DateTimeProperty()

class EnumerateBackupModels(FSMAction):

   def continuation(self, context, obj, token=None):
       if not token:
           obj['model'] = BACKUP_MODELS[0]
           return BACKUP_MODELS[1] if len(BACKUP_MODELS) &gt; 1 else None
       else:
           # find next in list
           for i in range(0, len(BACKUP_MODELS)):
               if BACKUP_MODELS[i] == token:
                   obj['model'] = BACKUP_MODELS[i]
                   return BACKUP_MODELS[i+1] if i &lt; len(BACKUP_MODELS)-1 else None
       return None # occurs if token passed in is not found in list - shouldn't happen

   def execute(self, context, obj):
       backupId = context['backupId']
       model = obj['model']

       def tx():
           keyName = '%s:%s' % (backupId, model)
           entry = _Backup.get_by_key_name(keyName)

           if not entry:
               entry = _Backup(key_name=keyName, backupId=backupId,
                               model=model, lastBackup=None)
           else:
               context['lastBackup'] = entry.lastBackup # get the lastBackup time

           entry.lastBackup = datetime.datetime.utcnow() # update to now
           entry.put()

       db.run_in_transaction(tx)
       context['model'] = model
       return 'ok'</pre>
<ul>
<li><code>BACKUP_CONFIG</code> This is simply a list of models that we’re interested in backing up. We build some maps (<code>BACKUP_CLASS</code>, <code>BACKUP_INCREMENTAL_PROPERTY</code>, etc.) over this configuration to make the rest of the code more readable.</li>
<li><code>_Backup</code> This is a model that will track the backup ID for each model that we’ve backed up. That is, we can refer here to see what we’ve backed up in  the past.</li>
<li><code>continuation</code> The continuation method is responsible simply to return a sequence of tokens, in this case, the strings from <code>BACKUP_MODELS</code>. The token returned from one invocation of continuation is passed as an  argument to the next invocation of continuation; we are able to walk  across the list in this way. Finally, our execute method needs to know which model we’re working on, so we store it on the <code>obj</code>, making it available to the execute method.</li>
<li><code>execute</code> Using the <code>backupId</code> (from <code>SelectBackup</code> state) and the model (from the <code>continuation</code> method), we ensure that we have an entity created in the <code>_Backup</code> tracking model. If a <code>_Backup</code> entity already existed for the given <code>backupId</code>, we add the time of the last backup to the context so that we can perform incremental backup. Next, we update the entity’s <code>lastBackup</code> to now since we’re working on a backup right now. Finally, we add the model to the context to make it available to subsequent machine states.</li>
</ul>
<p>At this point, we have a machine in flight for each of the models in the <code>BACKUP_MODELS</code> list. All of these machines now move to the next (and final) state, <code>BackupEntity</code>. The <code>BackupEntity</code> state is also demarcated as a continuation state. Here, we are going to  query for each of the entities for the model in question and copy the  data to a backup entity. Google App Engine namespaces (as part of the <a href="http://code.google.com/appengine/docs/python/multitenancy/">Multitenancy</a> feature) provide a very convenient mechanism to store backup entities.  We can name a namespace for the backup ID and simply store the entity  into that namespace. This has the further benefit of not polluting our  view of the data in the App Engine console data viewer.</p>
<pre>class BackupEntity(DatastoreContinuationFSMAction):

   def getQuery(self, context, obj):
       model = context['model']
       query = 'SELECT * FROM %s' % model

       lastBackup = context.get('lastBackup')
       if lastBackup and BACKUP_INCREMENTAL_PROPERTY[model]:
           query = query + ' WHERE %s &gt;= :1' % BACKUP_INCREMENTAL_PROPERTY[model]
           return db.GqlQuery(query, lastBackup)
       else:
           return db.GqlQuery(query)

   def getBatchSize(self, context, obj):
       model = context['model']
       batchSize = BACKUP_BATCH_SIZE[model]
       return batchSize

   def execute(self, context, obj):
       if not obj.results:
           # query may return no results, terminate if so
           return None

       model = context['model']
       backupId = context['backupId']
       entities = obj['results']

       backupEntities = []
       for originalEntity in entities:

           # build a key with same path, but in different namespace
           originalKeyPath = originalEntity.key().to_path()
           newKey = db.Key.from_path(*originalKeyPath, **{'namespace': backupId})

           # copy over the property values
           kwargs = {}
           for prop in originalEntity.properties().values():
               if isinstance(prop, (db.ReferenceProperty,
                                    blobstore.BlobReferenceProperty)):
                   # avoid the dereference/auto-lookup
                   datastoreValue = prop.get_value_for_datastore(originalEntity)
               else:
                   datastoreValue = getattr(originalEntity, prop.name, None)
               kwargs[prop.name] = datastoreValue

           backupModelClass = BACKUP_CLASS[model]
           backupEntity = backupModelClass(key=newKey, **kwargs)
           backupEntities.append(backupEntity)

       db.put(backupEntities)</pre>
<ul>
<li><code>getQuery</code> Since <code>BackupEntity</code> inherits from <code>DatastoreContinuationFSMAction</code>, we don’t need to implement continuation, we only need to implement <code>getQuery</code>. <code>getQuery</code> uses the model on the context to construct a simple query to fetch all  the entities for that model. If our configuration states that the  current model has a modified timestamp (e.g., a <code>db.DateTimeProperty(auto_now=True)</code>),  we can extend the query to only consider entities that have been  updated since the last backup time, i.e., an incremental backup. The  parent class handles the query cursor and fetching entities and spins up  a parallel <code>execute</code> method for each fetch batch.</li>
<li><code>getBatchSize</code> Simply tells the parent class how many entities to fetch at a time. The  number depends on the typical size of an entity for the current model;  we need to ensure that a batch can fit into a protocol buffer size  limitation because we <code>put()</code> them as a batch in the <code>execute</code> method.</li>
<li><code>execute</code> The <code>execute</code> method gets a list of entities on <code>obj['results']</code> (via the <code>DatastoreContinuationFSMAction</code> parent class). Looping through these entities, it creates a backup entity holding the data from the original. Most importantly, it uses a  slightly different key to place the backup entity on a different  namespace, named for the <code>backupId</code>. If an entity with an identical key already exists, the <code>put()</code> will overwrite that existing entity with the new backup entity.</li>
</ul>
<p>We  now have a finite state machine that will backup our entities as often  as we invoke the machine. The heuristic of the machine will roll to a  new snapshot each week, and perform incremental backup for machine  invocations between the week boundaries. We can simply set up a schedule  in <code>cron.yaml</code> to kick off our backups:</p>
<pre>cron:
- description: Backup
 url: /fantasm/fsm/Backup/?method=POST
 schedule: every day 01:00</pre>
<h3>Building a machine for scrubbing old backups.</h3>
<p>Constructing a finite state machine to scrub old backups is straightforward and yields a machine that is very similar to <code>Backup</code>:</p>
<pre>- name: DeleteBackup

 context_types:
   daysOld: int
   backupDate: datetime

 states:

 - name: ComputeDate
   action: ComputeDate
   initial: True
   transitions:
   - event: ok
     to: SelectBackupToDelete

 - name: SelectBackupToDelete
   action: SelectBackupToDelete
   continuation: True
   final: True
   transitions:
   - event: ok
     to: DeleteBackupEntity

 - name: DeleteBackupEntity
   action: DeleteBackupEntity
   final: True
   continuation: True</pre>
<p>Fantasm allows arguments to be passed in to machines, as standard <code>GET</code> or <code>POST</code> arguments. In the <code>ComputeDate</code> state, we convert a <code>daysOld</code> parameter into an absolute datetime and add it to the context as <code>backupDate</code>. Note that the <code>context_types</code> definition in the above machine configuration allows <code>context['backupDate']</code> to be automatically cast to the correct data type in subsequent states.</p>
<pre>class ComputeDate(object):

   def execute(self, context, obj):
       daysOld = context['daysOld'] # automatically cast as an int
       context['backupDate'] = datetime.datetime.utcnow() - \
                               datetime.timedelta(days=daysOld)
       return 'ok'</pre>
<p>The next state, <code>SelectBackupToDelete</code>, queries our <code>_Backup</code> model to find any backups that are older than <code>backupDate</code>.</p>
<pre>class SelectBackupToDelete(DatastoreContinuationFSMAction):

   def getQuery(self, context, obj):
       return _Backup.all().filter('lastBackup &lt;', context['backupDate'])

   def execute(self, context, obj):
       if not obj['result']:
           # we may get no result back at all, so we can terminate
           return None
       backupEntity = obj['result']
       context['model'] = backupEntity.model
       context['backupId'] = backupEntity.backupId
       db.delete(backupEntity)
       return 'ok'</pre>
<ul>
<li><code>getQuery</code> As a <code>DatastoreContinuationFSMAction</code>, we don’t need to implement the continuation method, we only need to implement <code>getQuery</code>. Here, we simply build a query of <code>_Backup</code> for backups that occurred before <code>lastBackup</code> date. Because the default batch size is 1, the execute method will be called (in parallel) for each <code>_Backup</code> entity retrieved.</li>
<li><code>execute</code> Using the <code>_Backup</code> entity retrieved, which <code>DatastoreContinuationFSMAction</code> stores on <code>obj['result']</code>, we add the model and <code>backupId</code> to the context. We can delete the entity from <code>_Backup</code> and pass control to the next state which will delete the actual entities.</li>
</ul>
<p>The next and final state, <code>DeleteBackupEntity</code>, is another continuation that walks across all the entities of the given model in the namespace named by <code>backupId</code>, and deletes them.</p>
<pre>class DeleteBackupEntity(DatastoreContinuationFSMAction):

   def getQuery(self, context, obj):
       model = context['model']
       backupId = context['backupId']
       backupModelClass = BACKUP_CLASS[model]
       return backupModelClass.all(keys_only=True, namespace=backupId)

   def getBatchSize(self, context, obj):
       return 50

   def execute(self, context, obj):
       """ Actually delete the keys. """
       if obj['results']:
           db.delete(obj['results'])</pre>
<ul>
<li><code>getQuery</code> Using the model and <code>backupId</code> from the previous state, we construct a query for all the entities, <strong>taking care to ensure the query is constrained to the right namespace</strong>. Also, since we’re just going to be deleting, we only need to retrieve the keys.</li>
<li><code>getBatchSize</code> We can delete a number of entities at a time, e.g., 50.</li>
<li><code>execute</code> The parent class <code>DatastoreContinuationFSMAction</code> places the results of the query on <code>obj['results']</code>; these results are a list of <code>db.Key</code> that we can pass directly to <code>db.delete</code>.</li>
</ul>
<p>The only thing left is to invoke the <code>DeleteBackup</code> machine. We could use a <code>cron.yaml</code> entry to schedule it, but I’ll take the opportunity to highlight another Fantasm feature: <code>spawn</code>. <code>spawn</code> can be used to invoke other machines with given contexts. Calling <code>spawn</code> simply queues a task (actually a task for each context provided), so it can be called with little overhead. We can add the spawn call to the first state of our original <code>Backup</code> machine:</p>
<pre>class SelectBackup(object):
   def execute(self, context, obj, utcnow=None):
       utcnow = utcnow or datetime.datetime.utcnow()
       fullBackupDay = utcnow - datetime.timedelta(days=utcnow.weekday()) \
                       + datetime.timedelta(days=FULL_BACKUP_ON_DAY_OF_WEEK)
       # above date might be in the future, if so, we need to back it up a week
       if fullBackupDay &gt; utcnow:
           fullBackupDay = fullBackupDay - datetime.timedelta(days=7)
       context['backupId'] = 'backup-' + fullBackupDay.strftime('%Y-%m-%d')

       context.spawn('DeleteBackup',
                     [{'daysOld': 60}], # spawn multiple machines with multiple contexts
                     countdown=4*60*60)

       return 'ok'</pre>
<p>Note that <code>spawn</code> also allows us to specify a <code>countdown</code>, which is the number of seconds in the future to start the machine. In this case, we are leaving some time to allow our <code>Backup</code> machine to complete before kicking off the <code>DeleteBackup</code> machine (though, practically speaking, they could run at the same time).</p>
<p>The full source for this example can be found in the Fantasm project at <a href="http://code.google.com/p/fantasm">http://code.google.com/p/fantasm</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vendasta.com/2011/09/02/incremental-data-backups-on-app-engine-using-fantasm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Those guys are really good.</title>
		<link>http://www.vendasta.com/2008/02/26/those-guys-are-really-good/</link>
		<comments>http://www.vendasta.com/2008/02/26/those-guys-are-really-good/#comments</comments>
		<pubDate>Wed, 27 Feb 2008 03:48:47 +0000</pubDate>
		<dc:creator>Jason A. Collins</dc:creator>
				<category><![CDATA[People & Culture]]></category>

		<guid isPermaLink="false">http://vendasta.wordpress.com/?p=19</guid>
		<description><![CDATA[Overheard today in the common area of our building at Innovation Place: &#8220;You know, VendAsta. Those guys are really good.&#8221; I thought to myself, &#8220;Hey, that&#8217;s pretty cool.&#8221; As I eavesdropped some more, it became apparent that he was referring to foosball. Oh well, I guess reputation has to start somewhere&#8230;]]></description>
			<content:encoded><![CDATA[<p>Overheard today in the common area of our building at <a href="http://www.innovationplace.com/" target="_blank">Innovation Place</a>:</p>
<blockquote><p>&#8220;You know, VendAsta. Those guys are <em>really </em>good.&#8221;</p></blockquote>
<p>I thought to myself, &#8220;Hey, that&#8217;s pretty cool.&#8221;</p>
<p>As I eavesdropped some more, it became apparent that he was referring to <a href="http://brendanking.ca/2008/01/30/the-new-foosball-tables-arrived-today/" target="_blank">foosball</a>.</p>
<p>Oh well, I guess reputation has to start somewhere&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.vendasta.com/2008/02/26/those-guys-are-really-good/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

