About company Our development services Portfolio. Case studies Support Testimonials Contacts Brutka Ltd: web site development and custom software development

Information technology news archive


Inside MSDN: Building the MSDN Aggregation System

Have you visited msdn2.microsoft.com? It’s the new online face of the MSDN® Developer Tools and Enterprise Server documentation. The infrastructure behind it includes a system developed by my team at Microsoft for aggregating information related to our content. By the time you read this, we expect that all of msdn.microsoft.com will have migrated to our new system and the msdn2 name will have been retired.
When my team set out to build our content aggregation system, we worked toward several high-level goals. The following explains what we wanted to accomplish:

Keep data providers and consumers loosely coupled using a publish/subscribe model and asynchronous messaging interfaces.
Let individual usage scenarios define and manage the structure and meaning of their data. We also wanted to keep those definitions opaque to the system and to unrelated scenarios.
Make it easy to plug new data providers and consumers into the system regardless of how complex their own internal logic may be.

These constraints were necessary because the universe of content and data we wanted to associate with MSDN content is widely varied and we couldn’t enforce structure or taxonomy upon it. For instance, we wanted to be able to recommend blog posts that refer to our documentation and allow users to tag documents to build up a "folksonomy" that describes MSDN content.

Building the System
After agreeing on our design goals we began looking for technologies to support them. It turned out that SQL Server™ Service Broker offered the asynchronous messaging support we needed and, since the message-queuing infrastructure is tightly integrated with the SQL Server database engine, our existing database backup, administration, and failover procedures could cover our messaging solution as well.
Fortunately, SQL Server Service Broker also supports significant scale out. For example, service endpoints (the senders and receivers of messages) can be remote or local and the database code does not need to know where endpoints are. This let us start out with all of our services on a single server, and then move them to remote servers as loads increased.
The common language runtime (CLR) integration in SQL Server 2005 allowed us to build a plug-in model for data providers and consumers and to implement complex business logic in the database layer. If you’ve written extended stored procedures to do things like pull data from a SOAP endpoint, you’ll appreciate how much more straightforward it is to implement this functionality using SQL Server CLR integration.
The public interfaces to our system are message-based, and since the system is built on top of Service Broker, we were able to provide all the benefits of reliable asynchronous messaging without a lot of additional work. We expose two Service Broker services (which are the logical endpoints for Service Broker messaging) as data input points for the system.
Briefly, services are communication endpoints that use queues as their backing stores. Services specify what types of messages they can process by bundling them into contracts. The glossary on Service Broker Objects explains further. The following are the two services that we expose:
The Job Service receives messages that provide the system with the information it needs to execute a Job. For us, a Job is a unit of work that brings data related to one or more MSDN documents into the system. This is how we support the pull model of data acquisition. A job message contains the information a JobWorker needs to operate, which will differ for various JobWorkers. The data retrieved by the JobWorker is then submitted back into the system via a second Service Broker service called the DataSubmission service.
The DataSubmission Service receives messages containing the data to be associated with MSDN content. DataSubmission messages can be submitted either by JobWorkers or by external data providers using a "push" model. In the latter case, data providers need to know the structure of the message envelope required by the DataSubmission service.
Message Structure and Validation

For either service, certain pieces of information used by the system to route messages and identify data sources must be included in all messages. So the message envelope needs to have a structure verifiable by the system, and the message payload needs to be structured according to a contract agreed upon by the provider and consumer while remaining opaque to the system.
We use XML for our messages. SQL Server Service Broker provides internal support for XML Schema Definitions (XSDs) to validate messages, so we adopted an XSD-based open content model to provide this multi-layer message validation. Using an open content model lets different consumers of XML data validate parts of an XML document while ignoring parts that aren’t relevant to their domain. Our system can validate the envelope data it needs without knowing the structure of the message payload, while data consumers can use domain-specific XSDs to validate the payload of messages they consume. This enables providers and consumers to agree on a schema-based contract defining their communications without the system (or other consumers) needing knowledge of that contract.
In the Job.xsd schema we’ve defined an optional jobData element that can contain children from any namespace (xs:any namespace="##any"). An XML document can contain any valid XML in the jobData element and still be validated by this schema. The system uses this schema to enforce the structure of the envelope data it needs while ignoring the message payload. Individual JobWorkers, who share knowledge of the structure of their message payload with Job requesters, can use another XSD schema that the two parties share to validate the message payload. Note that when defining an open content model schema outside of SQL Server 2005, the common pattern is to define the xs:any element with a processContents="lax" attribute which indicates that the parser should make a best-effort attempt at validation, but not fail if unable to validate. SQL Server 2005 does not support "lax" schema validation, so we specify processContents="skip" instead. Once the schema has been defined we make it available to Service Broker and associate it with message types
By creating the message type using WITH SCHEMA COLLECTION validation, we’re telling Service Broker to validate all messages of the specified type against our XML schema. If a message is received that does not conform to the schema, Service Broker will discard the message and return an error message to the Service that sent the message. This allows downstream code to consume messages without needing to validate them again, as Service Broker guarantees that all messages of this type on the queue conform to our schema.
In our job schema, root attribute values contain the identification information we carry throughout the system to uniquely identify data items. We use the combination of source, sourceDomain, and id values as our identifiers to mitigate namespace collisions across multiple providers. This protects against cases in which two different providers use the same identifier—an important safeguard since we can’t control how providers generate IDs. Envelope elements are generally used by scheduled jobs to pass and maintain information about when they should run, last-run status, and the like.
Job Worker Infrastructure

The MSDN Aggregation system supports a pull model of data acquisition through our JobWorker framework. A JobWorker is responsible for retrieving data related to MSDN content and adding it to the system via the data-submission messaging infrastructure. The design of the framework is intended to make it relatively easy to add new data providers to the system to support the rapid implementation of new data input scenarios. We needed to address both ease of implementation and ease of deployment and configuration in our design.
CLR integration enabled us to implement the JobWorker framework in C# using powerful object-oriented patterns while exposing its functionality to T-SQL and Service Broker code.
Before we look at how JobWorkers process messages, however, we need to get Service Broker set up to deliver those messages.
Service Broker Plumbing

We saw earlier how to define a Service Broker message type and associate an XML schema with it. Let’s take a look at the rest of the Service Broker setup needed to define our Job messaging infrastructure.We’ve already defined our JobRequest message. Now we’ll bundle this message into a Service Broker contract that specifies its directionality:
-- create Job contract CREATE CONTRACT [urn:msdn/aggregator/JobRequests] ( [urn:msdn/aggregator/JobRequestMessage] SENT BY INITIATOR );
This contract specifies that JobRequestMessages can be sent by the service delivering messages (the Initiator of the conversation).
Now we create the queues that will store these messages. We use unidirectional queues in our system, which means that a service receives messages from one queue and replies to another. This simplifies processing, as service activation code doesn’t need to filter out messages it doesn’t care about. Defining separate queues also enables scale-out, as it allows you to host the queues either locally or remotely without needing to rewrite your application code. Service Broker doesn’t restrict you from either sending or receiving on the same queue, however.
Now that we have our contract and queues in place, we can define the services that use them:
-- Service for handling (receiving) JobRequest messages CREATE SERVICE [urn:msdn/aggregator/JobService] AUTHORIZATION [dbo] ON QUEUE [urn:msdn/aggregator/JobPostingQueue] ( [urn:msdn/aggregator/JobRequests] ) ; -- Service for posting JobRequest messages (used by clients) CREATE SERVICE [urn:msdn/aggregator/JobPostingService] AUTHORIZATION [dbo] ON QUEUE [urn:msdn/aggregator/JobPostingQueue];
The JobPostingService isn’t bound to a contract, as it only initiates Dialog Conversations. Only the receiving end of a conversation (the target) needs to be bound to a contract when defining the service, but the contract does need to be shared by the services on both ends of a conversation. The service on the sending end of a conversation uses the contract at run time when sending messages to the target. Here’s an example:
-- Begin dialog with JobService. Sender needs to know about -- shared contract when sending. DECLARE @dialog_handle UNIQUEIDENTIFIER; BEGIN DIALOG CONVERSATION @dialog_handle FROM SERVICE [urn:msdn/aggregator/JobPostingService] TO SERVICE ‘urn:msdn/aggregator/JobService’ ON CONTRACT [urn:msdn/aggregator/Jobs] ; -- Send message on dialog conversation SEND ON CONVERSATION @dialog_handle MESSAGE TYPE [urn:msdn/aggregator/JobRequestMessage] (@message) ;
So now we’ve got a pair of endpoints that can send and receive messages (the services). We’ve defined the messages we can understand and bound them to the services using a contract, and we’ve defined the backing store used to manage messages. That puts the Service Broker plumbing for our Job processing in place. Now let’s look at how we actually process the messages that come through this plumbing.
JobWorker Interfaces

With the Service Broker messaging infrastructure set up, we defined a set of CLR interfaces, which represent the contract JobWorkers must implement to participate.
JobWorkers implement the IJobWorker interface, which defines a single Execute method and several properties for maintaining Job state data. Execute takes an IJobContext parameter, which is an interface defining the Submit method used to publish data back into the system. We provide a default implementation of IJobContext that implements basic submission functionality, and JobWorkers are free to provide their own implementation if the default doesn’t meet their needs. By exposing IJobContext as a parameter to IJobWorker.Execute, we’re not only enforcing a behavioral contract but also guaranteeing that submission functionality is always available to implementers.
Execute, which uses XSLT to transform input XML data into the structure required by the DataSubmission service. In this case the message contains the data itself and the JobWorker only needs to transform it before submitting. In a more common scenario, where the JobWorker needs to acquire data from outside the system, the message parameter would contain information the JobWorker needs to retrieve the data (such as the URI of an RSS feed). In any case, the pattern is that the message parameter is the raw Job message received on the JobPosting queue. Our use of XSDs defining an Open Content Model means that the message data can be validated by JobWorkers as long as they and the submitter of the Job message have shared an XSD defining the message payload.
Once a JobWorker has been coded and compiled, it needs to be cataloged to SQL Server so that it is available to database application code. Cataloging an assembly stores the assembly’s Microsoft intermediate language (MSIL) code in a SQL Server table as varbinary data, and this is what the SQL Server CLR host uses when loading our assembly. An example of the syntax for cataloging an assembly is shown here:
-- catalog assembly from file location CREATE ASSEMBLY [VeryUsefulJobWorker] FROM ‘d:\\AggregationBinaries\\VeryUsefulJobWorker.dll’ WITH PERMISSION_SET = EXTERNAL_ACCESS;
For a more detailed description of this statement and its syntax, see msdn2.microsoft.com/library/ms189524.aspx
When you execute this statement, the SQL Server CLR host loads the assembly specified in the FROM clause, checks that the assembly contains valid MSIL, and loads any dependent assemblies referenced by the assembly being cataloged. If the PERMISSION_SET specified is SAFE or EXTERNAL_ACCESS, additional type safety verification is performed, and classes in the assemblies are checked for constructs such as writable static members and finalizer methods (neither of which are allowed in assemblies using these permission sets).
After an assembly has been cataloged, procedures, functions, and the like, which are to be exposed directly to database application code need to be cataloged and the appropriate CREATE statements, such as CREATE PROCEDURE, must be used. Since our JobWorkers don’t expose functionality to the database application code, we skip this step. If you’re wondering how useful JobWorkers really are if we don’t expose their functionality to database code, you’ll soon see.

2007-03-01

 

More IT news, prices and information