At Data Applied, we use applied cryptography to solve different problems: verify license keys, encrypt data, authenticate users, sign code, safeguard passwords, etc. Our security is therefore dependent on proper usage of cryptographic primitives.
Previously, I had a chance to work on an S/MIME precursor (used to encrypt e-mails), security features of large products (Microsoft Exchange Server, Microsoft Dynamics CRM), and interact with NSA cryptographers. Working on security-related projects has given me a chance to observe some design mistakes related to encryption. Here are some you may find interesting.
1. Impersonation attack – Using encryption for authentication:
Let’s say you want to implement a ticket-based authentication mechanism. After being presented with a proof of someone’s identity (ex: a valid password, a social security number, an SMS message, etc.), your goal is to issue verifiable authentication tickets, each associated with a user account.
Often, the following solution is proposed: use a secret key to encrypt an account ID, and return this information as an authentication ticket. Subsequently, when an authentication ticket is received, decrypt it using the secret key. If decryption is successful, this means that the ticket is valid, and the decrypted account ID can be used. Unfortunately, this approach is 100% wrong. Read more…
Mono is an exciting project sponsored by Novell whose goal is to run .NET code on any platform, including Linux. While Java is very portable cross platforms, .NET code was for a long time restricted to running on Windows. Thanks to the Mono project, the rules of the game are changing. I should add that the guys at Mono I’ve talked to have always been eager to help. They’re also very passionate about what they are building. If you’re interested in Mono, don’t hesitate to ping them.
At Data Applied, we try to write code that runs well on commodity hardware, and is compatible with low-cost software (for example, we support SQL Server, but also MySQL). Running our server code on Mono would mean eliminating any dependency on ASP.NET and therefore on Windows licensing fees. For a long time, we managed to make our server code run on Mono. In the end, however, we failed. Here is why.
1. Code parity problems:
Our data visualization client uses Silverlight technology. Silverlight includes a lean-and-mean subset of the .NET platform. This means that it does not include simple .NET collections (ex: System.Collections.Hashtable), nor simple XML (ex: System.Xml.XmlDocument) classes. Instead, Silverlight developers are expected to use generic collections (ex: System.Collections.Dictionary(TKey, TValue)), and LINQ XML (ex: System.Xml.Linq.XDocument) classes. Read more…
At Data Applied, we support both MySQL and SQL Server. Because we must guarantee that our code works (and performs!) well using both platforms, we’ve had quite a few opportunities to compare the two (i.e. making our code work with both has been a *huge* pain).
Here is why I like SQL Server, but prefer MySQL. Having spent 10 years at Microsoft, my intent is not to disparage SQL Server. Some of SQL Server’s advanced enterprise features just can’t be beat. However I find that working with MySQL is often much easier. Here is why.
1. Make paging easier to use:
Most high-performance applications working with large data sets must rely on paging to incrementally process and return results. For example, to get the next 1000 rows starting at position 5000 (given a stable order), MySQL lets you use the following syntax:
SELECT * FROM T WHERE ... ORDER BY ... LIMIT 1000 OFFSET 5000
With SQL Server, doing the same thing is a lot more cumbersome. It’s pretty obvious which syntax is easier to use. Sadly, developers who may want to implement paging may be discouraged by this intimidating syntax:
SELECT *, ROW_NUMBER() OVER (ORDER BY ...) AS RowNumber FROM T WHERE ... AND RowNumber >= 5000 AND RowNumber < 6000 ORDER BY ...
2. Avoid ridiculous overflows:
SQL Server suffers from ridiculous numeric overflows. For example, when Read more…
How can time series be predicted, for example to ensure adequate capacity planning? In this example, we analyze traffic logs from popular bookmarking website Digg.com, and build a forecasting model. We also explain how the same technique can be applied to other types of data.
The following web traffic variables were made available by website Digg.com:
- Date (day)
- Number of front page articles
- Total number of comments
- Total number of diggs
- Average number of comments per article
- Average number of diggs per article
- Standard deviation of comments per article
- Standard deviation of diggs per article
How can clustering be used to identify similar groups of records? In this example, we analyze customer service satisfaction survey results for an IT company, and identify clusters of respondents to improve satisfaction. We also explain how the same technique can be applied to other types of data.
A total of 5500 customer satisfaction survey results were analyzed, with the following answers available (note: all questions were mandatory):
- What was your main reason for contacting technical support (ex: connectivity, maintenance, upgrade, backup)?
- How did you contact technical support (ex: email, phone, chat)?
- The amount of time I had to wait to speak to someone was reasonable (score 1-9)
- The technical support staff was knowledgeable (score 1-9)
- The technical support staff was helpful (score 1-9)
- The technical support staff was easy to understand (score 1-9)
- The technical support staff was able to help me solve my problem quickly (score 1-9)
- Was your problem resolved on the first contact to technical support (yes/no)?
- How would you rate your overall satisfaction (score 1-9)?
How can the influence of numeric variables on others be measured and visualized? In this example, using economic data for different countries, we analyze correlations between variables and construct a representative graph. We also explain how the same technique can be applied to other types of data.
The following economic variables were obtained for 187 countries, and this over a 55 year span (10287 rows total):
- Country name and code
- Exchange rate
- Purchasing power parity
- Real GDP
- Consumption share of GDP
- Government share of GDP
- Investment share of GDP
- Openness in constant prices
- …several others…
How can associations and relationships between variables be identified? In this example, we analyze anonymized banking customer data, and identify hidden associations to increase profitability. We also explain how the same technique can be applied to other types of data.
The following anonymized variables were made available by a bank:
- Number of cars
- Number of children
- Region (ex: city)
- Credit limit
- Opened a savings account
- Checking account balance
- Customer since (in months)
- Number of support calls
- Mortgage loan
- Defaulted on payment