News aggregator
Sightings: A Vennerable Challenge
Venn diagrams are a strange mix of structure and data visualization. In my latest Sightings column (PDF) for American Scientist, I use the example of a visualization challenge from last year to discuss different ways to show the same data about diagnosis techniques for autism in young children. This also sparked the launch of a new site feature: Ask Eagereyes.
Regular readers of this website will perhaps remember the Autism diagnosis accuracy redesign challenge originally posted on Nuit Blanche, which asked for better ways to show data that had been displayed in a Venn diagram. There were a number of interesting designs, and I also contributed one. All of these were clearly better than the original, but I was most interested in the ones that showed the structure of the data. So I ended up talking about my own redesign as well as Patrick Murphy's redesign of mine.
Ask EagerEyes is an extension of what I am trying to achieve with Sightings (which I took over from Felice Frankel in the summer): to make the scientific community aware of information visualization and to get a better understanding of the problems scientists are facing in analyzing their data and communicating their results. The autism case study is a great example for the need for a better understanding of these questions.
But Ask EagerEyes is not limited to scientific data. If you have any kind of data you would like to understand better, or have a visualization of data that you think could be improved, drop me a note. The reason this is not part of a forum is that I want to make it possible to discuss problems that involve confidential or otherwise unpublished data. Eventually, of course, I hope to be able to talk about the problems posed and solutions found on this site. Some of the questions might also become challenges for my dear readers.
Charts and Metaphors
What do pies, waffles, and donuts have in common? They're charts, or rather metaphors for popular charts. Why do we need to name charts after food? And what can we learn from this for getting the idea of visualization across more effectively?
The heatmap is another example. As I learned recently, people like to call treemaps "heatmaps." And perhaps they're not even that far off, given that most of these are based on entirely flat data, anyway. And a heatmap is easy to understand: it's a map. It shows where it's hot. We're looking for the hot spots. What's hard about that? A treemap on the other hand looks nothing like a tree, and when it's actually showing a hierarchy, it gets confusing.
Some images have recently been making the rounds that revert the pie metaphor back to actual food. There's Pie I have eaten/Pie I have not eaten, and then there's Mary & Matt's Chocolate Pie Chart (which, confusingly, is not actually a pie). In addition to providing distraction for a few seconds, these images made me think about the food connection. Why charts and food? Is it because of the (rather obvious) pie chart?
Metaphors for charts go back all the way to Nightingale's mid-1800s Coxcomb Plot - using a flower metaphor. The coxcomb was meant to communicate data to Queen Victoria of England, who Nightingale would have put to sleep with her tables of numbers. The chart not only showed the data impressively, it was also pretty to look at (prettier than a bar chart).
Metaphors are important. They make abstract concepts easier to grasp and remember, and we often use them without being aware of them (did you notice my metaphorical use of the word "grasp" a moment ago?).
Metaphors are also dangerous. They can become overly cute and we can get lost in making a visualization fit a particular metaphor. I don't know how many flower visualizations I have seen, from a single flower to whole blooming meadows showing everything from survey data to live conversations in newsgroups. Few (if any) of them were actually useful.
But to capture people's imagination, we need to be a bit more metaphoric. Perhaps we should rename the scatterplot into the salt'n'pepper plot. Or call parallel coordinates the spaghetti plot. That might even lead to new ideas: what would a sushi plot look like? Or a cupcake plot? What about the new tea leaf plot?
Visualization is not doing a good job marketing itself, and the names of visualization techniques certainly don't help. While forcing a particular metaphor won't help us, perhaps we should invest a bit of thought into calling our next method something a bit more marketable than "parallel coordinates." Can we make visualization sound interesting and fun without being stupid?
The Ethics of Business Presentations
I saw a presentation about business dashboard software by a guy from MicroStrategy yesterday, and started to wonder about the ethics of attribution in the business world. He showed a demo of a "bubble chart" that happened to be about fertility rate and life expectancy in different parts of the world over the last 20 years – in other words, Hans Rosling's example and visualization. There was no attribution, he made it sound like he had come up with that himself.
This was part of a small meeting of about 20 business folks, and it was sponsored by MicroStrategy. The presentation wasn't bad, though I did see a few of the things Stephen Few likes to jump on: gauges, flashy graphics, and he said something like "the most important thing of course is to have dashboard desktop-publishing quality output." To his credit, he cited Gene Zelazny's Say it With Charts, which has a food-pyramid style of recommendations for how many percent of your dashboard should be bars, lines, pies, etc. And if I remember correctly, 50% should be lines and bars, and only 5% pies. Most of his examples were also pretty reasonable, despite the occasional gratuitous animation. But I digress.
Their bubble chart is very similar to gapminder, showing size and position, and of course including playback controls. You can also split a bubble into smaller parts if it is aggregated, and that also looks exactly like in gapminder. But what ticked me off was the example: life expectancy vs. fertility rate by continent over time. Couldn't he have at least picked some other time-dependent data? I doubt that Rosling would mind, but still.
He did mention Stephen Few's and Edward Tufte's books at the end (which I found interesting, especially Few), but no mention of Rosling. The do use a lot of relatively current visualization ideas (including treemaps, which they call heatmaps), and that is certainly a good thing. And they can't give credit for everything in a presentation. But when they take such a big example almost verbatim, shouldn't there be at least a mention of the name? Or don't people do that in business? And is that considered ethically okay?
Pushing Data over Email
Email is still a useful transport mechanism for data (like Google Analytics, etc.), despite ftp, web services, etc. Some websites offer email for cheap, while other access can cost a lot of money. Email is also a push service, meaning you do not have to ask periodically if new data has arrived - if you do it right. Of course, that service is rather useless without an automated way to get that data into a database. Here is an introduction to the procmail program and the ancient art of the Unix mail filter.
Pushing EmailBut first a bit of email theory: When you send an email, your mail program contacts the mail server of the recipient (or an intermediary mail server, which then does the same), and tells it the sender, recipient, content, etc. of the email. The email is thus 'pushed' to the recipient's mail server, a process that typically takes a few seconds (depending on the size of the email).
This is similar to the way a letter or parcel is sent. The letter is transported to the recipient's mailbox and left there. Just like we have to check the mailbox periodically to see if new mail has arrived, most people's email programs ask their mail server if there is new email every few minutes. This part is usually referred to as 'pull,' because it requires the recipient to periodically check and retrieve the message.
But what if we were to live right next to the mailbox? Or if we could ask the friendly postman to knock on our door when we get a letter? That's the way things work when you have an account directly on the server that also acts as your mail server: a new email appears in your mailbox the moment it is delivered. Not only is the email 'pushed' to the user, the mail server can also check if the user has left any instructions what to do when a new email arrives, like filing it depending on the sender and subject, running it through a spam filter, etc.
Email FiltersThe mechanism for filtering email is known as the .forward file. This file lives in the user's home directory and is called a "dot file" (because it starts with a period, it is not normally shown when listing the directory). As its name suggests, it can contain one or more addresses that each email to that user is to be forwarded to – this is how simple distribution lists and functional email roles (like support, office, abuse) used to be handled. In addition, the syntax also allows the specification of a program to hand the email over to.
There are several programs for this purpose (like the aptly-named vacation program), the best-known of which is procmail. Procmail uses a further dot file, .procmailrc, to specify conditions and actions to be performed when an email matches those conditions. This is a lot more flexible than a simple forward list, and provides actions such as moving an email to a particular folder, discarding it, or handing it over to yet another program.
The SetupWhat I will describe in the following is a particular setup that I use for collecting data sent through email. To use this, you will have to have access to a Unix/Linux host, shell access (it's possible to do these things without shell access, but difficult to debug when things don't work), and some familiarity with the Unix command line. If you are running your own server, you have to have a mail transfer agent set up and accepting email.
I get the data emailed to a Gmail account, which stores it as a backup and forwards it to my actual data collection account. This is important because it means that I don't have to go to great lengths to check for errors (i can always re-forward my backup), and it also hides my data collection email address.
When an email is received by the data collector that matches the criteria (i.e., coming from a particular address and correct subject), it is run through a filter that extracts any attachments and stores them in a particular directory, and is then discarded. Of course, the same mechanism could also be used to do things like run a Twitter bot.
Other scripts are run in a cron job to pick up the files and push them into a database or do other things with them. The reason for this is that I want to do as little as possible in the actual mail filter, to reduce complexity and sources for errors.
ProcmailLike many Unix programs, procmail has been around for many, many years (the first version was released in December 1990). Its age does not take away from its functionality, but it does explain the cryptic definition file syntax. Documentation is also largely cryptic; there is a very readable introduction, but it's also rather long to cover all the program's options.
The first step in using procmail is to actually have it run for every email. The way this is done is with the following line in your .forward file:
"|/usr/bin/procmail"
Make sure the path for procmail is correct, but other than that, this is very simple. Now that procmail is being run on every email the user receives, we have to write the rules for it. This is what such a rule (stored in .procmailrc) looks like:
:0
* ^Subject: (Fwd: )?Analytics eagereyes.org
| bin/extractAttachments.py /tmp/data/www
The first line starts the rule and allows a number of flags (like c for making a copy of the mail, so it can be processed by further rules even if this one matches). The second line contains a condition, which is a regular expression. Any header field can be used here, and several conditions can be put on separate lines, which all have to be met for the rule to become active. The last line specifies the action, which in this case is to run a program. As the use of the pipe character (|) suggests, the program will receive the email through its standard input stream (stdin).
Any program can be run in a procmail rule, just like it were run from the shell. Commands can also be passed, like in this case the directory for storing the data – which might be different for different data sources.
Testing the procmail rule can be done by running procmail and passing it an email through stdin. Here I am using the mbox file created by the ancient mail program. More user-friendly ones (like pine) are usually available, and can also export emails.
procmail < mbox
If the rule works and the script does its thing correctly, the attachment will now be a file in the right directory. The next step is to send a new email with attachment to the address and see if things work. If they do, the email will not show up in the inbox, only the attachment will appear in the directory (procmail is only run automatically on new emails).
The Attachment Extractor ScriptThe script below is a very simple Python program that extracts any attachments from an email and stores them in a directory. It makes good use of the excellent libraries that are included in Python, but performs no error checking whatsoever. If the email is not well-formed, or the attachment is not base64-encoded (which is very unusual in this day and age, however), it will silently fail. Despite that, it works well for its purpose, and lets me collect data that would otherwise require a lot of manual effort.
Design Workshop Questions
Jeff Heer asked me to talk more about the Design, Vision, and Visualization workshop at VisWeek, so here is a list of questions we came up with. While we were not able to discuss them at great length, I think they're very valid, and might lead to a better understanding about how to connect the design and visualization worlds.
The list is based on my notes and what I remember from the discussion, but I also filled in some additional thoughts. Feel free to add and/or disagree in the comments.
- How do designers work? What do they actually do?
- Few people in visualization know what designers actually do, how they work, how they discuss their work, etc. We need to convince people that design is not just about being artsy or making things pretty. A potential problem is also that visualization researchers might not want to abandon their overall designs, but only expect a designer to make things look better. An understanding is therefore needed what each side can bring to the table, and which part of the work being discussed is the actual subject of a discussion or collaboration.
- How can we connect with designers?
- The incentives that bring researchers to a conference do not work for designers, and vice versa. It's not easy to set up a forum where the two communities can meet. The best bet right now is to find somebody to work with at your own institution, though there are some roadblocks there, as well (see the other points).
- When do we need to go to first principles, perception, etc.?
- Designers can't answer all our questions, and often they don't know why something works or should be done a certain way. Sometimes it may be necessary to go back to the basics, perform experiments, etc., to establish how (and if) a particular technique works.
- What are the properties/characteristics of data?
- When designing a visualization, we have to take the characteristics of the data into account. Or something. Frankly, I don't remember what this point was about.
- What is the visual language of the users?
- Some fields have well-established visual languages, even if those may not always be optimal. Colors have a certain meaning in chemistry for example, and any attempt to color atoms differently is bound to fail. In some cases, the conventions are not that strict, however, and users can be convinced to abandon them for more effective designs. The question is when that is the case, and how to present a radical new design so that it has a chance of being accepted.
- What are the expectations about good and/or effective design?
- Design includes being useful, but that is not all there is to it. Depending on the people working together, the definition of what the goal is can vary quite a bit. It really boils down to an actual collaboration and understanding of the other's motivation and way of working, not just throwing something over a wall and expecting what gets thrown back to be perfect.
- Perceptually correct vs. aesthetically pleasing?
- Similar to the point above, a design that is based on perceptual principles might not be the most aesthetically pleasing, and vice versa. The question is what to insist on and what to accept even if it contradicts things we know.
- What are fundamental skills for anybody in visualization?
- This was the most intriguing question in my humble opinion: What does everybody in visualization need to know? It still strikes me as odd how few people have any kind of background (or even just interest) in design, photography, or art. I think we need to learn a lot more about visual literacy and acquire more of an appreciation for how other disciplines communicate visually to improve our work. Others may well disagree, though ...
Image by Daniel Skrobak, used under creative commons.
Swing States
I always wondered how much those swing states actually swing. So I looked at the results of presidential elections over the last 100 years, and it's not easy to determine which states actually are swing states from just looking at their history. Rather, there seems to be a pattern of relative stability for a few election cycles, and then big, sweeping wins for one side.
The data for this chart was collected from the U.S. National Archives and Records Administration, which unfortunately does not provide this in a very usable format. The format also switches at some point, making things more work than necessary. I had originally collected the data in a year-by-state matrix, which turned out to be a poor choice. I used Hadley Wickham's reshape package for R to "melt" the data into a more useful format. That data was then fed to Tableau to produce this chart.
I chose a red that is quite a bit brighter than the blue to make the two colors easier to differentiate. Blue, of course, represents democrats, and red Republicans. There is also the Progressive Party of 1912 (not to be confused with the Progressive Party of 1924, but I still gave them the same color), as well as the "Dixiecrats" who only ran in 1948. They were all so short-lived that I didn't pay a lot of attention to them, but you can find them in the chart if you look closely.
(Click for larger image)
You can see big, sweeping wins where one party takes over from the other, like in 1932, 1964, and 1968, etc. Bear in mind though that each dot represents a state, not a fixed fraction of either the popular or the electoral vote (which can differ quite a bit, too). I have ideas for how to show these things, but haven't been able to do them in Tableau or Excel, and just don't have the time right now to write a program for this.
What is also interesting to see is how recently some states (like Alaska and Hawaii) became proper parts of the US, and that even "contiguous 48" states like Arizona and New Mexico were not represented 100 years ago. The District of Columbia is the only "state" to never change color, but there are a few that have fairly consistent records, like Vermont and Massachusetts.
The goal was to make a chart that would show the progression of state winners over time. The vertical time axis is not optimal, but due to the large number of states, there really is no other choice. This layout makes it possible to see each year as one unit, and also to follow each state separately (in the large version of the image, anyway).
So this is really more a starting point than a finished visualization. I don't think I really succeeded in showing the crucial structures here, and there is more information to be included (though I did not collect data on the number of electoral votes over time). The data is available below for you to try your hands on. Let me know what you come up with!
Data: Elections_1904-2004.zip
The New York Times Visualization Lab
The New York Times' new Visualization Lab uses IBM's Many Eyes technology. While it provides easy access to a wealth of visualization techniques and the possibility to comment, there is one major difference: only data provided by the NY Times can be used. The kind and quality of that data will determine the success of this new site.
I criticized Many Eyes for not having a business model, but figured that they would be able to survive within an organization as large as IBM. Looks like they had a strategy, after all. Martin Wattenberg has also worked with the NY Times (he had a paper at this year's InfoVis conference), and he has an interest in InfoVis for the Masses.
The big difference between Many Eyes and the NY Times VizLab is that users cannot upload their own data. That means that the offered data will be crucial for the success of this site – if it's not interesting, people won't bother going there. And if the data is coming from online sources (and easy to obtain, like the data that is there right now), there will be little difference between the NY Times site and Many Eyes itself.
But that is where the New York Times can offer a huge value-add: by supplying data that cannot be easily found on the web, but that is collected by (or on behalf of) the NY Times. I'm specifically thinking of data like exit poll results, where usually only a small number of cross-sections are published. It would be excellent to have such data available to find some interesting comparisons of voters based on a number of criteria.
The NY Times name will certainly drive traffic, but to make the site compelling and make people come back, an investment in good data will be needed.
Lessons Learned from Live-Blogging VisWeek 2008
VisWeek 2008 was an interesting set of conferences again. The live-blog is now archived, and here are a few thoughts on blogging a conference. I had a long summary written up, but it was mostly redundant with the live-blog, so it makes more sense to go there. I will write up further things at greater length over the next few weeks.
This was my first experiment with live-blogging, and it was quite interesting. I knew that Twitter's 140-character limit would be too little, but the postings grew a bit longer than I had originally expected. The Microblog box took up most of the visible frontpage, when it was really meant to only fill the top half on most screens. The postings were still fairly superficial, more pointers than descriptions of what the papers were really about.
The posting frequency reflected my level of interest and fatigue: I tend to need a break after three days of conference, which is why postings got sparse on Wednesday. It was also sometimes a challenge to write about the previous paper while listening to the next presentation, and I ended up only talking about one or two per session because of that. There were also some longer sessions that I attended (a workshop and a tutorial) during which I did not post anything.
Writing while listening also didn't give me any time to review and reflect what I was writing. While that may be the way a lot of blogs work, it's certainly not my preferred way of writing (and this is not a blog, after all ;). I had to go back and correct typos and other mistakes a few times.
I only wrote about what was presented at the conference, I did not read the papers. The presentation certainly makes some things look more exciting than they really are, and may even hurt good work. Hadley commented on one entry that he was not excited about a paper I liked, and I've been contacted about another posting I made where I said that I found a paper less than exciting. Putting out my personal impressions opens the door for criticism, and also corrections.
One last thing I'm going to say about this is how easy it was to build the infrastructure for the live-blog using Drupal's Content Construction Kit and Views. i spent the most time tweaking the design of the box and display of the messages and feed. Setting up the new posting type etc. was really easy once I had figured out how to use Views.
If you missed the conference, you can re-live the drama and excitement in the VisWeek 2008 Liveblog Archive. Also, check out Carlos Scheidegger's visualization, etc. and Alark Joshi's Visualization Blog for more coverage.
Debunking the Cent Smear
A story is making the rounds recently that the Obama campaign has received many contributions with "odd" amounts (i.e., not whole dollars), which is supposedly proof that Obama was being funded by foreign money. Here is a quick look at the data, which shows some interesting patterns, but no evidence of foreign intervention.
The whole story is of course non-sensical: if people were really charging their foreign credit cards, they would still send whole dollar amounts, since amounts are always specified in the target currency. But the much stronger evidence that the argument is nonsense is in the following image (multiples of 10 are colored blue, multiples of 5 (which are not multiples of 10) are green. Of the more than two million contributions, almost 94% were whole numbers, so the 0 cents case is not shown below.
As you can see, the distribution is very uneven (unlike what you would expect from the result of currency conversion). Multiples of five (and thus "round" cents) are much more common than values in between. The most common amount, though, is .95 – strange perhaps, but definitely done on purpose. The number .01 stands out (for the winner, presumably), and .08 quite obviously because of the year (I've read of people contributing $20.08 every month and the Obama Store also sells a lot of swag for that amount). "Odd" amounts in between are also explained by a list of cent "attributions" to a variety of blogs – and by rounding (when you buy something and you round the amount up to some nice number, so the difference becomes a contribution).
Interestingly, McCain's data looks quite different. Of the roughly 400,000 contributions, less than 0.2% have fractional parts. The only strong pattern is at .50, most of the others seem rather random.
Getting the DataFinding this data was much more difficult than expected. The FEC publishes campaign contribution data, and it is possible to download their reports as a large file. It took me a lot of time to finally figure out their horrible COBOL-style file format and be sure (because I thought I was just missing something) that they were only reporting whole dollars. I had to get the actual filing data (at the very bottom of the FTP page) and wade through another horrible format (which also changed over time) to finally get to the data. It is a mystery to me why they only report whole numbers, with the number of contributions, those cents add up.
Thanks to Robert Morton, who pointed me to the right place in a comment below. I have updated the charts with that data, which has changed the overall numbers a bit, but hasn't had an impact on the patterns.
The ChartThe chart was made in Excel this time, because I had trouble getting Numbers to show me the right axis labels. I used the stacked bar chart idea with three columns, two of which were zero in each row. This way, it was easy to get different colors for multiples of 5 and 10. If there is any interest, I can make the parsed data and the Excel file available.
VisWeek Live-Microblog now Live!
I just published the first two glimpses in the live microblog from VisWeek. The microblog appears as a box on the EagerEyes frontpage, you can't miss it (unless you're reading this in your RSS reader). There are links at the bottom of the box for more postings and for the RSS feed (glimpses do not appear in the main site feed).
The glimpses have comments enabled, so feel free to click through if you feel the urge to comment on something I write. Also, let me know how you like the microblog, what I should change, etc. I will keep tweaking this throughout the week.
But don't entirely ignore the stuff below the box, either, there will be an update or two there as well, but obviously most of my time and energy will be spent on other things.
Also check out Carlos Scheidegger's visualization, etc. blog for coverage of Vis sessions.
NY Times looks at Presidents and the Economy
The New York Times has an interesting interactive visualization on the influence of presidents on the economy. They ask, Can a President Tame the Business Cycle? The visualization they use is not bad, but would be much more readable if it used a better color scale.
What exactly is a "high" or "low" change? This is how the legend describes the different colors used, and it turns out that "low" sometimes means negative. The color scale as shown in the legend is continuous, but one with just a few values (maybe five on either side of zero) would have been much more readable. Also, it is kind of important if things go up or down, which is impossible to see in this chart. Where exactly is zero on the color scale? The bar chart has no such problem.
(Click image for larger version)
The answer here is a diverging color scale with two colors that are different enough so that it is easy to see which side of zero a value is. ColorBrewer has a number of color scales for such (and other) purposes.
What is good about the graphic is its interactivity and the amount of data: almost 60 years of data, and seven dimensions is quite a bit of stuff to work with. There is also quite a bit of level-of-detail, with a mouse-over tooltip and a way to "drill in" for the bar/line chart.
Of course, it would be great if all of the data they collected for graphics like this were immediately available through their API ...
Live-Microblog from VisWeek (InfoVis/VAST/Vis) 2008
As promised earlier, I will be live-blogging VisWeek 2008, which will take place next week in Columbus, OH. I will mostly attend InfoVis and VAST, with the odd Vis session and workshop thrown in. The live-blog will appear in a box at the top of the frontpage, and there will be a separate RSS feed for these posts. Coverage should start Sunday (October 19) morning, and there will also be pictures.
This is an experiment, and we'll see how much interest there will be. The idea is stolen from Mr. FlowingData, though in contrast I'm planning on actually following through. ;) I will write about the sessions I attend, papers I find notable, and any insights I think are worth sharing.
The plan is to write about 5-10 postings a day, depending on things I find interesting and how much time I have. I call these "glimpses," as a little play on "tweets." The reason I'm not using Twitter for this is that Twitter is just too limiting for any kind of meaningful comment, and I want to be able to post links without that abomination of tinyURL. I'm also still debating whether to activate comments on those glimpses, because they're not really meant to be full postings (i.e., there won't be a teaser and body, only the body) – but that's another thing I can't do with Twitter.
Having said that, I do use Twitter, and I will tweet things when I don't have my laptop with me, or the information does not seem relevant enough for this site.
I will also take pictures, as I have in the past (like at Vis 2004 and 2005, and a few other venues). I will try to post these quickly to Flickr, and then link there from here. Expect pictures especially from the social events, and perhaps a few taken during the day (I generally do not carry my camera with me all the time, especially because I'm not staying at the conference hotel).
To those of you going to Columbus, I hope to meet many of you! And to the rest: I hope you'll at least get an idea of what you're missing from the liveblog ...
Teaser image from the always brilliant xkcd (used under creative commons).
The Shaping of Information by Visual Metaphors
In January, my Ph.D. student Caroline Ziemkiewicz told me about an interesting observation she had made: in different papers comparing tree visualizations, treemaps came out as best, worst, or somewhere in the middle. One difference she noticed was how the questions were worded: when a levels metaphor was used, treemaps did badly; a containment metaphor, on the other hand, seemed to favor treemaps. So we decided to investigate – the result will be presented at InfoVis on Monday, October 20.
Containment questions use a metaphor of nesting, e.g., Find the directory that contains the most .png type files. Levels questions, on the other hand, use the more common node-link idea of a tree, e.g., Participants counted the number of levels in the tree.
The "smoking gun" is Table 1 in the paper, which shows a clear correlation between the number of containment questions and the ranking of treemaps. While the sample is small (five papers), the evidence is damning.
This is only the first step in this direction, but it gives pause to the idea that visualization is merely a conduit for information. The design and underlying metaphor of a technique actively shapes the way users understand the visualization, and its compatibility or incompatibility with the user's mental model determines how effective it is. The beauty of this work is that it is actual science: it is based on an initial observation, we set up an experiment to test our initial hypothesis, and we are drawing conclusions from its results.
Caroline is doing a lot of promising work in this direction, which will deepen our understanding of how visualization actually works (on the level of cognition, not just perception), and lead us to better visualization techniques and more effective evaluation.
For more details, you will have to see the talk at InfoVis (on Monday, October 20, in the Design session), and/or read the paper: The Shaping of Information by Visual Metaphors (PDF)
Sightings: Structures Smaller than Light
Proteins are inherently three-dimensional, complex structures. To understand them, we need to simplify them to focus their main structural components. Jane Richardson has played a key role in the visual language that we use today when talking about proteins: ribbons and spirals. I interviewed her recently for the Sightings column in American Scientist.
If the name and work sounds familiar, you may remember her capstone talk at Vis 2006. Jane Richardson is a professor at Duke University, where she is now working on visualizing proteins and atom configurations in virtual environments (among many other things).
The interview is available online: Structures Smaller than Light (PDF)
Popular vs. Electoral Votes Using Stacked Bar Charts
A few days ago, I looked at how the electoral college system amplifies the lead of the strongest candidate in a US presidential election. The way I made the chart (with the help of PhotoShop) created some interesting reactions, and finally led me to what I consider the best way to do it (using stacked bar charts). I also want to respond to a few comments about the kind of chart used and why I think it is the most effective way to show what it does.
My use of PhotoShop may have seemed silly, but I used it to stitch together the screenshots of the different parts of the chart (it is higher than my laptop screen), so it didn't seem so absurd to me. But there are of course much better ways to do this.
Jon Peltier wrote two postings on how to achieve the effect using overlapped bars (which are possible in Excel but not in Numbers), but making sure that the shorter bar is always visible. He also modified the technique to show thinner bars in front, so that the full length of both bars can be seen.
Jock Mackinlay asked me for the underlying data (which I later also added to the original posting) and made a similar chart in Tableau. He uses an interesting trick to add an additional series that is shown in front when the shorter bar would be hidden: if the value "in front" is greater than the one shown behind all other bars, that bar has the same length as the hidden one, otherwise it is zero.
Using Stacked Bar ChartsI'm using Mackinlay's idea to create the chart using stacked bar charts. Stacked bars are quite flexible, and I've used them to prototype a number of visualizations, including the Presidential Demographics applet (the key there was making parts of the bars invisible). They are also available in virtually any program that can draw charts, so this method should work with practically any program.
Here is a version of my table that shows the raw data (name, (popular) winner %, and electoral %) as well as the three columns that are going to be used for the stacked bar chart.
These three columns work like this: bar1 is green, and shows the electoral vote in case it is smaller than the popular vote (and it's zero otherwise); bar2 is blue, and shows the popular vote in both cases (meaning it's the same as the popular vote if bar1 is zero, or it's the difference between the popular and the electoral if it isn't); bar3 is green again, and shows the electoral vote in those cases where that is greater than the popular (i.e., the majority).
The formulas for these three are as follows: bar1: =IF(C2>B2,0,C2), bar2: =IF(D2>0,B2-C2,B2), and bar3: =IF(D2>0,0,C2-B2).
The resulting chart looks like this (again done in Numbers, certainly doable just as well in Excel, Tableau, or Open Office):
The only thing I did in PhotoShop (besides stitching) was to remove the third element from the legend. I also took up Jon Peltier's suggestion to only show the 50% and 100% lines, rather than shade the area behind the lower 50%. That makes for a cleaner chart that is easier to read, and focuses on the things I really wanted to do with this.
Making a Point with a ChartThe reason for making this chart were two questions: Were there cases where the electoral vote was less than the popular vote (and which were those)? Which candidates were pushed over the 50% mark by the "amplification" from the electoral college system (and how much was that)?
The whole point of this exercise was to make those cases stand out where the electoral vote was less than the popular one, and as I already described in my earlier posting, that was not doable with any other chart I tried. So in a way, this chart makes a point: it guides the viewer's attention to one specific criterion. It is not meant to be a generic chart to compare two series of numbers (that would be better done using pairs of bars).
Reader TV also commented on Jon Peltier's first posting that the chart went against the convention of the stacked bar chart that would have the blue and green bars be parts of a total that is shown by the total length of both bars. Though that works here too, because the green part can be seen as the amplification of the popular vote, so both add up to the effective votes that counted for a candidate.
Almost exactly two years ago, I wrote about the difference between visualization and information graphics being that one murmurs and the other opines. This chart has a clear message, and it is focused on answering these two particular questions. That is why it turned out to be a good idea to not show more than two vertical lines, because reading the precise percentages is just not a priority.
Making such decisions makes a chart more focused, and thus stronger. While we want to provide the reader with the means to see different information in a visualization, I believe that we also need to make a clear point. If we don't do that, the viewer is confused and lost, and is not given a well-defined starting point for his or her own exploration.
A Fisheye Calendar at Yahoo!
What a difference 22 years make! In 1986, George Furnas published his paper, Generalized Fisheye Views, which described what was to become one of the first (and most prominent) focus+context techniques. One of the examples he used was a calendar that showed the current day in most detail, with less space for the surrounding ones. Yahoo! just started an opt-in beta of their new calendar that uses the same idea.
Furnas observed that people tend to represent their immediate environment (whether physically, temporally, or simply the focus of their current work) in more detail than things that are further away. He called that effect fisheye view, after the fisheye lens that is used in photography and that enlarges objects in the center of the image, while compressing towards the edges (that lens also uses a metaphor for its name, which describes the view fish supposedly see when looking straight up).
Among his examples, he used a calendar. Similar images have appeared in the visualization and human-computer-interaction (HCI) literature since, but Furnas was the first to do this (as far as I am aware).
Now 22 years later, Yahoo! has picked that idea up, made it prettier, and made it useful in its new calendar. The regular view looks just like any other monthly calendar, but when you click on a day, it zooms into that day to give it more room. As a side-effect you also get more space for the current week, as well as for the same day of the week in other weeks of that month (this is sometimes considered a bad thing, but at least for the current week it makes a lot of sense). In this view, you can enter new events and change existing ones (click the image below for full size).
While I don't think that this will lead to a revolution in how quickly companies pick up ideas from visualization and HCI research, it is a good thing that this is happening. Perhaps increased competition, expanded flexibility of programs in browsers, and the pressure to appear cutting-edge will turn more such flash-backs into useful products.
The Electoral College and Second Terms
The Electoral College is a key aspect of the US presidential elections. Its mechanics and distribution of electors are crucial for presidential campaigns and determine the so-called battleground states – and possibly also distort the will of the people. I was interested this last effect, so I did a little analysis.
A presidential election in the US is essentially 51 separate elections (50 states plus the District of Columbia). All but two states have a winner-takes-all system, with Maine and Nebraska using a slightly more differentiated way of splitting up its delegates between the candidates. There are a number of consequences of this that I don't want to discuss in detail here, but what I was interested in was the boost this system gives to the strongest candidate.
There are two aspects to this. First, there is the relative majority: which candidate got the most votes? Splitting this up further, there is the popular vote (how many people voted for a particular candidate) and the electoral vote (how many electors voted for that candidate). My hypothesis was that the percentage of electoral votes the winner got would always be higher than the popular vote.
The other aspect is whether the candidate who wins is the candidate the absolute majority of people (i.e., more than 50%) voted for. In recent elections, with only two candidates from the two big parties, this has become almost synonymous with the previous question – any third-party candidate would only get a minuscule fraction of the popular vote and not a single electoral vote.
A Comparison ChartSo I came up with the following graphic to answer my question. The blue bars show the popular vote, the green ones electoral votes. Since I wanted to compare, I tried out a number of different configurations, but none made it easy to see the instances where the electoral vote would be smaller than the popular vote. So I ended up with a kind of stacking where the longer bar would be "behind" the shorter one. The idea was that the instances with electoral < popular would stand out.
As you can see, there were only three instances where the electoral percentage was lower than the popular one. The boost from the electoral system is quite astounding in many cases, easily adding 30 points and more to the popular vote.
The other thing the chart shows is where a candidate was elected with less than 50% of the popular vote. The shaded area marks the 50%, and you can see that there were quite a few presidents who where pushed across that mark by the electoral college system. The most recent is George W. Bush, but the list also includes Bill Clinton, Richard Nixon, and others.
This is really only meant to provide a data point for the discussion of the merits of the electoral system – the issue is far too complex to be boiled down to a few numbers. But I think this chart illustrates quite nicely what effect the current system has. For another data-centric discussion of how less than 1% change in popular vote could have changed the outcome of many of the past elections, see Mike Sheppard's How close were Presidential Elections?
Second TermsSince I already had the data (which I scraped from Wikipedia), I got interested in looking at the second terms of presidents (or, in the case of FDR, in second, third, and fourth terms). Would a sitting president tend to gain or lose points? And what is the effect of the electoral college here? The following chart shows this data for presidents who got re-elected.
At first glance, it appears that most re-elected presidents did gain votes, and most of these gains were amplified by the electoral college (the losses, too). There are two notable exceptions, Andrew Jackson and Woodrow Wilson: in these two cases, a gain in one actually translated into a loss in the other. I have no explanation how this was possible, especially in Wilson's case.
What is missing here is data about sitting presidents who did not get re-elected. But since I was mostly interested in popular vs. electoral, I did not collect this data. I will work on such a comparison for a future posting.
Charting ChallengesWhat surprised me was how hard it was to produce a good chart for what I considered a simple dataset and question. Putting pairs of bars next to each other was entirely ineffective, there was way too much noise, even with ample spacing between the pairs (which also created a huge chart). Neither Excel nor Numbers would let me specify negative distances between the bars to make them slide behind each other. I'm a bit surprised that this is so difficult, I'm sure I've seen charts with overlapping bars.
So I ended up creating stacked bar charts, with a few additional columns of data to generate the needed numbers. While that wasn't very difficult, it did defeat the point of doing this visually: if I could just look at the sign of the difference between the electoral and popular percentages, why bother with a chart? It still does provide a good way to present the data, especially the amplification of the stronger candidate.
While Numbers doesn't have nearly the power of Excel, I really like its approach to spreadsheets. It also produces much nicer charts, in my humble opinion. What it does not do, however, is let me change the color of an individual element of a chart – I ended up doing those in PhotoShop. Also, while Numbers lets me draw arbitrary shapes, there is no snapping to chart elements, only their outlines. That makes adding information like the 50% shaded area much more difficult than necessary.
While both Excel and Numbers do provide a large variety of chart types and settings, a lot of manual work is still necessary to make a chart really informative. And many things that should be very simple to do in these programs (including such advanced features as histograms) still require a lot of tweaking and the use of tools like PhotoShop.
See also: Presidential Demographics, Presidential Demographics II
The source data for these charts is available.
A better version of the chart using stacked bar charts is also available.
Two Years of EagerEyes
This site turns two today. There have been frantic periods of posting and periods of silence. There have been times when I thought nobody would read this and times when I had more than 50,000 visitors in a day. Here is a bit of history, some thoughts on what the site has accomplished, and what I am planning for the future.
The idea for the site goes back about five years. I wanted to write about visualization and art, and have a platform for a bit of outside-the-box thinking. In August 2004, I registered the domain. The idea came from another website called equaleyes. It had a nice ring to it, but something was missing. When I came up with "eager eyes," I immediately bought the domain name. I still haven't gotten much feedback on whether it sounds good or dorky or weird, but I like it. In any case, I'm stuck with it.
The misfits.
It took me another two years to get the site running. There were some embarrassing early versions with my custom made CMS and nonsensical articles. I also wanted to build something similar to Many-Eyes and Swivel for some time, until I realized that I simply couldn't do that by myself. Another thing that took forever was finding the right content management system. There are simply too many, and I was very picky about things like URLs, caching, etc. I spent way too much time on this, but I'm happy with my choice now.
The rebels.
What is the mission of this site? I guess to put it in as few words as possible, it is to shake up visualization. While this is still a very young field, it already seems to be set in its ways, and I don't think we should be at that point quite yet. In fact, I hope that this field stays alive and flexible for a long time, so it can grow and change. And I don't think we even begin to understand how visualization even works, let alone how we can use it for the most effective communication, representation, and insight.
The troublemakers.
This is not a blog. When people call this site my blog, I usually argue with them. The goal is to organically build a website over time that will have some more or less well-organized information about visualization methods, basics, applications, etc. Part of it is a blog, yes, but that part is filed under the blog category. The other articles do appear in the feed when they are published, but they are meant to have a much longer lifetime than the usual blog entry. They are also longer, better researched, and take a lot more work to put together.
The round pegs in the square holes.
This site is also about original thought and projects, rather than rehashing or pointing at what other people do. Because, let's face it, that is exactly what most blogs do, including a few visualization blogs. I have no interest in that. Of course, this means that I can't update this site every day. Projects like the ZIPScribble Map, the iTunes Store Visualization, the square pie chart redesign, Presidential Demographics, etc. take time. Plus, I also have a day job.
The ones who see things differently.
This site is about passion. I criticize what others do, and I can be very frank in my criticism. But I scare because I care. I want to get my readers' attention, and I want to point out things that I think are wrong. Some of my statements may be harsh, a bit more sweeping than is called for, and sometimes maybe wrong. But among reasonable adults, I think a frank and open discussion must be possible. And just as I am ready to dole out criticism, I am very receptive of what others have to say about my points of view.
They're not fond of rules.
At times, this site has been a bit of an echo chamber. There is the odd comment, but not a lot of discussions have started. I can't believe that the hundreds of people who visit this site every day all agree with what I am writing here. This is doubly true for my regular visitors who subscribe to the RSS or Atom feed. Why don't you say something? Don't be content with mere consumption! Let me know if you agree or disagree. Tell me what I missed. Tell me I'm wrong. Tell me what you think we need to talk about. Let's put all that fancy Web 2.0 technology to work.
And they have no respect for the status quo.
So the goal for the next year will be to make this site more open for discussion, and start a bit more of a conversation. I am not the person for cheap provocation to get discussions going, they have to happen naturally. But by providing the means and perhaps some starting points, I hope to foster more comments and discussions than have happened so far. I have recently changed the settings so comments appear immediately without my approval. That approval step was there after some initial problems with spam, but that is well under control now.
You can praise them, disagree with them, quote them,
I am not planning any radical changes, but there are a few things I want to do. One is a live-microblog from InfoVis and VAST in three weeks. Other things include more interactive visualization applets, more open-source visualization software, and potentially a discussion forum (if the current stream of comments continues). I have also been trying to talk people into contributing articles, but have not been successful so far.
disbelieve them, glorify or vilify them.
This is a non-commercial website. I make it a point not to have advertising, not to post affiliate links, and not to sell anything through this site. This is my naïve little contribution to a better world. I share because I care.
About the only thing you can't do is ignore them.
What has the site achieved? It has certainly helped me get recognized. As egotistical as it may seem, it is a great feeling to email an influential, senior InfoVis person to ask for a list of influences and then be told that he or she knows the site and is happy to oblige. I have also been greeted by strangers as "Mr. EagerEyes" at conferences and been told by others that the site has changed their view of visualization.
Because they change things.
That's a good start, but the next step is to get people to actually act on that, to ask questions and to demand more foundational work in InfoVis, more InfoVis for communication, and a deeper understanding of how it all works. All of that exists in one way or another, but it is not enough. We don't understand our own field, and we need to change that.
The Market Meltdown in Living Color
Images speak louder than words. A lot louder. It would be hard to find a more vivid and impressive visualization of what happened today on the New York Stock Exchange.
If you're wondering what that little green spot is: it's Barrick Gold, with a plus of 4.53%. Good for them.
You can take a look at the mess yourself at SmartMoney's Map of the Market. Or if we're already back to something resembling sanity, click the thumb below for a full-sized screenshot.
Here's another one, thanks to Michael Payne for the link! FinViz looks quite interesting, though their map is a bit overloaded. Looks great, though.
The Next YouTube for Charts: iCharts
There's new competition for Swivel and Many Eyes: iCharts. A good name, to be sure, but will they live up to their promise of being "YouTube for Charts" (a claim Swivel also made in the beginning)? A first look at their website suggests that they likely will not.
iCharts is more similar to Swivel than to Many Eyes, both in their limited choice of charts and because they are an independent start-up (whereas Many Eyes is run by IBM). In comparison to Swivel, their vision is quite limited though ("to bring charts online"), and Swivel also had a clear idea how they would eventually make money from the very beginning. I don't see anything resembling that on iCharts (correction: iCharts wants to offer certain features, like embedding of charts in PDF files, to premium users. That's a start, but I'm not convinced that that will be enough.).
Seymour Duncker, one of the co-founders, talks about "lousy-looking charts" in a TechCrucnch50 presentation that they have embedded on their about page, but I don't see how their charts look any better than Excel did many years ago. In fact, the current Excel's charts are a lot prettier, and so are Swivel's. He also claims that there is no good way to create charts online and embed them in web pages, and that is simply not true.
iCharts offera the usual chart types: line charts, bar charts (including stacked), and pie charts (including a way to do concentric donut charts around a pie chart). There is some interaction with mouse-over labels and interactive filtering of axes. The use of the latter is kind of pointless because you can really only zoom in on one axis, there is no way to gain more insight into the data this way. It is possible to add annotations to the chart for explanations and to point to particular elements. This would be cool if it was possible for viewers to add those to the charts (like sense.us did), but currently they can only leave old-fashioned comments. Charts can of course be embedded in webpages, just like with Swivel and Many Eyes.
iCharts is clearly a very low-budget operation at the moment. Their whole website looks like very basic and unfinished, including their staff pictures and the horrific video introduction on their about page. Given the artistic credentials of Tyron Montgomery, another co-founder, one would expect quite a bit more.
Another hint is their domain name: icharts.net (they also own the .org). The .com domain of the same name is owned by a domain squatter. It's not a good sign that they obviously don't have the money to buy that domain. It's not their fault that the domain is taken, or what it is used for, but it reflects badly on them (e.g., when people look for them and find the spam site). And if they become successful, the price for the .com domain will only go up.
They have just opened the site for a public beta, and it is obviously a bit early to tell what they will be able to do. But not only is what I can see right now not exciting, they also fail to present any kind of compelling vision. Bringing charts to the web was a good idea two years ago, but it's been done, and done better than what iCharts currently offers. And if it's only about the same three chart types all over again, I really don't see why the world needs another way to do this.
(Thanks for the link, Jorge.)
See also: Swivel vs. Many Eyes









