数据新闻

数据新闻（台湾或称资料新闻学）（英语：data journalism）是指通过对大量数据集进行分析与筛检后来产出新闻报导（故事）的一种新闻进程。在数据新闻学中，我们常常会使用到网络上可自由获取的开放数据，然后使用开放源代码软件来处理分析^[1]。数据新闻学希望能服务大众、协助消费者、经理管理人、政治人物来了解固定出现的模式，并根据出现的现像拟定策略。因此，数据新闻学将会使新闻记者在社会上扮演新的角色。

定义

根据信息架构师和多媒体新闻记者 Mirko Lorenz 的说法，信息新闻学是一个包含了下列这些元素的完整 workflow (工作流程) :将数据纯净化、结构化来“深入数据”，挖掘特定信息来“过滤数据”，再将数据“可视化”以做出报导。^[2]另外也可以将这个过处理过程扩展加入其他步骤，使其适用于个人层面或是更广的公共层面。

数据新闻学训练员暨作家Paul Bradshaw用一种类似的方式来描述这种数据导向的新闻工作：必须要能够使用像是MySQL或是Python等数据处理软件来“找到”数据；然后“讯问”它，也就是要能够理解当中的术语以及统计学；最后借由开放源代码工具将其“可视化”及“混搭”。^[3]

另外一个以结果导向来定义这个词的数据记者暨网络趋势研究者(web strategist)Henk van Ess (2012)^[4]认为“数据导向的新闻工作使得记者能够找到尚未被发现的事件，或是通过这套搜索数据的流程来找到新的角度完成这份报导，也就是运用可行的开放源代码工具对这些数据（可能是任何形式）加工并呈现出来。”Van Ess 认为一些数据导向的工作流程会使其产品“不在好叙事的范畴里”，因为做出来的结果在于强调问题，而非阐述问题。“一个好的数据导向生产流程拥有不同的层面。它不只能够让你找到只对你重要，且个人化的内容，还能够钻到相关的细节里让你能够广览全局。”

已隐藏部分未翻译内容，欢迎参与翻译。

基于数据的新闻报导

Telling stories based on the data is the primary goal. The findings from data can be transformed into any form of journalistic writing. Visualizations can be used to create a clear understanding of a complex situation. Furthermore, elements of storytelling can be used to illustrate what the findings actually mean, from the perspective of someone who is affected by a development. This connection between data and story can be viewed as a "new arc" trying to span the gap between developments that are relevant, but poorly understood, to a story that is verifiable, trustworthy, relevant and easy to remember.

数据质量

In many investigations the data that can be found might have omissions or is misleading. As one layer of data-driven journalism a critical examination of the data quality is important. In other cases the data might not be public or is not in the right format for further analysis, e.g. is only available in a PDF. Here the process of data-driven journalism can turn into stories about data quality or refusals to provide the data by institutions. As the practice as a whole is in early development steps, examinations of data sources, data sets, data quality and data format are therefore an equally important part of this work.

数据新闻学和信任的力量

Based on the perspective of looking deeper into facts and drivers of events, there is a suggested change in media strategies: In this view the idea is to move "from attention to trust". The creation of attention, which has been a pillar of media business models has lost its relevance because reports of new events are often faster distributed via new platforms such as Twitter than through traditional media channels. On the other hand, trust can be understood as a scarce resource. While distributing information is much easier and faster via the web, the abundance of offerings creates costs to verify and check the content of any story create an opportunity. The view to transform media companies into trusted data hubs has been described in an article cross-published in February 2011 on Owni.eu^[5] and Nieman Lab.^[6]

数据新闻学的进行过程

The process to transform raw data into stories is aking to a refinement and transformation. The main goal is to extract information recipients can act upon. The task of a data journalist is to extract what is hidden. This approach can be applied to almost any context, such as finances, health, environment or other areas of public interest.

倒金字塔数据新闻学

In 2011, Paul Bradshaw introduced a model, he called "The Inverted Pyramid of Data Journalism" （页面存档备份，存于互联网档案馆）.

进行步骤

In order to achieve this, the process should be split up into several steps. While the steps leading to results can differ, a basic distinction can be made by looking at six phases:

Find: Searching for data on the web
Clean: Process to filter and transform data, preparation for visualization
Visualize: Displaying the pattern, either as a static or animated visual
Publish: Integrating the visuals, attaching data to stories
Distribute: Enabling access on a variety of devices, such as the web, tablets and mobile
Measure: Tracking usage of data stories over time and across the spectrum of uses.

步骤描述

查找数据

Data can be obtained directly from governmental databases such as data.gov, data.gov.uk and World Bank Data API^[7] but also by placing Freedom of Information requests to government agencies; some requests are made and aggregated on websites like the UK's What Do They Know. While there is a worldwide trend towards opening data, there are national differences as to what extend that information is freely available in usable formats. If the data is in a webpage, scrapers are used to generate a spreadsheet. Examples of scrapers are: ScraperWiki, Firefox plugin OutWit Hub or Needlebase (note: Needlebase will be retired June 1, 2012^[8]). In other cases OCR-Software can be used to get data from PDFs.

Data can also be created by the public through crowd sourcing, as shown in March 2012 at the Datajournalism Conference in Hamburg by Henk van Ess ^[9]

数据清洗

Usually data is not in a format that is easy to visualize. Examples being that there are too many data points or that the rows and columns need to be sorted differently. Another issue is that once investigated many datasets need to be cleaned, structured and transformed. Various open source tools like Google Refine, Data Wrangler and Google Spreadsheets^[10] allow uploading, extracting or formatting data.

数据可视化

To visualize data in the form of graphs and charts, applications such as Many Eyes or Tableau Public are available. Yahoo! Pipes and Open Heat Map^[11] are examples of tools that enable the creation of maps based on data spreadsheets. The number of options and platforms is expanding. Some new offerings provide options to search, display and embed data, an example being Timetric.^[12]

To create meaningful and relevant visualizations, journalists use a growing number of tools. There are by now, several descriptions what to look for and how to do it. Most notable published articles are:

Joel Gunter: #ijf11: Lessons in data journalism from the New York Times, published on Journalism.co.uk （页面存档备份，存于互联网档案馆） (April 16, 2011)^[13]
Steve Myers: Using Data Visualization as a Reporting Tool Can Reveal Story’s Shape （页面存档备份，存于互联网档案馆）, published on Poynter (April 10, 2009, updated March 4, 2011), including a link to a tutorial by Sarah Cohen^[14]

As of 2011, the use of HTML 5 libraries using the canvas tag is gaining in popularity. There are numerous libraries enabling to graph data in a growing variety of forms. One example here would be RGraph （页面存档备份，存于互联网档案馆）.^[15] As of 2011 there is a growing list of JavaScript libraries （页面存档备份，存于互联网档案馆） allowing to visualize data.

出版数据故事

There are different options to publish data and visualizations. A basic approach is to attach the data to single stories, similar to embedding web videos. More advanced concepts allow to create single dossiers, e.g. to display a number of visualizations, articles and links to the data on one page. Often such specials have to be coded individually, as many Content Management Systems are designed to display single posts based on the date of publication.

散布数据

Providing access to existing data is another phase, which is gaining importance. Think of the sites as "marketplaces" (commercial or not), where datasets can be found easily by others. Especially of the insights for an article where gained from Open Data, journalists should provide a link to the data they used for others to investigate (potentially starting another cycle of interogation, leading to new insights).

Providing access to data and enabling groups to discuss what information could be extracted is the main idea behind Buzzdata,^[16] a site using the concepts of social media such as sharing and following to create a community for data investigations.

Other platforms (which can be used both to gather or to distribute data):

Help Me Investigate （页面存档备份，存于互联网档案馆） (created by Paul Bradshaw)^[17]
Kasabi （页面存档备份，存于互联网档案馆）, (currently in public beta, Aug. 2011)^[18]
Timetric （页面存档备份，存于互联网档案馆）^[19]

评量以数据说故事的影响

A final step of the process is to measure how often a dataset or visualization is viewed.

In the context of data-driven journalism, the extent of such tracking, such as collecting user data or any other information that could be used for marketing reasons or other uses beyond the control of the user, should be viewed as problematic.Template:Says who One newer, non-intrusive option to measure usage is a lightweight tracker called PixelPing. The tracker is the result of a project by ProPublica and DocumentCloud.^[20] There is a corresponding back-end solution to collect the data. The software is open source and can be downloaded via GitHub.^[21]

实例

There is a growing list of examples how data-driven journalism can be applied:

The Guardian, being one of the pioneering media companies in this space (see: Data journalism at the Guardian: what is it and how do we do it? （页面存档备份，存于互联网档案馆）)^[22], has compiled an extensive list of data stories, see: All of our data journalism in one spreadsheet （页面存档备份，存于互联网档案馆）^[23]

Other prominent uses of data driven journalism is related to the release by whistle-blower organization WikiLeaks of the Afghan War Diary, a compendium of 91,000 secret military reports covering the war in Afghanistan from 2004 to 2010.^[24] Three global broadsheets, namely The Guardian, The New York Times and Der Spiegel, dedicated extensive sections^[25]^[26]^[27] to the documents; The Guardian's reporting included an interactive map pointing out the type, location and casualties caused by 16,000 IED attacks,^[28] The New York Times published a selection of reports that permits rolling over underlined text to reveal explanations of military terms,^[29] while Der Spiegel provided hybrid visualizations (containing both graphs and maps) on topics like the number deaths related to insurgent bomb attacks.^[30]. For the Iraq War logs release, The Guardian used Google Fusion tables to create an interactive map of every incident where someone died^[31], a technique it used again in the England riots of 2011.^[32]

参见

外部链接

DataDrivenJournalism.net （页面存档备份，存于互联网档案馆）
The Data Journalism Handbook （页面存档备份，存于互联网档案馆） / “数据新闻学”手册
数据新闻学，从零开始(网站)

参考文献

^ Lorenz, Mirko. Data driven journalism: What is there to learn?. Edited conference documentation, based on presentations of participants. 荷兰阿姆斯特丹. 2010-08-24 [2012-11-18]. （原始内容存档于2019-06-09）.
^ Lorenz, Mirko. (2010). Data driven journalism: What is there to learn? （页面存档备份，存于互联网档案馆） Presented at IJ-7 Innovation Journalism Conference, 7–9 June 2010, Stanford, CA
^ Bradshaw, Paul (1 October 2010). How to be a data journalist （页面存档备份，存于互联网档案馆）. The Guardian
^ van Ess, Henk. (2012). Gory details of data driven journalism （页面存档备份，存于互联网档案馆）
^ 存档副本. [2011-08-17]. （原始内容存档于2011-08-24）.
^ 存档副本. [2012-11-17]. （原始内容存档于2020-09-19）.
^ World Bank Data API. [2012-11-17]. （原始内容存档于2016-06-23）.
^ http://needlebase.com/ （页面存档备份，存于互联网档案馆） (accessed February 10, 2012)
^ 存档副本. [2012-11-17]. （原始内容存档于2021-02-25）.
^ 存档副本. [2012-11-17]. （原始内容存档于2010-04-21）.
^ 存档副本. [2012-11-17]. （原始内容存档于2012-11-23）.
^ 存档副本. [2012-11-17]. （原始内容存档于2019-01-31）.
^ 存档副本. [2012-11-17]. （原始内容存档于2011-08-22）.
^ 存档副本. [2012-11-17]. （原始内容存档于2014-09-20）.
^ 存档副本. [2012-11-17]. （原始内容存档于2021-04-22）.
^ 存档副本. [2011-08-17]. （原始内容存档于2011-08-12）.
^ 存档副本. [2012-11-17]. （原始内容存档于2021-04-13）.
^ 存档副本. [2012-11-17]. （原始内容存档于2019-12-22）.
^ 存档副本. [2021-05-18]. （原始内容存档于2019-01-31）.
^ 存档副本. [2012-11-17]. （原始内容存档于2016-12-21）.
^ 存档副本. [2012-11-17]. （原始内容存档于2020-11-22）.
^ Rogers, Simon (2011) http://www.guardian.co.uk/news/datablog/2011/jul/28/data-journalism （页面存档备份，存于互联网档案馆）
^ Evans, Lisa (2011) http://www.guardian.co.uk/news/datablog/2011/jan/27/data-store-office-for-national-statistics （页面存档备份，存于互联网档案馆）
^ Kabul War Diary （页面存档备份，存于互联网档案馆）, 26 July 2010, WikiLeaks
^ Afghanistan The War Logs （页面存档备份，存于互联网档案馆）, 26 July 2010, The Guardian
^ The War Logs （页面存档备份，存于互联网档案馆）, 26 July 2010 The New York Times
^ The Afghanistan Protocol: Explosive Leaks Provide Image of War from Those Fighting It （页面存档备份，存于互联网档案馆）, 26 July 2010, Der Spiegel
^ Afghanistan war logs: IED attacks on civilians, coalition and Afghan troops （页面存档备份，存于互联网档案馆）, 26 July 2010, The Guardian
^ Text From a Selection of the Secret Dispatches （页面存档备份，存于互联网档案馆）, 26 July 2010, The New York Times
^ Deathly Toll: Death as a result of insurgent bomb attacks （页面存档备份，存于互联网档案馆）, 26 July 2010, Der Spiegel
^ Wikileaks Iraq war logs: every death mapped （页面存档备份，存于互联网档案馆）, 22 October 2010, Guardian Datablog
^ UK riots: every verified incident - interactive map （页面存档备份，存于互联网档案馆）, 11 August 2011, Guardian Datablog

[1] Lorenz, Mirko. Data driven journalism: What is there to learn?. Edited conference documentation, based on presentations of participants. 荷兰阿姆斯特丹. 2010-08-24 [2012-11-18]. （原始内容存档于2019-06-09）.

[2] Lorenz, Mirko. (2010). Data driven journalism: What is there to learn? （页面存档备份，存于互联网档案馆） Presented at IJ-7 Innovation Journalism Conference, 7–9 June 2010, Stanford, CA

[3] Bradshaw, Paul (1 October 2010). How to be a data journalist （页面存档备份，存于互联网档案馆）. The Guardian

[4] van Ess, Henk. (2012). Gory details of data driven journalism （页面存档备份，存于互联网档案馆）

[5] 存档副本. [2011-08-17]. （原始内容存档于2011-08-24）.

[6] 存档副本. [2012-11-17]. （原始内容存档于2020-09-19）.

[7] World Bank Data API. [2012-11-17]. （原始内容存档于2016-06-23）.

[8] ttp://needlebase.com/ （页面存档备份，存于互联网档案馆） (accessed February 10, 2012)

[9] 存档副本. [2012-11-17]. （原始内容存档于2021-02-25）.

[10] 存档副本. [2012-11-17]. （原始内容存档于2010-04-21）.

[11] 存档副本. [2012-11-17]. （原始内容存档于2012-11-23）.

[12] 存档副本. [2012-11-17]. （原始内容存档于2019-01-31）.

[13] 存档副本. [2012-11-17]. （原始内容存档于2011-08-22）.

[14] 存档副本. [2012-11-17]. （原始内容存档于2014-09-20）.

[15] 存档副本. [2012-11-17]. （原始内容存档于2021-04-22）.

[16] 存档副本. [2011-08-17]. （原始内容存档于2011-08-12）.

[17] 存档副本. [2012-11-17]. （原始内容存档于2021-04-13）.

[18] 存档副本. [2012-11-17]. （原始内容存档于2019-12-22）.

[19] 存档副本. [2021-05-18]. （原始内容存档于2019-01-31）.

[20] 存档副本. [2012-11-17]. （原始内容存档于2016-12-21）.

[21] 存档副本. [2012-11-17]. （原始内容存档于2020-11-22）.

[22] Rogers, Simon (2011) http://www.guardian.co.uk/news/datablog/2011/jul/28/data-journalism （页面存档备份，存于互联网档案馆）

[23] Evans, Lisa (2011) http://www.guardian.co.uk/news/datablog/2011/jan/27/data-store-office-for-national-statistics （页面存档备份，存于互联网档案馆）

[24] Kabul War Diary （页面存档备份，存于互联网档案馆）, 26 July 2010, WikiLeaks

[25] Afghanistan The War Logs （页面存档备份，存于互联网档案馆）, 26 July 2010, The Guardian

[26] The War Logs （页面存档备份，存于互联网档案馆）, 26 July 2010 The New York Times

[27] The Afghanistan Protocol: Explosive Leaks Provide Image of War from Those Fighting It （页面存档备份，存于互联网档案馆）, 26 July 2010, Der Spiegel

[28] Afghanistan war logs: IED attacks on civilians, coalition and Afghan troops （页面存档备份，存于互联网档案馆）, 26 July 2010, The Guardian

[29] Text From a Selection of the Secret Dispatches （页面存档备份，存于互联网档案馆）, 26 July 2010, The New York Times

[30] Deathly Toll: Death as a result of insurgent bomb attacks （页面存档备份，存于互联网档案馆）, 26 July 2010, Der Spiegel

[31] Wikileaks Iraq war logs: every death mapped （页面存档备份，存于互联网档案馆）, 22 October 2010, Guardian Datablog

[32] UK riots: every verified incident - interactive map （页面存档备份，存于互联网档案馆）, 11 August 2011, Guardian Datablog

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]