Digital Transformations in the Arts and Humanities: Big Data Workshop

I spent today at the fascinating AHRC Big Data workshop:
http://www.ahrc.ac.uk/News-and-Events/Events/Pages/Big-Data-Workshop.aspx

I made notes of what I saw as the headline issues, relating to the forthcoming funding call and what the AHRC considers of interest in the context of Big Data. The workshop was intended to influence the call. Herewith my evolving summary:

Morning Session

Emma Wakelin

  • Likely to be more focus on infrastructure via AHRC in coming years
  • £4M new call for capital research projects around Big Data: not necessarily kit
  • Main driver an asset that will be produced – something that is tangible and will extend beyond the project e.g. a toolkit, an exhibition of digital artwork
  • Call released next week (week of 1 July 2013)
  • Two strands – small and exploratory (up to 100k) and bigger (up to 600k)
  • Proliferation of born digital information e.g. social media and purchasing and location

Andrew Prescott

Working definitions of Big Data at the workshop:

  • Big data exceeds capacity of desktops and networks you have
  • Data is so large that existing methods don’t work
  • Gartner – high volume, velocity and variety

  Key issues:

  • Hypothesis vs data-driven research
  • How valid are predictive and probabilistic techniques in arts and humanities research?
  • Data quality issues – do we lose and sense of the context and stratigraphy of the data?
  • Danger of thinking data=truth
  • Arts and Humanities scholars have been dealing with the complexities and uncertainties of large datasets for a long time, including stimulating new technologies. In the context of AHRC-relevant Big Data questions Andrew Prescott showed and discussed in his introduction:
  • e.g. a letter from Gladstone to Disraeli. There are 160,000 political and literary papers associated with Gladstone.
  • e.g. He also showed the George W Bush Preseidential library – 200 million emails and 4 million photographs.
  • e.g. A thousand words project at Texas advanced computing center demonstrates the need of some investment in equipment.
  • e.g. mapping means and enroller in the context of linguistics;
  • e.g. census information;
  • e.g. audio visual content;
  • e.g. visiblearchive.blogspot.com and e.g. mtchl.net/cex at the University of Canberra;
  • e.g. legal records www.aalt.law.uh.edu
  • e.g. artistic responses to Big Data like The Obelisk (2012)at the Open Data Institute, including potential of artistic responses influencing novel interfaces;
  • e.g. Arts and Humanities interests and interactions with causation vs correlation meta questions and applications such as prediction
  • e.g. sentiment analysis asia trned map www.asiatrendmap.jp

Tim Hitchcock

His talk introduced the following examples:

  • oldbaileyonline.org
  • www.londonlives.org
  • www.connected histories.org
  • Locating London’s Past – www.locatinglondon.org
  • These were all about creating bodies of material to be interrogated

Recently Tim Hitchcock disconnected data underlying these resources from the representation of the data via websites. Designed to support longevity of the data and respond to ever changing web design style and technologies. Need a sustainable model to revisit these. Now analysing the data e.g. distribution of trial lengths in words via word counts including references to murders in the Old Bailey Online. This allowed the identification of interesting outlying cluster related to plea bargaining.
Also started mapping coarse-grained semantics onto these resources e.g. relationships between words associated with violence and specific types of trials at different times. Then went onto prediction of trial outcomes based on the record of the trial, and the relationship of this to changing legal and related contexts. Also interested in how you map texts into their topographic or topological location. For example, mapping word occurrence from legal documents onto streets. This provides a point of reference about how an urban space might work. We should have increasing access to context e.g. the idea of the macroscope by Börner. Concluded by emphasising a need for Arts and Humanities scholars to look at the original, through the lens of Big Data e.g. the real face of white australia project.

Michael Magruder

Remixing, remediation and translation of Big Data rather than making: including curated and realtime data:

  • Working in virtual spaces
  • Use of representative media such as screens and browsers
  • Bringing the born digital off the screen by immersing the spectator
  • Displaying within traditional art artefacts – sculpture, painting in galleries
  • By artistically analysing Big Data he has focused on interdisciplinary collaboration.
Case studies:

Living media repositories: Data Flower (VRML + Java 2010); it generates flora but not in the sense of generative architecture. He prefers non-deterministic methods based on code + changing environment. He uses A-life models textured by realtime samples of flickr images tagged with “flower”.

Scientific concepts and contexts:

Data_sea v1.0 (VRML + Java 2009); linking broadcast media and astronomy by visualising the “radiosphere” – the spread of radio throughout the universe and looking at where they interact with exoplanets. This work took abstract information and provided a visceral experience. Now looking at how to visualise ATLAS Detector data from CERN. Using artistic responses as a means to assist in data analysis through visualisation

Data archives:

Extending the idea of artwork as interface and uncovering hidden narratives e.g.

  • BBC World Service Radio Archive (prototype) http://worldservice.prototyping.bbc.co.uk
  • (in) remembrance [11-M] – data mining blogs, torrent sites, user generated information, news to compile an archive; issues of the horrific, personal, emotive data underlying the archive and its expression.
  • Digital ethics being a key component:
  • Access
  • Control
  • Exploitation
  • Manipulation
  • Ownership
  • Privacy
  • Rights
  • Subversion
  • Suppression
  • Referred to Big Data in the context of the Holocaust

Round table breakout session

I am afraid looking back these notes are rather random. Hopefully something of value in there. Needs a bit of text mining :-)

  • MOOCs might be an interesting area to examine in the context of Big Data [note: section about this in Big Data book]. This led to a discussion about AHRC and education.
  • I thought that knowing where you switch from Hadoop-based to structured semantic web based analyses is a crucial aspect to communicate – again I raised the @Viktor_MS example of MasterCard data processing
  • Intellectual property to consider – including our own
  • To what extent can provenance of information be derived from the data, in addition to encoded via e.g. PROV?
  • AHRC is interested in Big Data ethics issues
  • What, if any, are the new questions that Big Data poses in an AHRC context? Is Big Data in fact anything new for AHRC scholarship and is the causation/ correlation distinction illusory/ overblown? What do archaeological correlation questions look like compared to causation ones?
  • Big Data exposes the pattern for human interpretation – it doesn’t always provide the answer by doing the pattenr recognition itself
  • Really important to articulate the implications of a correlation focus
  • Capturing all interactions with AHRC-funded archives could be a very significant step forward in learning about AHRC questions and means for examining them
  • Varying temporal resolution poses interesting challenges for data aggregation
  • AHRC interested in what implications there are of kinds of data that may be being gathered about us, in addition to the kidns of data we are generating for AHRC related research.
  • Do the imperfections of AHRC-type data preclude Big Data approaches? e.g. what is the archaeological or historical equivalent of mining twitter? Do we have enough data that we can mash up to apply the same methods? Could it be that Arts and Humanities problems are perfect for Big Data approaches becuase they are messy and difficult to structure?
  • Will humanities scholars of the future have anything to say when everything is life-logged?

Afternoon

Farida Vis

View slides from the presentation on slide share

Where do images fit in the era of Big Data @flygirltwo Discussed #readingtheriots and the 2.6M tweets from 700,000 accounts sent during the riots (donated by Twitter). Focused on the role of rumours, the role of bots, etc.. Did incitement take place – no e.g. #riotcleanup What was the role of different actors on twitter. (I was reminded here of work by @raminetinati)

What are the role of data visualisations and how do we critically interrogate them? Is it dangerous for a visualisation to help us to understand (or think we understand) complex information? Images are not looked at a great deal in social media analysis. [One example that occurred to me was http://eprints.soton.ac.uk/352460]. Fascinating analysis by Farida of types of image sharing e.g. reuse of google streetmap data. You can see content being shared through different channels e.g. high quality camera footage going on via flickr and smartphone via twitpic. Also interested in deleted content e.g. image of someone with looted material that was deleted by the person but had already proliferated online.  People were also taking pictures of their TV screens as they watched the riots, and sharing these online. Journalists joined in doing this. Some people even pretended to be at the riots by sharing photos from TV. Also lots of material that was altered e.g. Tottenham presented as a war zone using Tour of Duty game imagery. Quality of imagery e.g. blurry imagery was reinforcing the veracity of the image. Really interesting aspect here of collective viewing practices. Referencing here John Berger 1992 – “I have decided that this image is worth sharing”. Direct visualisation rather than abstraction e.g. Lev Manovich, and reflecting on previous ideas e.g. Aby Warburg’s Mnemosyne. What is the relationship between the algorithm and visibility e.g. edge rank making an image visible.

Storyful guidelines for social media image verification e.g. who is the photographer, image altered? Always try and get the sequence of images around the specific image. Also TinEye would be possibility.

Keith May

Providing an overview of the existing Big Data landscape in archaeology:

  • Digital archiving
  • GIS remote sensing early big data
  • Semantic web and linked data
  • Use of cidoc crm
  • Recent projects with AHRC funding e.g. STAR, STELLAR, HESTIA2, SENESCHAL

 

Mark Flashman

  • Introducing the World Service Radio Archive and crowd annotation in linked data e.g. http://worldservice.prototyping.bbc.co.uk
  • Also some work on speaker recognition finding voices within the audio e.g. From Our Own Correspondent where voices are allocated a URI. They were able to process three years audio (50,000 items) and generate al the tags in less than two weeks for less than $10,000.
  • Also doing work experimenting with visualisation of these data

Adam Farquhar and James Baker

Talking about massive digital resources and activities in the British Library:

  • 19th C books 86k books OCR
  • 19th C newspapers 2m pages 68% OCR accuract
  • British National Bibliography
  • Personal Digital Archives Hamilton, Maynard Smith, Wendy Cope
  • Wikipedia e.g. @generalising who gave excellent workshop @sotonDH recently
  • Also mentioned labs.bl.uk competition that closes tomorrow (26/06/2013)
  • And many more

Dominic Oldman

Talking about ResearchSpace – www.researchspace.org
 
e.g. co-referencing through context such as concepts, names and places

Dan Pett

Talking about the Portable Antiquities Scheme – finds.org.uk
 
900,000 objects, each with 200 pieces of metadata and 400,000 images and 360 research projects. All data are pushed through dbpedia, ordnance survey URIs, geonames etc. and about to release as CIDOC CRM.

 

Valerie Johnson

Example National Archives data sets e.g. UK gov web archive. Very interested in Big Data applications to their data, and improving searchability and how you deal with poor quality (meta)data. Also interested in conceptual issues such as is this changing archives, research methods, etc.

Torsten Reimer

@torstenreimer talking about JISC support for Big Data e.g. JANET, data centres, digital repositories, technical advisory services e.g. cloud and grid computing, tool development Re-launched website: www.jisc.ac.uk Key messages: Think big about data – not about big data i.e. be driven by the research. And if you do have a technical need there will be new activities from JISC being planned now and they would welcome suggestions.

Break out session two

I took down the following key issues discussed in the break-out session: AHRC Collaborative skills development activity – there will be a focus on quantitative methods, and also more generally skills around new technologies, new ways of working e.g. digital literacy. How can the AHRC support more rapid development of effective multi- and inter-disciplinary working? Very positive discussion about sandpits. One legacy of the Big Data investment could be a repository of training materials. What measures? E.g. toolkit for measuring success produced by the OII – TIDSR – http://microsites.oii.ox.ac.uk/tidsr Use of AHRC funded resources in teaching would be a good thing to support e.g. examples integrated as part of MOOCs

Bill Thompson

Final breakout session

I didn’t take many notes from the final breakout but you can get the gist from the tweeted images. The key questions for the session were:

  • What can big data research do to benefit arts organisations
  • What are the challenges of working with arts organisations in context of big data? What are the barriers?

Main things I noted as it went along were:

  • More funding via R&D fund (I think) to support these kinds of interactions will be announced next week
  • AHRC Collaborative skills development activity – there will be a focus on quantitative methods, and also more generally skills around new technologies, new ways of working e.g. digital literacy.
  • AHRC interested in how can they can support more rapid development of effective multi- and inter-disciplinary working? Very positive discussion about sandpits.
  • One legacy of the Big Data investment could be a repository of training materials.
  • What ways of measuring the success of a Big Data call? E.g. use toolkit for measuring success produced by the OII – TIDSR – http://microsites.oii.ox.ac.uk/tidsr
  • Use of AHRC funded resources in teaching would be a good thing to support e.g. examples integrated as part of MOOCs