The Dash Partners Meeting

This past Thursday, 30 University of California system librarians, developers, and colleagues from nine of the ten campuses assembled at UCLA’s Charles E. Young Library for a discussion of the Dash service. If you weren’t aware, Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. The group assembled to talk about the project’s progress and future plans. See the full agenda.

Introductions & Expectations

UC Curation Center (UC3) Director Trisha Cruse kicked off the meeting by asking attendees to introduce themselves and describe what they want to learn  during the meeting. Responses had the following themes:

  • Better understanding of what the Dash service is, how it works, and what it offers for researchers.
  • Participation ideas: how the campuses can work together as a group, and what that work looks like.
  • How we will prioritize development and work together as a cohesive group to determine the trajectory of the service.
  • An understanding of how the campuses are implementing Dash: how they plan to reach out to faculty, how the service should be talked about on the campus, what outreach might look like, how this service can fit into the overall research infrastructure, and campus rollout/adoption plans.
  • Future plans for the Dash service.

Overview of the Dash Service

The team then provided an overview of the Dash service, demonstrating how to log in, describe, and upload a dataset to Dash. Four campus instances of Dash went live (beta) on Tuesday 23 September, and campuses were provided with instructions on how to help test the new system. Stephen Abrams covered the technical infrastructure of the Dash service, describing the relationship between the Merritt repository, the EZID identifier service, the DataONE network, and each of the campus Dash instances (slides).

Yours truly followed with a description of DataONE Dash, a unique instance of the service that will replace the existing DataUp Tool (slides). This instance will be available to anyone with a Google login, and all data submitted to DataONE Dash will be in the ONEShare repository (a DataONE Member Node) and therefore discoverable in the DataONE system. Emily Lin of UC Merced pointed out that some UC Dash contributors might also want their datasets discoverable in DataONE; an enhancement was suggested that would allow UC Dash users to check a box, indicating they would like their work indexed by DataONE.

Stephen then discussed the cost model that is pending approval for Dash (slides). This model is based  on recovering the cost for storage only; there is no service fee for UC users. The model indicates that UC could provide researchers, staff, and graduate students 10 GB of storage in Dash for a total of $290,000/year for the entire system. Sharon Farb of UCLA suggested that we determine what storage solutions are already in place on the various campuses, and coordinate our efforts with those extant solutions. Colleagues from UCSF pointed out that budgets are tight for research labs, and charging for storage may be a significant hurdle for them to participate. We need they need a concrete answer regarding costs now – options may be for each campus to pay up front, or for the UC Office of the President pays for the system. Individual researcher charges would be the responsibility of each campus; CDL has no plans to take on that responsibility.

I followed Stephen with an overview of data governance in Dash (slides). Dash will offer only CC-BY for UC researchers; DataONE Dash will offer only CC-0. The existing DataShare system at UCSF (on which Dash is based) uses a contract (i.e., data use agreement), however this option will not be available moving forward since it inhibits data reuse and complicates Dash implementation. The decision to use CC-BY for Dash is based on conversations with UC General Counsel, which is currently undergoing evaluation of the UC Data Policy. The UC Regents technically own data produced by UC researchers, which complicates how licenses can be used in Dash.

Development Contributions

Marisa Strong then described how campuses can get involved in the development process. She identified the different components of the Dash service, which include three code lines (all in GitHub under an MIT license):

  1. dash-xtf, which houses the search and browse functionality;
  2. dash-ingest, the rails client for ingest to Merritt; and
  3. dash-harvester, a python script for harvesting metadata.

Instructions on how to contribute code are available on the Dash wiki, including how to set up a local test environment.

Matthew McKinley from UC Irvine then described their group’s development efforts in working on the Dash code lines to implement geospatial metadata fields. He described the process for forking the code, implementing the new feature in a local branch, then merging that branch back into the main code line via a pull request.

Plans for Development with Requested Funding

UC3 has submitted a proposal to the Alfred P. Sloan Foundation, requesting funds to continue development of the Dash service. If approved, the grant would fund one year of development focused on the following:

  • Streamlined and improved user interface / user experience
  • Development of embedded widgets for deposit and search functionality in Dash
  • Generalization of Dash protocols so can be layered on top of any repository
  • Expanded functions, including parsing spreadsheets for cleaning and best practices (similar to previous DataUp functionality)
  • Support for more metadata schemas, e.g., EML, FGDC

This work would happen in parallel with the existing Dash application, allowing continuous service while development is ongoing. Declan Fleming of UCSD asked whether UC3 efforts would be better spent using existing infrastructures and tools, such as Fedora. The UC3 team said that they would like to talk further about better possible approaches to the Dash system, and encouraged attendees to share ideas prior to the start of development efforts (if funded).

Dash Enhancements: Identification & Prioritization

The group went through the existing enhancements suggested for Dash, available on GitHub Issues. There were 18 existing enhancements, and the group then suggested an additional 51. Attendees then broke into three groups to prioritize the 69 enhancements for future development. Enhancements that floated to the top included:

  • embargoes (restricted access) for datasets
  • metrics/feedback for data depositors and users (e.g., dataset-level metrics)
  • integration with tools and software such as GitHub, ResearchGate, R, and eScholarship
  • improvements to metadata, including ORCID and Fundref integration

This exercise is only the beginning of the process; the UC3 group plans to tidy up the list and re-share with the group after the meeting for continued discussion. This process will be documented on GitHub and via the Dash listserv. Stay tuned!

Next Steps & Wrap-up

The meeting ended with a discussion about how the campuses would stay informed, what contributions each campus might make to Dash, and how the cross-campus partnership should take shape moving forward. Communication lines will include the Dash Facebook page, Twitter account (@UC3Dash), and the GitHub page. Trisha facilitated a final around-the-room, where attendees could share final thoughts. Common thoughts included excitement for the Dash service, meeting campus partners and hearing about development plans moving forward.

The UCLA campus as it appeared in 1929. Enrollment was 6,175. Contributed to Calisphere by UC Berkeley.

The UCLA campus as it appeared in 1929. Enrollment was 6,175. Contributed to Calisphere by UC Berkeley.

DataUp is Merging with Dash!

Exciting news! We are merging the DataUp tool with our new data sharing platform, Dash.

About Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols such as SWORD and OAI-PMH. The Dash project is open-source and we encourage community discussion and contribution to our GitHub site.

About the Merge

There is significant overlap in functionality for Dash and DataUp (see below), so we will merge these two projects to enable better support for our users. This merge is funded by an NSF grant (available on eScholarship) supplemental to the DataONE project.

The new service will be an instance of our Dash platform (to be available in late September), connected to the DataONE repository ONEShare. Previously the only way to deposit datasets into ONEShare was via the DataUp interface, thereby limiting deposits to spreadsheets. With the Dash platform, this restriction is removed and any dataset type can be deposited. Users will be able to log in with their Google ID (other options being explored). There are no restrictions on who can use the service, and therefore no restrictions on who can deposit datasets into ONEShare, and the service will remain free. The ONEShare repository will continue to be supported by the University of New Mexico in partnership with CDL/UC3. 

The NSF grant will continue to fund a developer to work with the UC3 team on implementing the DataONE-Dash service, including enabling login via Google and other identity providers, ensuring that metadata produced by Dash will meet the conditions of harvest by DataONE, and exploring the potential for implementing spreadsheet-specific functionality that existed in DataUp (e.g., the best practices check). 

Benefits of the Merge

  • We will be leveraging work that UC3 has already completed on Dash, which has fully-implemented functionality similar to DataUp (upload, describe, get identifier, and share data).
  • ONEShare will continue to exist and be a repository for long tail/orphan datasets.
  • Because Dash is an existing UC3 service, the project will move much more quickly than if we were to start from “scratch” on a new version of DataUp in a language that we can support.
  • Datasets will get DataCite digital object identifiers (DOIs) via EZID.
  • All data deposited via Dash into ONEShare will be discoverable via DataONE.

FAQ about the change

What will happen to DataUp as it currently exists?

The current version of DataUp will continue to exist until November 1, 2014, at which point we will discontinue the service and the dataup.org website will be redirected to the new service. The DataUp codebase will still be available via the project’s GitHub repository.

Why are you no longer supporting the current DataUp tool?

We have limited resources and can’t properly support DataUp as a service due to a lack of local experience with the C#/.NET framework and the Windows Azure platform.  Although DataUp and Dash were originally started as independent projects, over time their functionality converged significantly.  It is more efficient to continue forward with a single platform and we chose to use Dash as a more sustainable basis for this consolidated service.  Dash is implemented in the  Ruby on Rails framework that is used extensively by other CDL/UC3 service offerings.

What happens to data already submitted to ONEShare via DataUp?

All datasets now in ONEShare will be automatically available in the new Dash discovery environment alongside all newly contributed data.  All datasets also continue to be accessible directly via the Merritt interface at https://merritt.cdlib.org/m/oneshare_dataup.

Will the same functionality exist in Dash as in DataUp?

Users will be able to describe their datasets, get an identifier and citation for them, and share them publicly using the Dash tool. The initial implementation of DataONE-Dash will not have capabilities for parsing spreadsheets and reporting on best practices compliance. Also the user will not be able to describe column-level (i.e., attribute) metadata via the web interface. Our intention, however, is develop out these functions and other enhancements in the future. Stay tuned!

Still want help specifically with spreadsheets?

  • We have pulled together some best practices resources: Spreadsheet Help 
  • Check out the Morpho Tool from the KNB – free, open-source data management software you can download to create/edit/share spreadsheet metadata (both file- and column-level). Bonus – The KNB is part of the DataONE Network.

 

It's the dawn of a new day for DataUp! From Flickr by David Yu.

It’s the dawn of a new day for DataUp! From Flickr by David Yu.

The First UC Libraries Code Camp

This post was co-authored by Stephen Abrams.

Military camp on Coronado Island, California. Contributed to Calisphere by the San Diego History Center. Click on the image for more information.

Military camp on Coronado Island, California. Contributed to Calisphere by the San Diego History Center. Click on the image for more information.

So 30 coders walk into a conference center in Oakland… No, it’s not a bad joke in need of a punch line, it instead describes the start of the first UC Libraries Code Camp, which took place in downtown Oakland last week. These coders were all from the University of California system (8 out of 10 campuses were represented!) and work with or for the UC libraries. CDL sponsored the event and was well represented among the attendees.

The event consisted of two days of lively collaborative brainstorming on ways to provide better, more sustainable library services to the UC community.  Camp participants represented a variety of library roles– curatorial, development, and IT– providing a useful synergistic approach to common problems and solutions. The camp was organized according to the participatory unconference format, in which topics of discussion were arrived at through group consensus.  The final schedule included 10 breakout sessions on topics as diverse as the UC Libraries Digital Collection (UCLDC), data visualization, agile methodology, cloud computing, and use of APIs.  There was also a plenary session of “dork shorts” in which campus representatives gave summary presentations on selected services and initiatives of common interest.

The conference agenda, with notes from the various breakouts, is available on the event website. For those of us that work in the very large and expansive UC system, get-togethers like this one are crucial for ensuring we are efficiently and effectively supporting the UC community.

Of Note

  • We established a GitHub organization: UCLT. Join by emailing your GitHub username to uc3@ucop.edu.
  • We are establishing a Listserv: uclibrarytech-l@ucop.edu
  • Next code camp to take place in the south, in January or February 2015. (we need a southern campus to volunteer!)

Next Steps

  1. Establish a new Common Knowledge Group for Libraries Information Technologists. We need to draft a charter and establish the initial principles of group. Status: in progress, being led by Rosalie Lack, CDL
  2. Help articulate the need for more resources (staff, knowledge, skills, funding) that would allow libraries better support data and researchers creating/managing data. Status: database of skills table is being filled out. Will help guide discussions about library resources across the UC.
  3. Build up a database of UC libraries technologists; help share expertise and skills. Status: table being filled out. Will be moved to GitHub wiki once completed.
  4. Establish a collaborative space for us to share war stories, questions, concerns, approaches to problems, etc. Status: GitHub Organization created. Those interested should join by emailing us at uc3@ucop.edu with their GitHub username.
  5. Have more Code Camp style events, and rotate locations between campuses and regions (e.g., North versus South). Status: can plan these via GitHub organization + listserv
  6. Keep UC Code Camp conversations going, drilling down into some specific topics via virtual conferencing. Status: can plan these via GitHub organization + listserv. Can create specific “teams” within the GitHub organization to help organize more specific groups within the organization.
  7. Develop teams of IT + librarians to help facilitate outreach and education on campuses.
  8. Have CDL visit campuses more often to run informational sessions.
  9. Have space for sharing outreach and education materials around data management, tools and services available, etc. Status: can use GitHub organization or …?