Implementation of Graph Database using Neo4j on Historical Dataset
Japan’s attack on Pearl Harbor in 1941 drew the United States into World War II and spawned a massive wave of shock and fear across the country. It prompted the U.S. government to round up and send more than 100,000 Japanese-Americans to internment camps. Between 110,000 and 120,000 Japanese-Americans, 70 percent of them born in the United States, were forced to leave their homes on the West Coast and incarcerated in makeshift camps in desolate areas until after the end of World War II.
The National Archives and Records Administration (NARA) is the repository of our data. The dataset contains paper records of internal security cases and associated paper index cards for the 10 Relocation Centers. These records have not been released to the public due to access restrictions on some of the records. The National Archives wants to use the metadata extracted from the index cards so that they can identify index cards with information about internees under the age of 18 years old, which should not be released to the public.
For this project, we are restricting our study to one of the objectives:
“Explore and discover the untold stories hidden in the forest of index cards and analyze the social network among the people who were sent to these intermittent camps."
- Study Japanese American WW2 Incarceration Camps
- Studying index cards and capture in MS Excel.
- Model a graph database using CSV files in Neo4j
- Visualize the data to detect any patterns or correlation in it and understand the data’s significance
The Index Cards contains following data:
- Name of the Person
- The case report ID
- The relevant Page number in the report
- The subject of the case report (offenses such as a Riot)
- The Japanese - American or Japanese internee name
- The residence ID in the camp (9999 - D)
- Remarks section
Below is an example of one of these :
The Index cards have different styles. This is most likely due to the fact that they were indexing a record series consisting of files created and used in different offices and at different internment camps. Once the initial analysis of about 500 index cards is complete, we created an excel sheet which contain s the structured text output. This is saved as a CSV file for input into the graph dat abase model.
We created two Dashboards for our dataset. The graphs and dashboards can be viewed here:
Why Graph Database
The index cards correspond to 120,000 Japanese-American who were sent to these camps during World War II. It turns out to be huge set of data. With conventional technologies such as Relational Database or Excel sheet, it would be impossible to visualize the social network among these people and store the metadata of each individual at the same time.
Hence, we will use Neo4j platform, a graph database tool, which stores data and its relationships together physically. The nodes in the graph can be people, organizations, or events (essentially the entities we extracted in GATE and stored in the database). The edges can represent family connections or co-appearance of people on an incident card. Both nodes and relationships can be further tagged with attribute-value pairs. After “stitching” nodes together with a number of computed relationships, a social network can be built. This will allow for a variety of analyses including: clustering, finding the shortest path between two nodes, calculating various measures of centrality and closeness, and recognizing hidden relationships in the network. The results of this type of network analysis may have strong social impacts, and when we are ready, we hope to engage with experts and survivors who can help guide the process in a meaningful and ethical way, taking into account the underlying sensitivities, and navigating through the inherent collection biases and propaganda.
Implementation using Neo4j
The two most important components of the project were to extract as much information as we can about these index cards and store it i n our Database for better and faster queries and analytics. Since, the index cards does not have one specific format, we had lots of missing data. If we try and create one complex graph including all attributes at one time, we will lose lots of important information in the process of omitting Null values. Hence, we operationalized our problem statements as below :
- How many different events occurred and who participated in them? Also, to see if one person participated in more than one events
- The residence address to which the Japanese - American belonged? Also, who were their family?
- The case file corresponding to each person