All right everyone it is noon so I figured we'd get started. Thank you again for joining us for the second annual summer training series hosted by the national data archive on child abuse and neglect. We're really excited as I said this is a summer training series posted by the national data archive on child abuse and neglect. We are housed in the Bronfenbrenner center for translational research at Cornell University. And we that presenters both from Cornell such as our presenter today Michael Dineen who works at the data archive he's one of our lead analysts. But we've also had presentations from people who are involved in the various stages of data collection and oversight including Children's Bureau so we been really lucky this year to have such wide support. And here's our summer series schedule. So as you can see we are reaching the end of our series. First we had an introduction to the archive and the services we offer. We then had overviews of the NCANDS, AFCARS, and NYTD data which is our administrative cluster and this series this summer focuses on powerful administrative cluster. We had strategies for data management which introduced the idea of linking we had a little bit of talked about linking, code on linking, but today we're going to be going in-depth on these three specific data sets and how we link them. And Michael's going to be going through kind of the theory of it and also the practice. And then for our concluding session you will have to Frank Edwards back who was here for data management and he'll be giving us an example of a paper that he did using our administrative data. So one of the unique facets of our data is the ability to link it across people's CPS history, the foster care experiences and then for those who are likely to age out of the foster care without finding a permanent placement we can also link in their data which follows for 3 waves. And so this is a kind of a as I said a unique facet of our data and it it's a big strength but it's also kind of a complicated process to do which is why we have this we also offer on our website have both some videos and some PDFs going over details about how to link. And there's some big benefits to linking so it really allows researchers to ask unique questions and it adds details and context to individual's data in each system. Sometimes it can be hard to tell if the results that you are looking at in the foster care system are really a result of differences in their child protective history services are not. And so this really allows us to have that rich detail and it can be all-encompassing for a specific population however there are also some difficulty with linking. And so the data structure can be different between the data sets. And so Michael will talk a little bit about that and I think Frank talked about it as well. But the NYTD and AFCARS can be organized by the child, where NCANDS are organized by report incidents. There can also be entry or recording errors in administrative data so Michael will talk about the ways that he works around that to make sure that you're info the right child and you really have to be familiar with the ins and outs of multiple data sets. For those of you who have been here for the whole series, you had some in-depth sessions on the three data sets but there's also a lot of user guides and codebooks and Michael creates a lot of these that are up on our website and they are extremely helpful so I really suggest if you're going to be attempting to link in your research that you you start there. So after the session if you're going to go forward on that, first step is always to download our user guides, codebooks and then also the PDFs and videos that we have available on the website for linking. And then this session will also be posted in kind of early fall. So this can be a resource to come back to as well. And so I'm going to pass it over to Michael Dineen, he's one of our lead analysts and he works specifically with these data sets and is our linking to. So thank you very much for being here Michael and take it away. Hello everybody I'm going to lead you through this process the administrative data sets that we hold here at NDACAN included the NCANDS which is the National Child Abuse and Neglect Data System that's our what we call the child file. And it has probably something like 3 1/2 million to 4 million records per year of child the child maltreatment reports. Then we have the AFCARS system which is the Adoption and Foster Care Analysis Reporting System. That has to files there's an adoption file and the foster care file. The adoption file is kids who were adopted out of foster care. Not all adoptions in the United States but just those kids where there was a state involvement in the adoption. Other entities are allowed to contribute to to the foster foster care file but that's that's it's pretty rare. But the apparently the some of the native American reservations are going to be coming online as being contributors to the adoption foster care file maybe with the 2018 data but I'm not sure. That but that is in the in the future. The NYTD is the third big administrative data sets. That also has two files. There's services file for all the kids who receive NYTD services or their really called Chaffee services under the Chaffee Act which which funded the whole NYTD system and the services related to it. So it's it starts out with the services NYTD does, services are provided to kids who are likely to age out of foster care that they haven't aged out yet but the will page out and then some states have extended foster care so that kids can continue to receive services up to the age of 21. So having received the services and also having experienced foster care there's a second part of the NYTD which is an outcomes it's an outcomes survey which the gift to kids who turn 17 in a particular year and then follow them at age 19 and age 21. So I want to give you some definitions that I'll be using. By a table I mean a format where data are arranged in rows and columns so it can be called a flat file or it data set and but a table is rows and columns. So you're talking two dimensions. Rows are cases that's a it could be a child that could be like in the case of NCANDS it could be a child on a report. Sometimes there's one variable that defines a row uniquely sometimes there's more than one variable that will define a row uniquely but every row in a table should have one or more variables that define that row uniquely and separate it from all other rows. Like no no two rows can have that same array of values for the one or more variables that define the record. A case well that's what I was saying a case can they be defined by one or more columns. Columns can be called variables and we'll use variables in this context because rows and columns are database management terminology and variables is more social research terminology so will stick with that because of the audience we're with today. Then generally about linking tables. If the variables you need are in two different tables then you can link the tables provided that the variables you want are about the same entity. And entity is the object that the variables contain information about. In the AFCARS data set, foster care, adoption, NYTD the AFCARS what I'm calling AFCARS data sets are any data sets that we have that have been AFCARS ID in them. So that's foster care, adoption, NYTD and the child file. The common entity in these files and all these files is a child. Both the tables must have columns that define the entity there's so there's this important identifier which I call state foster care ID and it is in all these files and in order to link tables you have to have that. You have to have a linking variable, in our case it's this particular variable state foster care ID. So in order to link any to tables you have to have a common variable in that comment variable has to be the one that defines a row in the file. So the steps in linking first of all I wanted to tell you you don't just link tables just you have to to tables and you just link them. You have to prepare the tables separately ahead of time, you take each table and do some preparation to it in order to set it up so that you'll be able to link the tables. So first you want to clarify your hypothesis like what variables are you going to need to do your analysis? And what variables will you need, what data sets have those variables? Then you'll remove all the unneeded variables from each data set just to simplify things. Then you'll resolve the tables to one row per child. This is speaking specifically of our data sets because it's always going to be a child. But in more general terms you're going to resolve your data sets to where that row will be the same kind of row that's in the other table. Let's say it was a hospital database and you had patients in that patients in a table that was a billing table and you had patients in a table that was a medical table and then you'd have to resolve it to the patient level in order to link them. So you save the results of your limitation of the data into a table and then you link the restructured tables. So that's the process first figure out what variables you need, figure out what data sets you'll need, remove the unneeded variables, resolve the tables to one row per child for our case and you save the results that have one row per child and the limited number of variables. And then you're ready to link these restructured tables. So resolving the AFCARS foster care file to one row per child, the foster care child already comes with one row per child but you might want to not have all the rows like you might just want kids who entered or you might want to know kids who were in at the end of the year, or kids who aged out or kids who are waiting for adoption, or other things and you can limit the number you can take rows out that don't meet the inclusion criteria that you are interested in. So it'll still be one row per child but it'll be fewer rows. If you're using multiple years of the foster care file then a child will be present in the data set for each year that there in the foster care. So every year that the annual foster care file comes out it's a complete census of all kids in foster care, who pass through foster care they either entered or exited or are in it and that's the foster care file: all entries, exits, and in at the end. So every child who was served by a foster care is in the file annually. So if a child's in for multiple years they'll be in multiple years of the AFCARS file. If you want to use multiple years of the foster care file and do want to resolve to one record per child still have to eliminate some of those. And almost always with the foster care file you'll want to use the most recent appearance because that's going to have if they exited it's going to have their exit and if they're still in it's going to have their they'll be in it and but the foster care file always has there first entry into the foster care and it always has the most recent entry date into foster care. So you'll have that no matter what year youth use so the last year is going to have their information you need about when they entered foster care and it's going to have the information like what services they received and what their any kind of risk factors that existed at the time that they were removed from the home that's going to be in every year. So it's going to be in the last one as well as any other year that they are in foster care. So the last year is going to have another words a lot of information in the foster care file is duplicated here to hear for the same child because a lot of it's about information that was gained at the time they entered foster care. But it's preserved in their record as they go through the years in foster care. Then the NYTD outcomes file is the survey that we do for the kids who were either aged out a foster care or risked aging out of foster care. Now we have full outcomes sets for fiscal year 2011 cohort and fiscal year 2014 cohort. The three waves. So again what I mean by fiscal year 2011 cohort is kids who turned 17 in fiscal year 2017 they were followed at their age at age 19 the were resurveyed that would have been 2013 and at age 21 they were resurveyed that would have been at 2015. So all three of those waves their responses to the survey questions are in the are in the data set for 2011 cohort for one and for 2014 cohort for the other. We also have a baseline survey wave one for the 2017 cohort available but that only has one way. So the data set comes in what's called long or stacked form in that if a child responded to all three waves it'll have three entries in the table. In the data set so if the only responded to one than though I one record in the there and if it didn't respond at all it'll have one record in their because the outcomes file has at least demographic information for everyone who turned 17 in the baseline year. So it's arranged long with up to three records per child. To get one row per child what people do in order to preserve the answers to the questions is to convert the data set to wide form. So instead of having wave one, wave to come away three and their answer to the education question had wave three. Answer at wave to all in the same column, you'll have three different columns for each question. So there's a question about whether they were incarcerated it would have incarcerated one incarcerated incarcerated two incarcerated three. Three columns for each of the three waves of the incarcerated question. So in general I'll be talking about limiting columns limiting rose. So in the limiting rows question the kids who weren't in the cohort in other words did not respond at all to the wave one survey, they'll have no outcomes data and you may or may not want those kids. You may want to keep them if you want to compare the demographics to those who responded and those who did not respond but if you're not interested in that you can also take them out. So now with the NYTD services file, resolving it to one row per child, the services file has to records per child per year because it's on a six-month basis. There's a record for each six-month period the child received a service. So the 2018 services file has data from 2011 through 2018 so as far as eliminating rose you might want to eliminate certain years so that would be rows and then if you want one record per child you can take to six months records and consolidate them. Like you could put a counter that says how many six month periods in that year did they have like one or two. And then another thing you might consider is whether you only want kids who received some particular service. So now resolving the NCANDS child file. The child file has one row per report-child. Like there can be a report can have multiple children on it, and a child can be on different reports so you need both the report ID and the child ID to uniquely define a row in the child file. Each fiscal year is a completely different data set. We don't combine them prior to sending them out but you can combine them. And in other words you would be stacking them there would be one on stacked on top of the other. Then a child can appear as I said on more than one report but also they can appear in more than one year. To get one row per child it may be necessary to use summary functions to populate variables. For example you might want to know the number of reports a child has appeared on or another thing you might want is whether they ever answer the question positively. Like like say you were interested in parental drug abuse. You would want to turn that into a dummy variable like say yes or no the did they had that and then carry that forward if and then you would take the maximum of the one or zero. You say if you're summarizing on the level of a child, then you'd say Max for the FC drug question. Did the child's parents did have problem with drugs? You would use one or zero dummy variables rather than a sum or a count. So that's about each of the files and how to resolve them to one record per child just some general overview some strategies involved. So now we have the tables resolved to one row per child and we have the rows we want and we have to variables we want. We have are two data sets set up for thinking. This slide is just repeating what I said at the very beginning as a summary before I go into the more details. So I'm going to give you a demonstration using Stata and I'm going to go through what we said like using particularly I'm going to go clarifying the hypothesis, what data sets we'll need, where going to remove unneeded variables, we're going to resolve the tables to one row per child, then we're going to save the results to a table and then we're going to link the tables. And we're going to do it with the hypothesis that we're using just a hypothetical hypothesis will be is that unstable experiences in foster care predict more negative outcomes. Then so the hypothesis being that unstable experiences in foster care predict more negative outcomes, our dependent variables are going to be from the 2014 NYTD outcomes file. The variables homeless, as a child experienced homelessness, has the child experienced substance abuse, have they been incarcerated, and how they had children. The independent variables or the predictor variables are going to be from the foster care file: the number of placements, the total removals from their home, and their life length of stay in foster care. So we're hypothesizing that the more placements they've had in the more times the been removed, and the longer they been in foster care, the more negative their outcomes will be. I don't know if this is the case or not. I don't even really hypothesized that but I'm just using this as a as something to demonstrate the process. The demographic variables are in both the NYTD file and in the foster care file, but if you ever link the NYTD and the foster care file I highly recommend that you always use the demographics from the foster care file because you might get disagreements across the three waves about their age of their gender or their race or things like that they might not always put the same thing when they've but with foster care it has at least it has the same one every time. So you won't have to deal with that. Okay so I'm going to do the Stata walk-through which they'll be a ".do" file which is a Stata file. I don't know if it may I don't know how many of you guys use Stata but what I want to you to see in this is not so much how to do it in Stata but how to do it with a computer statistical program. I'm going to bring up Stata right now. This is the Stata interface and I'm going to be doing you know one step at a time, rather than doing a batch things and I'm also going to do it with commands and not using the these "combining data sets", "merge two sets" you could you could go through that way but that won't tell you a whole lot of what's going on under the under the hood. I've I'll be using files and keeping files in this temp folder called "D:\Temp". So the first thing I'm going to do from once I've opened Stata is change is find that that temp file I mean that folder. So I'm changing the directory to "temp" so now I'm now Stata's looking at the directory I just showed you. This one. "D:\Temp". Next I'm going to say use cohort. Wait a minute. Okay. Oh I didn't I misspelled it I didn't realize so okay. So you see here now we have variables here and to see the the data set itself we can here's the table we are dealing with. It's got waves so you'll see the age 17 baseline that's wave one. And if I go down you'll see it's got wave the ninet age 19 follow-up, and further down age 21 follow-up. So it's got the three waves stacked. There there're different rows. And then state foster care ID, which is our linking variable, and all these different questions that they that they've asked the children: current enrollment, substance abuse, incarceration and so forth. And then fees more like are they in the cohort, were they eligible for the 19 and 21 cohort, are they in a sample state, those are also NYTD variables. So we are using that data sets right now. So first I'm going to eliminate all the variables that I don't need for my so I'm going to say "keep" I there's keep and drop in Stata so it just depends on how many you are keeping our how like you you use keep if you're dropping more than you are keeping and vice versa. So we're going to say "wave" you need that, we need state foster care ID (, we're going to "homeless" is one of our dependent variables, "subabuse", incarceration, children, and then we want to know we need to know whether they're in the cohort or not we need that which is a dummy variable. And then "elig19" if where they're if they were eligible at 19 if they were eligible at 21. So again those we're gonna keep those. And you see appear this list is going to be different. So now we only have these variables left. So you can see the table is much smaller now. And then the next thing were going to do we've got one row we still don't have one row per child but we we've eliminated the the we've eliminated the columns we don't want but now we need to resolve it to one row per per child and you do we're going to do that by wide like going wide. So that that in Stata that's called reshape to wide. And these are the variables. Okay those those of the variables we want. We don't have wave in here and we wouldn't have to put eligible at 21 and eligible at 19 if it that's the same value every time but it doesn't because it wasn't asked at wave one. so there the there's no value for the eligible at 19 at wave one it only comes into play at the the second wave they need to be eligible for, which is the 19. So then we have we say we have an I and a J. I is rows and Jay is columns. our I variable or rows is going to be state foster care ID. In other words were going to have one row, I, for each state foster care ID so then that's our goal state which is to have one row per child. And then the J is going to be wave. So our J columns are going to be one of the three waves. So what we're going to end up with is for each of these variables where not wide isn't a variable, for homeless, subabuse were going to have homeless1, subabuse1, incarc1 and then home homeless2, and then homeless3 and so forth. That's what and that's going to be the J is going to be added onto the back of those variable names. So let's run this and it gives us a summary. It says note J is 1, two, three, those worthless three waves. The number of observations was 52,000 in the long file and 23,780 in the wide file. We've gone from nine variables to 20 variables that's one reason that you should eliminate the variables you don't need because you're going to end up with a really long, a lot of columns if you don't that you don't need. So wave was dropped because it was transferred from a a column of its own to a to a tag that goes on to the variables like you'll see here homeless now the one, two, and three that used to be wave is now on identifier for which wave of the homeless question we are dealing with. Same thing with subabuse and so forth. So these of the of these are the variables we just created from these variables using wave. And let's look at that file though we still have state foster care ID notice we don't have wave anymore. We have homeless1, subabuse1 and so forth, homeless2, subabuse2, and then we have eligible 19 one, two, and three but FY 14 cohort doesn't have a one, two, three because it's the same for every wave. Okay so now we have a we've processed our NYTD outcomes file. We've eliminated the columns and we've eliminated we've re-structured it so that we have one row per child. So it's in a state that we will be able to link. So now we want to and I'm going to call it "Cohort14_wide". And now I'm going to clear the memory so now I don't have any variables here anymore. I've started over again. Now were going to do the same thing with the foster care file. Wait a minute I'm looking at questions here Katrina Brewsaugh her question is "The codebook says LifeLOS can only be computed for children with 1 or 2 removals (TotRem). So, wouldn't it be invalid as a variable for youth with >2 removals?". The answer is yes it it wouldn't. It would only capture two of their removals but on the other hand it's more than 95% of the kids in foster care have two or one or two removals. It's only about less than 5% of the kids who have more than two removals so you'll get you you can deal with kids who have three or more removals but you can deal with 95% of the kids using LifeLOS. And from Nathan it says "In the outcomes date file, there was a category called blank for each outcome, is that the same as missing data?" Yes. That's missing. Or it's just didn't apply. Okay that's the questions for now so it now were going to go back to getting the the foster care file so we're our Stata is pointed to our temp file and I can say use FC2014v7. And now we have variables here. And I can show you the table. This is foster care table. It has state, FIPS code, record number, demographic variables, whether they were ever adopted, so forth. There. All the lots of adoption varia I mean lots of foster care variables. And we are interested in. The first thing we're going to do is hold on a sec is say what we're going to keep. Are going to keep let's see fiscal year, state foster care ID, that's our linking variable, state, state this "state" S T A T E is a numeric variable it's the FIPS code for the state. "st" is the two letter code for the state like CO for Colorado or FL Florida kind of thing. Just so it's much easier to see FIPS code is the county identifier then we want our demographics, race, raceethn is like where Hispani Hispanicity is considered like a race, and then race itself where there's it doesn't take into account Hispanic origin, number of placements, total removals, life life LOS, in at end, and exited. So so now we limit we've reduced the size the number the number of variables the number of columns in our data set that we're using. And they just to see what it looks like much. Many fewer columns. Get those out-of-the-way and then since we don't have to flip it to wide because it's already has one row per child so were just going to save this. I'm going to call it foster care 14 variables. Okay and then I'm going to clear the field again there's nothing now we're back to zero. Now we're going to do the linking so the way it's done in Stata again I'm going to pull up the the FC14_vars so I'm going to pull that back up my foster care variables and then the the the command for linking the files is to say is called merge. And we're going to merge 121 because there's one one row there they're like road to grow one row per it's going to match one row because there's one child in the NYTD file and there's one child in the foster care file, it's the same child, so the same child has a row in each of those and that child is going to be connected together. So there are cases where like if you had a child if you have multiple if you had a child multiple times in one table and only once in another table then that would be called a one to many link but that's not what we're doing. There's like one-to-many, or many-to-one, or many-to-many. Those are kinds of links but we are just doing the one-to-one. That's the simple one. And the one here's the items we're trying to bring across. We are linking on we're thinking we're merging one-to-one on using state foster care ID has some merge variable and then we hear we want first thing we want to merge to is cohort 14 wide that we created earlier and that's all the one that's all set up for us. And then we're going to keep in the match file this is this this file's called the match file this the foster care is called the the master file. So we have a master file. So we going to we're going to keep in the match file we're going to keep variables that match in the in the using file. That's the way it goes in the using this is the using file, not the match file. But were going to keep were only going to keep variables that match we're not going to keep all the variables in the foster care file because the vast majority of foster care file are kids who weren't 17 in 2014. So it with have so many more and I had talked that can't conceive of any reason we want to keep them for you know use them for something. So this little part we could do this separately but were going to include it in the merge so there goes the merge. Okay, error. Okay maybe I shouldn't. Let's just do this okay so here's the result of our match we there were 628,000 records that were not matched from the master those are all the foster care records I spoke. This is actually better I I'm glad it worked out this way because then you get to see what would've happened. From the using file we have 1166 so those are the ones in the using file which is using file is NYTD and there are 1166 records in NYTD that aren't matched in the foster care file. But we have 22,000 that have matched so now I can drop them. I can say drop if there's this variable that Stata creates called merge, equals one. Okay so I I've dropped the 627,000 and the 1106 so all we have left are the ones with the correct merge. So let's look at the data and like at the end. Merge is always three and you have that see you this these first columns are the AFCARS file, inatend and exited are AFCARS variables and then you have state foster care repeated for the second data set, the NYTD data set and then we have our homeless one and so forth. All of our ones in the all our outcomes file variables that have been set to wide, the one the those three if each of the outcomes questions. So that's the merged data set. Now you would be ready to work on it in whatever way you want to. So we're going to save that. They are that's the demonstration in using Stata. So maybe next year we can demonstrate using some other some other statistical program but that I didn't want to make it it would be too long to do it in SAS and SPSS all at the same time. I just I hope that I've been able to get across the process and and underscore the importance of setting up your tables ahead of time before you do the merge. That what's been the like the key thing I wanted to get across to you guys. So I'm going to close Stata now and here's our temp file that has these extra these new data sets in them that she would be using you would be using this data set to do your analysis now based on the hypothesis that we set up. So this is the .do file that you would that a that that a all the commands that I was making. So do if you have any questions let me know. Now. Looks like we have some question I don't think you answered this one was: did you answer the one about the encrypted rec numbers? From Katrina Brewsaugh : "some states have encrypted recnumbers that look like webdings or characters other than numbers and letters. how do those get matched? or are they dropped in the merge?" No, those are valid foster care IDs. The when states do their encryption they do it by, the the encryption process is one of character substitution and so you have and each encryption algorithm has a pool of characters that it chooses from to do the substitution. And some states just use the ASCII range which is the like the ASCII range of characters is the easiest way to say it it's your keyboard characters, all the characters on your keyboard are in the first I think it's 128 characters in in the ASCII range. Then there's an extended ASCII range which has 256 characters, and then there's the UTF-8 range which has millions of characters. So if they use UTF-8 as their character pool then they you can get a lot of odd looking characters that aren't familiar to us and that do look like webdings or ever characters. But they are not. They are actually valid characters in an encrypted file and is odd looking as that foster care ID looks, it will be the same in the all the different files, so it's valid for linking and the characters will recognize one another and know that there the same ID. So you don't need to drop those rows that have the funny looking encrypted IDs. Those are valid encrypted IDs. I'll preview the next section just to give people a moment to write down any questions they have. So next week we are getting Frank Edwards back and he is going to be going through a research example. And for those of you who were here for the data management, many of the strategies used will be in this presentation and that'll be our last presentation for the summer. So for those of you who have to go early, thank you so much for attending we hope to see you next week. For those of you who are finishing up the last 10 minutes with us we have another question. Nathan asks "Can you explain again why so many records were not matched during the merge?" Well, we're matching the NYTD outcomes file against the AFCARS foster care file. The NYTD outcomes file has youth who turned 17 in fiscal year 2014 end of those about half of them responded to the survey. So not everyone responded to the survey. Then and so there are like 16,000 or 24,000 depending on whether you keep the people who who didn't respond. Okay that's on the NYTD side, let's say we have 24,000 records. Then on the foster care side we have every child in foster care, not just kids who turned 17 in 2014. The 2014 AFCARS file has kids of all ages and it's something like 800,000 records or something. And so you have lots and lots of records in the foster care file that aren't possibly going to be in the outcomes file because that only has kids who were 17 years old. So all the other pages of kids are in the foster care file and were dropped and not and couldn't be matched because they weren't in the outcomes file. So so I hope I didn't lose everyone on that going through that Stata thing maybe if you're familiar with Stata it would be easier. Okay well if there's no more questions then I'll see you next year with another webinar because that's the end of the webinars for me this year. And thank you everybody for participating and having the interest in our data sets. So I'll sign off and handed over to Erin again. Yes and big thank you Michael for for doing this for us you know we know that your time is really valuable and you do so much analytic work for the archive but coming out to to do these sessions we all really appreciated. And you're clearly getting thank use from our participants, yeah and this will be parts of the full series that'll be posted in early fall were mainly just going through to transcribe and to doing some of the formatting to post them on the website but if you are not on our listserv I highly recommend it. It's at the NDACAN website and that's where we will be announcing when all these go live. So thanks everyone for participating we'll see Michael next year and for those of you continuing with the series I will see you next week.