I'm trying to parse a large XML file that's divided into smaller sub-documents, each of which follows the same general format and has certain identical element tags, by looping through each sub-document and extracting key elements using a series of conditionals based on the elements' tag and/or attrib and/or text. Here's the relevant snippet of my code. The break condition is just so I can test it incrementally and isolate problem points:
data = []
for event, elem in etree.iterparse(xmlFile, events = ("start", "end")):
if event == "start":
# if beginning of sub-document
if elem.tag == "n-document":
# increase counter
count = count+1
#if the counter is too large, break
if count > limit:
break
#add a dictionary to hold relevant sub-document content
data.append({})
# if the tag is n-field
if elem.tag == "n-field":
# if the name is judge, get the name
if elem.attrib["name"] == "judge.docket":
data[count]["Judge(s)"] = elem.text[len("Judge(s): "):]
Of course, I add other key-value pairs into the dictionary for that sub-document, but the judge is the relevant one for my question. It works up to 53 iterations--that is, I can cleanly extract the judge's name from the first 53 sub-documents and add them to the corresponding sub-document dictionary in the list of dictionaries that I call "data." But on the 54th sub-document, I encounter the following NoneType error:
data[count]["Judge(s)"] = elem.text[len("Judge(s): "):]
TypeError: 'NoneType' object has no attribute '__getitem__'
My understanding of the error is that there is nothing to slice, and indeed, when I remove [len("Judge(s): "):] from the code, it enters "None" as the Judge name. What confuses me is that there is nothing special about this sub-document in terms of the XML formatting--it follows the same general outline as the 53 sub-documents before it. Here's the XML section that runs into issues. There is an element with tag "n-document" at the start and end of this sub-document like every other one--I just didn't include the tags:
<n-metadoc contenttype="Dockets"><n-field name="attorney.docket">Plaintiff
Attorney(s): JONATHAN LYNWOOD ABRAM</n-field><n-field
name="attorney.docket">Defendant Attorney(s): SHERYL LYNN FLOYD</n-field><n-
field name="judge.docket">Judge(s): JUDGE MARGARET M. SWEENEY</n-field><n-
field name="party.name">Plaintiff(s): AG-INNOVATIONS, INC., AND; LARRY
FAILLACE AND; LINDA FAILLACE AND; HOUGHTON FREEMAN</n-field><n-field
name="party.name">Defendant(s): USA</n-field><n-field
name="jury.demand">NONE</n-field>
...[XML omitted for brevity]
</n-metadoc>
Also, for what it's worth, I can extract all the other information in the 54th sub-document, just not the judge name.
It's also worth noting that my extractions of the previous 53 sub-documents aren't all clean--every once in a while one of the entries in the dataset when converted to a DataFrame is "NaN." I'm assuming it's the same problem, but maybe I'm wrong. (I am relatively new to all of this.)
There also definitely isn't any element with "name"="judge.docket" before this snippet I pasted--the only thing in this sub-document before the snippet is a metadata block without any attributes. Does anyone know what the problem is?
Aucun commentaire:
Enregistrer un commentaire