samedi 1 août 2015

NoneType Error When Extracting Element.Text, But Text Is Definitely There

I'm trying to parse a large XML file that's divided into smaller sub-documents, each of which follows the same general format and has certain identical element tags, by looping through each sub-document and extracting key elements using a series of conditionals based on the elements' tag and/or attrib and/or text. Here's the relevant snippet of my code. The break condition is just so I can test it incrementally and isolate problem points:

data = []
for event, elem in etree.iterparse(xmlFile, events = ("start", "end")):
    if event == "start":

    # if beginning of sub-document
        if elem.tag == "n-document":

            # increase counter
            count = count+1

            #if the counter is too large, break
            if count > limit:
                break

            #add a dictionary to hold relevant sub-document content
            data.append({})

        # if the tag is n-field 
        if elem.tag == "n-field":

            # if the name is judge, get the name
            if elem.attrib["name"] == "judge.docket":
                data[count]["Judge(s)"] = elem.text[len("Judge(s): "):]

Of course, I add other key-value pairs into the dictionary for that sub-document, but the judge is the relevant one for my question. It works up to 53 iterations--that is, I can cleanly extract the judge's name from the first 53 sub-documents and add them to the corresponding sub-document dictionary in the list of dictionaries that I call "data." But on the 54th sub-document, I encounter the following NoneType error:

data[count]["Judge(s)"] = elem.text[len("Judge(s): "):]
TypeError: 'NoneType' object has no attribute '__getitem__'

My understanding of the error is that there is nothing to slice, and indeed, when I remove [len("Judge(s): "):] from the code, it enters "None" as the Judge name. What confuses me is that there is nothing special about this sub-document in terms of the XML formatting--it follows the same general outline as the 53 sub-documents before it. Here's the XML section that runs into issues. There is an element with tag "n-document" at the start and end of this sub-document like every other one--I just didn't include the tags:

<n-metadoc contenttype="Dockets"><n-field name="attorney.docket">Plaintiff
Attorney(s): JONATHAN LYNWOOD ABRAM</n-field><n-field
name="attorney.docket">Defendant Attorney(s): SHERYL LYNN FLOYD</n-field><n-
field name="judge.docket">Judge(s): JUDGE MARGARET M. SWEENEY</n-field><n-
field name="party.name">Plaintiff(s): AG-INNOVATIONS, INC., AND; LARRY 
FAILLACE AND; LINDA FAILLACE AND; HOUGHTON FREEMAN</n-field><n-field 
name="party.name">Defendant(s): USA</n-field><n-field 
name="jury.demand">NONE</n-field>
...[XML omitted for brevity]
</n-metadoc>

Also, for what it's worth, I can extract all the other information in the 54th sub-document, just not the judge name.

It's also worth noting that my extractions of the previous 53 sub-documents aren't all clean--every once in a while one of the entries in the dataset when converted to a DataFrame is "NaN." I'm assuming it's the same problem, but maybe I'm wrong. (I am relatively new to all of this.)

There also definitely isn't any element with "name"="judge.docket" before this snippet I pasted--the only thing in this sub-document before the snippet is a metadata block without any attributes. Does anyone know what the problem is?

Aucun commentaire:

Enregistrer un commentaire